CN112633491A

CN112633491A - Method and device for training neural network

Info

Publication number: CN112633491A
Application number: CN201910951167.9A
Authority: CN
Inventors: 徐晨; 王坚; 皇甫幼睿; 李榕; 王俊
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2021-04-09

Abstract

The application provides a method and a device for training a neural network. Relate to artificial intelligence field, concretely relates to neural network training field. The method comprises the following steps: determining training data of a first agent according to first data obtained by interaction between the first agent and an environment and second data obtained by interaction between a second agent and the environment, wherein the environment is an environment corresponding to a wireless resource scheduling task; and training the reinforcement learning of the first agent by utilizing the training data of the first agent. Because the training data of the first agent considers the interactive data of the first agent and the environment and also considers the interactive data of the second agent and the environment, the stability and the accuracy of the training data of the first agent are improved, and therefore the convergence capacity of the reinforcement learning training can be improved, and the problem that the training process enters a local optimum point when the reinforcement learning is applied to an unstable environment is reduced or avoided.

Description

Method and device for training neural network

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method and apparatus for training a neural network.

Background

Artificial Intelligence (AI) is a new technical science to study and develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. Machine learning is the heart of artificial intelligence. The machine learning method comprises supervised learning and reinforcement learning.

The goal of supervised learning is to learn the mapping relationships between inputs and outputs in a training data set given a training data set, and at the same time, it is desirable that the mapping relationships can also be applied to data other than the training data set, i.e. when new data comes, the results can be predicted from the mapping relationships. Where the training data set is a set of correct input-output pairs. Supervised learning requires acquiring a labeled training data set, and generally, for a decision problem, it is difficult to acquire labeled training data.

Reinforcement learning is proposed for problems that make it difficult to obtain labeled training data sets, such as decision-making problems. Reinforcement learning is learning by agents in a "trial and error" manner, with rewards (rewarded) directed actions obtained by actions (actions) interacting with the environment with the goal of making the agents obtain the most rewards. Reinforcement learning is different from supervised learning, and is mainly characterized by no need of a training data set.

The reinforcement learning is performed in a trial and error mode, and the convergence capability and the convergence speed of the reinforcement learning are far lower than those of supervised learning. In particular, when the reinforcement learning is applied to a task whose environment is unstable, for example, a radio resource scheduling task in the communication field, the convergence speed of the training process of the reinforcement learning is very slow, even non-convergence, for example, a local optimal point is reached.

Therefore, it is an urgent problem to improve the convergence ability of reinforcement learning.

Disclosure of Invention

The application provides a method and a device for training a neural network, which can improve the convergence capacity of reinforcement learning.

In a first aspect, a method of training a neural network is provided, the method comprising: determining training data of a first agent according to first data obtained by interaction between the first agent and the environment and second data obtained by interaction between a second agent and the environment, wherein the environment is the environment corresponding to a wireless resource scheduling task; and training the reinforcement learning of the first agent by utilizing the training data of the first agent.

The first data represents data obtained by interacting the first agent with the environment. The second data represents data obtained by interacting the second agent with the environment.

The first data includes states and actions resulting from interaction of the first agent with the environment.

Optionally, the first data may further include a performance index obtained by interaction between the first agent and the environment.

Optionally, the first data may further include a reward corresponding to a state and an action resulting from interaction of the first agent with the environment. The reward may be derived from performance metrics resulting from interaction of the first agent with the environment.

The second data includes states and actions resulting from interaction of the second agent with the environment.

Optionally, the second data may further include a performance index obtained by interaction between the second agent and the environment.

Optionally, the second data may further include a reward corresponding to a state and an action resulting from interaction of the second agent with the environment. The reward may be derived from performance metrics resulting from interaction of the second agent with the environment.

It should be understood that, since the training data of the first agent not only considers the data of the interaction between the first agent and the environment, but also considers the data of the interaction between the second agent and the environment, the stability and accuracy of the training data of the first agent can be improved, and therefore, the convergence capability of the reinforcement learning training of the first agent can be improved to a certain extent, so that the convergence speed is improved.

Therefore, the scheme provided by the application can improve the convergence capability of the reinforcement learning algorithm, so that the problem that the training process enters a local optimal point when the reinforcement learning is applied to an unstable environment is favorably alleviated or avoided.

With reference to the first aspect, in a possible implementation manner of the first aspect, determining training data of a first agent according to first data obtained by interaction between the first agent and an environment and second data obtained by interaction between a second agent and the environment includes: and taking the training data corresponding to the first data and the training data corresponding to the second data as the training data of the first agent.

The training data corresponding to the first data represents training data obtained from the first data. The second data corresponds to training data representing training data obtained from the second data.

If the first data includes the state and the action obtained by the interaction between the first agent and the environment, and does not include the reward, the acquisition mode of the training data corresponding to the first data is as follows: obtaining a corresponding reward according to a performance index obtained by interaction of the first agent and the environment; and obtaining training data corresponding to the first data according to the obtained reward and the first data.

If the first data includes the state and action obtained by the interaction between the first agent and the environment and also includes the corresponding reward, the training data corresponding to the first data is the first data itself.

And if the first data comprises the state and the action obtained by the interaction of the first agent and the environment and also comprises the corresponding reward and also comprises the performance index obtained by the interaction of the first agent and the environment, the training data corresponding to the first data is the state, the action and the reward which are contained in the first data.

The above explanation of the training data corresponding to the first data is also applicable to the explanation of the training data corresponding to the second data, and is not repeated for brevity.

It should be appreciated that by expanding the training data of the second agent to the training data of the first agent, the richness of the training data of the first agent can be improved, thereby improving the stability and accuracy of the training data of the first agent, and therefore, the convergence ability of the reinforcement learning training of the first agent can be improved to a certain extent, thereby improving the convergence speed.

In addition, the training data of the second agent is expanded into the training data of the first agent, so that the second agent can be considered to be learned by the first agent, the non-stationary problem can be converted into the quasi-stationary problem, and the convergence capability of the reinforcement learning algorithm can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first data includes a performance index obtained by interaction between the first agent and the environment, and the second data includes a performance index obtained by interaction between the second agent and the environment; rewards in training data of the first agent are obtained based on performance indicators in the first data and performance indicators in the second data.

As an alternative implementation, the reward in the training data of the first agent is obtained by linearly weighting the difference between the performance indicator in the first data and the performance indicator in the second data.

As another alternative implementation, the reward in the training data of the first agent is obtained by normalizing the difference between the performance indicator in the first data and the performance indicator in the second data.

It should be understood that, the reward in the training data of the first agent is determined according to the performance index obtained by the interaction between the first agent and the environment and the performance index obtained by the interaction between the second agent and the environment, so that the stability and the accuracy of the reward in the training data of the first agent can be improved, and therefore, the convergence capability of the reinforcement learning training of the first agent can be improved to a certain extent, and the convergence speed is improved.

In addition, according to the performance index obtained by the interaction between the first agent and the environment and the performance index obtained by the interaction between the second agent and the environment, the reward in the training data of the first agent is determined, and the reward can be regarded as a base for establishing the performance index of the non-stationary environment, so that the non-stationary problem can be converted into a quasi-stationary problem, and the convergence capability of the reinforcement learning algorithm can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the acquisition time of the second data is close to the acquisition time of the first data.

It will be appreciated that the acquisition time of the second data, which is close to the acquisition time of the first data, may be advantageous in providing the accuracy of the training data of the first agent.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: determining training data of a second agent according to the first data and the second data; and training the reinforcement learning of the second agent by utilizing the training data of the second agent.

As an alternative implementation, the training of reinforcement learning for the second agent by using the training data of the second agent includes: and in all training rounds in the training process of the first agent, training reinforcement learning is carried out on the second agent by utilizing the training data of the second agent.

For example, the training of the second agent alternates with the training of the first agent.

It should be understood that, in this embodiment, through the first agent and the second agent interacting with the environment in turn, and the first agent and the second agent training in turn, mutual learning between the first agent and the second agent can be realized, thereby being beneficial to converting the non-stationary problem into the quasi-stationary problem, and being beneficial to improving the convergence ability of the reinforcement learning algorithm.

As another alternative implementation, the training of reinforcement learning for the second agent using the training data of the second agent includes: in the interval training round in the training process of the first agent, the training data of the second agent is utilized to carry out reinforcement learning training on the second agent.

It should be appreciated that, in the present embodiment, the computational load of the reinforcement learning training of the agents may be reduced by training the second agent with the training data of the second agent in alternate training rounds during the training of the first agent, rather than training the second agent in each training round of the first agent.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: and acquiring a resource scheduling decision of the environment according to the trained first agent.

Optionally, taking the current state of the environment as the input of a trained first agent, and obtaining the action output by the first agent; and determining the current resource scheduling decision of the resource scheduling task according to the action output by the first agent.

Optionally, the action output by the first agent may be used as a resource scheduling decision of the current resource scheduling.

Optionally, according to a traditional scheduling decision method, obtaining a first scheduling decision of a wireless resource scheduling environment; and determining the resource scheduling decision of the current resource scheduling according to the first scheduling decision and a second scheduling decision corresponding to the action output by the first agent.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: and acquiring a resource scheduling decision of the environment according to the trained first agent and the trained second agent.

Optionally, the current state of the radio resource scheduling environment is used as the input of the trained first agent and second agent, respectively, to obtain action 1 output by the first agent and action 2 output by the second agent. And determining the resource scheduling decision of the current resource scheduling according to the scheduling decision corresponding to the action 1 output by the first agent and the scheduling decision corresponding to the action 2 output by the second agent.

Optionally, the resource scheduling decision of the current resource scheduling is determined according to the scheduling decision corresponding to the action 1 output by the first agent, the scheduling decision corresponding to the action 2 output by the second agent, and the scheduling decision obtained by the conventional scheduling decision method.

It should be understood that the method provided by the application is beneficial to alleviating or avoiding the problem that when reinforcement learning is applied to an unstable environment, the training process of the reinforcement learning enters a local optimal point, so that a relatively accurate intelligent agent can be trained and obtained by the method provided by the application, and a relatively reasonable resource scheduling strategy can be obtained based on the intelligent agent.

In a second aspect, a communication device is provided, which may be configured to perform the method of the first aspect.

Optionally, the communication device may comprise means for performing the method of the first aspect.

Optionally, the communication device is a network device.

Optionally, the communication device is a chip.

For example, the communication device is a chip or a circuit configured in the network device. For example, the communication device may be referred to as an AI module.

In a third aspect, a communication device is provided, which comprises a processor coupled to a memory for storing computer programs or instructions, the processor being configured to execute the computer programs or instructions stored by the memory such that the method of the first aspect is performed.

For example, the processor is for executing a memory-stored computer program or instructions causing the communication device to perform the method of the first aspect.

Optionally, the communication device comprises one or more processors.

Optionally, a memory coupled to the processor may also be included in the communication device.

Optionally, the communication device may include one or more memories.

Alternatively, the memory may be integral with the processor or provided separately.

Optionally, a transceiver may also be included in the communication device.

In a fourth aspect, a chip is provided, where the chip includes a processing module and a communication interface, the processing module is configured to control the communication interface to communicate with the outside, and the processing module is further configured to implement the method in the first aspect.

In a fifth aspect, a computer readable storage medium is provided, on which a computer program (also referred to as instructions or code) for implementing the method in the first aspect is stored.

The computer program, when executed by a computer, causes the computer to perform the method of the first aspect, for example. The computer may be a communication device.

A sixth aspect provides a computer program product comprising a computer program (also referred to as instructions or code) which, when executed by a computer, causes the computer to carry out the method of the first aspect. The computer may be a communication device.

Above-mentioned, in this application, because the training data of first agent has not only considered the interactive data of first agent and environment, the interactive data of second agent and this environment has still been considered, be favorable to improving the stability and the accuracy of the training data of first agent, consequently, can improve the convergence ability and the convergence rate of the reinforcement learning training of first agent to a certain extent, thereby be favorable to slowly reducing or avoid when being applied to unstable environment with reinforcement learning, the problem that its training process got into local optimum point.

Drawings

Figure 1 is a schematic diagram of a markov decision process.

Fig. 2 is a schematic diagram of reinforcement learning.

Fig. 3 is a schematic diagram of a radio resource scheduling scenario to which the embodiment of the present application may be applied.

Fig. 4 is a schematic flow diagram of a method of training a neural network according to an embodiment of the present application.

FIG. 5 is another schematic flow chart diagram of a method of training a neural network in accordance with an embodiment of the present application.

FIG. 6 is yet another schematic flow chart diagram of a method of training a neural network in accordance with an embodiment of the present application.

Fig. 7 is a schematic block diagram of an apparatus for training a neural network provided in an embodiment of the present application.

Fig. 8 is a schematic block diagram of an apparatus for training a neural network according to another embodiment of the present application.

Fig. 9 is a schematic block diagram of a communication device provided in an embodiment of the present application.

Fig. 10 is a schematic block diagram of a network device provided in an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a new technical science to study and develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Research in the field of artificial intelligence includes robotics, language recognition, image recognition, natural language processing, decision and reasoning, human-computer interaction, recommendation and search, and the like.

Machine learning is the heart of artificial intelligence. Those skilled in the art define machine learning as: to accomplish task T, a process of model representation P is gradually increased through a training process E. For example, let a model recognize whether a picture is a cat or a dog (task T). To improve the accuracy of the model (model representation P), pictures are continuously provided to the model to learn cat and dog differences (training process E). Through the learning process, the obtained final model is a product of machine learning, and ideally, the final model has the function of identifying cats and dogs in the pictures. The training process is the learning process of machine learning.

The machine learning method comprises supervised learning and reinforcement learning.

The goal of supervised learning is to learn the mapping relationships between inputs and outputs in a training data set given a training data set, and at the same time, it is desirable that the mapping relationships can also be applied to data other than the training data set, i.e. when new data comes, the results can be predicted from the mapping relationships. Where the training data set is a set of correct input-output pairs. Typically, the correct input-output pairs in the training dataset are labeled by a human. In other words, supervised learning can be considered as providing a learning algorithm with a data set consisting of "correct answers".

It can be seen that supervised learning requires the acquisition of labeled training data sets (i.e., data sets consisting of "correct answers"). However, some tasks have difficulty in obtaining labeled training data sets, for example, for decision-making problems.

Reinforcement learning is presented for tasks (e.g., decision-making problems) that have difficulty obtaining labeled training data sets.

Reinforcement Learning (RL), also known as refinish learning, evaluative learning, or reinforcement learning, is used to describe and solve the problem of agents (agents) reaching a maximum return or achieving a specific goal through learning strategies during interaction with the environment.

A common model for reinforcement learning is the Markov Decision Process (MDP). MDP is a mathematical model for analyzing decision problems, such as fig. 1, which assumes that the environment has markov properties (the conditional probability distribution of the future states of the environment depends only on the current state), and the decision maker makes a decision (which may also be called an action, corresponding to the label s in fig. 1) according to the state of the current environment by periodically observing the state of the environment (corresponding to the label s in fig. 1), and gets a new state (corresponding to the label s in fig. 1) and a reward (corresponding to the label r in fig. 1) after interacting with the environment.

Reinforcement learning is learning by agents in a "trial and error" manner, with rewards (rewarded) directed actions obtained by actions (actions) interacting with the environment with the goal of making the agents obtain the most rewards. Reinforcement learning is different from supervised learning, and is mainly characterized by no need of a training data set. The reinforcement signal (i.e., reward) provided by the environment in reinforcement learning provides an assessment of how well an action is being generated, rather than telling the reinforcement learning system how to generate the correct action. Since the information provided by the external environment is very small, the agent must learn on its own experience. In this way, the agent gains knowledge in the context of action-assessment (i.e., rewards), improving the course of action to suit the context. Common reinforcement learning algorithms include Q-learning, polarity gradient, operator-critic, and the like.

As shown in fig. 2, reinforcement learning mainly includes four elements: agent, environment state (state), action (action), and reward (reward), where the input of the agent is a state and the output is an action.

In the prior art, the training process of reinforcement learning is as follows: the method comprises the steps that multiple interactions are carried out between an agent and the environment, and the action, the state and the reward of each interaction are obtained; the agent is trained once using the sets (action, state, reward) as training data. By adopting the process, the next round of training is carried out on the intelligent agent until the convergence condition is met.

As shown in fig. 2, the process of obtaining the action, state and reward of one interaction is to input the environment current state s0 to the agent, obtain an action a0 output by the agent, calculate the reward r0 of the current interaction according to the relevant performance index of the environment under the action a0, and thus obtain an action a0, an action a0 and a reward r0 of the current interaction. And recording the action a0, the action a0 and the reward r0 of the interaction for being used for training the intelligent agent later. The next state s1 of the environment under action a0 is also recorded in order to enable the next interaction of the agent with the environment.

The reinforcement learning is learned in a trial and error mode, so the convergence capability and the convergence speed of the reinforcement learning are far lower than those of the supervision learning.

In practical applications, the environment of some decision tasks is not smooth, for example, the task of scheduling radio resources in the communication field. When the reinforcement learning is applied to the radio resource scheduling task, the feedback of the environment to the action (scheduling action) is affected by factors such as the position of the user and the variation of the radio channel, for example, at different times, the same state corresponds to different rewards for the action, which may result in a very slow convergence rate of the training process of the reinforcement learning, or even no convergence, for example, entering a local optimal point.

The artificial intelligence technology which is becoming mature will produce important promoting effect to the evolution of the future mobile communication network technology. Currently, the academic community has a lot of research on applying artificial intelligence technology to network layers (such as network optimization, mobility management, resource allocation, etc.) and physical layers (such as channel coding and decoding, channel prediction, receivers, etc.). The application of reinforcement learning to the radio resource scheduling task is expected to be a future trend.

In view of the above problem, the present application provides a method and an apparatus for training a neural network, which can improve the convergence ability of reinforcement learning.

In the application, when the training data of one intelligent agent is obtained, the interactive data of a plurality of intelligent agents and the environment are comprehensively considered, so that the stability of the training data of the intelligent agents can be enhanced, and the adverse effect of an unstable environment on reinforcement learning can be reduced.

The embodiments of the present application may be applied to various communication systems, for example, a Long Term Evolution (LTE) system, a 5th Generation (5G) system, a machine to machine (M2M) system, or other communication systems of future evolution. The wireless air interface technology of 5G is called new air interface (NR), and the 5G system can also be called NR system.

The terminal device referred to in the embodiments of the present application may refer to a User Equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment. The terminal device may also be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), a handheld device with wireless communication capability, a computing device or other processing device connected to a wireless modem, a vehicle mounted device, a wearable device, a terminal device in a 5G network or a terminal device in a Public Land Mobile Network (PLMN) for future evolution, etc.

The network device in the embodiments of the present application may be configured to communicate with one or more terminals, and may also be configured to communicate with one or more base stations having partial terminal functions (for example, communication between a macro base station and a micro base station, such as an access point). The base station may be an evolved Node B (eNB) in an LTE system, or a base station (gNB) in a 5G system, an NR system. In addition, a base station may also be an Access Point (AP), a transport point (TRP), a Central Unit (CU), or other network entity, and may include some or all of the functions of the above network entities.

Fig. 3 is a schematic diagram of a radio resource scheduling scenario to which the embodiment of the present application may be applied. In the application scenario shown in fig. 3, including the network device 310 and the terminal device 320, the network device 310 may allocate scheduling resources to the terminal device 320. For example, the network device 310 receives information reported by the terminal device 320, such as a channel status. The network device 310 buffers the user queue information sent by the upper layer and waits for sending. In each Transmission Time Interval (TTI), the network device 310 performs priority ordering on users whose user queue information is not empty according to the information reported by the terminal device 320 and/or the information obtained by the network device, allocates resources to the users according to the priority order, and transmits data.

Network device 310 may include a resource scheduling module to schedule resources for users, among other things.

It should be understood that fig. 3 is exemplary only and not limiting. For example, fig. 3 schematically shows 4 terminal devices 320, and in practical applications, the network device 310 may schedule resources for a plurality of terminal devices.

It should be noted that the reinforcement learning training method provided in the present application can be applied to all tasks suitable for solving problems by using reinforcement learning algorithms. For example, the task has a stable environment. Alternatively, the task has an unstable environment, for example, the task is a radio resource scheduling task as shown in fig. 3.

For convenience of description and understanding, the application scenario is described as an example of radio resource scheduling.

As described above, reinforcement learning includes four elements shown in fig. 2: agent, environmental state, action, reward. In the present application, an application scenario is taken as an example of wireless resource scheduling, and a training method of reinforcement learning is described. Hereinafter, the state of the environment (environment corresponding to the radio resource scheduling task) and the action of the agent in the reinforcement learning in the case of applying the reinforcement learning to the radio resource scheduling task are defined.

Defining a state of a radio resource scheduling context as S ═ E_N×M,V_1×M,T_1×M,B_1×M). Wherein E_N×MDenotes an estimated transmission code rate of each user at each minimum scheduling unit (e.g., RBG), N denotes the number of resources, and M denotes the maximum number of users. V_1×MRepresenting the windowed average code rate for each user. T is_1×MIndicating the latency of the longest waiting packet in each user's cache. B is_1×MIndicating the size of each user's cache. E_N×M,V_1×M,T_1×M,B_1×MAre all real numbers.

Defining the action of an agent in reinforcement learning as A_N×M，A_N×MIndicating the priority of user scheduling. A. the_N×MAre real numbers.

In the case of applying reinforcement learning to the radio resource scheduling task, a reward in reinforcement learning may be determined based on a performance index of the environment. For example, the performance indicators of the environment corresponding to the radio resource scheduling task include, but are not limited to: throughput, fairness, packet loss rate, etc.

Throughput refers to, among other things, the amount of data successfully transmitted per unit time for a network, device, port, virtual circuit, or other facility. For example, throughput may be measured in units of bits, bytes, or packets.

Fairness refers to an index in a system for measuring resource allocation among users. For example, a more common fairness metric is the Jain fairness index (Jain's fair index).

The packet loss rate refers to the ratio of packets discarded by overflow and overtime to the total number of packets sent in the user buffer.

It should be understood that the above definitions of the states and actions of the radio resource scheduling environment are only examples and not limitations, and in practical applications, the states and actions of the radio resource scheduling environment may be defined according to application requirements.

The method for training the neural network can be applied to network equipment.

Optionally, the network device may be subjected to software improvement according to the method for training a neural network provided by the present application. For example, a reinforcement learning algorithm in the network device is determined based on the neural network method provided in the present application.

Optionally, the network device may be hardware improved according to the method for training a neural network provided by the present application. For example, an apparatus that can implement the method of the neural network provided by the present application is taken as a resource scheduling policy module in the network device.

It should be noted that the training of the agent mentioned in the following embodiments refers to training of reinforcement learning for the agent.

Fig. 4 is a schematic flow chart diagram of a method of training a neural network according to an embodiment of the present application. For example, the execution subject of the method may be a network device, or may be a resource scheduling module configured in the network device. The method comprises the following steps.

S410, determining training data of the first agent according to first data obtained by interaction between the first agent and the environment and second data obtained by interaction between the second agent and the environment. Wherein, the environment is the environment corresponding to the radio resource scheduling task.

It should be noted that in the present application, it is referred to a plurality of agents interacting with the same environment. In this embodiment, a first agent and a second agent interact with the same environment.

For example, in step S410, every time the first agent interacts with the environment, the state, action, and performance index of the current interaction are recorded. The performance indicators represent performance indicators for the environment of interest in calculating rewards. And recording the state, action and performance index of the interaction every time the second agent interacts with the environment. The performance indicators represent performance indicators for the environment of interest in calculating rewards.

Optionally, the first data may further include a reward corresponding to a state and an action resulting from interaction of the first agent with the environment. The reward may be derived from performance metrics resulting from interaction of the first agent with the environment. The following will describe a method of acquiring a prize, which will not be described here for the moment.

From the foregoing, the training data in reinforcement learning includes 3 elements: status, actions, rewards.

The training data of the first agent includes 3 elements: status, actions, rewards.

The training data of the agent mentioned in the embodiment of the present application refers to training data in reinforcement learning, and includes 3 elements: status, actions, rewards. When reference is made again to training data of an agent, the explanation will not be repeated.

The training data of the first agent is determined from the first data and the second data, indicating that the determination of the training data of the first agent takes into account not only the data of the interaction of the first agent with the environment, but also the data of the interaction of the second agent with the environment. Step S410 will be further described below.

And S420, training reinforcement learning on the first agent by using the training data of the first agent.

The second agent represents other agents than the first agent.

For example, there may be a difference between the second agent and the first agent in any one or more of the following attributes: neural network structure, reinforcement learning algorithm.

Neural network structures include, but are not limited to: the number of layers of the neural network and the weight configuration of each layer of the neural network.

In the present application, the implementation method for determining the training data of the first agent may include the following steps.

A first implementation. The training data of the other agents is augmented to the training data of the first agent.

A second implementation. Rewards in the training data of the first agent are obtained by comprehensively considering the interaction data of the first agent with the environment and the interaction data of other agents with the environment.

The first implementation will be described below.

Optionally, in step S410, training data corresponding to the first data and training data corresponding to the second data are used as the training data of the first agent.

For example, by having a first agent interact with the environment, training data 1{ s is obtained₀,a₀,r₀；s₁,a₁,r₁；…；s_n1,a_n1,r_n1}. Where n1 represents the number of times the first agent interacted with the environment in a round of training, s_i,a_i,r_iStates, actions and rewards in the ith interaction of the first agent with the environment are represented, i ═ 0,1, …, n 1. Obtaining training data 2{ s 'by interacting a second agent with the environment'₀,a’₀,r’₀；s’₁,a’₁,r’₁；…；s’_n2,a’_n2,r’_n2}. Where n2 represents the number of times the second agent interacted with the environment in that round of training, s'_j,a’_j,r’_jStates, actions and rewards in the j-th interaction of the second agent with the environment, j being 0,1, …, n 2. The training data of the first agent includes training data 1 and training data 2. I.e. the training data of the first agent is s₀,a₀,r₀；s₁,a₁,r₁；…；s_n1,a_n1,r_n1；s’₀,a’₀,r’₀；s’₁,a’₁,r’₁；…；s’_n2,a’_n2,r’_n2}。

In a first implementation, the training data of the first agent includes both the state, action, and reward resulting from the first agent interacting with the environment, and the state, action, and reward resulting from the second agent interacting with the environment. It can be seen that in a first implementation, the training data of the second agent is augmented to the training data of the first agent.

The second implementation will be described below.

Optionally, in the embodiment shown in fig. 4, the first data includes a performance index obtained by the first agent interacting with the environment, and the second data includes a performance index obtained by the second agent interacting with the environment. In step S410, training data of the first agent is obtained based on the first data and the second data, wherein the reward in the training data of the first agent is obtained according to the performance index in the first data and the performance index in the second data.

Optionally, the reward in the training data of the first agent is obtained by linearly weighting a difference of a performance indicator in the first data and a performance indicator in the second data.

For example, reward in the training data of the first agent is calculated according to the following equation (1):

reward＝αΔe₀+βΔe₁+γΔe₂+...

wherein, Δ e_iRepresenting a difference between the ith performance indicator of the first agent and the ith performance indicator of the second agent. { e₀,e₁,e₂… denotes the performance index of environmental interest, { α, β, γ. } denotes the weighting factor.

It should be understood that the ith performance indicator of the first agent refers to the ith performance indicator of the environment after interaction with the first agent. The ith performance indicator of the second agent refers to the ith performance indicator of the environment after interaction with the second agent.

For example, the weighting coefficients α, β, γ may also all be equal to 1.

Optionally, the reward in the training data of the first agent is obtained by normalizing a difference of the performance indicator in the first data and the performance indicator in the second data.

For example, reward in training data for the first agent is calculated according to equation (2) below_j：

Wherein the content of the first and second substances,

representing a normalization factor, K representing the number of agents, I representing the number of performance indicators of interest, K-1\ j representing the exclusion of j from (K-1), r_jkiIndicating a prize awarded by the jth agent for the ith performance indicator relative to the kth agent. In this embodiment, the jth agent is the first agent, and K equals 2.

As an example, r_jkiThe following formula is satisfied:

for example, when K ═ I ═ 2, α₀＝α₁＝1，β₀＝β₁Equation (2) can be expressed using table 1 below, 0.5.

TABLE 1

It should be understood that the above equations (1) and (2) are merely examples and not limitations. In practical applications, the reward in the training data of the first agent may be obtained according to the performance index in the first data and the performance index in the second data in other feasible manners according to application requirements.

In a second implementation manner, the reward in the training data of the first agent is determined according to the performance index obtained by the interaction between the first agent and the environment and the performance index obtained by the interaction between the second agent and the environment, so that the stability and accuracy of the reward in the training data of the first agent can be improved, and therefore, the convergence capability of the reinforcement learning training of the first agent can be improved to a certain extent, and the convergence speed is improved.

Optionally, in the first implementation, the reward in the training data of the agent may adopt the method for acquiring the reward in the training data of the first agent in the second implementation.

Alternatively, in a first implementation, rewards in the training data of the agent may be obtained in other ways.

For example, rewards in the training data of the first agent are obtained based on a linearly weighted sum of the respective performance indicators.

For example, reward in the training data of the first agent may be obtained by the following equation:

reward＝αe₀+βe₁+γe₂+...

wherein, { e₀,e₁,e₂… denotes performance indicators of environmental interest, { α, β, γ. } denotes weights.

Optionally, in a second implementation, the training data of other agents may also be augmented to the training data of the first agent. The reward in the training data of other agents may be obtained by the method for obtaining the reward in the training data of the first agent in the second implementation manner, or may be obtained by other feasible methods.

Optionally, in step S410, the acquisition time of the second data is close to the acquisition time of the first data.

For example, in one training round, the way a first agent interacts with the environment is as follows.

t, the first agent interacts with the environment to obtain a state 1, an action 1 and a performance index 1;

t +1, the second agent interacts with the environment to obtain a state 2, an action 2 and a performance index;

t +2, the first agent interacts with the environment to obtain a state 3, an action 3 and a performance index;

t +3, the second agent interacts with the environment to obtain a state 4, an action 4 and a performance index;

and so on.

Assume that the training data for the first agent includes (state 1, action 1, reward 1; state 3, action 3 and performance index 3; …), where reward 1 may be obtained based on performance index 1 and performance index 2 and reward 3 may be obtained based on performance index 3 and performance index 4.

In the prior art, in order to solve the problem that the training process enters a local optimal point when reinforcement learning is applied to an unstable environment, modeling the environment in reinforcement learning as a Hidden Markov Model (HMM) or a partially observed markov process (POMDP) is proposed.

In the application, the environment in the reinforcement learning is modeled as a common Markov Model (MDP), so that the problem that the training process enters a local optimal point when the reinforcement learning is applied to an unstable environment can be relieved or avoided. Therefore, compared with the prior art, the scheme provided by the application can reduce the complexity of the reinforcement learning algorithm.

It should be appreciated that the training process of the first agent typically includes multiple rounds of training, wherein the number of rounds of training is determined by a convergence condition. For example, when the first agent after the nth round of training satisfies the convergence condition, the training process of the first agent is ended, and in this example, the training process of the first agent includes N rounds.

It should also be appreciated that the process of each round of training of the first agent includes: acquiring training data of a first agent; the first agent is trained using the acquired training data.

It should also be appreciated that after each round of training, it is determined whether a convergence condition is satisfied, if so, the training process of the first agent is ended, and if not, the next round of training is continued.

The convergence condition mentioned in the present application includes, but is not limited to, any of the following conditions: the number of training rounds is larger than a threshold value, the training time is larger than the threshold value, the loss function is smaller than the threshold value, and the test performance reaches the threshold value.

In the present application, the above step S410 may be adopted to obtain the training data of the first agent in part or all of the training rounds of the first agent.

Optionally, in each round of training of the first agent, the training data of the first agent is obtained by using the step S410.

Optionally, in the partial round training of the first agent, the training data of the first agent is obtained by using the step S410. For example, in the remaining rounds of training of the first agent, the training data for the first agent may be obtained using conventional methods.

For example, in the first N1 trainings of the first agent, the training data of the first agent is obtained using step S410 described above. N1 is less than N, where N represents the number of training rounds for the first agent.

For another example, in the interval training round of the first agent, the training data of the first agent is obtained by using the step S410.

As an example, in the 1 st round of training of the first agent, the training data of the first agent is obtained by using the step S410; acquiring training data of the first agent by adopting a traditional method in the 2 nd round training of the first agent; in the 3 rd round of training of the first agent, the training data of the first agent is obtained by adopting the step S410; acquiring training data of the first agent by adopting a traditional method in the 4 th round of training of the first agent; and so on until the convergence condition is satisfied.

The above mentioned acquisition of training data of the first agent using conventional methods means that the training data of the first agent may be acquired using any method, existing or evolving in the future. For example, the training data for the first agent is obtained using conventional methods, meaning that the training data for the first agent is obtained using the method described above in connection with FIG. 1.

In the present application, when acquiring the training data of the first agent, the interaction data of the second agent and the environment is considered, wherein the second agent may complete the training in advance before training the first agent, or may perform the training in the training process of the first agent.

Optionally, in the embodiment shown in fig. 4, the method further includes: determining training data of a second agent according to the first data and the second data; and training the reinforcement learning of the second agent by utilizing the training data of the second agent.

Optionally, training reinforcement learning for the second agent using training data of the second agent includes: and in all training rounds in the training process of the first agent, training reinforcement learning is carried out on the second agent by utilizing the training data of the second agent.

Optionally, training reinforcement learning for the second agent using training data of the second agent includes: in the interval training round in the training process of the first agent, the training data of the second agent is utilized to carry out reinforcement learning training on the second agent.

The convergence criteria for training the second agent may or may not be the same as the convergence criteria for training the first agent.

As for the training of the second agent, the method of step S410 in the embodiment of the present application may be used to obtain the training data of the second agent, or the conventional method may be used to obtain the training data of the second agent.

To facilitate a better understanding of the solution provided by the present application, an example is given below in connection with fig. 5. The environment in fig. 5 may be a corresponding environment for a radio resource scheduling task.

The respective elements shown in fig. 5 are explained first. The portion enclosed by the dashed box labeled "train 1" represents the training process of the first agent and the portion enclosed by the dashed box labeled "train 2" represents the training process of the second agent. State 1, action 1 represents the state and action resulting from the interaction of the first agent with the environment. State 2, action 2 represent the state and action resulting from the interaction of the second agent with the environment. Performance index 1 represents the performance index resulting from the interaction of the first agent with the environment. Reward 1 represents a reward obtained according to performance index 1. Performance index 2 represents the performance index resulting from the interaction of the second agent with the environment. The award 2 represents an award obtained according to the performance index 2. The next state 1 represents the next state of the environment after action 1 output by the first agent is applied to the environment. The next state 2 represents the next state of the environment after action 2 of the second agent output is applied to the environment.

In this example, the training process of reinforcement learning includes the following steps.

Step one, initializing the agent, so that the input of the agent is a state and the output is an action.

The initialization method of the first agent is similar to that of the second agent, and for the sake of simplicity of description, the initialization of the first agent is described below as an example of the first agent.

For example, a first agent contains a set of neural networks, which may be referred to as a policy network. The input of the policy network is state and the output is action. It should be appreciated that the policy network causes the input of the first agent to be a state and the output to be an action.

Optionally, the first agent may also contain another set of neural networks whose outputs are estimated rewards and inputs are actions, states or intermediate results.

It should be understood that depending on the reinforcement learning algorithm employed by the first agent, the first agent may contain one or more additional sets of neural networks in addition to preceding the policy network. This is not a limitation of the present application.

It is also to be understood that the description above in step one, taking the first agent as an example, may apply to the description of the second agent.

The structure and reinforcement learning algorithm of the first agent and the second agent may be the same or different. Wherein, the structure of the agent includes but is not limited to: the number of groups of neural networks contained by the agent, and the number of layers of neural networks in each group.

It should be noted that even though the first agent and the second agent have the same structure, the weights on the neural networks of the first agent and the second agent are not exactly the same.

In other words, the first agent and the second agent are not exactly the same two agents.

And step two, alternately training the first agent and the second agent until a convergence condition is reached.

Step two includes the following substeps.

And substep 1), the first agent and the second agent acquire the wireless environment state and perform scheduling action on the environment according to own policy network. And counting the concerned performance indexes (such as throughput, fairness, packet loss rate and the like) and the next state of the environment, and respectively recording the state, the action, the performance indexes and the next state. This step may be performed multiple times until the data obtained is sufficient to support one training session.

And a substep 2) of calculating rewards of each interaction according to the performance indexes recorded in the substep 1), and training the neural network of the first agent once according to data obtained by all interactions.

And substep 3), the first agent and the second agent acquire the wireless environment state and perform scheduling action on the environment according to the policy network of the first agent and the second agent. And counting the concerned performance indexes (such as throughput, fairness, packet loss rate and the like) and the next state of the environment, and respectively recording the state, the action, the performance indexes and the next state. This step may be performed multiple times until the data obtained is sufficient to support one training session.

And a substep 4) of calculating rewards of each interaction according to the performance index parameters in the substep 3) and training the neural network of the second agent once according to data obtained by all interactions.

Optionally, substep 2) is not performed simultaneously with substep 4).

For example, in sub-step 2) the neural network parameters of the second agent are frozen, and in sub-step 4) the neural network parameters of the first agent are frozen.

Alternatively, substep 2) and substep 4) may be performed simultaneously.

In this case, substep 1) and substep 3) may be the same step, or may be two steps performed in a sequential order.

For example, each round of training in step two includes substeps 1) through 4) described above.

It should be noted that, in the present application, other agents participating in the acquisition of the training data of the first agent may include one agent or may include a plurality of agents, and the present application is not limited thereto.

For example, other agents participating in the acquisition of training data for a first agent may include a second agent and a third agent. The description herein for the second agent may apply to the third agent.

In this example, if the first implementation described above is used to obtain training data for a first agent, the training data for a second agent and the training data for a third agent may be augmented to the training data for the first agent.

In this example, if the training data of the first agent is obtained by the second implementation manner, the reward in the training data of the first agent can be obtained based on performance indexes obtained by the interaction between the first agent, the second agent and the third agent and the environment respectively.

It is to be understood that the scenario described in this application, taking as an example the participation of a second agent in the acquisition of training data of a first agent, a person skilled in the art may logically deduce the scenario in case the other agent participating in the acquisition of training data of the first agent comprises more agents. To avoid encumbrance, the scenario in which other agents participating in the acquisition of training data of the first agent include more agents is not described herein.

After obtaining the trained first agent, a resource scheduling policy in a radio resource scheduling environment may be obtained based on the first agent.

The agents mentioned in the following embodiments refer to agents after training is completed.

Optionally, as shown in fig. 6, in the embodiment of the present application, the method further includes the following steps.

And S430, taking the current state of the environment as the input of the trained first agent, and obtaining the output action of the first agent.

S440, determining the current resource scheduling decision of the resource scheduling task according to the action output by the first agent.

Optionally, in step S440, the action output by the first agent is taken as a resource scheduling decision of the current resource scheduling.

For example, the current state of the radio resource scheduling environment is used as the input of the first agent, and the action output by the first agent is a ═ { m ═ m₀，m₁，m₂，m₃，m₄}. The action A is { m ═ m₀，m₁，m₂，m₃，m₄Denotes the scheduling priority of 5 terminal devices on 1 resource block. For example, the terminal device with the highest scheduling priority is selected to be scheduled in the 1 resource block.

As an example, argmax (m) according to the formula₀，m₁，m₂，m₃，m₄) And determining the terminal equipment with the highest scheduling priority.

Optionally, in step S440, a first scheduling decision of the radio resource scheduling environment is obtained according to a conventional scheduling decision method; and determining the resource scheduling decision of the current resource scheduling according to the first scheduling decision and a second scheduling decision corresponding to the action output by the first agent.

For example, the resource scheduling decision of the current resource scheduling is determined by performing compromise processing on the first scheduling decision and the second scheduling decision.

As one example. The current state of the wireless resource scheduling environment is used as the input of the first agent, and the action output by the first agent is A ═ m₀，m₁，m₂，m₃，m₄}. The action A is { m ═ m₀，m₁，m₂，m₃，m₄Denotes the scheduling priority of 5 terminal devices on 1 resource block. For the current radio resource scheduling environment, the scheduling priority of 5 terminal devices obtained according to the conventional scheduling decision method is a' ═ p₀，p₁，p₂，p₃，p₄}. And determining the finally scheduled terminal equipment according to the following formula:

argmax(m₀+p₀，m₁+p₁，m₂+p₂，m₃+p₃，m₄+p₄)

for example, p in the above formula_iAnd m_iEach of (i ═ 0,1,2,3,4) can be scaled by a coefficient to adjust its weight in the final scheduling result.

Optionally, in an embodiment in which training of the second agent is further included in the process of training the first agent, the resource scheduling policy in the radio resource scheduling environment may be obtained based on the first agent that has completed training and the second agent that has completed training.

Optionally, in step S430, the current state of the radio resource scheduling context is used as the input of the trained first agent and second agent, respectively, to obtain action 1 output by the first agent and action 2 output by the second agent. In step S440, a resource scheduling decision for the current resource scheduling is determined according to the scheduling decision corresponding to action 1 output by the first agent and the scheduling decision corresponding to action 2 output by the second agent.

Optionally, in step S440, a resource scheduling decision for the current resource scheduling is determined according to the scheduling decision corresponding to action 1 output by the first agent, the scheduling decision corresponding to action 2 output by the second agent, and the scheduling decision obtained by the conventional scheduling decision method.

As one example. The current state of the radio resource scheduling environment is used as the input of a first agent and a second agent respectively, and the action output by the first agent is A1 ═ m₀，m₁，m₂，m₃，m₄The action output by the second agent is a2 ═ q₀，q₁，q₂，q₃，q₄}. Action a1 ═ m₀，m₁，m₂，m₃，m₄Denotes the scheduling priority of 5 terminal devices on 1 resource block. Action a2 ═ q₀，q₁，q₂，q₃，q₄It also denotes the scheduling priority of 5 terminal devices on 1 resource block. And determining the finally scheduled terminal equipment according to the following formula:

argmax(m₀+q₀，m₁+q₁，m₂+q₂，m₃+q₃，m₄+q₄)

for example, m in the above formula_iAnd q is_iEach of (i ═ 0,1,2,3,4) can be scaled by a coefficient to adjust its weight in the final scheduling result.

As another example. The current state of the radio resource scheduling environment is used as the input of a first agent and a second agent respectively, and the action output by the first agent is A1 ═ m₀，m₁，m₂，m₃，m₄The action output by the second agent is a2 ═ q₀，q₁，q₂，q₃，q₄}. Action a1 ═ m₀，m₁，m₂，m₃，m₄Denotes the scheduling priority of 5 terminal devices on 1 resource block. Action a2 ═ q₀，q₁，q₂，q₃，q₄It also denotes the scheduling priority of 5 terminal devices on 1 resource block. For the current radio resource scheduling environment, the scheduling priority of 5 terminal devices obtained according to the conventional scheduling decision method is a3 ═ p₀，p₁，p₂，p₃，p₄}. And determining the finally scheduled terminal equipment according to the following formula:

argmax(m₀+p₀+q₀，m₁+p₁+q₁，m₂+p₂+q₂，m₃+p₃+q₃，m₄+p₄+q₄)

for example, p in the above formula_i、m_iAnd q is_iEach of (i ═ 0,1,2,3,4) can be scaled by a coefficient to adjust its weight in the final scheduling result.

The various embodiments described herein may be implemented as stand-alone solutions or combined in accordance with inherent logic and are intended to fall within the scope of the present application.

Embodiments of the methods provided herein are described above, and embodiments of the apparatus provided herein are described below. It should be understood that the description of the apparatus embodiments corresponds to the description of the method embodiments, and therefore, for brevity, details are not repeated here, since the details that are not described in detail may be referred to the above method embodiments.

In the embodiment of the present application, according to the method example, functional modules of a device for training a neural network may be divided, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present application is schematic, and is only one logical function division, and other feasible division manners may be available in actual implementation. The following description will be given taking the example of dividing each functional module corresponding to each function.

Fig. 7 is a schematic block diagram of an apparatus 700 for training a neural network provided herein. The apparatus 700 may be used to perform the method of training a neural network provided by the above embodiments. The apparatus 700 may be a network device, or may also be a chip or a circuit configured in a network device. Alternatively, the apparatus 700 is a chip or a circuit configured in a network device, in which case the apparatus 700 may be referred to as an AI module.

As shown in fig. 7, the apparatus 700 includes a processing unit 710 and a training unit 720. The processing unit 710 is configured to determine training data of a first agent according to first data obtained by interaction between the first agent and an environment and second data obtained by interaction between a second agent and the environment, where the environment is an environment corresponding to a radio resource scheduling task. The training unit 720 is configured to perform reinforcement learning training on the first agent by using the training data of the first agent.

Optionally, the processing unit 710 is configured to use training data corresponding to the first data and training data corresponding to the second data as the training data of the first agent.

Optionally, the first data includes a performance index obtained by interaction between the first agent and the environment, and the second data includes a performance index obtained by interaction between the second agent and the environment; rewards in training data of the first agent are obtained based on performance indicators in the first data and performance indicators in the second data. That is, the processing unit 710 is configured to obtain a reward in the training data of the first agent based on a performance indicator obtained by interaction between the first agent and the environment and a performance indicator obtained by interaction between the second agent and the environment.

For example, rewards in the training data of the first agent are obtained by linearly weighting the difference of performance indicators in the first data and performance indicators in the second data.

As another example, rewards in the training data of the first agent are obtained by normalizing differences in performance indicators in the first data and performance indicators in the second data.

Optionally, the acquisition time of the second data is close to the acquisition time of the first data.

Optionally, the processing unit 710 is further configured to determine training data of a second agent according to the first data and the second data; the training unit 720 is further configured to perform reinforcement learning training on the second agent using the training data of the second agent.

Optionally, the processing unit 710 is configured to perform reinforcement learning training on the second agent by using the training data of the second agent in all training rounds in the training process of the first agent.

Optionally, the processing unit 710 is configured to perform reinforcement learning training on the second agent by using training data of the second agent in an interval training round in the training process of the first agent.

Optionally, the processing unit 710 is configured to obtain a resource scheduling decision of the radio resource scheduling task according to the trained first agent.

Optionally, the processing unit 710 is configured to obtain a resource scheduling decision of the radio resource scheduling task according to the trained first agent and the trained second agent.

Alternatively, the apparatus 700 provided in the above embodiment is a chip or a circuit configured in a network device.

For example, in this case, the apparatus 700 may be referred to as an AI module.

Optionally, the apparatus 700 provided in the above embodiment is a network device.

Optionally, as shown in fig. 8, the apparatus 700 may further include a transceiver 730. The processing unit 710 is further configured to obtain a resource scheduling policy according to the agent (such as the first agent in the foregoing embodiment) that completes training obtained by the training unit 720, and the transceiver unit 730 is configured to send resource scheduling configuration information to the terminal device based on the resource scheduling policy.

The processing unit 710 and the training unit 720 in the above embodiments may be implemented by a processor or processor-related circuits. The transceiver unit 730 may be implemented by a transceiver or transceiver-related circuitry. The transceiving unit 730 may also be referred to as a communication unit or a communication interface.

As shown in fig. 9, an embodiment of the present application further provides a communication apparatus 900. The communication device 900 comprises a processor 910, the processor 910 is coupled to a memory 920, the memory 920 is used for storing computer programs or instructions, and the processor 910 is used for executing the computer programs or instructions stored in the memory 920, so that the method in the above method embodiment is executed.

Optionally, as shown in fig. 9, the communication apparatus 900 may further include a memory 920.

Optionally, as shown in fig. 9, the communication device 900 may further include a transceiver 930, and the transceiver 930 is used for receiving and/or transmitting signals. For example, processor 910 may be configured to control transceiver 930 to receive and/or transmit signals.

For example, the processor 910 is configured to implement the processing-related operations performed by the terminal device in the above method embodiments, and the transceiver 930 is configured to implement the transceiving-related operations performed by the terminal device in the above method embodiments.

The communication device 900 is configured to implement the method in the above method embodiment. For example, processor 910 is configured to implement the processing-related operations in the methods of the above method embodiments, and transceiver 930 is configured to implement the transceiving-related operations in the methods of the above method embodiments.

Optionally, the communication apparatus 900 is a chip or a circuit configured in a network device.

In this case, for example, the communication apparatus 900 may be referred to as an AI module.

Optionally, the communication apparatus 900 provided in the above embodiment is a network device.

For example, in this case, the processor 810 is configured to obtain a resource scheduling policy according to the trained agent (e.g., the first agent in the foregoing embodiment), and the transceiver 930 is configured to transmit resource scheduling configuration information to the terminal device based on the resource scheduling policy.

The embodiment of the present application further provides a communication apparatus 1000, where the communication apparatus 1000 may be a network device or a chip. The communication device 1000 may be adapted to perform the method in the above-described method embodiments.

When the communication apparatus 1000 is a network device, it is, for example, a base station. Fig. 10 shows a simplified base station structure. The base station includes

portions

1010 and 1020. The 1010 part is mainly used for receiving and transmitting radio frequency signals and converting the radio frequency signals and baseband signals; the 1020 section is mainly used for baseband processing, base station control, and the like. Portion 1010 may be generally referred to as a transceiver unit, transceiver, transceiving circuitry, or transceiver, etc. Part 1020 is generally a control center of the base station, and may be generally referred to as a processing unit, configured to control the base station to perform the processing operation on the network device side in the foregoing method embodiment.

The transceiver unit of part 1010, which may also be referred to as a transceiver or transceiver, includes an antenna and a radio frequency circuit, wherein the radio frequency circuit is mainly used for radio frequency processing. Alternatively, a device for implementing a receiving function in the part 1010 may be regarded as a receiving unit, and a device for implementing a transmitting function may be regarded as a transmitting unit, that is, the part 1010 includes a receiving unit and a transmitting unit. A receiving unit may also be referred to as a receiver, a receiving circuit, or the like, and a transmitting unit may be referred to as a transmitter, a transmitting circuit, or the like.

Section 1020 may include one or more boards, each of which may include one or more processors and one or more memories. The processor is used to read and execute programs in the memory to implement baseband processing functions and control of the base station. If a plurality of single boards exist, the single boards can be interconnected to enhance the processing capacity. As an alternative implementation, multiple boards may share one or more processors, multiple boards may share one or more memories, or multiple boards may share one or more processors at the same time.

For example, in one implementation, the part 1020 is used to perform the steps S410 and S420 in fig. 4, or to perform the steps S410 to S440 in fig. 5, and/or the part 1020 is also used to perform the steps related to the processing in the method of the above embodiment. The transceiver unit of part 1010 is configured to perform the steps related to transceiving operations in the method of the above embodiments, for example, the transceiver unit of part 1010 is configured to transmit resource scheduling configuration information to the terminal device according to a resource scheduling policy determined based on an output of the trained agent.

It should be understood that fig. 10 is only an example and not a limitation, and the network device including the transceiving unit and the processing unit may not depend on the structure shown in fig. 10.

When the communication device 1000 is a chip, the chip includes a transceiver unit and a processing unit. The transceiver unit can be an input/output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip.

Embodiments of the present application also provide a computer-readable storage medium on which computer instructions for implementing the method in the above method embodiments are stored.

For example, the computer program, when executed by a computer, causes the computer to implement the methods in the above-described method embodiments.

Embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer, cause the computer to implement the method in the above method embodiments.

For the explanation and beneficial effects of the related content in any of the communication apparatuses provided above, reference may be made to the corresponding method embodiments provided above, and details are not repeated here.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Optionally, the network device related in this embodiment of the present application includes a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer. The hardware layer may include hardware such as a Central Processing Unit (CPU), a Memory Management Unit (MMU), and a memory (also referred to as a main memory). The operating system of the operating system layer may be any one or more computer operating systems that implement business processing through processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system. The application layer may include applications such as a browser, an address book, word processing software, and instant messaging software.

The embodiment of the present application does not particularly limit a specific structure of an execution subject of the method provided by the embodiment of the present application, as long as communication can be performed by the method provided by the embodiment of the present application by running a program in which codes of the method provided by the embodiment of the present application are recorded. For example, an execution main body of the method provided by the embodiment of the present application may be a terminal device or a network device, or a functional module capable of calling a program and executing the program in the terminal device or the network device.

Various aspects or features of the disclosure may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. For example, computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disk, floppy disk, or magnetic tape), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD), etc.), smart cards, and flash memory devices (e.g., erasable programmable read-only memory (EPROM), card, stick, or key drive, etc.).

Various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, but is not limited to: wireless channels and various other media capable of storing, containing, and/or carrying instruction(s) and/or data.

It should be understood that the processor mentioned in the embodiments of the present application may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory referred to in the embodiments of the application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM). For example, RAM can be used as external cache memory. By way of example and not limitation, RAM may include the following forms: static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

It should be noted that when the processor is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) may be integrated into the processor.

It should also be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. Furthermore, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions thereof, may be embodied in the form of a computer software product stored in a storage medium, the computer software product including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the methods described in the embodiments of the present application. The foregoing storage media may include, but are not limited to: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a neural network, comprising:

determining training data of a first agent according to first data obtained by interaction between the first agent and an environment and second data obtained by interaction between a second agent and the environment, wherein the environment is an environment corresponding to a wireless resource scheduling task;

and training reinforcement learning on the first agent by utilizing the training data of the first agent.

2. The method of claim 1, wherein determining training data for a first agent based on first data obtained from interaction of the first agent with an environment and second data obtained from interaction of a second agent with the environment comprises:

and taking the training data corresponding to the first data and the training data corresponding to the second data as the training data of the first agent.

3. The method of claim 1, wherein the first data comprises performance metrics of the first agent interacting with the environment, and wherein the second data comprises performance metrics of the second agent interacting with the environment;

rewards in training data for the first agent are obtained based on performance indicators in the first data and performance indicators in the second data.

4. The method of claim 3, wherein the reward in the training data of the first agent is obtained by linearly weighting a difference between a performance indicator in the first data and a performance indicator in the second data.

5. The method of claim 3, wherein the reward in the training data of the first agent is obtained by normalizing a difference between a performance metric in the first data and a performance metric in the second data.

6. The method of any of claims 1 to 5, wherein the acquisition time of the second data is close to the acquisition time of the first data.

7. The method according to any one of claims 1 to 6, further comprising:

determining training data of the second agent according to the first data and the second data;

and performing reinforcement learning training on the second agent by using the training data of the second agent.

8. The method of claim 7, wherein the training for reinforcement learning of the second agent using the training data of the second agent comprises:

and in all training rounds in the training process of the first agent, training reinforcement learning is carried out on the second agent by using the training data of the second agent.

9. The method of claim 7, wherein the training for reinforcement learning of the second agent using the training data of the second agent comprises:

and in the interval training round in the training process of the first agent, performing reinforcement learning training on the second agent by using the training data of the second agent.

10. The method according to any one of claims 1 to 9, further comprising:

and acquiring a resource scheduling decision of the environment according to the trained first agent.

11. The method according to any one of claims 7 to 9, further comprising:

and acquiring a resource scheduling decision of the environment according to the trained first agent and the trained second agent.

12. An apparatus for training a neural network, comprising:

the processing unit is used for determining training data of a first agent according to first data obtained by interaction between the first agent and an environment and second data obtained by interaction between a second agent and the environment, wherein the environment is an environment corresponding to a wireless resource scheduling task;

and the training unit is used for training the first agent in reinforcement learning by utilizing the training data of the first agent.

13. The apparatus of claim 12, wherein the processing unit is configured to use training data corresponding to the first data and training data corresponding to the second data as the training data of the first agent.

14. The apparatus of claim 12, wherein the first data comprises performance metrics of the first agent interacting with the environment, and wherein the second data comprises performance metrics of the second agent interacting with the environment;

15. The apparatus of claim 14, wherein the reward in the training data of the first agent is obtained by linearly weighting a difference between a performance indicator in the first data and a performance indicator in the second data.

16. The apparatus of claim 14, wherein the reward in the training data of the first agent is obtained by normalizing a difference between a performance metric in the first data and a performance metric in the second data.

17. The apparatus of any of claims 12 to 16, wherein the acquisition time of the second data is close to the acquisition time of the first data.

18. The apparatus according to any of claims 12 to 17, wherein the processing unit is further configured to determine training data for the second agent based on the first data and the second data;

the training unit is further configured to perform reinforcement learning training on the second agent using the training data of the second agent.

19. The apparatus according to claim 18, wherein the processing unit is configured to,

20. The apparatus according to claim 18, wherein the processing unit is configured to,

21. The apparatus according to any of claims 12 to 20, wherein the processing unit is configured to obtain a resource scheduling decision of the radio resource scheduling task according to the trained first agent.

22. The apparatus according to any one of claims 18 to 20, wherein the processing unit is configured to,

and acquiring a resource scheduling decision of the wireless resource scheduling task according to the trained first agent and the trained second agent.

23. The apparatus according to any of claims 12 to 22, wherein the apparatus is a network device.

24. A network device comprising an apparatus as claimed in any one of claims 12 to 22.

25. A network device, comprising:

a memory for storing executable instructions;

a processor for invoking and executing the executable instructions in the memory to perform the method of any one of claims 1-11.

26. A computer-readable storage medium, in which program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1 to 11.

27. A computer program product, characterized in that it comprises computer program code for implementing the method of any one of claims 1 to 11 when said computer program code is run on a computer.