CN112734048A - Reinforced learning method - Google Patents

Reinforced learning method Download PDF

Info

Publication number
CN112734048A
CN112734048A CN202110101401.6A CN202110101401A CN112734048A CN 112734048 A CN112734048 A CN 112734048A CN 202110101401 A CN202110101401 A CN 202110101401A CN 112734048 A CN112734048 A CN 112734048A
Authority
CN
China
Prior art keywords
action
value
agent
weight matrix
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110101401.6A
Other languages
Chinese (zh)
Inventor
李纪先
安涛
王瑞杰
朱青山
谭绪祥
刘烜宏
刘宇生
聂琳静
于湃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Feiteng Information Technology Co ltd
Original Assignee
Tianjin Feiteng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Feiteng Information Technology Co ltd filed Critical Tianjin Feiteng Information Technology Co ltd
Priority to CN202110101401.6A priority Critical patent/CN112734048A/en
Publication of CN112734048A publication Critical patent/CN112734048A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a reinforcement learning method, which comprises the following steps: initializing agent parameters by a CPU (Central processing Unit), wherein the agent parameters comprise model parameters of an agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the agent; the CPU end adopts the agent to interact with the environment so as to acquire an initial experience value and store the initial experience value into a buffer; the CPU end transmits the initialized intelligent agent parameters to the FPGA end; and the FPGA terminal takes the initialized intelligent agent parameters as initial values of training and carries out iterative updating on the intelligent agent according to the initial empirical values read from the buffer. The intelligent agent is initially trained at the CPU end, and then subsequently trained at the FPGA end, so that the computing resources and the memory space of the FPGA can be effectively reduced.

Description

Reinforced learning method
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a reinforcement learning method.
Background
Reinforcement learning, which mainly studies what actions an agent takes to maximize cumulative rewards in a random environment, differs from typical deep learning in that the agent itself explores the environment in which it is located and learns the appropriate actions. Therefore, reinforcement learning requires a high generalization ability so as not to be affected by low quality data.
To reduce the dependency on the input data sequence, an empirical playback technique is generally applied to DQN (Deep Q-learning), empirical values (including state, action, and reward) are recorded in a buffer, and then the empirical values are randomly selected for training. Currently, the following problems exist in performing DQN reinforcement learning on edge devices: 1) based on the research of GPU equipment on deep learning and reinforcement learning algorithms, the calculation power consumption is large; 2) an acceleration platform based on FPGA, but the training time of weight parameters is long, and relatively large data transmission overhead and storage capacity are needed; 3) the TRPO algorithm is applied to deep reinforcement learning on an FPGA platform, but the structure of a reinforcement neural network is too complex, so that the consumption of the reinforcement neural network is large in the aspects of resource occupation and power consumption. Therefore, hardware platforms such as the FPGA and the like are difficult to independently operate reinforcement learning due to the limitation of computing resources and storage resources.
Disclosure of Invention
The embodiment of the invention provides a reinforcement learning method, which can effectively reduce the computing resource and the memory space of an FPGA (field programmable gate array) by performing initialization training on an intelligent agent at a CPU (central processing unit) end and performing subsequent training at the FPGA end.
An embodiment of the present invention provides a reinforcement learning method, including:
initializing agent parameters by a CPU (Central processing Unit), wherein the agent parameters comprise model parameters of an agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the agent;
the CPU end adopts the agent to interact with the environment so as to acquire an initial experience value and store the initial experience value into a buffer;
the CPU end transmits the initialized intelligent agent parameters to the FPGA end;
and the FPGA terminal takes the initialized intelligent agent parameters as initial values of training and carries out iterative updating on the intelligent agent according to the initial empirical values read from the buffer.
In some embodiments, the FPGA side iteratively updates the agent by:
reading the initialized agent parameters from the CPU;
acquiring a current state from the environment, and determining a current action according to the current state;
outputting, with the agent, the current action to the environment to obtain a next state from the environment in response to the action, a current reward, and a current episode end marker;
organizing said current state, said current action, said next state, and said current reward into experience values to update data of said buffer when said current episode end flag indicates that a current episode is over;
judging whether the number of neurons in the multilayer perceptron is equal to the number of network nodes; if so, performing initialization training on the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer; if not, and when the number of neurons is greater than the number of network nodes, optimizing the initialized output weight matrix, and updating the action Q value corresponding to the next state by using the current experience value stored in the buffer;
and when the end of the current operation is detected, updating the model parameters of the agent.
In some embodiments, said determining a current action from said current state comprises:
acquiring a random action in the current state;
inputting the random motion to the multi-layered perceptron for prediction, the multi-layered perceptron outputting a motion Q value responsive to a next state;
and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
In some embodiments, said determining a current action from said current state comprises:
acquiring a random value of the reward;
judging whether the random value of the reward is smaller than a preset random initial value of an agent running at the CPU end;
if so, inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action;
otherwise, acquiring a random action in the current state, inputting the random action into the multilayer perceptron for prediction, outputting an action Q value responding to the next state by the multilayer perceptron, and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
In some embodiments, the CPU initializes the input weight matrix and the output weight matrix by:
initializing an input weight matrix by adopting a random value, and keeping the input weight matrix unchanged to obtain an initialized input weight matrix;
performing initialization training on the output weight matrix;
and optimizing the output weight matrix after the initialization training to obtain the initialized output weight matrix.
In some embodiments, the method further comprises:
performing initial training on the output weight matrix through the following formula:
Figure BDA0002915821760000031
Figure BDA0002915821760000032
the output weight matrix is optimized by the following formula:
Figure BDA0002915821760000041
Figure BDA0002915821760000042
wherein, betaiIs an output weight matrix;
Figure BDA0002915821760000043
Hi≡G(xiα + b), i ≧ 0, ith dataset
Figure BDA0002915821760000044
kiFor batch size, xiFor the ith input data set, tiIs ith m-dimensional target data, n is the number of input layer nodes of the multilayer perceptron, m is the number of output layer nodes of the multilayer perceptron, G is an activation function of the multilayer perceptron,
Figure BDA0002915821760000045
in order to input the weight matrix, the weight matrix is input,
Figure BDA0002915821760000046
the number of hidden layer nodes of the multi-layer perceptron,
Figure BDA0002915821760000047
i represents an initialized multi-layer perceptron parameter matrix for a bias vector of a hidden layer; batch sizekiIs set to 1.
In some embodiments, the method further comprises L2 regularization of the output weight matrix, which includes the following steps:
performing initial training on the output weight matrix through the following formula:
Figure BDA0002915821760000048
Figure BDA0002915821760000049
where δ represents a regularization parameter that controls the significance of regularization.
In some embodiments, the model parameters of the agent include a first model parameter of the agent running on the CPU side and a second model parameter of the agent running on the FPGA side, and the CPU side initializes the model parameters of the agent by:
initializing model parameters of the agent by:
Figure BDA00029158217600000410
wherein, theta1Is the first model parameter, θ2Is the second model parameter, L (θ)1) Is theta1A loss function of (d);
Figure BDA00029158217600000411
denotes theta1Under the condition of (1), the state is stIn the case of (a)tQ value of the action; r istIs in a state of stThe reward in the case of (a), gamma is a proportional parameter that controls the importance of the next step,
Figure BDA00029158217600000412
denotes theta2Under the condition of (1), the state is st+1The Q value of the a operation occurs in the case of (1).
In some embodiments, the method further comprises:
setting a lippocitz constant of the multi-layered perceptron to 1.
In some embodiments, the FPGA further includes a step of cutting an action Q value output by the multilayer perceptron, and the specific steps are as follows:
and clipping the action Q value output by the multilayer perceptron by the following formula:
Figure BDA0002915821760000051
wherein r istFor the reward of this update, dtGamma is a proportional parameter for controlling the importance of the next step,
Figure BDA0002915821760000052
is theta2Q value of lower, st+1In the next state, a is an action.
Compared with the prior art, the embodiment of the invention discloses a reinforcement learning method, which comprises the steps of initializing intelligent body parameters through a CPU (Central processing Unit) end, wherein the intelligent body parameters comprise model parameters of an intelligent body, an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent body, further adopting the intelligent body to interact with the environment to collect initial experience values and store the initial experience values into a buffer, so that the initialized intelligent body parameters are transmitted to an FPGA end through the CPU end, the initialized intelligent body parameters are used as initial values of training by the FPGA end, the intelligent body is iteratively updated according to the initial experience values read from the buffer, and therefore, the calculation amount of the FPGA can be effectively reduced by performing initialization training on the intelligent body at the CPU end and performing accelerated processing on subsequent training and other logic operations at the FPGA end, and meanwhile, the experience values are stored in the buffer, the memory space of the FPGA is greatly reduced, so that the FPGA can realize reinforcement learning under the assistance of the CPU.
In the embodiment, the output weight matrix is initially trained or optimized through whether the number of the neurons in the multilayer perceptron is equal to the number of the network nodes, and then the action Q value corresponding to the next state is updated by using the latest experience value obtained from the buffer based on the trained output weight matrix, so that the dependence on an input data sequence is reduced, and the parameters of the intelligent body can be updated by using a smaller batch processing size sequence, thereby effectively reducing the calculation resources and the memory space, and ensuring that the reinforcement learning can be realized on the FPGA with limited resources. The neural network can be updated according to the sequence (namely the latest experience value) without retraining according to the past result, so that the calculation amount is greatly reduced. Meanwhile, the state variables and the actions are used as the input of the multilayer perceptron, and the corresponding action Q value is used as the output, so that the intelligent agent model is greatly simplified, and the calculated amount is effectively reduced.
In the embodiment, the current action is randomly acquired, the experience value is trained, so that the latest experience value in the buffer is selected in the subsequent updating operation to update the action Q value corresponding to the next state, the random updating strategy is realized, the requirements on the computing resources and the storage resources of the FPGA platform can be greatly reduced, and the state updating is realized on the equipment platform with limited computing resources and storage resources.
The embodiment ensures the maximum reward of each action by combining the mode of acquiring the current action according to the maximum value of the action Q value and the mode of randomly acquiring the current action, and also judges the action strategy according to a specific platform to obtain two advantages.
The embodiment adopts the random value to initialize the input weight matrix, keeps unchanged thereafter, and optimizes only the output weight matrix, thereby greatly reducing the calculation amount and the calculation complexity.
The above-described embodiments greatly reduce the amount of computation by optimizing the output weight matrix in order (i.e., the result of the previous training) without requiring retraining based on past results. Secondly, the optimization algorithm of the embodiment is not iterative, and an optimal output weight matrix can be obtained at one time without repeated iterative computation of parameters. And then, in the process of optimizing the output weight matrix, calculating the pseudo-inverse matrix by adopting the batch processing size fixed to 1 so as to eliminate the calculation of the pseudo-inverse matrix, eliminate the calculation caused by matrix singular value decomposition in optimization training and effectively reduce the use of hardware calculation resources and storage capacity.
The embodiment greatly reduces the overfitting problem of the MLP and improves generalization capability of the MLP by carrying out L2 regularization in the initialization training process of the output weight matrix.
According to the embodiment, the fixed target Q network is used at the FPGA end, updating is carried out in a larger period, the stability of reinforcement learning of the intelligent agent at the FPGA end is improved, and the method and the device are suitable for hardware equipment with limited resources and insufficient calculation.
The embodiment adopts the Ripocitz constant limited to 1, and effectively improves the stability of the reinforcement learning based on the multi-layer perceptron.
In the embodiment, the output value is cut when the motion Q value is updated through the FPGA end, so that the abnormal value of the Q value can be effectively inhibited, and stable reinforcement learning can be performed.
Drawings
Fig. 1 is a schematic flow chart of a reinforcement learning method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a lightweight reinforcement learning platform based on edge devices according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a reinforcement learning method according to a first embodiment of the present invention;
FIG. 4 is a flowchart illustrating a reinforcement learning method according to a second embodiment of the present invention;
fig. 5 is a flowchart illustrating a reinforcement learning method according to a third embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flowchart of a reinforcement learning method according to an embodiment of the present invention is shown, where the method includes steps S101 to S104.
In the invention, reinforcement learning is a reward guidance behavior that the intelligent agent learns in a trial-and-error mode and obtains through interaction between actions and the environment, and the goal is to enable the intelligent agent to obtain the maximum reward. Reinforcement learning is the evaluation of the performance of an agent by rewards provided by the environment, so that the agent gains knowledge in the action-reward (evaluation) environment and improves the action scheme to adapt to the environment. Wherein, the training process of reinforcement learning is as follows: and performing multiple interactions with the environment through the intelligent agent, obtaining the action, the state, the reward and the next state of each interaction as the experience value, and training the intelligent agent by using multiple groups of experience values as training data. The above training process is used iteratively until a convergence condition is met. Because hardware platforms such as FPGA are limited in computing resources and storage resources and are difficult to independently operate reinforcement learning, the intelligent agent is initially trained at the CPU end, and logic operations such as subsequent training and the like are accelerated at the FPGA end.
S101, initializing intelligent agent parameters by a CPU (Central processing Unit) end, wherein the intelligent agent parameters comprise model parameters of the intelligent agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent agent.
It should be noted that the multi-layer perceptron is specifically a rule for adopting behaviors, which is used by the agent in reinforcement learning. In the embodiment, the multilayer perceptron introduces one to a plurality of hidden layers on the basis of a single-layer neural network, and the hidden layers are positioned between an input layer and an output layer. Specifically, the multi-layer perceptron network is initialized in this embodiment, including but not limited to the number of layers of the network, the number of nodes of a specific layer, the arrangement mode, the input weight matrix and the output weight matrix. Illustratively, the present embodiment assumes input layer, hidden layer, output layer nodesThe numbers are respectively as follows: n is the sum of the numbers of the n,
Figure BDA0002915821760000081
m, then the input data set is
Figure BDA0002915821760000082
With a batch size of k per number of samples processed, the data set is output for m dimensions
Figure BDA0002915821760000083
Can be expressed as:
y=G(x·α+b)β
wherein G represents an activation function,
Figure BDA0002915821760000084
a matrix of input weights is represented that is,
Figure BDA0002915821760000085
a matrix of output weights is represented by a matrix of,
Figure BDA0002915821760000086
representing the bias vectors of the hidden layer.
Assuming the multi-layered perceptron is approximated as m-dimensional target data
Figure BDA0002915821760000087
And satisfies the following equation:
G(x·α+b)β=t
thus, the hidden layer matrix can be defined as H ≡ G (x. alpha + b) beta, which optimizes the output weight matrix
Figure BDA0002915821760000088
The calculation can be made from:
Figure BDA0002915821760000089
wherein HTA pseudo-inverse matrix representing H, which may be generated by matrix decomposition algorithm such as Singular Value Decomposition (SVD) or QR decompositionAnd (6) performing calculation.
In addition, in this embodiment, the model parameter of the agent is specifically updated by the agent according to a preset interval, where the model parameter of the agent includes a first model parameter θ of the agent running on the CPU side1And a second model parameter theta of the agent operating at the FPGA end2. Illustratively, at time t,
Figure BDA0002915821760000091
model parameters theta representing agents running on the CPU side1Under the condition of (1), the state is stIn the case of (a)tQ value of behavior. Theta1Determining a neural network pair as a result of the training
Figure BDA0002915821760000092
The accuracy of the prediction. Thus, the CPU port pair first model parameter θ1And a second model parameter theta2Performing initialization training to obtain the second model parameter theta2Sending the second model parameter to the FPGA terminal so that the FPGA terminal can obtain the second model parameter theta2And operating the intelligent agent.
S102, the CPU side adopts the agent to interact with the environment so as to collect an initial experience value and store the initial experience value into a buffer.
Specifically, referring to fig. 2, the schematic diagram of the lightweight reinforcement learning platform based on the edge device according to an embodiment of the present invention is shown, where a CPU end initializes an input weight matrix α and an output weight matrix β of a multi-layer perceptron; normalizing the input weight matrix; predicting the Q value of the behavior a at the time t; interacting with the environment, and predicting the Q value of the behavior a at the moment of t + 1; l2 regularization is carried out on beta, and the strengthened neural network is initially trained. Wherein, the reinforcement learning of the intelligent agent adopts an experience replay technique to restrain the influence of time dependency on input data, and the CPU end initializes an initial experience value (such as s) obtained by training in the inventiont,at,rt,st+1) The experience value is stored in a buffer so that the FPGA end randomly extracts the experience value from the buffer to update the plurality of agentsWeight parameters of the layer perceptron. In addition, the buffer is a device for storing the training parameters such as the experience value or the agent parameter, and can effectively relieve the memory space of the FPGA side, and the buffer may be disposed at the FPGA side, the CPU side, or other devices.
And S103, the CPU end transmits the initialized intelligent agent parameters to the FPGA end.
And S104, the FPGA terminal takes the initialized intelligent agent parameters as initial values of training, and the intelligent agent is subjected to iterative updating according to the initial empirical values read from the buffer.
Specifically, please refer to fig. 2, which is implemented by an FPGA end, and the weight matrices α and β are read from a CPU end; predicting the Q value of the behavior a at the moment t; and interacting with the environment to predict the Q value of the behavior a occurring at time t + 1.
The embodiment of the invention provides a reinforcement learning method, which comprises the steps of initializing intelligent body parameters through a CPU (Central processing Unit) end, wherein the intelligent body parameters comprise update parameters of an intelligent body, an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent body, further adopting the intelligent body to interact with the environment to collect initial experience values and storing the initial experience values into a buffer, so that the initialized intelligent body parameters are transmitted to an FPGA end through the CPU end, the initialized intelligent body parameters are used as initial values for training by the FPGA end, the intelligent body is iteratively updated according to the initial experience values read from the buffer, and therefore, the intelligent body is initially trained through the CPU end, logic operations such as subsequent training and the like are accelerated through the FPGA end, the calculation amount of the FPGA can be effectively reduced, and meanwhile, the experience values are stored in the buffer, the memory space of the FPGA is greatly reduced, so that the FPGA can realize reinforcement learning under the assistance of the CPU.
In some embodiments, the FPGA side iteratively updates the agent by:
reading the initialized agent parameters from the CPU;
acquiring a current state from the environment, and determining a current action according to the current state;
outputting, with the agent, the current action to the environment to obtain a next state from the environment in response to the action, a current reward, and a current episode end marker;
organizing said current state, said current action, said next state, and said current reward into experience values to update data of said buffer when said current episode end flag indicates that a current episode is over;
judging whether the number of neurons in the multilayer perceptron is equal to the number of network nodes; if so, performing initialization training on the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer; if not, optimizing the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer;
and when the end of the current operation is detected, updating the model parameters of the agent.
In the embodiment, the agent running on the FPGA end and the multilayer perceptron thereof are set according to the initialized agent parameters from the CPU end. A current state of the environment is input to a multi-tier perceptron of the agent, such that the multi-tier perceptron generates an action responsive to the current state. The current action is output to the environment with the agent and a next state, a current reward, and a current episode end marker responsive to the action is obtained from the environment. It should be noted that, by determining whether the current episode flag is 1, it is determined whether the current episode end flag indicates that the current episode is ended. When the current story flag is 1, the current story ending flag is considered to indicate that the current story is ended, a new story is started, and the experience value(s) obtained this time is usedt,at,rt,st+1) Stored in a buffer. When the current plot mark is not 1, the current plot ending mark is considered to represent that the current plot is ended, and the current state is determined again according to the current stateAnd (5) acting until the current plot mark is 1. Further, in the updating state, whether the number of neurons in the multi-layer perceptron is equal to the number of network nodes is judged, when the number of neurons is equal to the number of network nodes N (t is equal to N), the output weight matrix is subjected to initialization training, and the initialization training is carried out by using the experience value stored in the buffer; when t is>And N, optimizing the output weight matrix, and sequentially updating by using the latest empirical value in the buffer.
It should be noted that reinforcement learning of an agent typically trains its neural network parameters in a batch process and randomly forms the batch process using an empirical playback technique so that it can mitigate dependencies on input data sequences. On the other hand, the multi-layer perceptron is a sequential training method, and the neural network parameters can be updated sequentially with smaller batch processing size k. The main bottleneck in implementing a multi-layer perceptron during a resource-constrained FPGA is the pseudo-inverse matrix operation, which may require a singular value decomposition kernel. Therefore, the pseudo-inverse matrix operation is eliminated by fixing k to 1 in the invention, so that the neural network can be operated in a resource-limited device.
Furthermore, in order to reduce the dependency on the input data sequence while keeping the small batch k at 1 in the present invention, a method of randomly determining whether to update the neural network parameters of each step (i.e., randomly choosing the latest empirical value from the buffer) is employed. Training up-to-date experience of sequences based on random actions, including but not limited to a set of observations, action atState s oft. The batch size is fixed set to 1 and the pseudo-inverse matrix operation can be eliminated. The first initial training is done in the CPU by software, while all subsequent training is computed in the FPGA. In this way, the multi-layer perceptron model with the random update strategy and the batch processing size set to 1 can effectively reduce the computing resources and the memory space.
Furthermore, in reinforcement learning of agents, the ith node of the output layer of the multi-layered perceptron represents the Q value of the ith action and is trained so that the ith node can accurately predict Q (s, a)i). In this caseNext, its input and output sizes are equal to the number of state variables and actions, respectively. In the invention, a more simplified input/output pair is adopted, a group of state variables and action values are given as the input of the intelligent agent, the corresponding Q value is the output of a scalar quantity, and according to the form, the Q learning model of the intelligent agent is greatly simplified, thereby effectively reducing the calculated quantity.
Therefore, in the embodiment, the output weight matrix is initially trained or optimized according to whether the number of neurons in the multi-layer perceptron is equal to the number of network nodes, and then based on the trained output weight matrix, the action Q value corresponding to the next state is updated by using the latest experience value obtained from the buffer, so that the dependence on the input data sequence is reduced, and the parameters of the agent can be updated by using a smaller batch processing size sequence, so that the calculation resources and the memory space are effectively reduced, and the reinforcement learning can be realized on the FPGA with limited resources. The neural network can be updated according to the sequence (namely the latest experience value) without retraining according to the past result, so that the calculation amount is greatly reduced. Meanwhile, the state variables and the actions are used as the input of the multilayer perceptron, and the corresponding action Q value is used as the output, so that the intelligent agent model is greatly simplified, and the calculated amount is effectively reduced.
On the basis of the foregoing embodiments, in some embodiments, referring to fig. 3, a flowchart of a first specific embodiment of a reinforcement learning method according to an embodiment of the present invention is shown, where the determining a current action according to the current state is:
and inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action.
In this embodiment, the CPU initializes the input weight matrix and the output weight matrix using the random value, initializes the model parameters of the agent, and initializes the parameter matrix cache. Further, judging whether the iteration period epsilon is less than 1, if so, transmitting the parameter matrix to an FPGA end; otherwise, the iteration will be skipped. Furthermore, the FPGA terminal updates the behavior value according to the maximum Q value, and iteratively updates the intelligent agent according to the reward data to improve the training efficiency and accuracy in order to select the action with the maximum reward so that the reward is maximized due to each action. And judging whether the current plot ending mark is 1, if so, importing the updated experience value into a buffer, and otherwise, updating the behavior value again by the maximum Q value. And judging whether the number of the neurons in the multilayer perceptron is equal to the number of the network nodes, if so, acquiring a group of experience values (which can be the latest experience value) from the buffer to update the action Q value in the next state. Otherwise, when the number of the neurons is larger than the number of the network nodes, optimizing the output weight matrix, and updating the action Q value in the next state by using the latest experience value. Then, when the iteration period epicode is divisible by step in the iteration period, the model parameters of the agent are updated. And judging whether step in the iteration cycle is smaller than 1, if so, circulating the intelligent agent updating step at the FPGA end, and otherwise, finishing the operation.
On the basis of the foregoing embodiments, in some embodiments, referring to fig. 4, a flowchart of a reinforcement learning method according to a second specific embodiment of the present invention is shown, where the determining a current action according to the current state is:
acquiring a random action in the current state;
inputting the random motion to the multi-layered perceptron for prediction, the multi-layered perceptron outputting a motion Q value responsive to a next state;
and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
Similarly, in this embodiment, the behavior value is updated randomly, and the experience value is trained, so that the latest experience value in the buffer is selected in the subsequent updating operation to update the action Q value corresponding to the next state, thereby implementing a random updating strategy, greatly reducing the requirements on the computation resource and the storage resource of the FPGA platform, and implementing state updating on the device platform with limited computation resource and storage resource.
On the basis of the foregoing embodiments, in some embodiments, referring to fig. 5, a flowchart of a reinforcement learning method according to a third specific embodiment of the present invention is shown, where the determining a current action according to the current state is:
acquiring a random value of the reward;
judging whether the random value of the reward is smaller than a preset random initial value of an agent running at the CPU end;
if so, inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action;
otherwise, acquiring a random action in the current state, inputting the random action into the multilayer perceptron for prediction, outputting an action Q value responding to the next state by the multilayer perceptron, and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
Similarly, in the embodiment, by combining the mode of obtaining the current action according to the maximum value of the action Q value and the mode of obtaining the current action randomly, the maximum value reward of each action is ensured, and the action strategy is also judged for a specific platform, so that two advantages are obtained.
In some embodiments, the CPU initializes the input weight matrix and the output weight matrix by:
initializing an input weight matrix by adopting a random value, and keeping the input weight matrix unchanged to obtain an initialized input weight matrix;
performing initialization training on the output weight matrix;
and optimizing the output weight matrix after the initialization training to obtain the initialized output weight matrix.
In the embodiment, the input weight matrix is initialized by random values and then is kept unchanged, and only the output weight matrix is optimized, so that the calculation amount and the calculation complexity are greatly reduced.
On the basis of the above embodiments, in some embodiments, the method further comprises:
performing initial training on the output weight matrix through the following formula:
Figure BDA0002915821760000151
Figure BDA0002915821760000152
the output weight matrix is optimized by the following formula:
Figure BDA0002915821760000153
Figure BDA0002915821760000154
wherein, betaiIs an output weight matrix;
Figure BDA0002915821760000155
Hi≡G(xiα + b), i ≧ 0, ith dataset
Figure BDA0002915821760000156
kiFor batch size, xiFor the ith input data set, tiIs ith m-dimensional target data, n is the number of input layer nodes of the multilayer perceptron, m is the number of output layer nodes of the multilayer perceptron, G is an activation function of the multilayer perceptron,
Figure BDA0002915821760000157
in order to input the weight matrix, the weight matrix is input,
Figure BDA0002915821760000158
the number of hidden layer nodes of the multi-layer perceptron,
Figure BDA0002915821760000159
i represents an initialized multi-layer perceptron parameter matrix for a bias vector of a hidden layer; batch size kiIs set to 1.
In this embodiment, the amount of computation is greatly reduced by optimizing the output weight matrix in order (i.e., the result of the previous training) without retraining based on the past results. Secondly, the optimization algorithm of the embodiment is not iterative, and an optimal output weight matrix can be obtained at one time without repeated iterative computation of parameters. And then, in the process of optimizing the output weight matrix, calculating the pseudo-inverse matrix by adopting the batch processing size fixed to 1 so as to eliminate the calculation of the pseudo-inverse matrix, eliminate the calculation caused by matrix singular value decomposition in optimization training and effectively reduce the use of hardware calculation resources and storage capacity.
In a preferred embodiment, the method further includes L2 regularizing the output weight matrix, and the specific steps are as follows:
performing initial training on the output weight matrix through the following formula:
Figure BDA0002915821760000161
Figure BDA0002915821760000162
where δ represents a regularization parameter that controls the significance of regularization.
In the embodiment, the L2 regularization is performed during the initial training process of the output weight matrix, so that the overfitting problem of the MLP is greatly reduced and the generalization capability of the MLP is improved.
In some embodiments, the model parameters of the agent include a first model parameter of the agent running on the CPU side and a second model parameter of the agent running on the FPGA side, and the CPU side initializes the model parameters of the agent by:
initializing model parameters of the agent by:
Figure BDA0002915821760000163
wherein, theta1Is the first model parameter, θ2Is the second model parameter, L (θ)1) Is theta1A loss function of (d);
Figure BDA0002915821760000164
denotes theta1Under the condition of (1), the state is stIn the case of (a)tQ value of the action; r istIs in a state of stThe reward in the case of (a), gamma is a proportional parameter that controls the importance of the next step,
Figure BDA0002915821760000165
denotes theta2Under the condition of (1), the state is st+1The Q value of the a operation occurs in the case of (1).
In the embodiment, the fixed target Q network is used at the FPGA end and is updated in a larger period, so that the stability of reinforcement learning of the intelligent agent at the FPGA end is improved, the method is suitable for hardware equipment with limited resources and insufficient calculation, and theta can be solved1The continuous change at each time point may cause the instability of reinforcement learning.
In some embodiments, the method further comprises:
setting a lippocitz constant of the multi-layered perceptron to 1.
Specifically, in order to stably acquire the Q value of a specific action, the present invention uses a regularization method in deep learning, that is, the range of the output of the multi-layer perceptron should be within a constant multiple of the input thereof, so as to maintain stability. More specifically, assuming that the input values are x1 and x2, the output value f (x) thereof1) And f (x)2) The following constraints should be satisfied:
Figure BDA0002915821760000171
where K is referred to as the Ripocez constant, the Ripocez constant of the multi-layered perceptron is derived from the partial product of the Ripocez constants of all layers, each product being the product of a rank-weight matrix (e.g., the maximum singular value) and the Ripocez constant of an activation function (e.g., ReLU or tanh). For stable Q learning, it should be subjected to a suppression operation. Spectral regularization can be used to suppress the lippocitz constant of a multi-layered perceptron, where the sum of the largest singular values in each weight matrix is added as a loss term to the loss function.
A common extension of spectral regularization is spectral normalization, where the output of the multi-layered perceptron is derived based on the product of the input data and each weight matrix divided by its maximum singular value. The Ripocitz constant limited to 1 is used in the present invention to stabilize the reinforcement learning based on the multi-layered perceptron.
In some embodiments, multi-tier perceptron networks have a characteristic that tends to be unstable, especially when an agnostic input is fed into the network, in which case its output values may become abnormal. Since such singular values are very large and exceed the range of normal prize values, the values hinder reinforcement learning. When learning in a typical setting in general, the environment gives a maximum reward of 1 and a minimum reward of-1. Therefore, when the intelligent agent carries out initialization training and state updating, the intelligent agent carries out cutting according to the following strategy, so that abnormal values of the Q value can be effectively restrained, and stable reinforcement learning can be carried out.
Specifically, the FPGA side further includes a step of cutting an action Q value output by the multilayer perceptron, which includes the following steps:
and clipping the action Q value output by the multilayer perceptron by the following formula:
Figure BDA0002915821760000172
wherein r istFor the reward of this update, dtIs slow this timeThe value stored in the memory, gamma, is a proportional parameter that controls the importance of the next step,
Figure BDA0002915821760000173
is theta2Q value of lower, st+1In the next state, a is an action.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A reinforcement learning method, comprising:
initializing agent parameters by a CPU (Central processing Unit), wherein the agent parameters comprise model parameters of an agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the agent;
the CPU end adopts the agent to interact with the environment so as to acquire an initial experience value and store the initial experience value into a buffer;
the CPU end transmits the initialized intelligent agent parameters to the FPGA end;
and the FPGA terminal takes the initialized intelligent agent parameters as initial values of training and carries out iterative updating on the intelligent agent according to the initial empirical values read from the buffer.
2. The reinforcement learning method of claim 1, wherein the FPGA side iteratively updates the agent by:
reading the initialized agent parameters from the CPU;
acquiring a current state from the environment, and determining a current action according to the current state;
outputting, with the agent, the current action to the environment to obtain a next state from the environment in response to the action, a current reward, and a current episode end marker;
organizing said current state, said current action, said next state, and said current reward into experience values to update data of said buffer when said current episode end flag indicates that a current episode is over;
judging whether the number of neurons in the multilayer perceptron is equal to the number of network nodes; if so, performing initialization training on the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer; if not, and when the number of neurons is greater than the number of network nodes, optimizing the initialized output weight matrix, and updating the action Q value corresponding to the next state by using the current experience value stored in the buffer;
and when the end of the current operation is detected, updating the model parameters of the agent.
3. The reinforcement learning method of claim 2, wherein the determining a current action from the current state:
acquiring a random action in the current state;
inputting the random motion to the multi-layered perceptron for prediction, the multi-layered perceptron outputting a motion Q value responsive to a next state;
and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
4. The reinforcement learning method of claim 2, wherein the determining a current action from the current state:
acquiring a random value of the reward;
judging whether the random value of the reward is smaller than a preset random initial value of an agent running at the CPU end;
if so, inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action;
otherwise, acquiring a random action in the current state, inputting the random action into the multilayer perceptron for prediction, outputting an action Q value responding to the next state by the multilayer perceptron, and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.
5. The reinforcement learning method of claim 1, wherein the CPU initializes the input weight matrix and the output weight matrix by:
initializing an input weight matrix by adopting a random value, and keeping the input weight matrix unchanged to obtain an initialized input weight matrix;
performing initialization training on the output weight matrix;
and optimizing the output weight matrix after the initialization training to obtain the initialized output weight matrix.
6. The reinforcement learning method according to claim 2 or 5, wherein the method further comprises:
performing initial training on the output weight matrix through the following formula:
Figure FDA0002915821750000031
Figure FDA0002915821750000032
the output weight matrix is optimized by the following formula:
Figure FDA0002915821750000033
Figure FDA0002915821750000034
wherein, betaiIs an output weight matrix;
Figure FDA0002915821750000035
Hi≡G(xiα + b), i ≧ 0, ith dataset
Figure FDA0002915821750000036
kiFor batch size, xiFor the ith input data set, tiIs ith m-dimensional target data, n is the number of input layer nodes of the multilayer perceptron, m is the number of output layer nodes of the multilayer perceptron, G is an activation function of the multilayer perceptron,
Figure FDA0002915821750000037
in order to input the weight matrix, the weight matrix is input,
Figure FDA0002915821750000038
the number of hidden layer nodes of the multi-layer perceptron,
Figure FDA0002915821750000039
i represents an initialized multi-layer perceptron parameter matrix for a bias vector of a hidden layer; batch size kiIs set to 1.
7. The reinforcement learning method of claim 6, further comprising L2 regularization of the output weight matrix by the steps of:
performing initial training on the output weight matrix through the following formula:
Figure FDA0002915821750000041
Figure FDA0002915821750000042
where δ represents a regularization parameter that controls the significance of regularization.
8. The reinforcement learning method of claim 6, wherein the model parameters of the agent comprise a first model parameter of the agent running on the CPU side and a second model parameter of the agent running on the FPGA side, and the CPU side initializes the model parameters of the agent by:
initializing model parameters of the agent by:
Figure FDA0002915821750000043
wherein, theta1Is the first model parameter, θ2Is the second model parameter, L (θ)1) Is theta1A loss function of (d);
Figure FDA0002915821750000044
denotes theta1Under the condition of (1), the state is stIn the case of (a)tQ value of the action; r istIs in a state of stThe reward in the case of (a), gamma is a proportional parameter that controls the importance of the next step,
Figure FDA0002915821750000045
denotes theta2Under the condition of (1), the state is st+1The Q value of the a operation occurs in the case of (1).
9. The reinforcement learning method of claim 5, wherein the method further comprises:
setting a lippocitz constant of the multi-layered perceptron to 1.
10. The reinforcement learning method of claim 8, wherein the FPGA further comprises a step of clipping an action Q value output by the multi-layer perceptron, the specific steps being as follows:
and clipping the action Q value output by the multilayer perceptron by the following formula:
Figure FDA0002915821750000051
wherein r istFor the reward of this update, dtGamma is a proportional parameter for controlling the importance of the next step,
Figure FDA0002915821750000052
is theta2Q value of lower, st+1In the next state, a is an action.
CN202110101401.6A 2021-01-26 2021-01-26 Reinforced learning method Pending CN112734048A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110101401.6A CN112734048A (en) 2021-01-26 2021-01-26 Reinforced learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110101401.6A CN112734048A (en) 2021-01-26 2021-01-26 Reinforced learning method

Publications (1)

Publication Number Publication Date
CN112734048A true CN112734048A (en) 2021-04-30

Family

ID=75595326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110101401.6A Pending CN112734048A (en) 2021-01-26 2021-01-26 Reinforced learning method

Country Status (1)

Country Link
CN (1) CN112734048A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763360A (en) * 2018-05-16 2018-11-06 北京旋极信息技术股份有限公司 A kind of sorting technique and device, computer readable storage medium
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
CN111582311A (en) * 2020-04-09 2020-08-25 华南理工大学 Method for training intelligent agent by using dynamic reward example sample based on reinforcement learning
CN111832723A (en) * 2020-07-02 2020-10-27 四川大学 Multi-target neural network-based reinforcement learning value function updating method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763360A (en) * 2018-05-16 2018-11-06 北京旋极信息技术股份有限公司 A kind of sorting technique and device, computer readable storage medium
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
CN111582311A (en) * 2020-04-09 2020-08-25 华南理工大学 Method for training intelligent agent by using dynamic reward example sample based on reinforcement learning
CN111832723A (en) * 2020-07-02 2020-10-27 四川大学 Multi-target neural network-based reinforcement learning value function updating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HIROHISAWATANABE等: "AN FPGA-BASED ON-DEVICE REINFORCEMENT LEARNING APPROACH USING ONLINE SEQUENTIAL LEARNING", 《ARXIV》 *

Similar Documents

Publication Publication Date Title
US10992541B2 (en) Methods and apparatus for communication network
CN111556461A (en) Vehicle-mounted edge network task distribution and unloading method based on deep Q network
WO2018112699A1 (en) Artificial neural network reverse training device and method
EP3889846A1 (en) Deep learning model training method and system
KR102490060B1 (en) Method for restoring a masked face image by using the neural network model
US20180293486A1 (en) Conditional graph execution based on prior simplified graph execution
Andersen et al. Towards safe reinforcement-learning in industrial grid-warehousing
Özalp et al. A review of deep reinforcement learning algorithms and comparative results on inverted pendulum system
Chen et al. Computing offloading decision based on DDPG algorithm in mobile edge computing
CN113487039A (en) Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning
CN114449584A (en) Distributed computing unloading method and device based on deep reinforcement learning
CN114090108A (en) Computing task execution method and device, electronic equipment and storage medium
Zhao et al. Offline supervised learning vs online direct policy optimization: A comparative study and a unified training paradigm for neural network-based optimal feedback control
KR102471514B1 (en) Method for overcoming catastrophic forgetting by neuron-level plasticity control and computing system performing the same
CN112734048A (en) Reinforced learning method
CN115220818A (en) Real-time dependency task unloading method based on deep reinforcement learning
CN113721655B (en) Control period self-adaptive reinforcement learning unmanned aerial vehicle stable flight control method
CN114138493A (en) Edge computing power resource scheduling method based on energy consumption perception
US20230214725A1 (en) Method and apparatus for multiple reinforcement learning agents in a shared environment
KR102546176B1 (en) Post-processing method for video stability
Antony et al. Q-Learning: Solutions for Grid World Problem with Forward and Backward Reward Propagations
US11715036B2 (en) Updating weight values in a machine learning system
Seth et al. A scalable species-based genetic algorithm for reinforcement learning problems
KR102570771B1 (en) Method for parameter adjustment of reinforcement learning algorithm
US20220147821A1 (en) Computing device, computer system, and computing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination