CN112734048A

CN112734048A - Reinforced learning method

Info

Publication number: CN112734048A
Application number: CN202110101401.6A
Authority: CN
Inventors: 李纪先; 安涛; 王瑞杰; 朱青山; 谭绪祥; 刘烜宏; 刘宇生; 聂琳静; 于湃
Original assignee: Tianjin Feiteng Information Technology Co ltd
Current assignee: Tianjin Feiteng Information Technology Co ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-04-30

Abstract

The invention discloses a reinforcement learning method, which comprises the following steps: initializing agent parameters by a CPU (Central processing Unit), wherein the agent parameters comprise model parameters of an agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the agent; the CPU end adopts the agent to interact with the environment so as to acquire an initial experience value and store the initial experience value into a buffer; the CPU end transmits the initialized intelligent agent parameters to the FPGA end; and the FPGA terminal takes the initialized intelligent agent parameters as initial values of training and carries out iterative updating on the intelligent agent according to the initial empirical values read from the buffer. The intelligent agent is initially trained at the CPU end, and then subsequently trained at the FPGA end, so that the computing resources and the memory space of the FPGA can be effectively reduced.

Description

Reinforced learning method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a reinforcement learning method.

Background

Reinforcement learning, which mainly studies what actions an agent takes to maximize cumulative rewards in a random environment, differs from typical deep learning in that the agent itself explores the environment in which it is located and learns the appropriate actions. Therefore, reinforcement learning requires a high generalization ability so as not to be affected by low quality data.

To reduce the dependency on the input data sequence, an empirical playback technique is generally applied to DQN (Deep Q-learning), empirical values (including state, action, and reward) are recorded in a buffer, and then the empirical values are randomly selected for training. Currently, the following problems exist in performing DQN reinforcement learning on edge devices: 1) based on the research of GPU equipment on deep learning and reinforcement learning algorithms, the calculation power consumption is large; 2) an acceleration platform based on FPGA, but the training time of weight parameters is long, and relatively large data transmission overhead and storage capacity are needed; 3) the TRPO algorithm is applied to deep reinforcement learning on an FPGA platform, but the structure of a reinforcement neural network is too complex, so that the consumption of the reinforcement neural network is large in the aspects of resource occupation and power consumption. Therefore, hardware platforms such as the FPGA and the like are difficult to independently operate reinforcement learning due to the limitation of computing resources and storage resources.

Disclosure of Invention

The embodiment of the invention provides a reinforcement learning method, which can effectively reduce the computing resource and the memory space of an FPGA (field programmable gate array) by performing initialization training on an intelligent agent at a CPU (central processing unit) end and performing subsequent training at the FPGA end.

An embodiment of the present invention provides a reinforcement learning method, including:

initializing agent parameters by a CPU (Central processing Unit), wherein the agent parameters comprise model parameters of an agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the agent;

the CPU end adopts the agent to interact with the environment so as to acquire an initial experience value and store the initial experience value into a buffer;

the CPU end transmits the initialized intelligent agent parameters to the FPGA end;

and the FPGA terminal takes the initialized intelligent agent parameters as initial values of training and carries out iterative updating on the intelligent agent according to the initial empirical values read from the buffer.

In some embodiments, the FPGA side iteratively updates the agent by:

reading the initialized agent parameters from the CPU;

acquiring a current state from the environment, and determining a current action according to the current state;

outputting, with the agent, the current action to the environment to obtain a next state from the environment in response to the action, a current reward, and a current episode end marker;

organizing said current state, said current action, said next state, and said current reward into experience values to update data of said buffer when said current episode end flag indicates that a current episode is over;

judging whether the number of neurons in the multilayer perceptron is equal to the number of network nodes; if so, performing initialization training on the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer; if not, and when the number of neurons is greater than the number of network nodes, optimizing the initialized output weight matrix, and updating the action Q value corresponding to the next state by using the current experience value stored in the buffer;

and when the end of the current operation is detected, updating the model parameters of the agent.

In some embodiments, said determining a current action from said current state comprises:

acquiring a random action in the current state;

inputting the random motion to the multi-layered perceptron for prediction, the multi-layered perceptron outputting a motion Q value responsive to a next state;

and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.

acquiring a random value of the reward;

judging whether the random value of the reward is smaller than a preset random initial value of an agent running at the CPU end;

if so, inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action;

otherwise, acquiring a random action in the current state, inputting the random action into the multilayer perceptron for prediction, outputting an action Q value responding to the next state by the multilayer perceptron, and acquiring the action corresponding to the maximum value of the action Q value of the next state as the current action.

In some embodiments, the CPU initializes the input weight matrix and the output weight matrix by:

initializing an input weight matrix by adopting a random value, and keeping the input weight matrix unchanged to obtain an initialized input weight matrix;

performing initialization training on the output weight matrix;

and optimizing the output weight matrix after the initialization training to obtain the initialized output weight matrix.

In some embodiments, the method further comprises:

performing initial training on the output weight matrix through the following formula:

the output weight matrix is optimized by the following formula:

wherein, beta_iIs an output weight matrix;

H_i≡G(x_iα + b), i ≧ 0, ith dataset

k_iFor batch size, x_iFor the ith input data set, t_iIs ith m-dimensional target data, n is the number of input layer nodes of the multilayer perceptron, m is the number of output layer nodes of the multilayer perceptron, G is an activation function of the multilayer perceptron,

in order to input the weight matrix, the weight matrix is input,

the number of hidden layer nodes of the multi-layer perceptron,

i represents an initialized multi-layer perceptron parameter matrix for a bias vector of a hidden layer; batch sizek_iIs set to 1.

In some embodiments, the method further comprises L2 regularization of the output weight matrix, which includes the following steps:

where δ represents a regularization parameter that controls the significance of regularization.

In some embodiments, the model parameters of the agent include a first model parameter of the agent running on the CPU side and a second model parameter of the agent running on the FPGA side, and the CPU side initializes the model parameters of the agent by:

initializing model parameters of the agent by:

wherein, theta₁Is the first model parameter, θ₂Is the second model parameter, L (θ)₁) Is theta₁A loss function of (d);

denotes theta₁Under the condition of (1), the state is s_tIn the case of (a)_tQ value of the action; r is_tIs in a state of s_tThe reward in the case of (a), gamma is a proportional parameter that controls the importance of the next step,

denotes theta₂Under the condition of (1), the state is s_t+1The Q value of the a operation occurs in the case of (1).

In some embodiments, the method further comprises:

setting a lippocitz constant of the multi-layered perceptron to 1.

In some embodiments, the FPGA further includes a step of cutting an action Q value output by the multilayer perceptron, and the specific steps are as follows:

and clipping the action Q value output by the multilayer perceptron by the following formula:

wherein r is_tFor the reward of this update, d_tGamma is a proportional parameter for controlling the importance of the next step,

is theta₂Q value of lower, s_t+1In the next state, a is an action.

Compared with the prior art, the embodiment of the invention discloses a reinforcement learning method, which comprises the steps of initializing intelligent body parameters through a CPU (Central processing Unit) end, wherein the intelligent body parameters comprise model parameters of an intelligent body, an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent body, further adopting the intelligent body to interact with the environment to collect initial experience values and store the initial experience values into a buffer, so that the initialized intelligent body parameters are transmitted to an FPGA end through the CPU end, the initialized intelligent body parameters are used as initial values of training by the FPGA end, the intelligent body is iteratively updated according to the initial experience values read from the buffer, and therefore, the calculation amount of the FPGA can be effectively reduced by performing initialization training on the intelligent body at the CPU end and performing accelerated processing on subsequent training and other logic operations at the FPGA end, and meanwhile, the experience values are stored in the buffer, the memory space of the FPGA is greatly reduced, so that the FPGA can realize reinforcement learning under the assistance of the CPU.

In the embodiment, the output weight matrix is initially trained or optimized through whether the number of the neurons in the multilayer perceptron is equal to the number of the network nodes, and then the action Q value corresponding to the next state is updated by using the latest experience value obtained from the buffer based on the trained output weight matrix, so that the dependence on an input data sequence is reduced, and the parameters of the intelligent body can be updated by using a smaller batch processing size sequence, thereby effectively reducing the calculation resources and the memory space, and ensuring that the reinforcement learning can be realized on the FPGA with limited resources. The neural network can be updated according to the sequence (namely the latest experience value) without retraining according to the past result, so that the calculation amount is greatly reduced. Meanwhile, the state variables and the actions are used as the input of the multilayer perceptron, and the corresponding action Q value is used as the output, so that the intelligent agent model is greatly simplified, and the calculated amount is effectively reduced.

In the embodiment, the current action is randomly acquired, the experience value is trained, so that the latest experience value in the buffer is selected in the subsequent updating operation to update the action Q value corresponding to the next state, the random updating strategy is realized, the requirements on the computing resources and the storage resources of the FPGA platform can be greatly reduced, and the state updating is realized on the equipment platform with limited computing resources and storage resources.

The embodiment ensures the maximum reward of each action by combining the mode of acquiring the current action according to the maximum value of the action Q value and the mode of randomly acquiring the current action, and also judges the action strategy according to a specific platform to obtain two advantages.

The embodiment adopts the random value to initialize the input weight matrix, keeps unchanged thereafter, and optimizes only the output weight matrix, thereby greatly reducing the calculation amount and the calculation complexity.

The above-described embodiments greatly reduce the amount of computation by optimizing the output weight matrix in order (i.e., the result of the previous training) without requiring retraining based on past results. Secondly, the optimization algorithm of the embodiment is not iterative, and an optimal output weight matrix can be obtained at one time without repeated iterative computation of parameters. And then, in the process of optimizing the output weight matrix, calculating the pseudo-inverse matrix by adopting the batch processing size fixed to 1 so as to eliminate the calculation of the pseudo-inverse matrix, eliminate the calculation caused by matrix singular value decomposition in optimization training and effectively reduce the use of hardware calculation resources and storage capacity.

The embodiment greatly reduces the overfitting problem of the MLP and improves generalization capability of the MLP by carrying out L2 regularization in the initialization training process of the output weight matrix.

According to the embodiment, the fixed target Q network is used at the FPGA end, updating is carried out in a larger period, the stability of reinforcement learning of the intelligent agent at the FPGA end is improved, and the method and the device are suitable for hardware equipment with limited resources and insufficient calculation.

The embodiment adopts the Ripocitz constant limited to 1, and effectively improves the stability of the reinforcement learning based on the multi-layer perceptron.

In the embodiment, the output value is cut when the motion Q value is updated through the FPGA end, so that the abnormal value of the Q value can be effectively inhibited, and stable reinforcement learning can be performed.

Drawings

Fig. 1 is a schematic flow chart of a reinforcement learning method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a lightweight reinforcement learning platform based on edge devices according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a reinforcement learning method according to a first embodiment of the present invention;

FIG. 4 is a flowchart illustrating a reinforcement learning method according to a second embodiment of the present invention;

fig. 5 is a flowchart illustrating a reinforcement learning method according to a third embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of a reinforcement learning method according to an embodiment of the present invention is shown, where the method includes steps S101 to S104.

In the invention, reinforcement learning is a reward guidance behavior that the intelligent agent learns in a trial-and-error mode and obtains through interaction between actions and the environment, and the goal is to enable the intelligent agent to obtain the maximum reward. Reinforcement learning is the evaluation of the performance of an agent by rewards provided by the environment, so that the agent gains knowledge in the action-reward (evaluation) environment and improves the action scheme to adapt to the environment. Wherein, the training process of reinforcement learning is as follows: and performing multiple interactions with the environment through the intelligent agent, obtaining the action, the state, the reward and the next state of each interaction as the experience value, and training the intelligent agent by using multiple groups of experience values as training data. The above training process is used iteratively until a convergence condition is met. Because hardware platforms such as FPGA are limited in computing resources and storage resources and are difficult to independently operate reinforcement learning, the intelligent agent is initially trained at the CPU end, and logic operations such as subsequent training and the like are accelerated at the FPGA end.

S101, initializing intelligent agent parameters by a CPU (Central processing Unit) end, wherein the intelligent agent parameters comprise model parameters of the intelligent agent, and an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent agent.

It should be noted that the multi-layer perceptron is specifically a rule for adopting behaviors, which is used by the agent in reinforcement learning. In the embodiment, the multilayer perceptron introduces one to a plurality of hidden layers on the basis of a single-layer neural network, and the hidden layers are positioned between an input layer and an output layer. Specifically, the multi-layer perceptron network is initialized in this embodiment, including but not limited to the number of layers of the network, the number of nodes of a specific layer, the arrangement mode, the input weight matrix and the output weight matrix. Illustratively, the present embodiment assumes input layer, hidden layer, output layer nodesThe numbers are respectively as follows: n is the sum of the numbers of the n,

m, then the input data set is

With a batch size of k per number of samples processed, the data set is output for m dimensions

Can be expressed as:

y＝G(x·α+b)β

wherein G represents an activation function,

a matrix of input weights is represented that is,

a matrix of output weights is represented by a matrix of,

representing the bias vectors of the hidden layer.

Assuming the multi-layered perceptron is approximated as m-dimensional target data

And satisfies the following equation:

G(x·α+b)β＝t

thus, the hidden layer matrix can be defined as H ≡ G (x. alpha + b) beta, which optimizes the output weight matrix

The calculation can be made from:

wherein H^TA pseudo-inverse matrix representing H, which may be generated by matrix decomposition algorithm such as Singular Value Decomposition (SVD) or QR decompositionAnd (6) performing calculation.

In addition, in this embodiment, the model parameter of the agent is specifically updated by the agent according to a preset interval, where the model parameter of the agent includes a first model parameter θ of the agent running on the CPU side₁And a second model parameter theta of the agent operating at the FPGA end₂. Illustratively, at time t,

model parameters theta representing agents running on the CPU side₁Under the condition of (1), the state is s_tIn the case of (a)_tQ value of behavior. Theta₁Determining a neural network pair as a result of the training

The accuracy of the prediction. Thus, the CPU port pair first model parameter θ₁And a second model parameter theta₂Performing initialization training to obtain the second model parameter theta₂Sending the second model parameter to the FPGA terminal so that the FPGA terminal can obtain the second model parameter theta₂And operating the intelligent agent.

S102, the CPU side adopts the agent to interact with the environment so as to collect an initial experience value and store the initial experience value into a buffer.

Specifically, referring to fig. 2, the schematic diagram of the lightweight reinforcement learning platform based on the edge device according to an embodiment of the present invention is shown, where a CPU end initializes an input weight matrix α and an output weight matrix β of a multi-layer perceptron; normalizing the input weight matrix; predicting the Q value of the behavior a at the time t; interacting with the environment, and predicting the Q value of the behavior a at the moment of t + 1; l2 regularization is carried out on beta, and the strengthened neural network is initially trained. Wherein, the reinforcement learning of the intelligent agent adopts an experience replay technique to restrain the influence of time dependency on input data, and the CPU end initializes an initial experience value (such as s) obtained by training in the invention_t,a_t,r_t,s_t+1) The experience value is stored in a buffer so that the FPGA end randomly extracts the experience value from the buffer to update the plurality of agentsWeight parameters of the layer perceptron. In addition, the buffer is a device for storing the training parameters such as the experience value or the agent parameter, and can effectively relieve the memory space of the FPGA side, and the buffer may be disposed at the FPGA side, the CPU side, or other devices.

And S103, the CPU end transmits the initialized intelligent agent parameters to the FPGA end.

And S104, the FPGA terminal takes the initialized intelligent agent parameters as initial values of training, and the intelligent agent is subjected to iterative updating according to the initial empirical values read from the buffer.

Specifically, please refer to fig. 2, which is implemented by an FPGA end, and the weight matrices α and β are read from a CPU end; predicting the Q value of the behavior a at the moment t; and interacting with the environment to predict the Q value of the behavior a occurring at time t + 1.

The embodiment of the invention provides a reinforcement learning method, which comprises the steps of initializing intelligent body parameters through a CPU (Central processing Unit) end, wherein the intelligent body parameters comprise update parameters of an intelligent body, an input weight matrix and an output weight matrix of a multilayer perceptron of the intelligent body, further adopting the intelligent body to interact with the environment to collect initial experience values and storing the initial experience values into a buffer, so that the initialized intelligent body parameters are transmitted to an FPGA end through the CPU end, the initialized intelligent body parameters are used as initial values for training by the FPGA end, the intelligent body is iteratively updated according to the initial experience values read from the buffer, and therefore, the intelligent body is initially trained through the CPU end, logic operations such as subsequent training and the like are accelerated through the FPGA end, the calculation amount of the FPGA can be effectively reduced, and meanwhile, the experience values are stored in the buffer, the memory space of the FPGA is greatly reduced, so that the FPGA can realize reinforcement learning under the assistance of the CPU.

In some embodiments, the FPGA side iteratively updates the agent by:

reading the initialized agent parameters from the CPU;

judging whether the number of neurons in the multilayer perceptron is equal to the number of network nodes; if so, performing initialization training on the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer; if not, optimizing the initialized output weight matrix, and updating an action Q value corresponding to the next state by using the current experience value stored in the buffer;

In the embodiment, the agent running on the FPGA end and the multilayer perceptron thereof are set according to the initialized agent parameters from the CPU end. A current state of the environment is input to a multi-tier perceptron of the agent, such that the multi-tier perceptron generates an action responsive to the current state. The current action is output to the environment with the agent and a next state, a current reward, and a current episode end marker responsive to the action is obtained from the environment. It should be noted that, by determining whether the current episode flag is 1, it is determined whether the current episode end flag indicates that the current episode is ended. When the current story flag is 1, the current story ending flag is considered to indicate that the current story is ended, a new story is started, and the experience value(s) obtained this time is used_t,a_t,r_t,s_t+1) Stored in a buffer. When the current plot mark is not 1, the current plot ending mark is considered to represent that the current plot is ended, and the current state is determined again according to the current stateAnd (5) acting until the current plot mark is 1. Further, in the updating state, whether the number of neurons in the multi-layer perceptron is equal to the number of network nodes is judged, when the number of neurons is equal to the number of network nodes N (t is equal to N), the output weight matrix is subjected to initialization training, and the initialization training is carried out by using the experience value stored in the buffer; when t is>And N, optimizing the output weight matrix, and sequentially updating by using the latest empirical value in the buffer.

It should be noted that reinforcement learning of an agent typically trains its neural network parameters in a batch process and randomly forms the batch process using an empirical playback technique so that it can mitigate dependencies on input data sequences. On the other hand, the multi-layer perceptron is a sequential training method, and the neural network parameters can be updated sequentially with smaller batch processing size k. The main bottleneck in implementing a multi-layer perceptron during a resource-constrained FPGA is the pseudo-inverse matrix operation, which may require a singular value decomposition kernel. Therefore, the pseudo-inverse matrix operation is eliminated by fixing k to 1 in the invention, so that the neural network can be operated in a resource-limited device.

Furthermore, in order to reduce the dependency on the input data sequence while keeping the small batch k at 1 in the present invention, a method of randomly determining whether to update the neural network parameters of each step (i.e., randomly choosing the latest empirical value from the buffer) is employed. Training up-to-date experience of sequences based on random actions, including but not limited to a set of observations, action a_tState s of_t. The batch size is fixed set to 1 and the pseudo-inverse matrix operation can be eliminated. The first initial training is done in the CPU by software, while all subsequent training is computed in the FPGA. In this way, the multi-layer perceptron model with the random update strategy and the batch processing size set to 1 can effectively reduce the computing resources and the memory space.

Furthermore, in reinforcement learning of agents, the ith node of the output layer of the multi-layered perceptron represents the Q value of the ith action and is trained so that the ith node can accurately predict Q (s, a)_i). In this caseNext, its input and output sizes are equal to the number of state variables and actions, respectively. In the invention, a more simplified input/output pair is adopted, a group of state variables and action values are given as the input of the intelligent agent, the corresponding Q value is the output of a scalar quantity, and according to the form, the Q learning model of the intelligent agent is greatly simplified, thereby effectively reducing the calculated quantity.

Therefore, in the embodiment, the output weight matrix is initially trained or optimized according to whether the number of neurons in the multi-layer perceptron is equal to the number of network nodes, and then based on the trained output weight matrix, the action Q value corresponding to the next state is updated by using the latest experience value obtained from the buffer, so that the dependence on the input data sequence is reduced, and the parameters of the agent can be updated by using a smaller batch processing size sequence, so that the calculation resources and the memory space are effectively reduced, and the reinforcement learning can be realized on the FPGA with limited resources. The neural network can be updated according to the sequence (namely the latest experience value) without retraining according to the past result, so that the calculation amount is greatly reduced. Meanwhile, the state variables and the actions are used as the input of the multilayer perceptron, and the corresponding action Q value is used as the output, so that the intelligent agent model is greatly simplified, and the calculated amount is effectively reduced.

On the basis of the foregoing embodiments, in some embodiments, referring to fig. 3, a flowchart of a first specific embodiment of a reinforcement learning method according to an embodiment of the present invention is shown, where the determining a current action according to the current state is:

and inputting the current state into the multilayer perceptron, and outputting an action Q value responding to the current state by the multilayer perceptron to obtain an action corresponding to the maximum value of the action Q value of the current state as the current action.

In this embodiment, the CPU initializes the input weight matrix and the output weight matrix using the random value, initializes the model parameters of the agent, and initializes the parameter matrix cache. Further, judging whether the iteration period epsilon is less than 1, if so, transmitting the parameter matrix to an FPGA end; otherwise, the iteration will be skipped. Furthermore, the FPGA terminal updates the behavior value according to the maximum Q value, and iteratively updates the intelligent agent according to the reward data to improve the training efficiency and accuracy in order to select the action with the maximum reward so that the reward is maximized due to each action. And judging whether the current plot ending mark is 1, if so, importing the updated experience value into a buffer, and otherwise, updating the behavior value again by the maximum Q value. And judging whether the number of the neurons in the multilayer perceptron is equal to the number of the network nodes, if so, acquiring a group of experience values (which can be the latest experience value) from the buffer to update the action Q value in the next state. Otherwise, when the number of the neurons is larger than the number of the network nodes, optimizing the output weight matrix, and updating the action Q value in the next state by using the latest experience value. Then, when the iteration period epicode is divisible by step in the iteration period, the model parameters of the agent are updated. And judging whether step in the iteration cycle is smaller than 1, if so, circulating the intelligent agent updating step at the FPGA end, and otherwise, finishing the operation.

On the basis of the foregoing embodiments, in some embodiments, referring to fig. 4, a flowchart of a reinforcement learning method according to a second specific embodiment of the present invention is shown, where the determining a current action according to the current state is:

acquiring a random action in the current state;

Similarly, in this embodiment, the behavior value is updated randomly, and the experience value is trained, so that the latest experience value in the buffer is selected in the subsequent updating operation to update the action Q value corresponding to the next state, thereby implementing a random updating strategy, greatly reducing the requirements on the computation resource and the storage resource of the FPGA platform, and implementing state updating on the device platform with limited computation resource and storage resource.

On the basis of the foregoing embodiments, in some embodiments, referring to fig. 5, a flowchart of a reinforcement learning method according to a third specific embodiment of the present invention is shown, where the determining a current action according to the current state is:

acquiring a random value of the reward;

Similarly, in the embodiment, by combining the mode of obtaining the current action according to the maximum value of the action Q value and the mode of obtaining the current action randomly, the maximum value reward of each action is ensured, and the action strategy is also judged for a specific platform, so that two advantages are obtained.

performing initialization training on the output weight matrix;

In the embodiment, the input weight matrix is initialized by random values and then is kept unchanged, and only the output weight matrix is optimized, so that the calculation amount and the calculation complexity are greatly reduced.

On the basis of the above embodiments, in some embodiments, the method further comprises:

the output weight matrix is optimized by the following formula:

wherein, beta_iIs an output weight matrix;

H_i≡G(x_iα + b), i ≧ 0, ith dataset

in order to input the weight matrix, the weight matrix is input,

the number of hidden layer nodes of the multi-layer perceptron,

i represents an initialized multi-layer perceptron parameter matrix for a bias vector of a hidden layer; batch size k_iIs set to 1.

In this embodiment, the amount of computation is greatly reduced by optimizing the output weight matrix in order (i.e., the result of the previous training) without retraining based on the past results. Secondly, the optimization algorithm of the embodiment is not iterative, and an optimal output weight matrix can be obtained at one time without repeated iterative computation of parameters. And then, in the process of optimizing the output weight matrix, calculating the pseudo-inverse matrix by adopting the batch processing size fixed to 1 so as to eliminate the calculation of the pseudo-inverse matrix, eliminate the calculation caused by matrix singular value decomposition in optimization training and effectively reduce the use of hardware calculation resources and storage capacity.

In a preferred embodiment, the method further includes L2 regularizing the output weight matrix, and the specific steps are as follows:

In the embodiment, the L2 regularization is performed during the initial training process of the output weight matrix, so that the overfitting problem of the MLP is greatly reduced and the generalization capability of the MLP is improved.

initializing model parameters of the agent by:

In the embodiment, the fixed target Q network is used at the FPGA end and is updated in a larger period, so that the stability of reinforcement learning of the intelligent agent at the FPGA end is improved, the method is suitable for hardware equipment with limited resources and insufficient calculation, and theta can be solved₁The continuous change at each time point may cause the instability of reinforcement learning.

In some embodiments, the method further comprises:

setting a lippocitz constant of the multi-layered perceptron to 1.

Specifically, in order to stably acquire the Q value of a specific action, the present invention uses a regularization method in deep learning, that is, the range of the output of the multi-layer perceptron should be within a constant multiple of the input thereof, so as to maintain stability. More specifically, assuming that the input values are x1 and x2, the output value f (x) thereof₁) And f (x)₂) The following constraints should be satisfied:

where K is referred to as the Ripocez constant, the Ripocez constant of the multi-layered perceptron is derived from the partial product of the Ripocez constants of all layers, each product being the product of a rank-weight matrix (e.g., the maximum singular value) and the Ripocez constant of an activation function (e.g., ReLU or tanh). For stable Q learning, it should be subjected to a suppression operation. Spectral regularization can be used to suppress the lippocitz constant of a multi-layered perceptron, where the sum of the largest singular values in each weight matrix is added as a loss term to the loss function.

A common extension of spectral regularization is spectral normalization, where the output of the multi-layered perceptron is derived based on the product of the input data and each weight matrix divided by its maximum singular value. The Ripocitz constant limited to 1 is used in the present invention to stabilize the reinforcement learning based on the multi-layered perceptron.

In some embodiments, multi-tier perceptron networks have a characteristic that tends to be unstable, especially when an agnostic input is fed into the network, in which case its output values may become abnormal. Since such singular values are very large and exceed the range of normal prize values, the values hinder reinforcement learning. When learning in a typical setting in general, the environment gives a maximum reward of 1 and a minimum reward of-1. Therefore, when the intelligent agent carries out initialization training and state updating, the intelligent agent carries out cutting according to the following strategy, so that abnormal values of the Q value can be effectively restrained, and stable reinforcement learning can be carried out.

Specifically, the FPGA side further includes a step of cutting an action Q value output by the multilayer perceptron, which includes the following steps:

wherein r is_tFor the reward of this update, d_tIs slow this timeThe value stored in the memory, gamma, is a proportional parameter that controls the importance of the next step,

is theta₂Q value of lower, s_t+1In the next state, a is an action.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A reinforcement learning method, comprising:

2. The reinforcement learning method of claim 1, wherein the FPGA side iteratively updates the agent by:

reading the initialized agent parameters from the CPU;

3. The reinforcement learning method of claim 2, wherein the determining a current action from the current state:

acquiring a random action in the current state;

4. The reinforcement learning method of claim 2, wherein the determining a current action from the current state:

acquiring a random value of the reward;

5. The reinforcement learning method of claim 1, wherein the CPU initializes the input weight matrix and the output weight matrix by:

performing initialization training on the output weight matrix;

6. The reinforcement learning method according to claim 2 or 5, wherein the method further comprises:

the output weight matrix is optimized by the following formula:

wherein, beta_iIs an output weight matrix;

H_i≡G(x_iα + b), i ≧ 0, ith dataset

in order to input the weight matrix, the weight matrix is input,

the number of hidden layer nodes of the multi-layer perceptron,

7. The reinforcement learning method of claim 6, further comprising L2 regularization of the output weight matrix by the steps of:

8. The reinforcement learning method of claim 6, wherein the model parameters of the agent comprise a first model parameter of the agent running on the CPU side and a second model parameter of the agent running on the FPGA side, and the CPU side initializes the model parameters of the agent by:

initializing model parameters of the agent by:

9. The reinforcement learning method of claim 5, wherein the method further comprises:

setting a lippocitz constant of the multi-layered perceptron to 1.

10. The reinforcement learning method of claim 8, wherein the FPGA further comprises a step of clipping an action Q value output by the multi-layer perceptron, the specific steps being as follows:

is theta₂Q value of lower, s_t+1In the next state, a is an action.