CN115983358A

CN115983358A - Hardware implementation method of Bellman equation based on strategy iteration

Info

Publication number: CN115983358A
Application number: CN202310055769.2A
Authority: CN
Inventors: 朱云来; 郭文斌; 冯哲; 吴祖恒; 徐祖雨; 代月花
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-07-12
Filing date: 2023-01-18
Publication date: 2023-04-18

Abstract

The invention discloses a hardware realization method of a Bellman equation based on strategy iteration, which comprises the steps of firstly inputting an incentive value into a Bellman expectation equation circuit, and solving the strategy value of the incentive value; inputting the obtained strategy value into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value; and mapping the obtained optimal value to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, thereby achieving the purpose of solving the optimal value by using a hardware acceleration Bellman equation. The method can use a memristor array multiplication and addition mode to realize the hardware of the Bellman equation, so that the performance of a reinforcement learning hardware system is greatly optimized.

Description

Hardware implementation method of Bellman equation based on strategy iteration

Technical Field

The invention relates to the technical field of memristors, in particular to a hardware implementation method of a Bellman equation based on strategy iteration.

Background

Reinforcement learning belongs to a branch of machine learning and aims to utilize interaction between an agent and an environment to learn a strategy with maximum rewards. With the rapid development of artificial intelligence technology, reinforcement learning has a wide application prospect in up to 12 fields, and compared with unsupervised learning and supervised learning, the reinforcement learning field requires more computing power, and needs the realization of a full hardware system and the comprehensive optimization of a hardware accelerator. Until now, many memristor-based reinforcement learning acceleration systems have appeared, for example, a reinforcement learning classical algorithm DQN is completed by using a memristor array, but the work is only a neural Network part of a Deep-Q-Network (DQN) algorithm, and the strategy solution of reinforcement learning is still performed in a CPU; another solution is to use phase change memory to accomplish the implementation of eligibility traces, but the system itself just generates the strategy map on the PCM array, and the main strategy iteration calculation is still performed in the computer.

In conclusion, the memristor array is used as an accelerator for reinforcement learning, so that the calculation efficiency is improved to a certain extent, but most of calculation processes are still carried out in a CPU (central processing unit), especially as a Bellman expectation equation of a reinforcement learning base stone, and no corresponding hardware circuit is designed after investigation; in the process of reinforcement learning trial and error exploration, the calculation of strategy iteration occupies most resources, how to realize the strategy iteration in hardware by utilizing the matrix dot product of the memristor array and complete the solution of the optimal strategy is the problem which needs to be solved by the whole hardware realization of the reinforcement learning system.

Disclosure of Invention

The invention aims to provide a method for realizing the hardware of the Bellman equation based on strategy iteration, which can realize the hardware of the Bellman equation by using a memristor array multiply-add mode, thereby having great optimization effect on the performance of a reinforcement learning hardware system.

The purpose of the invention is realized by the following technical scheme:

a hardware implementation method of a berman equation based on policy iteration, the method comprising:

step 1, inputting an award value into a Bellman expectation equation circuit, and solving the strategy value of the award value;

step 2, inputting the strategy value obtained in the step 1 into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value;

and 3, mapping the optimal value obtained in the step 2 to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, so as to achieve the purpose of solving the optimal value by using a hardware acceleration Bellman equation.

According to the technical scheme provided by the invention, the method can use a memristor array multiply-add mode to realize the hardware of the Bellman equation, so that the performance of a reinforcement learning hardware system is greatly optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a hardware implementation method of a berman equation based on policy iteration according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an iterative solution transformation process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation of the Bellman expectation equation circuit according to an embodiment of the invention;

FIG. 4 is a schematic diagram of an implementation of the Bellman optimal equation circuit according to the embodiment of the invention;

fig. 5 is a circuit diagram of a whole hardware implementation of the bellman equation according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and this does not limit the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a hardware implementation method of a berman equation based on policy iteration according to an embodiment of the present invention, where the method includes:

step 1, inputting an incentive value into a Bellman expectation equation circuit to obtain a strategy value of the incentive value;

in this step, the bellman expectation equation is taken as an expected motionless point equation, and does not have the property of matrix multiply-add, the embodiment of the present invention optimizes and deduces the structure of the bellman equation, as shown in fig. 2, it is an implementation schematic diagram of the bellman expectation equation circuit according to the embodiment of the present invention, that is, an expected form of the bellman expectation equation is converted into an iterative solution of the matrix multiply-add form by using the motionless point principle, as shown in fig. 2, it is a schematic diagram of a process of iterative solution conversion according to the embodiment of the present invention:

the original bellman equation is expressed as: v(s) = R(s) + gamma sigma _s′∈S P (s '| s) V (s'), where R(s) represents the prize value of the input, γ ∑ _s′∈S P (s '| s) V (s') tableShowing the total sum of discount rewards for the future, gamma is a discount factor, and P (s' | s) is a transition matrix from the current state to the next state;

the value function of the Bellman equation expressing the current state can be calculated through the value function of the next state, and after the value function is written into a matrix form and solved and derived, the iterative analytic solution of the value function can be obtained as follows: v = (I-gamma P) ^-1 R；

Based on the iterative analytic solution, the right two parts of the equation can be respectively replaced by matrix multiplication of the memristor array, so that the Bellman equation is realized by hardware.

The memristor array is formed by two-dimensional row-column stacking of single cross rod memristors, one memristor is arranged at each cross point on the array, and each memristor has an adjustable conductance value. When voltage is input to the memristor array row by row, each memristor in the same row can obtain the current value of the point by multiplying the input voltage by the conductance of the memristor, and when multiple rows of input are input to the array, the sum of the current values of all the memristors in each row is output by each column according to the kirchhoff current law, so that the purpose of matrix multiplication and addition is achieved.

In this embodiment, the memristor array used is a 1T1R structure, that is, a transistor and a memristor are integrated, the drain of the transistor is connected to the upper electrode of the memristor, and the transistor switch is controlled by applying pressure to the gate to suppress current crosstalk between different rows and columns, so that the reading error is reduced, and the recognition accuracy of a hardware system is improved.

Fig. 3 is a schematic diagram of an implementation of the bellman expectation equation circuit according to an embodiment of the present invention, where an input voltage signal represents an incentive value R(s) input by a system; the conductance value of each memristor in the memristor array represents a state transition probability P (s' | s), and the inference and mapping of the memristor array are carried out by using a peripheral FPGA board card circuit; the output current value is converted into a voltage signal through the constant resistor array, and the value of the output, namely the strategy value of the input reward value, is represented.

In specific implementation, when a row of parallel level signals are transmitted into a memristor array as reward values, a peripheral circuit of the memristor array controls a first row to be started, then the input level signals and a first row of write-in probability value (0.1-1) nonvolatile devices carry out vector multiplication and addition operation through ohm's law and kirchhoff's law, current values of the first row are output, then the current values are converted into voltage signals, attenuation is 50%, the input signals serving as a first row are controlled by the peripheral circuit to be input into the array together with the reward level signals, and meanwhile the array is controlled to start a second row of output current signals, so that one cycle operation is completed;

and after traversing all the columns, observing whether the output current is not changed any more, if not, taking the output current value as the value obtained by the first epoch, and updating the weight in the array, namely the strategy value of the whole system according to the value.

in this step, the bellman optimal equation circuit is to repeatedly recursion the policy value obtained by the bellman expected equation circuit, update the value probability matrix by using a greedy algorithm, and repeatedly solve the policy value until the policy value is included in a determined value, that is, the optimal value, as shown in fig. 4, an implementation schematic diagram of the bellman optimal equation circuit according to the embodiment of the present invention;

the circuit input voltage signal is a strategy value matrix calculated by a Bellman expectation equation circuit; the method comprises the steps that an array conductance value in a memristor array represents a value probability transfer matrix, corresponding voltage values are output after matrix multiplication and addition operation of the memristor array, then input values are returned again for repeated recursion operation, and a value probability matrix is updated through an algorithm to output new optimized values; the output value gradually tends to be stable after repeated iteration according to a fixed point iteration method, and finally the maximum value is obtained by using a winner eating-all circuit, namely the optimal value.

Fig. 5 is a circuit diagram of a full hardware implementation of the bellman equation according to an embodiment of the present invention, where fig. 5 includes: the method is implemented by three layers of array circuits, namely a memristor array with the scale of 25 multiplied by 25 in a first layer and a second layer, and mainly aims at inputting strategy values output by reward value calculation and performing iterative operation until the values converge;

then converting the output current value into a voltage signal, outputting the voltage signal to a memristor array with the scale of 25 x 100 on the third layer, performing matrix multiplication and addition operation through ohm's law and kirchhoff's law, and finally outputting the optimal value to return to the upper computer;

and updating the conductance values of the three-layer array by using the final output value, and finally completing the calculation of the whole circuit after 6 updating operations.

It is noted that those skilled in the art will recognize that embodiments of the present invention are not described in detail herein.

In summary, the bellman equation is one of important basic stones for reinforcement learning, and occupies a large proportion in the system operation process, so that the bellman equation can be subjected to matrix deformation iteration to obtain an analytical solution, the analytical solution utilizes the acceleration of matrix multiplication and addition inherent to the memristor array to solve the optimal strategy of the bellman equation, and the memristor array is used for hardware realization, so that the reinforcement learning hardware system performance can be greatly optimized.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Claims

1. A hardware implementation method of a Bellman equation based on strategy iteration is characterized by comprising the following steps:

2. The hardware implementation method of the berman equation based on strategy iteration of claim 1, wherein in step 1, the berman expectation equation circuit is an iterative solution that converts the expected form of the berman expectation equation into a matrix multiply-add form by using the principle of motionless points;

the original bellman equation is expressed as: v(s) = R(s) + gamma sigma _s′∈S P (s '| s) V (s'), where R(s) represents the prize value of the input, γ ∑ _s′∈S P (s ' | s) V (s ') represents the total sum of the discounted rewards for the future, gamma is a discount factor, and P (s ' | s) is a transition matrix from the current state to the next state;

Based on the iterative analytic solution, the left part and the right part of an equation are respectively replaced by matrix multiplication of the memristor array, so that a Bellman equation is realized by hardware;

wherein the input voltage signal represents a reward value R(s) input by the system; the conductance value of each memristor in the memristor array represents a state transition probability P (s' | s), and the inference and mapping of the memristor array are carried out by using a peripheral FPGA board card circuit; the output current value is converted into a voltage signal through the constant resistor array, and the value of the output, namely the strategy value of the input reward value, is represented.

3. The method of claim 2, wherein the method comprises,

the memristor array is formed by two-dimensional row-column stacking of single cross rod memristors, each cross point on the array is provided with one memristor, and each memristor is provided with an adjustable conductance value;

when the memristor array is subjected to line-by-line voltage input, each memristor in the same line can obtain the current value of the point by multiplying the input voltage by the conductance of the memristor, and when the array is subjected to multi-line input, for each column, the sum of the current values of all the memristors in the column is output by each column according to the kirchhoff current law, so that the purpose of matrix multiplication and addition is achieved.

4. The hardware implementation method of the berman equation based on strategy iteration of claim 1, wherein in step 2, the berman optimal equation circuit is to repeatedly recur the strategy value obtained by the berman expected equation circuit, update the value probability matrix by using a greedy algorithm, and repeatedly solve the strategy value until the strategy value is attributed to a certain value, i.e. the optimal value;