CN115983358A - Hardware implementation method of Bellman equation based on strategy iteration - Google Patents

Hardware implementation method of Bellman equation based on strategy iteration Download PDF

Info

Publication number
CN115983358A
CN115983358A CN202310055769.2A CN202310055769A CN115983358A CN 115983358 A CN115983358 A CN 115983358A CN 202310055769 A CN202310055769 A CN 202310055769A CN 115983358 A CN115983358 A CN 115983358A
Authority
CN
China
Prior art keywords
value
equation
strategy
bellman
memristor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310055769.2A
Other languages
Chinese (zh)
Inventor
朱云来
郭文斌
冯哲
吴祖恒
徐祖雨
代月花
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Publication of CN115983358A publication Critical patent/CN115983358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses a hardware realization method of a Bellman equation based on strategy iteration, which comprises the steps of firstly inputting an incentive value into a Bellman expectation equation circuit, and solving the strategy value of the incentive value; inputting the obtained strategy value into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value; and mapping the obtained optimal value to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, thereby achieving the purpose of solving the optimal value by using a hardware acceleration Bellman equation. The method can use a memristor array multiplication and addition mode to realize the hardware of the Bellman equation, so that the performance of a reinforcement learning hardware system is greatly optimized.

Description

Hardware implementation method of Bellman equation based on strategy iteration
Technical Field
The invention relates to the technical field of memristors, in particular to a hardware implementation method of a Bellman equation based on strategy iteration.
Background
Reinforcement learning belongs to a branch of machine learning and aims to utilize interaction between an agent and an environment to learn a strategy with maximum rewards. With the rapid development of artificial intelligence technology, reinforcement learning has a wide application prospect in up to 12 fields, and compared with unsupervised learning and supervised learning, the reinforcement learning field requires more computing power, and needs the realization of a full hardware system and the comprehensive optimization of a hardware accelerator. Until now, many memristor-based reinforcement learning acceleration systems have appeared, for example, a reinforcement learning classical algorithm DQN is completed by using a memristor array, but the work is only a neural Network part of a Deep-Q-Network (DQN) algorithm, and the strategy solution of reinforcement learning is still performed in a CPU; another solution is to use phase change memory to accomplish the implementation of eligibility traces, but the system itself just generates the strategy map on the PCM array, and the main strategy iteration calculation is still performed in the computer.
In conclusion, the memristor array is used as an accelerator for reinforcement learning, so that the calculation efficiency is improved to a certain extent, but most of calculation processes are still carried out in a CPU (central processing unit), especially as a Bellman expectation equation of a reinforcement learning base stone, and no corresponding hardware circuit is designed after investigation; in the process of reinforcement learning trial and error exploration, the calculation of strategy iteration occupies most resources, how to realize the strategy iteration in hardware by utilizing the matrix dot product of the memristor array and complete the solution of the optimal strategy is the problem which needs to be solved by the whole hardware realization of the reinforcement learning system.
Disclosure of Invention
The invention aims to provide a method for realizing the hardware of the Bellman equation based on strategy iteration, which can realize the hardware of the Bellman equation by using a memristor array multiply-add mode, thereby having great optimization effect on the performance of a reinforcement learning hardware system.
The purpose of the invention is realized by the following technical scheme:
a hardware implementation method of a berman equation based on policy iteration, the method comprising:
step 1, inputting an award value into a Bellman expectation equation circuit, and solving the strategy value of the award value;
step 2, inputting the strategy value obtained in the step 1 into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value;
and 3, mapping the optimal value obtained in the step 2 to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, so as to achieve the purpose of solving the optimal value by using a hardware acceleration Bellman equation.
According to the technical scheme provided by the invention, the method can use a memristor array multiply-add mode to realize the hardware of the Bellman equation, so that the performance of a reinforcement learning hardware system is greatly optimized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a hardware implementation method of a berman equation based on policy iteration according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an iterative solution transformation process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation of the Bellman expectation equation circuit according to an embodiment of the invention;
FIG. 4 is a schematic diagram of an implementation of the Bellman optimal equation circuit according to the embodiment of the invention;
fig. 5 is a circuit diagram of a whole hardware implementation of the bellman equation according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and this does not limit the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a hardware implementation method of a berman equation based on policy iteration according to an embodiment of the present invention, where the method includes:
step 1, inputting an incentive value into a Bellman expectation equation circuit to obtain a strategy value of the incentive value;
in this step, the bellman expectation equation is taken as an expected motionless point equation, and does not have the property of matrix multiply-add, the embodiment of the present invention optimizes and deduces the structure of the bellman equation, as shown in fig. 2, it is an implementation schematic diagram of the bellman expectation equation circuit according to the embodiment of the present invention, that is, an expected form of the bellman expectation equation is converted into an iterative solution of the matrix multiply-add form by using the motionless point principle, as shown in fig. 2, it is a schematic diagram of a process of iterative solution conversion according to the embodiment of the present invention:
the original bellman equation is expressed as: v(s) = R(s) + gamma sigma s′∈S P (s '| s) V (s'), where R(s) represents the prize value of the input, γ ∑ s′∈S P (s '| s) V (s') tableShowing the total sum of discount rewards for the future, gamma is a discount factor, and P (s' | s) is a transition matrix from the current state to the next state;
the value function of the Bellman equation expressing the current state can be calculated through the value function of the next state, and after the value function is written into a matrix form and solved and derived, the iterative analytic solution of the value function can be obtained as follows: v = (I-gamma P) -1 R;
Based on the iterative analytic solution, the right two parts of the equation can be respectively replaced by matrix multiplication of the memristor array, so that the Bellman equation is realized by hardware.
The memristor array is formed by two-dimensional row-column stacking of single cross rod memristors, one memristor is arranged at each cross point on the array, and each memristor has an adjustable conductance value. When voltage is input to the memristor array row by row, each memristor in the same row can obtain the current value of the point by multiplying the input voltage by the conductance of the memristor, and when multiple rows of input are input to the array, the sum of the current values of all the memristors in each row is output by each column according to the kirchhoff current law, so that the purpose of matrix multiplication and addition is achieved.
In this embodiment, the memristor array used is a 1T1R structure, that is, a transistor and a memristor are integrated, the drain of the transistor is connected to the upper electrode of the memristor, and the transistor switch is controlled by applying pressure to the gate to suppress current crosstalk between different rows and columns, so that the reading error is reduced, and the recognition accuracy of a hardware system is improved.
Fig. 3 is a schematic diagram of an implementation of the bellman expectation equation circuit according to an embodiment of the present invention, where an input voltage signal represents an incentive value R(s) input by a system; the conductance value of each memristor in the memristor array represents a state transition probability P (s' | s), and the inference and mapping of the memristor array are carried out by using a peripheral FPGA board card circuit; the output current value is converted into a voltage signal through the constant resistor array, and the value of the output, namely the strategy value of the input reward value, is represented.
In specific implementation, when a row of parallel level signals are transmitted into a memristor array as reward values, a peripheral circuit of the memristor array controls a first row to be started, then the input level signals and a first row of write-in probability value (0.1-1) nonvolatile devices carry out vector multiplication and addition operation through ohm's law and kirchhoff's law, current values of the first row are output, then the current values are converted into voltage signals, attenuation is 50%, the input signals serving as a first row are controlled by the peripheral circuit to be input into the array together with the reward level signals, and meanwhile the array is controlled to start a second row of output current signals, so that one cycle operation is completed;
and after traversing all the columns, observing whether the output current is not changed any more, if not, taking the output current value as the value obtained by the first epoch, and updating the weight in the array, namely the strategy value of the whole system according to the value.
Step 2, inputting the strategy value obtained in the step 1 into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value;
in this step, the bellman optimal equation circuit is to repeatedly recursion the policy value obtained by the bellman expected equation circuit, update the value probability matrix by using a greedy algorithm, and repeatedly solve the policy value until the policy value is included in a determined value, that is, the optimal value, as shown in fig. 4, an implementation schematic diagram of the bellman optimal equation circuit according to the embodiment of the present invention;
the circuit input voltage signal is a strategy value matrix calculated by a Bellman expectation equation circuit; the method comprises the steps that an array conductance value in a memristor array represents a value probability transfer matrix, corresponding voltage values are output after matrix multiplication and addition operation of the memristor array, then input values are returned again for repeated recursion operation, and a value probability matrix is updated through an algorithm to output new optimized values; the output value gradually tends to be stable after repeated iteration according to a fixed point iteration method, and finally the maximum value is obtained by using a winner eating-all circuit, namely the optimal value.
And 3, mapping the optimal value obtained in the step 2 to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, so as to achieve the purpose of solving the optimal value by using a hardware acceleration Bellman equation.
Fig. 5 is a circuit diagram of a full hardware implementation of the bellman equation according to an embodiment of the present invention, where fig. 5 includes: the method is implemented by three layers of array circuits, namely a memristor array with the scale of 25 multiplied by 25 in a first layer and a second layer, and mainly aims at inputting strategy values output by reward value calculation and performing iterative operation until the values converge;
then converting the output current value into a voltage signal, outputting the voltage signal to a memristor array with the scale of 25 x 100 on the third layer, performing matrix multiplication and addition operation through ohm's law and kirchhoff's law, and finally outputting the optimal value to return to the upper computer;
and updating the conductance values of the three-layer array by using the final output value, and finally completing the calculation of the whole circuit after 6 updating operations.
It is noted that those skilled in the art will recognize that embodiments of the present invention are not described in detail herein.
In summary, the bellman equation is one of important basic stones for reinforcement learning, and occupies a large proportion in the system operation process, so that the bellman equation can be subjected to matrix deformation iteration to obtain an analytical solution, the analytical solution utilizes the acceleration of matrix multiplication and addition inherent to the memristor array to solve the optimal strategy of the bellman equation, and the memristor array is used for hardware realization, so that the reinforcement learning hardware system performance can be greatly optimized.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Claims (4)

1. A hardware implementation method of a Bellman equation based on strategy iteration is characterized by comprising the following steps:
step 1, inputting an incentive value into a Bellman expectation equation circuit to obtain a strategy value of the incentive value;
step 2, inputting the strategy value obtained in the step 1 into a Bellman optimal equation circuit to carry out strategy iteration solution, and solving the optimal value;
and 3, mapping the optimal value obtained in the step 2 to a strategy map consisting of memristor arrays, completing the optimal value solution of each state, and determining the moving direction of each state according to the size of the optimal value, so as to achieve the purpose of solving the optimal value by using a hardware acceleration Bellman equation.
2. The hardware implementation method of the berman equation based on strategy iteration of claim 1, wherein in step 1, the berman expectation equation circuit is an iterative solution that converts the expected form of the berman expectation equation into a matrix multiply-add form by using the principle of motionless points;
the original bellman equation is expressed as: v(s) = R(s) + gamma sigma s′∈S P (s '| s) V (s'), where R(s) represents the prize value of the input, γ ∑ s′∈S P (s ' | s) V (s ') represents the total sum of the discounted rewards for the future, gamma is a discount factor, and P (s ' | s) is a transition matrix from the current state to the next state;
the value function of the Bellman equation expressing the current state can be calculated through the value function of the next state, and after the value function is written into a matrix form and solved and derived, the iterative analytic solution of the value function can be obtained as follows: v = (I-gamma P) -1 R;
Based on the iterative analytic solution, the left part and the right part of an equation are respectively replaced by matrix multiplication of the memristor array, so that a Bellman equation is realized by hardware;
wherein the input voltage signal represents a reward value R(s) input by the system; the conductance value of each memristor in the memristor array represents a state transition probability P (s' | s), and the inference and mapping of the memristor array are carried out by using a peripheral FPGA board card circuit; the output current value is converted into a voltage signal through the constant resistor array, and the value of the output, namely the strategy value of the input reward value, is represented.
3. The method of claim 2, wherein the method comprises,
the memristor array is formed by two-dimensional row-column stacking of single cross rod memristors, each cross point on the array is provided with one memristor, and each memristor is provided with an adjustable conductance value;
when the memristor array is subjected to line-by-line voltage input, each memristor in the same line can obtain the current value of the point by multiplying the input voltage by the conductance of the memristor, and when the array is subjected to multi-line input, for each column, the sum of the current values of all the memristors in the column is output by each column according to the kirchhoff current law, so that the purpose of matrix multiplication and addition is achieved.
4. The hardware implementation method of the berman equation based on strategy iteration of claim 1, wherein in step 2, the berman optimal equation circuit is to repeatedly recur the strategy value obtained by the berman expected equation circuit, update the value probability matrix by using a greedy algorithm, and repeatedly solve the strategy value until the strategy value is attributed to a certain value, i.e. the optimal value;
the circuit input voltage signal is a strategy value matrix calculated by a Bellman expectation equation circuit; the method comprises the steps that an array conductance value in a memristor array represents a value probability transfer matrix, corresponding voltage values are output after matrix multiplication and addition operation of the memristor array, then input values are returned again for repeated recursion operation, and a value probability matrix is updated through an algorithm to output new optimized values; the output value gradually tends to be stable after repeated iteration according to a fixed point iteration method, and finally the maximum value is obtained by using a winner eating-all circuit, namely the optimal value.
CN202310055769.2A 2022-07-12 2023-01-18 Hardware implementation method of Bellman equation based on strategy iteration Pending CN115983358A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210817242 2022-07-12
CN2022108172424 2022-07-12

Publications (1)

Publication Number Publication Date
CN115983358A true CN115983358A (en) 2023-04-18

Family

ID=85960966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310055769.2A Pending CN115983358A (en) 2022-07-12 2023-01-18 Hardware implementation method of Bellman equation based on strategy iteration

Country Status (1)

Country Link
CN (1) CN115983358A (en)

Similar Documents

Publication Publication Date Title
CN112183739B (en) Hardware architecture of memristor-based low-power-consumption pulse convolution neural network
US11501130B2 (en) Neural network hardware accelerator architectures and operating method thereof
Yakopcic et al. Extremely parallel memristor crossbar architecture for convolutional neural network implementation
US20200342301A1 (en) Convolutional neural network on-chip learning system based on non-volatile memory
CN109146073B (en) Neural network training method and device
Lee et al. Hardware annealing in analog VLSI neurocomputing
US20180095930A1 (en) Field-Programmable Crossbar Array For Reconfigurable Computing
CN110674933A (en) Pipeline technique for improving neural network inference accuracy
AU2020274862B2 (en) Training of artificial neural networks
CN110807519B (en) Parallel acceleration method of neural network based on memristor, processor and device
Hu et al. Dot-product engine as computing memory to accelerate machine learning algorithms
US20210049448A1 (en) Neural network and its information processing method, information processing system
CN111478703B (en) Memristor cross array-based processing circuit and output current compensation method
Dong et al. Neuromorphic extreme learning machines with bimodal memristive synapses
CN110543939A (en) hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
US11803360B2 (en) Compilation method, apparatus, computing device and medium
CN112070204A (en) Neural network mapping method and accelerator based on resistive random access memory
JP3353786B2 (en) Information processing device
CN111461308B (en) Memristor neural network and weight training method
CN115983358A (en) Hardware implementation method of Bellman equation based on strategy iteration
CN115879530A (en) Method for optimizing array structure of RRAM (resistive random access memory) memory computing system
CN115458005A (en) Data processing method, integrated storage and calculation device and electronic equipment
CN115062583B (en) Hopfield network hardware circuit for solving optimization problem and operation method
CN114004344A (en) Neural network circuit
JPH06187472A (en) Analog neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination