Background
Due to the excellent performance of the reinforcement learning algorithm, the reinforcement learning algorithm is concerned by many scientific researchers in recent years, a problem solving strategy can be generated based on reward and punishment of a task environment, complex tasks in many fields can be effectively completed through an optimization strategy generated through multiple rounds of iteration, and external guidance or supervision is not needed. The continuously optimized reinforcement learning algorithm can achieve the performance close to or even exceeding the human level in the fields of automatic driving, game fighting and the like. The strong function of the reinforcement learning algorithm is not supported by an effective and common mechanism, namely the qualification trace, and the state trace of an intelligent agent in the reinforcement learning in one round of training can be recorded in a time attenuation mode, and the updating amplitude of strategies corresponding to different states is guided based on the amplitude of the trace, so that the formation of the optimal strategy is accelerated, the cost in the reinforcement learning training process is reduced, and the final training effect is improved.
The qualification trace realized on the traditional computing platform is obtained by computing a large number of exponential decay functions, which not only needs to carry out a large number of multiplication operations, but also needs to carry out data transfer between a calculator and a memory frequently, and the energy consumption is very high, thereby seriously limiting the realization of a complex reinforcement learning algorithm. Phase change memories are a new type of nonvolatile memory that rely on the significant difference in conductance between the crystalline and amorphous states of the internal phase change material to achieve high-speed, high-density data storage, while the unstable amorphous material undergoes spontaneous structural disintegration to form a glassy state with lower conductance, and thus the conductance state of the phase change memory decays over time, known as conductance drift. By reasonably utilizing the conductance drift of the phase change memory, the attenuation mechanism of the qualification trace can be automatically realized in a memory calculation mode, and a large amount of data transportation and multiplication are avoided, so that the expense of a large-scale reinforcement learning algorithm is effectively reduced.
Disclosure of Invention
In order to solve the problem that the energy consumption of computing the qualification trace in a complex reinforcement learning algorithm is too high, the invention provides a qualification trace calculator based on the multi-value characteristic and the conductance drift characteristic of a phase change memory, which can spontaneously realize the attenuation of the qualification trace in a memory computing mode, thereby greatly reducing the energy consumption of computing the qualification trace. By utilizing the self-conductance drift effect of the phase change memory, the invention can automatically realize the attenuation operation of the qualification trace without a complex operation circuit, thereby effectively reducing the hardware expense; in addition, the storage and the operation of the qualification trace are finished in the phase change memory, so that frequent data handling is avoided, and the energy consumption of the operation is further reduced. The present invention thus has significant advantages in terms of energy and hardware overhead compared to conventional eligibility trace implementations.
The qualification calculator of the present invention is comprised of two parts, see FIG. 1, the first part is a programmable phase change memory array comprising peripheral circuits for the generation of programming pulses and the reading of device conductance and phase change memory array cells connected in common; each phase change memory array unit consists of a phase change memory and a transistor, one end of the phase change memory is connected with the transistor, the other end of the phase change memory is grounded, and the transistor controls the on-off of the phase change memory and a peripheral circuit; each phase change memory stores corresponding qualification trace data in a form of conductance and performs attenuation operation spontaneously; the second part is a result converter, including a comparator and a linear operator, capable of converting conductance data read from the phase change memory array into qualification data for reinforcement learning.
The principle of the qualification trace calculator is to realize attenuation calculation based on spontaneous conductance drift of the phase change memory, and the rule of the conductance drift is as follows: g (t) ═ G (t)0)(t/t0)-vWhere G (t) is the conductance of the phase change memory at time t, t0V is the conductance drift coefficient, which is the time for the first measurement of conductance, and is in the range of about 0.01 to 0.1. The qualification trace updating process of reinforcement learning comprises two steps, wherein the first step is to update the qualification trace corresponding to the current state and action to be 1: e (s, a) ═ 1 (where E denotes the eligibility trace matrix, s denotes the state number, and a denotes the action number), which can be achieved by a programming operation on the device at the corresponding location in the phase change memory array; the second step is to perform attenuation operation on all qualification trace data, and the traditional implementation mode is as follows: e ═ α E (α)<1, attenuation amplitude), whereas the qualification trace data G stored in the phase change memory in the form of a conductance in the present invention spontaneously attenuates: g ═ G (t/t)0)-vWithout requiring external operations. Conductance data read from the phase change memory array needs to be mapped to a range of 0-1 through a result converter for a reinforcement learning process, and the specific conversion process is as follows: first, the conductance G (s, a) is fed to a comparator and an upper and lower threshold GU、GDBy comparison, if G (s, a)>GUThen the corresponding eligibility trace E (s, a) is 1; if G (s, a)<GDIf E (s, a) ═ 0; if G isD<G(s,a)<GUThen the data is sent to a linear operator for calculation: e (s, a) ═ k (G (s, a) -GD) Wherein k is 1/(G)U–GD) Is the amplification factor; after passing through the result converter, the conductance data is converted into qualification trace data in the range of 0-1, and can further participate in measurement updating of reinforcement learning: q + δ E, where Q is the policy table and δ is the update error.
Preferably, the result converter comprises two analog comparators and a linear operator, and the conductance data G read from the phase change memory array is first fed to the first analog comparator and the upper conductance limit GUBy comparison, if G>GUThen the corresponding qualification trace is directly determined as E ═ 1; if G is<GUThen G is fed into a second analog comparator and a lower conductance limit GDContinuing to compare; if G is<GDThen the corresponding qualification trace is directly determined as E ═ 0; if G is>GDAnd sending G into a linear arithmetic unit for conversion: e ═ k (G-b), where b ═ GD,k=1/(GU-GD) (ii) a Thereby converting conductance data read from the phase change memory array to a range of [0, 1%]The qualification trace data.
The invention provides a qualification trace calculator based on a phase change memory conductance drift effect, which firstly utilizes the multi-valued characteristic of the phase change memory, and floating point type qualification trace data can be stored in a memory unit in a conductance form; then, the decay operation over time is spontaneously realized by utilizing the conductance drift effect of the phase change memory: g ═ G (t/t)0)-vOther operation circuits are not needed, so that the hardware overhead of operation can be effectively reduced; by adjusting the comparison threshold in the result converter, the decay speed of the qualification trace can be flexibly adjusted, so that the qualification trace can be used in reinforcement learning tasks with different requirements; the storage and attenuation operation of the qualification data are completed in the phase change memory, so that frequent data handling is avoided, the energy consumption in the operation process can be effectively reduced, the limitation of a storage wall in the traditional calculation architecture can be broken through, and the further development of reinforcement learning is promoted.
Detailed Description
To more clearly illustrate the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to the accompanying drawings. The description herein is intended to be illustrative of the invention and is not intended to be limiting.
The invention provides a qualification trace calculator based on a phase change memory conductance drift effect, which not only can realize high-density storage of qualification trace data in a phase change memory, but also can automatically realize attenuation operation by depending on the conductance drift effect. Compared with the traditional eligibility trace calculation mode, the method reduces the hardware overhead in the operation process and avoids high energy consumption caused by carrying data back and forth.
FIG. 1 is a schematic diagram of the overall structure of the present invention, and the qualification trajectory calculator is composed of two parts. The first part is a programmable phase change memory array, shown on the left side of FIG. 1, for storing the qualification data and performing automatic decay operations. Each cell in the array is formed by a phase change memory and a transistor, each phase change memory stores a corresponding trace data in the form of a conductance, and the transistor is used for controllingThe connection and disconnection of the phase change memory and the outside are realized; one end of the phase change memory is connected with the transistor, and the other end of the phase change memory is grounded. And updating the qualification trace data according to a reinforcement learning algorithm: when E (s, a) ═ 1, the control circuit at the periphery of the array turns on the transistors in the row where the phase change memory G (s, a) is located, and applies a step-like programming current to the corresponding column, so that the programming current can be applied to the phase change memory G (s, a), thereby raising its conductance slightly. The signal application as shown in fig. 1 enables programming of the phase change memory within the dashed box without affecting the other phase change memories. In the subsequent decay operation, the trace data stored in the phase change memory in the form of conductance spontaneously decays over time due to the conductance drift effect: g ═ G (t/t)0)-vThe operation process does not need to be operated outside.
The second part of the present invention is a result converter, as shown on the right side of FIG. 1, for converting conductance data read in a phase change memory array to qualification trace data in the range of 0-1. The part is composed of two analog comparators and a linear arithmetic unit, the conductance data G read from the memory array is firstly fed into the first analog comparator and the upper limit of conductance GUBy comparison, if G>GUThen the conductance data exceeds the upper limit and the corresponding qualification trace is determined directly as E-1, if G<GUThen G is sent to the next comparator and lower conductance limit GDContinuing to compare; if G is<GDThen the conductance data exceeds the lower limit and the corresponding qualification trace is determined directly as E-0, if G>GDAnd G is in a specified range, and G is sent to a linear arithmetic unit for conversion: e ═ k (G-b), where b ═ GD,k=1/(GU-GD) The parameters are set to ensure that the calculated eligibility trace data satisfies 0<E<1; the conductance data stored in the phase change memory is converted into a range of [0,1 ] by the result converter]And thus can be used in subsequent steps in reinforcement learning.
FIG. 2 is a flow chart of an eligibility trace calculator based on a phase change memory according to the present invention, wherein the process of calculating the eligibility trace mainly comprises the following steps:
(1) according to the current state and action (s, a) in the reinforcement learning algorithm, the corresponding device G (s, a) is selected in the phase change memory array, the transistor of the unit is turned on, the programming current I _ program is applied to the phase change memory, and the conductance state of the phase change memory is slightly improved.
(2) All conductance data in the phase change memory array is read out and fed into a result converter.
(3) Selecting a G from the current conductance data, and comparing the G with the upper conductance limit GUBy comparison, if G<GUStep (4) is entered, otherwise step (6) is entered.
(4) The conductance data G and the lower conductance limit GDBy comparison, if G>GDStep (5) is entered, otherwise step (7) is entered.
(5) The conductance data G is linearly transformed: e ═ k (G-b), where b ═ GD,k=1/(GU-GD) The eligibility trace E is output and step (8) is entered.
(6) The output eligibility trace E is 1 and step (8) is entered.
(7) The output eligibility trace E is 0 and step (8) is entered.
(8) And (4) judging whether all the conductance data are converted, if so, entering the step (9), and otherwise, returning to the step (3).
(9) And completing the calculation of the qualification trace.
Since the rate of conductance drift of a phase change memory is substantially fixed, all at G ═ G (t/t)0)-vIn such a way that it decays with time, characterized by the time t from programming0The closer, the faster the conductance decays, and vice versa; however, the qualification trace attenuation rates required by different reinforcement learning tasks are different, so that the upper and lower conductance limits G in the steps (3) to (5) can be adjusted in the inventionD、GUThe final qualification track attenuation speed is adjusted in a mathematical operation mode, so that the final qualification track attenuation speed is suitable for different reinforcement learning tasks without changing other hardware parts.
Fig. 3 is a schematic diagram of the effect of the result converter on the modulation of the conductance drift effect, wherein (a) shows the comparison between the qualification trace attenuation effect realized by the present invention and the conventional exponential attenuation effect before passing through the result converter, wherein the dashed line cluster is the conductance drift effect (the conductance drift coefficient is in the range of the common phase change memory: 0.01-0.03), and the solid line cluster is the exponential attenuation effect (the base number of attenuation is in the range of 0.8-0.9 in the reinforcement learning). It can be seen through comparison that the difference of attenuation effects generated by the two operations is obvious, and the qualification trace attenuation realized by the conductance drift is slower than the exponential attenuation, so that the direct use in the reinforcement learning effect is not ideal, and the conversion of the result of the conductance drift is indispensable. The effect of the attenuation generated by the conductance drift after the adjustment of the result converter is shown in fig. 3 (b), and the attenuation speed generated by the conductance drift can reach the level of exponential attenuation by selecting appropriate parameters k and b, and the similar attenuation mode also ensures that the qualification trace generated by the invention can be used in reinforcement learning.
The invention provides a novel reinforcement learning qualification trace calculator which can automatically realize attenuation operation by depending on the conductance drift effect of a phase change memory. In addition, by adjusting the parameters in the result converter of the present invention, the decay of the qualification track can be adjusted to a suitable rate to accommodate different reinforcement learning tasks. The invention can effectively break through the limitation of a storage wall in the traditional computer architecture to a complex reinforcement learning algorithm, and has important significance for promoting the further development of reinforcement learning and other artificial intelligence.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.