CN102723112B

CN102723112B - Q learning system based on memristor intersection array

Info

Publication number: CN102723112B
Application number: CN201210188573.2A
Authority: CN
Inventors: 王丽丹; 何朋飞; 段书凯; 钟宇平
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2012-06-08
Filing date: 2012-06-08
Publication date: 2015-06-17
Anticipated expiration: 2032-06-08
Also published as: CN102723112A

Abstract

The invention discloses a Q learning system based on a memristor intersection array. The Q learning system comprises the memristor intersection array, and is characterized in that the system also comprises a read/write selective switch, a state selective switch, a column selective switch, a delay unit and a state detection module, wherein the read/write selective switch is used for controlling a read/write operation of the memristor intersection array; a state selective module is used for detecting current environment state st, and selecting corresponding row lines by the state selective switch; the column selective switch is used for selecting column lines corresponding to actions at when a certain memristor value of the memristor intersection array, namely a Q value, is updated; the delay unit is used for delaying time step of voltage of the selected column lines; and the state detection module is used for detecting current environment state and saving the last environment state. According to the Q learning system, a new circuit element, i.e. the memristor is successfully applied to reinforcement learning, so that the problem of a large quantity of storage spaces for reinforcement learning is solved, and a new thought is provided for the research on reinforcement learning in future.

Description

A kind of Q learning system based on recalling resistance crossed array

Technical field

The present invention relates to a kind of storage matrix and intelligent learning algorithm.

Background technology

Intensified learning is a kind of senior intelligent learning algorithm, is widely used in field in intelligent robotics in recent years, becomes the focus of research.1954, Minsky proposed the intensified learning computation model of SNARCs.Then, Sutton proposes AHC algorithm and TD learning algorithm in its PhD dissertation.Afterwards, the people such as Watkins, on the basis of TD learning algorithm, proposed the classic algorithm-Q learning algorithm in current nitrification enhancement, and Q learning algorithm is an important milestone in intensified learning evolution.After Q learning algorithm proposes, Q learning algorithm is applied to the navigation of mobile robot by Many researchers, the scheduling of robot soccer system and intelligent I/O.But intensified learning also has the limitation of himself, when problem is comparatively complicated, it needs a large amount of states-action storage space.1971, Chua is according to the completeness theory of circuit, propose the 4th kind of circuit component-memristor (L.O.Chua.Memristor-the missing circuit element.IEEE Trans.Circuit Theory.1971,18 (5): 507-519.).

2008, HP laboratory successfully manufactured the memristor of first physics realization, and after this memristor causes and pays close attention to widely.Memristor has nano-scale, nonlinear characteristic, and its resistance changes along with the change of input stimulus, and this change is non-volatile, and therefore memristor is applicable to for design large scale memory very much.Memristor crossed array is the one in memristor storer, and its structure is simple, and design is convenient.The people such as Hu little Fang utilize memristor crossed array to achieve the storage (Hu little Fang of image, Duan Shukai, Wang Lidan, etc. memristor crossed array and the application in image procossing. Chinese science F collects: information science .2011, and 41 (4): 500-512.).Because memristor has nano-scale, therefore memristor crossed array can make large scale memory, can solving intensified learning when solving challenge, needing the problem of a large amount of states-action storage space, therefore, utilization recalls resistance crossed array to realize Q study is a kind of good selection.

As shown in Figure 1, memristor is made up of doped region and undoped region two parts the physical model of HP memristor.Wherein w and D represents the width of doped region and the overall width of memristor in memristor respectively.Its mathematical model is as follows:

M (t) = R_{ON} \frac{w (t)}{D} + R_{OFF} (1 - \frac{w (t)}{D})

Wherein, R _oFFand R _oNwhen representing that w equals 0 and D respectively, the resistance of memristor.

\frac{dw (t)}{dt} = \frac{μ_{V} R_{ON}}{D} i (t)

Here, μ _vrepresent the movement of average ion, unit is cm ²s ^-1v ^-1.

T_{w} = \frac{Φ_{D}}{V_{A} R_{OFF}^{2}} [{(R (w_{0}))}^{2} - {(R (w))}^{2}]

Wherein,

Φ_{D} = \frac{{(βD)}^{2}}{2 μ_{v} (β - 1)}

Here, Tw is the pulse width of the pulse voltage at input memristor two ends, V _athe amplitude of pulse, R (w ₀) representing the initial resistance of memristor, R (w) represents the resistance that memristor can reach, β=R _oFF/ R _oN.

As R (w ₀) when being less than or equal to R (w), can obtain

R (w) = \sqrt{{(R (w_{0}))}^{2} - \frac{V_{A} T_{w} R_{OFF}^{2}}{Φ_{D}}}, R_{ON} \leq R (w) \leq R_{OFF}

Therefore, when Tw mono-timing, along with V _achange, the resistance of memristor can change, and this change is non-volatile.

Memristor memory circuit as shown in Figures 2 and 3.As shown in Figure 2, the circuit of sense data as shown in Figure 3 for the circuit of write data.When the data is written, add a positive potential pulse to memristor, R (w) can reduce, and therefore memristor can remember added potential pulse.When reading the data, the resistance of memristor is different, the V obtained _outalso different, V _outand defining a corresponding relation between the resistance of memristor, therefore, it is possible to the resistance size of correct reflection memristor, is also the size of memristor storing value.

The resistance of memristor can change along with the change of input stimulus, and this change is non-volatile; Therefore, memristor has extraordinary storage characteristics.Further, memristor has nano-scale, is suitable for use in very much in large scale memory.And recall resistance crossed array be exactly the example that a memristor makes storer.

Recall the structure of resistance crossed array as shown in Figure 4, the circuit of each border circular areas representative as shown in Figure 5.In Figure 5, read write switch be write data and the gauge tap of sense data.When giving some memristors write data, switch connects the point on the left side, and now, data voltage V is write in corresponding line input _in; When reading the data of some memristors, switch connects the point on the right, now, and corresponding line input read data voltage V _in, corresponding alignment output voltage V _out.

Summary of the invention

The object of this invention is to provide a kind of Q of realization learning algorithm based on recall resistance crossed array Q learning system.

To achieve these goals, by the following technical solutions: a kind of Q learning system based on recalling resistance crossed array, comprises and recall resistance crossed array, it is characterized in that: described system also comprises

Read-write selector switch: control the read-write operation recalling resistance crossed array;

State selecting switch: state detection module detects current ambient conditions s _t, by state selecting switch, select corresponding line;

Column select switch: when needs are to Q value, also namely to recall resistance crossed array some recall resistance upgrade time, column select switch selects action a _tcorresponding alignment.

Delay cell: by voltage delay time step of the alignment of selection;

State detection module: detect current ambient condition, and preserve an ambient condition.When needs are according to condition selecting action, state detection module detects current ambient conditions, and this state is supplied to state selecting switch and state control switch.After performing an action, state selecting switch detects ambient condition now, and preserves an ambient condition, and ambient condition is now supplied to state selecting switch and state control switch.In time upgrading Q value, state detection module exports the ambient condition in previous moment, and is supplied to state selecting switch, selects corresponding line.

New circuit component-memristor has been successfully applied in intensified learning by the present invention, and solving intensified learning needs a large amount of memory space problem, and the research for later intensified learning provides a kind of new thinking.

Accompanying drawing explanation

Fig. 1 is the physical model structure figure of HP memristor;

Fig. 2 is the circuit diagram of memristor when writing data;

Circuit diagram when Fig. 3 is memristor read data;

Fig. 4 is the structural representation recalling resistance crossed array;

Fig. 5 singlely recalls resistance circuit figure for recalling in resistance crossed array;

Fig. 6 is structural representation of the present invention;

Fig. 7 is the structural representation of robot and barrier in the embodiment of the present invention;

Fig. 8 is the simulation result of the present embodiment.

Specific embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described further.

Q learning algorithm is a classic algorithm in nitrification enhancement, and in Q study, the simplest a kind of form is that single step Q learns, and the more new formula of its Q value is

Q(s _t，a _t)＝Q(s _t，a _t)+α(r _t+1+γmaxQ(s _t+1，a)-Q(s _t，a _t))

Wherein, α is learning rate, and γ is discount rate.R _t+1represent at state s _tperform an action a _tobtain the award of environment.Q (s _t, a _t) represent that operating state is to value function, namely at state s _t, perform an action a _t, the size of the value obtained.

The limitation of intensified learning is to need a large amount of storage spaces, and new circuit component-memristor, have nano-scale and storage characteristics, the crossed array based on memristor has a large amount of storage spaces and parallel processing capability, is applicable to very much for addressing this problem.

In Q learning algorithm, often perform an action, the award value of environment can be obtained, and the Q value that the action selecting the award of the maximum Q value of current state-action pair and acquisition to go to upgrade preceding state and selection is right.And with recall resistance crossed array go to realize Q study time, the Q value that the state-action corresponding to output voltage representative of each memristor is right.According to the storage principle of memristor, after can knowing power down, resistance can not change, and therefore only need add at memristor two ends and write voltage

V _i＝α(r+γmaxV(s _t+1，a)-V(s _t，a _t))

Just can go s _tand a _tthe resistance of corresponding memristor upgrades, thus changes the output voltage V (s of this memristor _t, a _t), be also Q (s _t, a _t) value.

Recall resistance crossed array and realize the process of Q study as shown in Figure 6.Recall in resistance crossed array, the corresponding state s of each line, a little corresponding action a of each row, its specific implementation process is as follows:

(1) read and write selector switch to select to read effectively, the state detection module in robot detects current ambient conditions s _t, by state selecting switch, select corresponding line;

(2) column select switch selects all row, by state control switch, alignment is connected to random selection module, random selection module is according to the random selection of the size of each column line voltage, the alignment that voltage is larger is larger by the probability selected, last Stochastic choice goes out an alignment, according to the alignment selected, obtain the action a performed _t, robot performs an action a _t.Also when some state set, by state control switch, alignment can be connected to comparator module, select the alignment that voltage is maximum, then by connecting selector switch, this alignment is connected to delay cell.ε-the greedy selecting module just can realize in intensified learning by state selecting switch, random selection module, comparer, connection is tactful.

(3) alignment of selection is connected to delay cell, delay cell is to voltage delay time step of alignment;

(4) state detection module detects current ambient conditions, and robot gets the hang of s _t+1now alignment is connected to comparer by state control switch, pass through comparer, select the alignment that voltage is maximum, select module that this alignment is connected to Q value update module by connecting, the output voltage of this voltage and delay cell and the award that obtains environment calculate according to formula (7) by Q value update module, obtain writing voltage V _i.

(5) read and write selector switch to select with effect, voltage V will be write _ibe added in the two ends of memristor, the time is T _w.

(6) process is above repeated, until reach the number of times of setting.

Robot obstacle-avoiding experiment to allow robot realize collisionless walking in the environment having obstacle.This experiment adopts the study realizing robot based on the Q study recalling resistance crossed array, and finally realizes clog-free walking, and this experiment uses mobotsim software.

In the figure 7, border circular areas represents robot, robot has three sensors, and digital 0-2 is corresponding 3 sensors respectively, and the ultimate range that each sensor can detect is 1.5 meters, and black region represents barrier.

In this experiment, that each sensor is detected be divided into 3 sections with distance that is barrier, as follows:

Wherein, dist0-dist2 submeter represents the distance to barrier that each sensor detects, s0-s2 is combined, 27 kinds of situations can be obtained, using these 27 kinds of situations as kind of the state of 27 in the environment residing for robot, this 27 kinds of states are stored with a three-dimensional array state [s0, s1, s2].Due in this experiment porch, when robot and barrier collide or sensor barrier can not be detected, the value that sensor returns is all-1, and therefore, state when robot and barrier being collided, is classified as state 0, situation when also namely s0-s2 is 0.

Reward functions r is defined as:

In this experiment, robot is by execution three kinds of actions: advance, and turns left and turns right.If when the state residing for robot is state [2,2,2], the execution of action performs at random according to the proportion of Q value; During other states, perform the action that Q value is maximum.

Get α=0.8, γ=0.98, simulation times is set to 500 times, and each emulation 2000 steps, Simulation results as shown in Figure 8.

Claims

1., based on the Q learning system recalling resistance crossed array, comprise and recall resistance crossed array, it is characterized in that: described system also comprises

Column select switch: when needs are to Q value, also namely to recall resistance crossed array some recall resistance upgrade time, column select switch selects action a _tcorresponding alignment;

Delay cell: by voltage delay time step of the alignment of selection;

State detection module: detect current ambient condition, preserve an ambient condition, when needs are according to condition selecting action, state detection module detects current ambient conditions, and this state is supplied to state selecting switch and state control switch, after performing an action, state selecting switch detects ambient condition now, preserve an ambient condition, and ambient condition is now supplied to state selecting switch and state control switch; In time upgrading Q value, state detection module exports the ambient condition in previous moment, and is supplied to state selecting switch, selects corresponding line, adds write voltage at memristor two ends

It is right just can to go s _twith a _tthe resistance of corresponding memristor upgrades, thus changes the output voltage of this memristor v(s _t, a _t), be also q(s _t, a _t) value; Herein v(s _t, a _t) value with q(s _t, a _t) value is equal;

Wherein, α is learning rate, and r is reward function, and γ is discount rate.