CN102723112A

CN102723112A - Q learning system based on memristor intersection array

Info

Publication number: CN102723112A
Application number: CN2012101885732A
Authority: CN
Inventors: 王丽丹; 何朋飞; 段书凯; 钟宇平
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2012-06-08
Filing date: 2012-06-08
Publication date: 2012-10-10
Anticipated expiration: 2032-06-08
Also published as: CN102723112B

Abstract

The invention discloses a Q learning system based on a memristor intersection array. The Q learning system comprises the memristor intersection array, and is characterized in that the system also comprises a read/write selective switch, a state selective switch, a column selective switch, a delay unit and a state detection module, wherein the read/write selective switch is used for controlling a read/write operation of the memristor intersection array; a state selective module is used for detecting current environment state st, and selecting corresponding row lines by the state selective switch; the column selective switch is used for selecting column lines corresponding to actions at when a certain memristor value of the memristor intersection array, namely a Q value, is updated; the delay unit is used for delaying time step of voltage of the selected column lines; and the state detection module is used for detecting current environment state and saving the last environment state. According to the Q learning system, a new circuit element, i.e. the memristor is successfully applied to reinforcement learning, so that the problem of a large quantity of storage spaces for reinforcement learning is solved, and a new thought is provided for the research on reinforcement learning in future.

Description

A kind of based on the Q learning system of recalling the resistance crossed array

Technical field

The present invention relates to a kind of storage matrix and intelligence learning algorithm.

Background technology

Intensified learning is a kind of senior intelligence learning algorithm, is widely used in field in intelligent robotics in recent years, becomes the focus of research.1954, Minsky proposed the intensified learning computation model of SNARCs.Then, Sutton has proposed AHC algorithm and TD learning algorithm in its PhD dissertation.Afterwards, people such as Watkins had proposed the classic algorithm-Q learning algorithm in the present intensified learning algorithm on the basis of TD learning algorithm, and the Q learning algorithm is an important milestone in the intensified learning evolution.After the Q learning algorithm proposed, Many researchers was applied to mobile robot's navigation, the scheduling of robot soccer system and intelligent I/O with the Q learning algorithm.But intensified learning also has the limitation of himself, and when problem was comparatively complicated, it needed a large amount of state-action storage spaces.1971, Chua proposed the 4th kind of circuit component-memristor (L.O.Chua.Memristor-the missing circuit element.IEEE Trans.Circuit Theory.1971,18 (5): 507-519.) according to the completeness theory of circuit.

2008, the memristor of first physics realization was successfully made in the HP laboratory, and after this memristor has caused concern widely.Memristor has nano-scale, nonlinear characteristic, and its resistance changes along with the variation of input stimulus, and this variation is non-volatile, so memristor is fit to be used for designing large scale memory very much.The memristor crossed array is a kind of in the memristor storer, it simple in structure, easy design.People such as Hu Xiaofang utilize the memristor crossed array realized the storage of image (Wang Lidan is etc. memristor crossed array and in Application in Image Processing for Hu Xiaofang, Duan Shukai. Chinese science F collects: information science .2011,41 (4): 500-512.).Because memristor has nano-scale; Therefore the memristor crossed array can be made large scale memory, can solve intensified learning when solving challenge, needs the problem of a large amount of state-action storage spaces; Therefore, utilize and to recall the resistance crossed array and realize that Q study is a kind of good selection.

The physical model of HP memristor is as shown in Figure 1, and memristor is made up of doped region and non-doped region two parts.Wherein w and D represent the width of doped region in the memristor and the overall width of memristor respectively.Its mathematical model is following:

M (t) = R_{ON} \frac{w (t)}{D} + R_{OFF} (1 - \frac{w (t)}{D})

Wherein, R _OFFAnd R _ONRepresent that respectively w equals 0 when the D, the resistance of memristor.

\frac{dw (t)}{dt} = \frac{μ_{V} R_{ON}}{D} i (t)

Here, μ _vMoving of expression average ion, unit is cm ²s ^-1V ^-1

T_{w} = \frac{Φ_{D}}{V_{A} R_{OFF}^{2}} [{(R (w_{0}))}^{2} - {(R (w))}^{2}]

Wherein,

Φ_{D} = \frac{{(βD)}^{2}}{2 μ_{v} (β - 1)}

Here, Tw is the pulse width of the pulse voltage at input memristor two ends, V _ABe the amplitude of pulse, R (w ₀) the initial resistance of expression memristor, the resistance that R (w) expression memristor can reach, β=R _OFF/ R _ON

As R (w ₀) during smaller or equal to R (w), can obtain

R (w) = \sqrt{{(R (w_{0}))}^{2} - \frac{V_{A} T_{w} R_{OFF}^{}}{Φ_{D}}}, R_{ON} \leq R (w) \leq R_{OFF}

Therefore, when Tw one timing, along with V _AVariation, the resistance of memristor can change, and this variation is non-volatile.

The memristor memory circuit as shown in Figures 2 and 3.The circuit that writes data is as shown in Figure 2, and the circuit of sense data is as shown in Figure 3.When writing data, add a positive potential pulse to memristor, R (w) can reduce, so memristor can be remembered institute's making alive pulse.When sense data, the resistance of memristor is different, the V that obtains _OutAlso different, V _OutAnd having formed a corresponding relation between the resistance of memristor, therefore can correctly reflect the resistance size of memristor, also is the size of memristor storing value.

The resistance of memristor can change along with the variation of input stimulus, and this variation is non-volatile; Therefore, memristor has extraordinary storage characteristics.And memristor has nano-scale, is suitable for use in the large scale memory very much.And recall the resistance crossed array is exactly the example that a memristor is made storer.

The structure of recalling the resistance crossed array is as shown in Figure 4, and the circuit of each border circular areas representative is as shown in Figure 5.In Fig. 5, read write switch be the CS that writes data and sense data.When writing data for some memristors, switch connects the point on the left side, at this moment, and corresponding column rule input write data voltage V _InWhen reading the data of some memristors, switch connects the point on the right, at this moment, and corresponding column rule input read data voltage V _In, corresponding alignment output voltage V _Out

Summary of the invention

The purpose of this invention is to provide a kind of Q of realization learning algorithm based on recall the resistance crossed array the Q learning system.

To achieve these goals, adopt following technical scheme: a kind of based on the Q learning system of recalling the resistance crossed array, comprise and recall the resistance crossed array that it is characterized in that: said system also comprises

The read-write SS: the read-write operation of resistance crossed array is recalled in control;

State selecting switch: state detection module detects current environment state s _t,, select corresponding column rule through state selecting switch;

Column select switch: when needs to the Q value, also promptly some when recalling resistance and upgrading to what recall the resistance crossed array, column select switch is selected action a _tPairing alignment.

Delay cell: with time step of voltage delay of the alignment of selecting;

State detection module: detect current ambient condition, and preserve an ambient condition.When needs were selected action based on state, state detection module detected the current environment state, and this state is offered state selecting switch and state control switch.After carrying out action, state selecting switch detects the ambient condition of this moment, and preserves an ambient condition, and the ambient condition of this moment is offered state selecting switch and state control switch.When the Q value was upgraded, state detection module was exported the ambient condition in the previous moment, and offered state selecting switch, selected corresponding column rule.

The present invention has arrived new circuit component-memristor successful Application in the intensified learning, and having solved intensified learning needs a large amount of memory space problem, for the research of intensified learning later on provides a kind of new thinking.

Description of drawings

Fig. 1 is the physical model structure figure of HP memristor;

Circuit diagram when Fig. 2 is the memristor write data;

Circuit diagram when Fig. 3 is the memristor read data;

Fig. 4 is a structural representation of recalling the resistance crossed array;

Resistance is single in the crossed array recalls resistance circuit figure to Fig. 5 in order to recall;

Fig. 6 is a structural representation of the present invention;

Fig. 7 is the structural representation of robot and barrier in the embodiment of the invention;

Fig. 8 is the simulation result of present embodiment.

Specific embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is done and to further describe.

The Q learning algorithm is a classic algorithm in the intensified learning algorithm, and the simplest a kind of form is single step Q study in the Q study, and the more new formula of its Q value does

Q(s _t，a _t)＝Q(s _t，a _t)+α(r _t+1+γmaxQ(s _t+1，a)-Q(s _t，a _t))

Wherein, α is a learning rate, and γ is a discount rate.r _T+1Be illustrated in state s _tCarry out action a _tThe award of the environment that obtains.Q (s _t, a _t) represent that operating state is to value function, promptly at state s _t, carry out action a _t, the size of resulting value.

The limitation of intensified learning need to be a large amount of storage spaces; And new circuit component-memristor; Have nano-scale and storage characteristics, have a large amount of storage spaces and parallel processing capability, be fit to very much be used for addressing this problem based on the crossed array of memristor.

In the Q learning algorithm, action of every execution can obtain the prize value of environment, and selects the award of maximum Q value and the acquisition of current state-action pair to go to upgrade the right Q value of action of preceding state and selection.And going to realize Q when study with recalling the resistance crossed array, the output voltage of each memristor is represented the right Q value of pairing state-action.According to the storage principle of memristor, can know that resistance can not change after the power down, therefore only need add and write voltage at the memristor two ends

V _i＝α(r+γmaxV(s _t+1，a)-V(s _t，a _t))

Just can go s _tAnd a _tThe resistance of pairing memristor is upgraded, thereby changes the output voltage V (s of this memristor _t, a _t), also be Q (s _t, a _t) value.

Recall the resistance crossed array and realize that the process of Q study is as shown in Figure 6.Recall in the resistance crossed array, the corresponding state s of each bar column rule, a little corresponding action a of each bar row, its concrete implementation procedure is as follows:

(1) the read-write SS is selected to read effectively, and the state detection module in the robot detects current environment state s _t,, select corresponding column rule through state selecting switch;

(2) column select switch is selected all row; Through state control switch alignment is connected to and selects module at random; Select size at random the selection of module according to each column line voltage at random, the selecteed probability of the alignment that voltage is big more is big more, selects an alignment at last at random; According to the alignment of selecting, the action a that obtains carrying out _t, robot carries out action a _tAlso can when some state of setting, alignment be connected to comparator module, select the maximum alignment of voltage, through connecting SS this alignment is connected to delay cell again through state control switch.Through state selecting switch, select module, comparer, connection to select module just can realize that the ε-greedy in the intensified learning is tactful at random.

(3) alignment of selecting is connected to delay cell, delay cell is to time step of voltage delay of alignment;

(4) state detection module detects the current environment state, the s that gets the hang of of robot _T+1This moment, state control switch was connected to comparer with alignment; Through comparer, select the maximum alignment of voltage, select module that this alignment is connected to Q value update module through connecting; Q value update module is calculated the output voltage of this voltage and delay cell and the award of acquisition environment according to formula (7), obtain writing voltage V _i

(5) the read-write SS is selected will write voltage V with effect _iBe added in the two ends of memristor, the time is T _w

(6) repeat top process, up to the number of times that reaches setting.

The robot obstacle-avoiding experiment is to let robot in the environment of obstacle is arranged, realize collisionless walking.This experiment is adopted based on the Q that recalls the resistance crossed array and is learnt to realize the study of robot, and finally realizes clog-free walking, and mobotsim software is used in this experiment.

In Fig. 7, border circular areas is represented robot, and three sensors are arranged in the robot, respectively corresponding 3 sensors of digital 0-2, and the ultimate range that each sensor can detect is 1.5 meters, black region is represented barrier.

In this experiment, each sensor to the distance with barrier be divided into 3 sections, as follows:

Wherein, The distance to barrier that each sensor of representing the dist0-dist2 submeter arrives makes up s0-s2, can obtain 27 kinds of situation; With these 27 kinds of situation as 27 kinds of states in the residing environment of robot; Store this 27 kinds of states with a three-dimensional array state [s0, s1, s2].Because in this experiment porch, when robot can not detect barrier with barrier collision or sensor, the value that sensor returns all was-1, and therefore, the state when robot and barrier are collided is classified as state 0, also is that s0-s2 is 0 o'clock a situation.

Award function r is defined as:

In this experiment, robot will carry out three kinds of actions: advance, turn left and right-hand rotation.If when the residing state of robot was state [2,2,2], the execution of action was carried out according to the proportion of Q value at random; During other states, carry out the maximum action of Q value.

Get α=0.8, γ=0.98, simulation times is made as 500 times, and in each 2000 steps of emulation, the experiment simulation result is as shown in Figure 8.

Claims

One kind based on recall the resistance crossed array the Q learning system, comprise and recall the resistance crossed array that it is characterized in that: said system also comprises

The read-write SS: the read-write operation of resistance crossed array is recalled in control;

State selecting switch: state detection module detects the current environment state s _t,, select corresponding column rule through state selecting switch;

Column select switch: when needs to the Q value, also promptly some when recalling resistance and upgrading to what recall the resistance crossed array, column select switch is selected action a _tPairing alignment;

Delay cell: with time step of voltage delay of the alignment of selecting;

State detection module: detect current ambient condition, preserve an ambient condition, when needs are selected action according to state; State detection module detects the current environment state; And this state offered state selecting switch and state control switch, and to carry out after the action, state selecting switch detects the ambient condition of this moment; Preserve an ambient condition, and ambient condition is at this moment offered state selecting switch and state control switch; When the Q value was upgraded, state detection module was exported the ambient condition in the previous moment, and offered state selecting switch, selected corresponding column rule.