WO2022259434A1

WO2022259434A1 - Computing device for reinforcement learning and reinforcement learning method

Info

Publication number: WO2022259434A1
Application number: PCT/JP2021/021969
Authority: WO
Inventors: 光雅中島; 俊和橋本
Original assignee: 日本電信電話株式会社
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2022-12-15
Also published as: JPWO2022259434A1

Abstract

The purpose of the present invention is to provide a computing device for reinforcement learning and a reinforcement learning method for executing highly efficient reinforcement learning in which physical reservoir computing (RC) is applied to reinforcement learning having a high computation processing load and computation resources are effectively reduced. Therefore, the computing device (112) for reinforcement learning according to the present invention comprises a physical medium (115) that includes an RC algorithm having an input layer for inputting a subject state st, an intermediate layer for learning a subject action at on the basis of input information inputted to the input layer, and an output layer for outputting output information relating to an action at, and also includes an action determination algorithm for outputting information relating to the subject action on the basis of the action at outputted from the RC algorithm, the RC algorithm executing at least some of the functions to physically convert the state st and learn a subject strategy.

Description

Arithmetic device for reinforcement learning, reinforcement learning method

The present invention relates to a reinforcement learning computing device equipped with an algorithm for reinforcement learning.

In recent years, research and development of deep reinforcement learning using a deep neural network (DNN: Deep Neural Network) has progressed. Deep neural networks have been reported to exhibit excellent learning performance in a wide range of fields such as game play, robotics, and Go, and many proposals have been made for highly generalized methods that can learn strategies for various tasks. It is Generally, in deep reinforcement learning, a state variable s _t that represents the state at time t, an action variable a _t that is an action that can be taken at that time, and a reward R _t that is obtained as a result are used to determine the action in a certain state Also, the action value function Q(s _t , a _t ) and the state value function V(s _t ), which are the values of the state, are calculated as the output of the neural network, and the policy is determined. Reinforcement learning often handles time-series information such as sensor data as input information, so recurrent neural networks such as LSTM (Long short-term memory) (hereinafter referred to as "RNN") (also written) model is often incorporated. Since the computational processing and learning costs are generally high, it is desired to reduce these processing loads.

By the way, in recent years, as a type of RNN, a model called reservoir computing (hereinafter also referred to as “RC”) has been proposed. RC is an input layer in which input information is connected to each neuron, each neuron are connected to each other, and an output layer for summing and outputting the signals of each neuron.

RC differs from general RNN in that the input layer and intermediate layer networks are fixed, and the only variable used for learning is the weighting coefficient of the output layer. Such an RC method can greatly reduce the number of variables to be learned, and is therefore suitable for time-series learning that requires a large amount of data and high-speed processing. RC generates most node-to-node connections by random mapping. When this random mapping is operated on a computer, it requires a large number of computational resources as the number of nodes increases, and is one of the factors that increase the scale and cost of the computing system including the computer. In order to solve this problem, research and demonstration of a configuration called physics RC, in which a mapping is calculated by a physical system, are in progress. Physics RC enables efficient learning even on devices with low computational resources, as part of the computation can be outsourced to physical elements. Physical RC is described in Patent Document 2, for example.

The present disclosure applies physical RC to reinforcement learning with a large amount of computational processing, effectively reduces computational resources, and provides a reinforcement learning arithmetic device and a reinforcement learning method that perform highly efficient reinforcement learning. With the goal.

To achieve the above object, a reinforcement learning arithmetic device of one aspect of the present invention is a reinforcement learning arithmetic device for learning a strategy when an object executes a task, wherein input information about the state of the object is inputted. a computer system having an input layer that performs the above, an intermediate layer that learns the target policy based on the input information input to the input layer, and an output layer that outputs output information related to the policy, and the computer and a behavior determination unit that outputs information about the behavior that the subject should take based on the output information output from the system, the computer system physically converting the input information, a physical medium that performs at least a portion of the function of learning the

A computational method for reinforcement learning according to one aspect of the present invention is a computational method for reinforcement learning in which a target learns a strategy for executing a task, comprising an input layer for inputting information, and inputting input information about the state of the target to the input layer for a computer system having an intermediate layer that performs an operation based on input information and an output layer that outputs the result of the operation; layer learning said policy of said target based on said input information;
and outputting information about the behavior of the target based on the information about the policy obtained by the learning, and in the step of learning the policy of the target, at least a part of the input information. It is performed by a physical medium that physically transforms.

According to the above aspects, a reinforcement learning arithmetic device and a reinforcement learning method that apply physical RC to reinforcement learning with a large amount of arithmetic processing, effectively reduce computational resources, and perform highly efficient reinforcement learning. can provide.

It is a figure for demonstrating the form of general RC model. FIG. 3 is a diagram for explaining an RC model of a computing system having the computing device for reinforcement learning according to the first embodiment; 1 is a schematic diagram of a configuration of an arithmetic unit used in the first embodiment; FIG. 4 is a diagram for explaining the hardware configuration of the arithmetic unit shown in FIG. 3; FIG. (a), (b) is a figure for demonstrating the arithmetic unit for reinforcement learning of 2nd Embodiment of this invention. It is a figure for demonstrating the arithmetic unit of 3rd Embodiment of this invention. (a) and (b) are graphs showing the state of learning in an embodiment of the present invention. (a) is a diagram for explaining a preprocessing unit of the present embodiment. (b) is a graph showing the state of learning in this embodiment.

The first, second, and third embodiments of the present invention will be described below. Note that the first to third embodiments illustrate the technical idea, configuration, procedure, etc. of the present invention, and do not limit the specific configuration, conditions, parameters, and the like. First, in this specification, prior to describing the first, second, and third embodiments, reservoir computing (hereinafter also referred to as "RC") will be described.

FIG. 1 is a diagram for explaining the form of a general RC model. The RC model 10 shown in FIG. 1 includes an input layer Ri in which an input signal, which is input information, is connected to each neuron, an intermediate layer Rr in which each neuron is connected to each other, and an output layer Ro in which the signals of each neuron are summed and output. It is composed by An input signal coupled to each neuron is also referred to herein as an "input". Equation (1) is an equation for determining the input signal u(n) input to the input layer Ri. Equation (2) is an equation for determining the output signal y(n) output from the output layer Ro when the input signal u(n) is input to the input layer Ri.

In the above equations (1) and (2), N is the number of neurons, x _i (n) is the state of the i-th neuron at time step n, and Ω _ij , mi, η _i , ω _i are , coupling between neurons, coupling of an input signal to a neuron, coupling of an output signal to an FB (Feed Back) signal to each neuron, and coupling from each neuron to an output signal. Also, f(·) represents a nonlinear response in each neuron, and tanh(·) and the like are frequently used. A major difference between the RC network and a general recurrent neural network (hereinafter also referred to as "RNN") is that the network of the input layer Ri and the intermediate layer Rr is fixed, and the variables used for learning are the output layer Ro is the only weighting factor of . Since the RC method can greatly reduce the variables to be learned, it has a great advantage over time-series learning, which requires a large amount of data and high-speed processing.

[First Embodiment]
(Reinforcement learning by RC (Q learning))
FIG. 2 is a diagram for explaining the RC model 10 of the computing system having the computing device 112 for reinforcement learning according to the first embodiment. The RC model 10 of the first embodiment has an agent 111 , a state measuring device 113 , an arithmetic device 112 and a behavior control device 114 . The RC model 10 learns the policy when the agent 111 executes a certain task in the arithmetic device 112 . The state of the agent 111 at time t is represented by s _t , and the agent 111 can take a plurality of actions a _t according to the state s _t . The agent 111 _{obtains a reward R t} _as a result of the action at. State s _t is measured by state measuring device 113 . Further, the action at is controlled by the action control device ₁₁₄ so as to _realize the action at in line with the policy.

As a specific example of the above configuration, for example, a model for controlling a multi-joint robot arm can be considered. In robot arm control, the state _st corresponds to sensor data (position, angle, acceleration, etc.) of each joint of the robot arm. The action at corresponds to a control signal (amount of rotation, amount of displacement, etc.) for _driving the robot arm. The reward _Rt is a virtual value obtained on software when the robot arm grabs an object or performs a desired action. In a system that controls a robot arm, the state measuring device 113 corresponds to a device such as an ammeter or a voltmeter that acquires sensor data such as current or voltage, and the behavior control device 114 corresponds to a voltage source or a current source. It corresponds to a driving device such as

The configuration shown in FIG. 2 can also be applied to learning games and shogi. In such a learning system, the state s _t is an image of a screen or board, and the action a _t corresponds to a possible move. Also, the reward _Rt corresponds to points or the like obtained by actions. Note that the method of the first embodiment can be applied regardless of the above task-dependent inputs and control amounts, and can be widely used for tasks other than those described above.

The computing device 112 determines the behavior of the agent 111 based on the information of the state s _t , the behavior a _t and the reward R _t . The method will be described below. FIG. 3 is a schematic diagram of the configuration of the arithmetic unit 112 used in the first embodiment. Computing unit 112 includes RC algorithm 211 and action decision algorithm 212 . RC algorithm 211 has RC model 10 and performs reinforcement learning using an RC network. In the first embodiment, the state s _t measured via the state measuring device 113 is regarded as the input signal u(t) at time t in equation (1) and input to the arithmetic device 112 . This signal undergoes the transformation described by equation (2) by the RC algorithm 211 and the action decision algorithm 212, and is transformed into an intermediate state x(t) and a final output y(t). Here, the dimensions of the input and output of the RC model are set to be equal to the dimensions of the state s _t and the action a _t .

Also, only ω is a learning variable, and the other variables m and Ω are generated according to the configuration of the physical medium 115 (FIG. 4) described later, and are not learned. As a result, the first embodiment has an excellent effect of being able to learn with a small number of parameters compared to a normal deep learning model. In the first embodiment, the RC algorithm 211 is configured such that the output y( _t ) of the RC algorithm 211 is the action-value function Q( _st , at ₎ , which is an index representing the value of the action at in the state s _t . , the variable ω of the equation (2) is learned. The action decision algorithm ₂₁₂ decides the action at based on the output action-value function Q(s _t , at ₎ and outputs it to the action control device ₁₁₄ as information on the action at. The behavior control device 114 controls the behavior of the agent ₁₁₁ based on the behavior at.

In the configuration described above, the agent 111 corresponds to the object, the information about the state corresponds to the state _st , and the information about the _policy corresponds to the action at. The state s _t is input to the input layer Ri of the RC model 10, and the agent's action _at is learned based on the state s _t in the intermediate layer Rr. The action at obtained as a _result of learning is output to the action decision algorithm 212 . The action determination algorithm ₂₁₂ determines the action at of the agent ₁₁₁ from among the _actions at, and outputs the action at to the action control device 114 . The RC model 10 is a computational processing model simulating a neuron, and the RC algorithm 211 realizes computational processing based on the RC model 10 and outputs an action-value function Q(s _t , at ₎ including a reward R _t . It is a program for The behavior determination algorithm 212 is a program that determines the behavior of the agent 111 from the results of processing performed by the RC algorithm 211 . Arithmetic device 112 is a concept that includes software including such a program and hardware for operating the program. In the first embodiment, RC model 10 corresponds to a computer system.

Moreover, the above configuration executes a _computation method for reinforcement learning for learning the behavior at when the agent 111 executes a task. This computational method for reinforcement learning includes the steps of inputting a state s _t to the input layer Ri of the RC model 10, learning the action a _t of the agent 111 based on the state s _t , and and outputting information about the behavior of the agent ₁₁₁ based on the behavior at. In the step of learning the action at of the agent ₁₁₁ , at least a part of the process of _learning the policy of the action at is performed by the physical medium 115 that physically transforms the state st. to run.

Learning is performed, for example, so as to minimize a TD (Temporal Differential) error L defined by the following equation (3).

In the above equation (3), γ is a discount rate, which is a hyperparameter indicating how much future rewards are discounted. Typically, values less than and close to 1, such as 0.99, are used. Since the only learning variable is ω,

From this, the gradient of ω is updated, for example, by the following equation (5).

However, λ in Equation (5) is the learning rate. Formula (5) is an update rule based on a simple gradient descent method, but various optimization algorithms used in the field of machine learning such as Adam-optimizer and stochastic gradient descent (SGD) can be used. is. Also, in equation (3), the TD error for one step is obtained, but the TD error for n steps may be obtained. In that case, the following formula (6) is used as the cost function.

By setting n in Equation (6) to a value greater than 1, the stability and convergence of learning can be improved. Desirably, n is a value of 10 or less. By considering the partial differential with respect to ω in Equation (6), the update rule can be derived in the same manner as Equations (5) and (6).

In general deep reinforcement learning, learning data is often sampled by experimental replay that randomly samples a data series, but the first embodiment uses an RC model, which is a type of RNN Therefore, such a method cannot be used. In order to avoid this, there are methods of inputting the time series data (episode) during one trial as it is, sampling methods of extracting data of a certain length from within the episode, and specifying a certain time in the episode and pre-data It is possible to use a method of reproducing the internal state with The method of inputting the time series data (episode) during one trial as it is and the sampling method of extracting data of a certain length from within the episode are, for example, Matthew, Hausknecht. and Peter, Stone. (11, Jan , 2017). Deep recurrent q-learning for partially observable. Steven, Kapturowski.et al. (2019) Recurrent experience replay in distributed reinforcement learning. Published as a conference paper at ICLR (hereinafter also referred to as “reference document 2”).

For example, the ε-greedy method is used in the action decision algorithm. The ε-greedy method selects an action based on policy π with probability ε, and selects the action with the highest action value function Q(s, a) calculated above with probability (1−ε). The policy π is typically generated from a uniform probability distribution, and tries to uniformly choose all possible actions equally. Arithmetic unit 112 requires a large amount of computational resources for the numerical computation described by equation (1), and therefore requires power and execution time as the scale of the network increases. In the first embodiment, the physical medium 115 calculates the calculation corresponding to the formula (1) in the RC model in FIG.

The physical medium 115 may, for example, receive input information as light and convert physical parameters such as the amplitude, wavelength, and frequency of the input light. Also, the physical medium 115 may receive input information as an electric signal and convert physical parameters such as this value and frequency. Furthermore, it is also possible to input a liquid as input information and convert physical parameters such as pressure and flow rate. Conversion is not limited to conversion of numerical values without changing the type of parameter, but also includes changes to other parameters. Learning refers to accumulating or sorting out information about actions so that _actions at which are more likely to accomplish the task can be selected according to the state _st .

FIG. 4 is a diagram for explaining the hardware configuration of the arithmetic unit 112 of the first embodiment that performs arithmetic operations corresponding to equation (1) in the physical medium 115. As shown in FIG. The arithmetic unit 112 shown in FIG. 4 includes

signal converters

116 and 117 , a physical medium 115 , an electronic arithmetic unit 119 and a storage unit 118 . The input information _st from the state measuring device 113 is converted from digital information to analog physical signals via the signal conversion device 116 . The converted physical signal is input to physical medium 115 . Here, the physical signal refers to, for example, current, voltage, light intensity, and the like. The conversion into physical signals is desirably performed by a signal system according to the configuration of the physical medium 115 . Signal propagation in the physical medium 115 is described by physical laws determined by each system. reported by Ref.

The signal propagated through the physical medium 115 is measured by the signal conversion device 117 and converted back into digital information. This measurement signal can be considered equivalent to x(t) in equation (1). The measurement signal is held in a storage unit 118 such as a memory or hard disk. The storage unit 118 also defines and stores programs and parameters for executing formulas (2) to (5) at the same time. Information including measurement signals, programs and parameters is transferred to an electronic arithmetic unit 119 comprising a CPU (central processing unit), a GPU (graphics processing unit), etc., and the electronic arithmetic unit 119 calculates equations (2) to (5). Reinforcement learning is performed by computing. With such a configuration, the first embodiment has the excellent effect of being able to automatically calculate equation (1) during signal propagation. In addition, the first embodiment with such a configuration automatically executes the calculation corresponding to the formula (1) in the physical medium 115 within the laws of physics. can be done.

Next, the physical medium 115 of the first embodiment will be explained. The physical medium 115 may be a known one, and the first embodiment uses an optical circuit for the physical medium 115 to perform calculations. For example, J, Bueno. et al. (2018). Reinforcement learning in a large-scale photonic recurrent neural network. Optica 5, 756-760 (hereinafter also referred to as "Reference 3" et al, (2012) Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics express 20, 3241 (hereinafter also referred to as "Reference 4"), M, Nakajima. et al. (2021) Scalable reservoir computing on coherent linear photonic processor communications. physics.4.20 (hereinafter also referred to as "Reference 5"). Since the first embodiment is capable of high-speed input/output of light exceeding 100 Gbit/s via the optical communication device, it is possible to perform calculations at high speed.

In order to input and output light at high speed, the signal conversion device 116 is composed of a digital-to-analog converter (DAC) and an optical modulator, converts the digital signal into optical intensity and optical phase, It is introduced into an optical physical medium 115 . When converting the output light into a digital signal again, the signal conversion device 117 is composed of an optical receiver and an analog-to-digital converter (ADC). The optical physical medium 115 may be, for example, a spatial optical system composed of a lens, a mirror, and a spatial modulator described in Reference 3, or may be configured using an optical fiber ring described in Reference 4. While such a configuration makes the device relatively large, as described in reference 5, it is also possible to integrate a small arithmetic circuit by using an optical integrated circuit.

Physical medium 115 is preferably a non-linear optical element to allow non-linear transformation. The nonlinear element is realized, for example, by configuring the physical medium 115 with a nonlinear optical material (for example, LiNbO ₃ or the like) or by using gain saturation of an optical amplifier. In the first embodiment, the physical medium 115 can be physically implemented such that the spectral radius (maximum eigenvalue of the matrix) of the matrix Ω in the formula (1) is in the range of 0.8 to 1.2. preferable. With such a configuration, the first embodiment can realize operation near the chaotic transition point, and can improve the performance of holding RC memory. If the physical medium 115 is an optical medium that does not contain a gain medium, operation near chaotic transitions corresponds to operation performed by a configuration that reduces optical losses as much as possible. Note that if the physical medium 115 includes a gain medium such as an amplifier, it is possible to improve the memory retention function by adjusting the gain.

Also, the first embodiment is not limited to the physical medium 115 performing all of the functions equivalent to the calculation shown in Equation (1). For example, the first embodiment may use the physical medium 115 to perform part of the result obtained by equation (1). Then, part of the computation represented by Equation (1) may be performed by the electronic computation unit 119 . More specifically, for example, part of the computation shown in Equation (1) may be performed in advance by the electronic computation unit 119 or the like. Such pretreatment is described, for example, in Reference 4. In the pre-processing described in reference 4, the processing (input mask processing) of the second term of Equation (1) is performed in advance, and the result is used as the force signal to the physical medium 115 . Such preprocessing facilitates the implementation of the physical medium 115 even when the dimensionality of the state s _t and the intermediate state x(t) is large.

Another aspect of the physical medium 115 is, for example, a memristor such as ReRAM. Such an aspect is a configuration that utilizes a physical phenomenon in which the resistance value changes according to the history of the current that has passed through the resistance cell. In this configuration, DAC/ADC are used for the

signal converters

116 and 117 . Since such a configuration is implemented by an analog electronic circuit, it has a high interface affinity and is advantageous for miniaturization. Such a configuration is, for example, Yanan, Zhong. et al. (2021). Dynamic memristor-based reservoir computing for high-efficiency temporal signal processing. Nature Communications 12:408. It is described in. In addition, since the memristor can be formed by a semiconductor process, it is advantageous in that a large-scale node can be easily configured. Furthermore, in addition to such a configuration, magnetic circuits, fluids, soft materials, and the like can also be used as aspects of the physical medium 115 . This point is summarized, for example, in Reference 1.

As described above, the first embodiment applies physical RC to reinforcement learning with a large amount of computational processing, and can effectively reduce computational resources. For this reason, it is possible to suppress an increase in the size and cost of a reinforcement learning computing device that executes reinforcement learning. Alternatively, when there is room in the computational resources due to the miniaturization of the computational device for reinforcement learning, it is possible to perform more advanced computational processing. Such a first embodiment can provide a reinforcement learning computing device and a reinforcement learning method that perform highly efficient reinforcement learning.

[Second embodiment]
(Reinforcement learning by RC (Actor-critic type))
Next, a second embodiment of the invention will be described. In the reinforcement learning arithmetic device of the second embodiment, the policy π is generated with a uniform probability and is not learned (policy off-type learning), whereas the policy π is also learned (policy on-type learning). learning). The second embodiment with policy-on-learning is capable of faster convergence to a solution than the first embodiment.

FIGS. 5(a) and 5(b) are diagrams for explaining

arithmetic units

512 and 513 for reinforcement learning according to the second embodiment. FIG. 5A is a schematic diagram of the reinforcement learning arithmetic device 512 of the second embodiment. FIG. 5(b) is a schematic diagram of an arithmetic device 513 that is a modification of the arithmetic device 512 shown in FIG. 5(a). The computing device 512 differs from the computing device 112 of the first embodiment in that it includes an RC algorithm 311 having two

RC models

10a and 10b and an action decision algorithm 312 that selects an action different from the action decision algorithm 212.

The signal flow in the arithmetic device 512 of the second embodiment shown in FIG. 5(a) is the same as that of the arithmetic device 112 of FIG. 2 described in the first embodiment. However, the computing device 512 differs from the computing device 112 in the structure of the RC network and the learning method. In the configuration shown in FIG. 5(a), the state s _t input from the state measuring device 113 is input to the two

RC models

10a and 10b as the input signal u(t) in equation (1). The dimensions of the inputs and outputs of the

RC models

10a and 10b are set equal to the dimensions of the states s _t and actions a _t , respectively. The only learning variables in the RC algorithm 311 are the output layer weights ω _a and ω _c of each of the

RC models

10a, 10b. This gives the RC algorithm 311 the advantage of being able to learn using a small number of parameters compared to regular deep learning models. The

RC models

10a, 10b each perform different processes called "Actors" and "Critics".

Here, the actor determines the possible course of action of the agent 111 and the critic estimates the value of the state by gathering information from the state S _t output from the state measurer 113 . The RC algorithm 311 learns so that the output on the actor side is the action value function π(s _t , a _t ) representing the policy, and the output on the critic side is the state value function V(s _t ). I do.

The action determination algorithm 312 of the arithmetic device 310 selects the action of the agent 111 shown in FIG. 2 based on the policy π learned on the actor side, unlike the first embodiment. Action selection is performed, for example, so as to minimize cost functions L _a and L _c on the actor side and the critic side defined by the following equations (7) and (8), respectively.

By considering the partial differentiation of Equations (7) and (8) with respect to ω _a and ω _c , the update rule can be derived in the same manner as Equations (5) and (6). In order to stabilize learning, a cost function that considers up to n steps ahead may be used for _Lc , as in the following equation (9).

By setting n to a value greater than 1 in Equation (9), the stability and convergence of learning can be improved. It is desirable to set n to a value of 10 or less, and by considering the partial differential of Equation (9) with respect to _ωc , the update rule can be derived in the same manner as Equations (5) and (6). The arithmetic unit 310 having such a configuration connects two

RC models

10a and 10b in parallel so that the action decision algorithm 212 decides the action of the agent 111 by combining the information on the policy and the information on the value. can be realized by

Also, in the above,

different RC models

10a and 10b were used on the actor side and the critic side, but like the RC algorithm 411 in FIG. You can set it differently. As a result, calculation results of the same physical medium 115 can be shared in the RC algorithm 411, so that even the calculation device 410 having one RC model 10 can perform calculations, as in the first embodiment.

When the

RC algorithms

311 and 411 shown in FIGS. 5(a) and 5(b) are configured as hardware, both of them are similar to the arithmetic unit 112 shown in FIG. 4 of the first embodiment. 116 , 117 , physical medium 115 , storage unit 118 and electronic computing unit 119 .

[Third embodiment]
(adding random convolutional layers)
Next, a third embodiment of the present disclosure will be described. FIG. 6 is a diagram for explaining the arithmetic unit 612 of the third embodiment. The arithmetic device 612 has the RC algorithm 211 and the action determination algorithm 212 shown in FIG. The preprocessing unit 613 has a function of performing preprocessing using a random convolution layer and a pooling layer, and converts the state s _t output from the state measuring device 113 into a state s t _' . The RC algorithm 211 performs RC model calculations based on the state _st' .

The preprocessing unit 613 has two layers, a kernel filter that performs convolution and a kernel filter that performs pooling. Convolution is a process of compressing the pixel values of a small area of an image into a smaller area using this area as one feature, and is a process of reducing the amount of image data. Pooling is a process of thinning out images to reduce the amount of data. The coefficients of the kernel filter in the random convolution layer are generated, for example, by a random number table in the interval [-1:1], and are not learned thereafter. Thus, in the third embodiment, the RC algorithm 211 can process state s _t' in the same way that the first and second embodiments process state s _t . In the third embodiment, when it is necessary to handle high-dimensional input information corresponding to the number of pixels such as a game or an in-vehicle camera, the dimensions are compressed in advance by a convolutional neural network, and the amount of processing in the RC algorithm 211 can be reduced. .

(Verification by numerical calculation)
An example of reinforcement learning performed using the above-described first to third embodiments will be described. In this embodiment, a task called cartpole problem, which is generally used in reinforcement learning tests, is executed on a computer simulation. The goal of this task is to control the position of the cart moving in one-dimensional space so that the pole standing on top of the cart does not topple. As the state _st , the speed, position, angle and angular velocity of the truck at each time are given. As the _action at, it is possible to select whether the rod moves left or right at that time. If the angle tilts more than ±20.9° or the position changes more than ±2.4, it fails and receives a reward R _t of −200. Otherwise, get +1 reward Rt each time. In order to implement such a task, this embodiment applied the computing device with two RC models of the actor and the critic of the second embodiment.

An optical medium is adopted as the physical medium of the RC model, and tanh() based on the saturation behavior of a semiconductor optical amplifier (SOA) is used as a nonlinear function. The input coupling m is, for example, calculated in advance before being introduced into the physical medium 115, and is generated by a uniform random number from the [0:1] section. It is assumed that the physical medium 115 is constructed on a fiber basis, and Ω corresponding to the mutual coupling within the medium reflects the ring topology as expressed by the following equation. Such a medium is described in, for example, Francois, Duport. et al. (2012). All-optical reservoir computing. Vol. 20, No. 20 / Optics Express, 22783-22795. )It is described in.

The return light gain α in the ring is approximately equal to the spectral radius of Ω. Learning in the RC model 10 is performed by the method described in equations (7) and (9). n and γ in equation (9) were set to n=5 and γ=0.99, and the number of nodes of the RC model 10 was set to 64.

FIGS. 7(a) and 7(b) are graphs showing the state of learning in this embodiment. In FIG. 7(a), the horizontal axis indicates the number of trials, and the vertical axis indicates the number of steps in which the stick was not knocked over. A solid line in FIG. 7A indicates learning using the RC model of this embodiment. The dashed line shows the results of similar training using a 3-layer fully-connected neural network for comparison. As is clear from FIG. 7(a), in this embodiment, compared to the dashed line, the number of steps in which the stick can not be knocked down while the number of trials is small increases, and convergence is faster than using a fully-connected neural network. . Furthermore, according to the RC algorithm, the only learning variable is ω, and the other variables are processed by the physical medium 115, which will be described later, so learning is possible with a small number of parameters.

FIG. 7(b) shows the relationship between the return light gain α in the ring and the average number of trials required for convergence. Here, convergence is defined as five consecutive successes in holding the bar up to 200 steps without falling. As is clear from FIG. 7B, when α is 1.2 or more, the number of trials until convergence increases, and when α is between 0.6 and 1.0, the number of trials required for convergence is small. and stable. The reason for this reflects the tendency that when α exceeds 1.2, the behavior of the optical medium becomes chaotic and the performance of the RC algorithm increases near the chaotic transition point of the optical medium. Therefore, it is desirable to set the spectral radius of the RC algorithm within a range that does not reach the chaotic transition point and to have as large a value as possible.

In this embodiment, as another example of learning, the computer game Pong (manufactured by Atari) is used as a task, and the object is to compete against a computer and win. In the RC model, the state st is _preprocessed by the preprocessing unit described in the third embodiment and input to the RC model. FIG. 8A is a diagram for explaining the preprocessing unit 613 used in this embodiment. FIG. 8(b) is a graph showing the state of learning. The preprocessing unit 613 has a convolution filter pooling layer 3135 having a plurality of

intermediate layers

3130 , 3131 , 3133 . After the convolution filter, an arithmetic unit with an RC model 10 is provided. In the intermediate layer 3130, an 8×8 kernel filter 3137 performs convolution processing on 84×84×4 input data while sliding by 4×4 pixels. In the intermediate layer 3131, a 4×4 kernel filter 3132 performs convolution processing by sliding 20×20×32 data by 2×2 pixels. In the intermediate layer 3133, a 3×3 kernel filter 3134 slides 9×9×64 data by 1×1 pixels and performs convolution processing. As a result, a feature map is output each time convolution processing is performed. The feature maps are pixel-reduced by a pooling layer 3135 and then combined to form a combined layer 3136 .

Learning was performed by the method shown in formulas (7) and (9). In equation (9), n=5, γ=0,99, and the number of RC nodes is 4,000. The horizontal axis of FIG. 8(b) indicates the number of trials, and the vertical axis indicates the score difference (a value obtained by subtracting the score of the subject from the score of the computer). A score difference of "+" indicates that the subject has won. As is clear from FIG. 8(b), the score difference increased as the number of trials increased, and then stably converged. Such results demonstrate that even relatively difficult tasks such as game play can be learned using the RC model of the present disclosure.

10, 10a, 10b... RC model 111...

Agents

112, 512, 513, 612... Arithmetic device 113... State measurement device 114... Action control device 115...

Physical media

116, 117 ... Signal conversion device 118 ... Storage unit 119 ... Electronic

arithmetic unit

211, 311, 411 ... RC algorithm 212, 312, 412 ... Action determination algorithm 613 ...

Preprocessing section

3130, 3131 , 3133...

Intermediate layers

3132, 3134, 3137... Kernel filter 3135... Pooling layer 3136... Connection layer

Claims

A computing device for reinforcement learning that learns a policy when a subject performs a task,
An input layer that inputs input information about the state of a target, an intermediate layer that learns the policy of the target based on the input information that is input to the input layer, and an output layer that outputs output information about the policy. a computer system having
a behavior determination unit that outputs information about the behavior of the target based on the output information output from the computer system;
A computational unit for reinforcement learning, wherein the computer system comprises a physical medium that physically transforms the input information and performs at least part of the function of learning the policy of the object.
2. The computer system of claim 1, wherein the computer system comprises an actor part that determines the policy of the object based on the input information, and a critic part that estimates the value of the state based on the input information. Arithmetic unit for reinforcement learning.
The arithmetic device for reinforcement learning according to claim 2, comprising a first computer system functioning as the actor unit and a second computer system functioning as the critic unit.
4. The computing device for reinforcement learning according to any one of claims 1 to 3, comprising a preprocessing unit that compresses the input information in the preceding stage of the computer system.
The arithmetic device for reinforcement learning according to any one of claims 1 to 4, wherein the physical medium is a nonlinear optical element.
The computer system includes an input conversion unit that converts the input information corresponding to the physical medium, an output conversion unit that converts the output information output from the physical medium according to arithmetic processing to be performed later, and the From claim 1, wherein a computing unit that executes computational processing related to reinforcement learning based on the output information converted by the output transforming unit, and outputs information related to the behavior of the target based on the computation results of the computing unit. 6. The computing device for reinforcement learning according to any one of 5.
A computational method for reinforcement learning for learning a strategy for a subject to perform a task, comprising:
For a computer system having an input layer for inputting information, an intermediate layer for executing operations based on input information input to the input layer, and an output layer for outputting the results of the operations,
inputting input information about the state of the object into the input layer;
the intermediate layer learning the policy of the target based on the input information;
and outputting information about the behavior of the target based on the information about the policy obtained by the learning,
A computational method for reinforcement learning, wherein the step of learning the policy of the object is performed, at least in part, by a physical medium that physically transforms the input information.