WO2022259434A1 - Computing device for reinforcement learning and reinforcement learning method - Google Patents

Computing device for reinforcement learning and reinforcement learning method Download PDF

Info

Publication number
WO2022259434A1
WO2022259434A1 PCT/JP2021/021969 JP2021021969W WO2022259434A1 WO 2022259434 A1 WO2022259434 A1 WO 2022259434A1 JP 2021021969 W JP2021021969 W JP 2021021969W WO 2022259434 A1 WO2022259434 A1 WO 2022259434A1
Authority
WO
WIPO (PCT)
Prior art keywords
reinforcement learning
input
learning
information
output
Prior art date
Application number
PCT/JP2021/021969
Other languages
French (fr)
Japanese (ja)
Inventor
光雅 中島
俊和 橋本
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/021969 priority Critical patent/WO2022259434A1/en
Priority to JP2023526735A priority patent/JPWO2022259434A1/ja
Publication of WO2022259434A1 publication Critical patent/WO2022259434A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a reinforcement learning computing device equipped with an algorithm for reinforcement learning.
  • Deep neural networks have been reported to exhibit excellent learning performance in a wide range of fields such as game play, robotics, and Go, and many proposals have been made for highly generalized methods that can learn strategies for various tasks.
  • a state variable s t that represents the state at time t
  • an action variable a t that is an action that can be taken at that time
  • a reward R t that is obtained as a result
  • the action value function Q(s t , a t ) and the state value function V(s t ) which are the values of the state
  • Reinforcement learning often handles time-series information such as sensor data as input information, so recurrent neural networks such as LSTM (Long short-term memory) (hereinafter referred to as "RNN") (also written) model is often incorporated. Since the computational processing and learning costs are generally high, it is desired to reduce these processing loads.
  • RC reservoir computing
  • RC differs from general RNN in that the input layer and intermediate layer networks are fixed, and the only variable used for learning is the weighting coefficient of the output layer.
  • Such an RC method can greatly reduce the number of variables to be learned, and is therefore suitable for time-series learning that requires a large amount of data and high-speed processing.
  • RC generates most node-to-node connections by random mapping. When this random mapping is operated on a computer, it requires a large number of computational resources as the number of nodes increases, and is one of the factors that increase the scale and cost of the computing system including the computer.
  • physics RC in which a mapping is calculated by a physical system, are in progress.
  • Physics RC enables efficient learning even on devices with low computational resources, as part of the computation can be outsourced to physical elements. Physical RC is described in Patent Document 2, for example.
  • Harnessing nonlinearity predicting chaotic systems and saving energy in wireless communication.
  • VOL 304 Science Gouhei, Tanaka.et al.(2019). Recent advances in physical reservoir computing. Neural Networks.100-123.
  • the present disclosure applies physical RC to reinforcement learning with a large amount of computational processing, effectively reduces computational resources, and provides a reinforcement learning arithmetic device and a reinforcement learning method that perform highly efficient reinforcement learning. With the goal.
  • a reinforcement learning arithmetic device of one aspect of the present invention is a reinforcement learning arithmetic device for learning a strategy when an object executes a task, wherein input information about the state of the object is inputted.
  • a computer system having an input layer that performs the above, an intermediate layer that learns the target policy based on the input information input to the input layer, and an output layer that outputs output information related to the policy, and the computer and a behavior determination unit that outputs information about the behavior that the subject should take based on the output information output from the system, the computer system physically converting the input information, a physical medium that performs at least a portion of the function of learning the
  • a computational method for reinforcement learning is a computational method for reinforcement learning in which a target learns a strategy for executing a task, comprising an input layer for inputting information, and inputting input information about the state of the target to the input layer for a computer system having an intermediate layer that performs an operation based on input information and an output layer that outputs the result of the operation; layer learning said policy of said target based on said input information; and outputting information about the behavior of the target based on the information about the policy obtained by the learning, and in the step of learning the policy of the target, at least a part of the input information. It is performed by a physical medium that physically transforms.
  • a reinforcement learning arithmetic device and a reinforcement learning method that apply physical RC to reinforcement learning with a large amount of arithmetic processing, effectively reduce computational resources, and perform highly efficient reinforcement learning. can provide.
  • FIG. 3 is a diagram for explaining an RC model of a computing system having the computing device for reinforcement learning according to the first embodiment
  • 1 is a schematic diagram of a configuration of an arithmetic unit used in the first embodiment
  • FIG. 4 is a diagram for explaining the hardware configuration of the arithmetic unit shown in FIG. 3
  • FIG. (a), (b) is a figure for demonstrating the arithmetic unit for reinforcement learning of 2nd Embodiment of this invention. It is a figure for demonstrating the arithmetic unit of 3rd Embodiment of this invention.
  • (a) and (b) are graphs showing the state of learning in an embodiment of the present invention.
  • (a) is a diagram for explaining a preprocessing unit of the present embodiment.
  • (b) is a graph showing the state of learning in this embodiment.
  • first, second, and third embodiments of the present invention will be described below. Note that the first to third embodiments illustrate the technical idea, configuration, procedure, etc. of the present invention, and do not limit the specific configuration, conditions, parameters, and the like.
  • reservoir computing hereinafter also referred to as "RC" will be described.
  • FIG. 1 is a diagram for explaining the form of a general RC model.
  • the RC model 10 shown in FIG. 1 includes an input layer Ri in which an input signal, which is input information, is connected to each neuron, an intermediate layer Rr in which each neuron is connected to each other, and an output layer Ro in which the signals of each neuron are summed and output. It is composed by An input signal coupled to each neuron is also referred to herein as an "input”.
  • Equation (1) is an equation for determining the input signal u(n) input to the input layer Ri.
  • Equation (2) is an equation for determining the output signal y(n) output from the output layer Ro when the input signal u(n) is input to the input layer Ri.
  • N is the number of neurons
  • x i (n) is the state of the i-th neuron at time step n
  • ⁇ ij , mi, ⁇ i , ⁇ i are , coupling between neurons, coupling of an input signal to a neuron, coupling of an output signal to an FB (Feed Back) signal to each neuron, and coupling from each neuron to an output signal.
  • f( ⁇ ) represents a nonlinear response in each neuron, and tanh( ⁇ ) and the like are frequently used.
  • RC network A major difference between the RC network and a general recurrent neural network (hereinafter also referred to as "RNN") is that the network of the input layer Ri and the intermediate layer Rr is fixed, and the variables used for learning are the output layer Ro is the only weighting factor of . Since the RC method can greatly reduce the variables to be learned, it has a great advantage over time-series learning, which requires a large amount of data and high-speed processing.
  • FIG. 2 is a diagram for explaining the RC model 10 of the computing system having the computing device 112 for reinforcement learning according to the first embodiment.
  • the RC model 10 of the first embodiment has an agent 111 , a state measuring device 113 , an arithmetic device 112 and a behavior control device 114 .
  • the RC model 10 learns the policy when the agent 111 executes a certain task in the arithmetic device 112 .
  • the state of the agent 111 at time t is represented by s t , and the agent 111 can take a plurality of actions a t according to the state s t .
  • the agent 111 obtains a reward R t as a result of the action at. State s t is measured by state measuring device 113 . Further, the action at is controlled by the action control device 114 so as to realize the action at in line with the policy.
  • the state st corresponds to sensor data (position, angle, acceleration, etc.) of each joint of the robot arm.
  • the action at corresponds to a control signal (amount of rotation, amount of displacement, etc.) for driving the robot arm.
  • the reward Rt is a virtual value obtained on software when the robot arm grabs an object or performs a desired action.
  • the state measuring device 113 corresponds to a device such as an ammeter or a voltmeter that acquires sensor data such as current or voltage
  • the behavior control device 114 corresponds to a voltage source or a current source. It corresponds to a driving device such as
  • the configuration shown in FIG. 2 can also be applied to learning games and shogi.
  • the state s t is an image of a screen or board
  • the action a t corresponds to a possible move.
  • the reward Rt corresponds to points or the like obtained by actions. Note that the method of the first embodiment can be applied regardless of the above task-dependent inputs and control amounts, and can be widely used for tasks other than those described above.
  • FIG. 3 is a schematic diagram of the configuration of the arithmetic unit 112 used in the first embodiment.
  • Computing unit 112 includes RC algorithm 211 and action decision algorithm 212 .
  • RC algorithm 211 has RC model 10 and performs reinforcement learning using an RC network.
  • the state s t measured via the state measuring device 113 is regarded as the input signal u(t) at time t in equation (1) and input to the arithmetic device 112 .
  • This signal undergoes the transformation described by equation (2) by the RC algorithm 211 and the action decision algorithm 212, and is transformed into an intermediate state x(t) and a final output y(t).
  • the dimensions of the input and output of the RC model are set to be equal to the dimensions of the state s t and the action a t .
  • the first embodiment has an excellent effect of being able to learn with a small number of parameters compared to a normal deep learning model.
  • the RC algorithm 211 is configured such that the output y( t ) of the RC algorithm 211 is the action-value function Q( st , at ) , which is an index representing the value of the action at in the state s t . , the variable ⁇ of the equation (2) is learned.
  • the action decision algorithm 212 decides the action at based on the output action-value function Q(s t , at ) and outputs it to the action control device 114 as information on the action at.
  • the behavior control device 114 controls the behavior of the agent 111 based on the behavior at.
  • the agent 111 corresponds to the object
  • the information about the state corresponds to the state st
  • the information about the policy corresponds to the action at.
  • the state s t is input to the input layer Ri of the RC model 10, and the agent's action at is learned based on the state s t in the intermediate layer Rr.
  • the action at obtained as a result of learning is output to the action decision algorithm 212 .
  • the action determination algorithm 212 determines the action at of the agent 111 from among the actions at, and outputs the action at to the action control device 114 .
  • the RC model 10 is a computational processing model simulating a neuron, and the RC algorithm 211 realizes computational processing based on the RC model 10 and outputs an action-value function Q(s t , at ) including a reward R t . It is a program for
  • the behavior determination algorithm 212 is a program that determines the behavior of the agent 111 from the results of processing performed by the RC algorithm 211 .
  • Arithmetic device 112 is a concept that includes software including such a program and hardware for operating the program.
  • RC model 10 corresponds to a computer system.
  • the above configuration executes a computation method for reinforcement learning for learning the behavior at when the agent 111 executes a task.
  • This computational method for reinforcement learning includes the steps of inputting a state s t to the input layer Ri of the RC model 10, learning the action a t of the agent 111 based on the state s t , and and outputting information about the behavior of the agent 111 based on the behavior at.
  • the step of learning the action at of the agent 111 at least a part of the process of learning the policy of the action at is performed by the physical medium 115 that physically transforms the state st. to run.
  • TD Temporal Differential
  • is a discount rate, which is a hyperparameter indicating how much future rewards are discounted. Typically, values less than and close to 1, such as 0.99, are used. Since the only learning variable is ⁇ ,
  • Equation (5) is the learning rate.
  • Formula (5) is an update rule based on a simple gradient descent method, but various optimization algorithms used in the field of machine learning such as Adam-optimizer and stochastic gradient descent (SGD) can be used. is.
  • equation (3) the TD error for one step is obtained, but the TD error for n steps may be obtained. In that case, the following formula (6) is used as the cost function.
  • n in Equation (6) By setting n in Equation (6) to a value greater than 1, the stability and convergence of learning can be improved. Desirably, n is a value of 10 or less.
  • the update rule can be derived in the same manner as Equations (5) and (6).
  • the ⁇ -greedy method is used in the action decision algorithm.
  • the ⁇ -greedy method selects an action based on policy ⁇ with probability ⁇ , and selects the action with the highest action value function Q(s, a) calculated above with probability (1 ⁇ ).
  • the policy ⁇ is typically generated from a uniform probability distribution, and tries to uniformly choose all possible actions equally.
  • Arithmetic unit 112 requires a large amount of computational resources for the numerical computation described by equation (1), and therefore requires power and execution time as the scale of the network increases.
  • the physical medium 115 calculates the calculation corresponding to the formula (1) in the RC model in FIG.
  • the physical medium 115 may, for example, receive input information as light and convert physical parameters such as the amplitude, wavelength, and frequency of the input light. Also, the physical medium 115 may receive input information as an electric signal and convert physical parameters such as this value and frequency. Furthermore, it is also possible to input a liquid as input information and convert physical parameters such as pressure and flow rate. Conversion is not limited to conversion of numerical values without changing the type of parameter, but also includes changes to other parameters. Learning refers to accumulating or sorting out information about actions so that actions at which are more likely to accomplish the task can be selected according to the state st .
  • FIG. 4 is a diagram for explaining the hardware configuration of the arithmetic unit 112 of the first embodiment that performs arithmetic operations corresponding to equation (1) in the physical medium 115.
  • the arithmetic unit 112 shown in FIG. 4 includes signal converters 116 and 117 , a physical medium 115 , an electronic arithmetic unit 119 and a storage unit 118 .
  • the input information st from the state measuring device 113 is converted from digital information to analog physical signals via the signal conversion device 116 .
  • the converted physical signal is input to physical medium 115 .
  • the physical signal refers to, for example, current, voltage, light intensity, and the like.
  • the conversion into physical signals is desirably performed by a signal system according to the configuration of the physical medium 115 . Signal propagation in the physical medium 115 is described by physical laws determined by each system. reported by Ref.
  • the signal propagated through the physical medium 115 is measured by the signal conversion device 117 and converted back into digital information.
  • This measurement signal can be considered equivalent to x(t) in equation (1).
  • the measurement signal is held in a storage unit 118 such as a memory or hard disk.
  • the storage unit 118 also defines and stores programs and parameters for executing formulas (2) to (5) at the same time.
  • Information including measurement signals, programs and parameters is transferred to an electronic arithmetic unit 119 comprising a CPU (central processing unit), a GPU (graphics processing unit), etc., and the electronic arithmetic unit 119 calculates equations (2) to (5). Reinforcement learning is performed by computing.
  • the first embodiment has the excellent effect of being able to automatically calculate equation (1) during signal propagation.
  • the first embodiment with such a configuration automatically executes the calculation corresponding to the formula (1) in the physical medium 115 within the laws of physics. can be done.
  • the physical medium 115 may be a known one, and the first embodiment uses an optical circuit for the physical medium 115 to perform calculations.
  • the physical medium 115 may be a known one, and the first embodiment uses an optical circuit for the physical medium 115 to perform calculations.
  • Optica 5 756-760 (hereinafter also referred to as "Reference 3" et al, (2012) Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing.
  • Optics express 20, 3241 (hereinafter also referred to as "Reference 4"), M, Nakajima. et al. (2021) Scalable reservoir computing on coherent linear photonic processor communications. physics.4.20 (hereinafter also referred to as "Reference 5"). Since the first embodiment is capable of high-speed input/output of light exceeding 100 Gbit/s via the optical communication device, it is possible to perform calculations at high speed.
  • the signal conversion device 116 is composed of a digital-to-analog converter (DAC) and an optical modulator, converts the digital signal into optical intensity and optical phase, It is introduced into an optical physical medium 115 .
  • the signal conversion device 117 is composed of an optical receiver and an analog-to-digital converter (ADC).
  • the optical physical medium 115 may be, for example, a spatial optical system composed of a lens, a mirror, and a spatial modulator described in Reference 3, or may be configured using an optical fiber ring described in Reference 4. While such a configuration makes the device relatively large, as described in reference 5, it is also possible to integrate a small arithmetic circuit by using an optical integrated circuit.
  • Physical medium 115 is preferably a non-linear optical element to allow non-linear transformation.
  • the nonlinear element is realized, for example, by configuring the physical medium 115 with a nonlinear optical material (for example, LiNbO 3 or the like) or by using gain saturation of an optical amplifier.
  • the physical medium 115 can be physically implemented such that the spectral radius (maximum eigenvalue of the matrix) of the matrix ⁇ in the formula (1) is in the range of 0.8 to 1.2. preferable. With such a configuration, the first embodiment can realize operation near the chaotic transition point, and can improve the performance of holding RC memory.
  • the physical medium 115 is an optical medium that does not contain a gain medium, operation near chaotic transitions corresponds to operation performed by a configuration that reduces optical losses as much as possible. Note that if the physical medium 115 includes a gain medium such as an amplifier, it is possible to improve the memory retention function by adjusting the gain.
  • the first embodiment is not limited to the physical medium 115 performing all of the functions equivalent to the calculation shown in Equation (1).
  • the first embodiment may use the physical medium 115 to perform part of the result obtained by equation (1).
  • part of the computation represented by Equation (1) may be performed by the electronic computation unit 119 .
  • part of the computation shown in Equation (1) may be performed in advance by the electronic computation unit 119 or the like.
  • Such pretreatment is described, for example, in Reference 4.
  • the processing (input mask processing) of the second term of Equation (1) is performed in advance, and the result is used as the force signal to the physical medium 115 .
  • Such preprocessing facilitates the implementation of the physical medium 115 even when the dimensionality of the state s t and the intermediate state x(t) is large.
  • Another aspect of the physical medium 115 is, for example, a memristor such as ReRAM.
  • a memristor such as ReRAM.
  • Such an aspect is a configuration that utilizes a physical phenomenon in which the resistance value changes according to the history of the current that has passed through the resistance cell.
  • DAC/ADC are used for the signal converters 116 and 117 . Since such a configuration is implemented by an analog electronic circuit, it has a high interface affinity and is advantageous for miniaturization.
  • Such a configuration is, for example, Yanan, Zhong. et al. (2021). Dynamic memristor-based reservoir computing for high-efficiency temporal signal processing. Nature Communications 12:408. It is described in.
  • the memristor can be formed by a semiconductor process, it is advantageous in that a large-scale node can be easily configured. Furthermore, in addition to such a configuration, magnetic circuits, fluids, soft materials, and the like can also be used as aspects of the physical medium 115 . This point is summarized, for example, in Reference 1.
  • the first embodiment applies physical RC to reinforcement learning with a large amount of computational processing, and can effectively reduce computational resources. For this reason, it is possible to suppress an increase in the size and cost of a reinforcement learning computing device that executes reinforcement learning. Alternatively, when there is room in the computational resources due to the miniaturization of the computational device for reinforcement learning, it is possible to perform more advanced computational processing.
  • Such a first embodiment can provide a reinforcement learning computing device and a reinforcement learning method that perform highly efficient reinforcement learning.
  • FIGS. 5(a) and 5(b) are diagrams for explaining arithmetic units 512 and 513 for reinforcement learning according to the second embodiment.
  • FIG. 5A is a schematic diagram of the reinforcement learning arithmetic device 512 of the second embodiment.
  • FIG. 5(b) is a schematic diagram of an arithmetic device 513 that is a modification of the arithmetic device 512 shown in FIG. 5(a).
  • the computing device 512 differs from the computing device 112 of the first embodiment in that it includes an RC algorithm 311 having two RC models 10a and 10b and an action decision algorithm 312 that selects an action different from the action decision algorithm 212.
  • the signal flow in the arithmetic device 512 of the second embodiment shown in FIG. 5(a) is the same as that of the arithmetic device 112 of FIG. 2 described in the first embodiment.
  • the computing device 512 differs from the computing device 112 in the structure of the RC network and the learning method.
  • the state s t input from the state measuring device 113 is input to the two RC models 10a and 10b as the input signal u(t) in equation (1).
  • the dimensions of the inputs and outputs of the RC models 10a and 10b are set equal to the dimensions of the states s t and actions a t , respectively.
  • the only learning variables in the RC algorithm 311 are the output layer weights ⁇ a and ⁇ c of each of the RC models 10a, 10b. This gives the RC algorithm 311 the advantage of being able to learn using a small number of parameters compared to regular deep learning models.
  • the RC models 10a, 10b each perform different processes called “Actors” and “Critics”.
  • the actor determines the possible course of action of the agent 111 and the critic estimates the value of the state by gathering information from the state S t output from the state measurer 113 .
  • the RC algorithm 311 learns so that the output on the actor side is the action value function ⁇ (s t , a t ) representing the policy, and the output on the critic side is the state value function V(s t ). I do.
  • the action determination algorithm 312 of the arithmetic device 310 selects the action of the agent 111 shown in FIG. 2 based on the policy ⁇ learned on the actor side, unlike the first embodiment. Action selection is performed, for example, so as to minimize cost functions L a and L c on the actor side and the critic side defined by the following equations (7) and (8), respectively.
  • Equations (7) and (8) By considering the partial differentiation of Equations (7) and (8) with respect to ⁇ a and ⁇ c , the update rule can be derived in the same manner as Equations (5) and (6). In order to stabilize learning, a cost function that considers up to n steps ahead may be used for Lc , as in the following equation (9).
  • n By setting n to a value greater than 1 in Equation (9), the stability and convergence of learning can be improved. It is desirable to set n to a value of 10 or less, and by considering the partial differential of Equation (9) with respect to ⁇ c , the update rule can be derived in the same manner as Equations (5) and (6).
  • the arithmetic unit 310 having such a configuration connects two RC models 10a and 10b in parallel so that the action decision algorithm 212 decides the action of the agent 111 by combining the information on the policy and the information on the value. can be realized by
  • FIG. 6 is a diagram for explaining the arithmetic unit 612 of the third embodiment.
  • the arithmetic device 612 has the RC algorithm 211 and the action determination algorithm 212 shown in FIG.
  • the preprocessing unit 613 has a function of performing preprocessing using a random convolution layer and a pooling layer, and converts the state s t output from the state measuring device 113 into a state s t ' .
  • the RC algorithm 211 performs RC model calculations based on the state st' .
  • the preprocessing unit 613 has two layers, a kernel filter that performs convolution and a kernel filter that performs pooling.
  • Convolution is a process of compressing the pixel values of a small area of an image into a smaller area using this area as one feature, and is a process of reducing the amount of image data.
  • Pooling is a process of thinning out images to reduce the amount of data.
  • the coefficients of the kernel filter in the random convolution layer are generated, for example, by a random number table in the interval [-1:1], and are not learned thereafter.
  • the RC algorithm 211 can process state s t' in the same way that the first and second embodiments process state s t .
  • the dimensions are compressed in advance by a convolutional neural network, and the amount of processing in the RC algorithm 211 can be reduced. .
  • An optical medium is adopted as the physical medium of the RC model, and tanh() based on the saturation behavior of a semiconductor optical amplifier (SOA) is used as a nonlinear function.
  • the input coupling m is, for example, calculated in advance before being introduced into the physical medium 115, and is generated by a uniform random number from the [0:1] section. It is assumed that the physical medium 115 is constructed on a fiber basis, and ⁇ corresponding to the mutual coupling within the medium reflects the ring topology as expressed by the following equation.
  • Such a medium is described in, for example, Francois, Duport. et al. (2012). All-optical reservoir computing. Vol. 20, No. 20 / Optics Express, 22783-22795. )It is described in.
  • the return light gain ⁇ in the ring is approximately equal to the spectral radius of ⁇ .
  • FIGS. 7(a) and 7(b) are graphs showing the state of learning in this embodiment.
  • the horizontal axis indicates the number of trials, and the vertical axis indicates the number of steps in which the stick was not knocked over.
  • a solid line in FIG. 7A indicates learning using the RC model of this embodiment.
  • the dashed line shows the results of similar training using a 3-layer fully-connected neural network for comparison.
  • the number of steps in which the stick can not be knocked down while the number of trials is small increases, and convergence is faster than using a fully-connected neural network.
  • the only learning variable is ⁇ , and the other variables are processed by the physical medium 115, which will be described later, so learning is possible with a small number of parameters.
  • FIG. 7(b) shows the relationship between the return light gain ⁇ in the ring and the average number of trials required for convergence.
  • convergence is defined as five consecutive successes in holding the bar up to 200 steps without falling.
  • is 1.2 or more
  • the number of trials until convergence increases and when ⁇ is between 0.6 and 1.0, the number of trials required for convergence is small. and stable.
  • the reason for this reflects the tendency that when ⁇ exceeds 1.2, the behavior of the optical medium becomes chaotic and the performance of the RC algorithm increases near the chaotic transition point of the optical medium. Therefore, it is desirable to set the spectral radius of the RC algorithm within a range that does not reach the chaotic transition point and to have as large a value as possible.
  • the computer game Pong (manufactured by Atari) is used as a task, and the object is to compete against a computer and win.
  • the state st is preprocessed by the preprocessing unit described in the third embodiment and input to the RC model.
  • FIG. 8A is a diagram for explaining the preprocessing unit 613 used in this embodiment.
  • FIG. 8(b) is a graph showing the state of learning.
  • the preprocessing unit 613 has a convolution filter pooling layer 3135 having a plurality of intermediate layers 3130 , 3131 , 3133 . After the convolution filter, an arithmetic unit with an RC model 10 is provided.
  • an 8 ⁇ 8 kernel filter 3137 performs convolution processing on 84 ⁇ 84 ⁇ 4 input data while sliding by 4 ⁇ 4 pixels.
  • a 4 ⁇ 4 kernel filter 3132 performs convolution processing by sliding 20 ⁇ 20 ⁇ 32 data by 2 ⁇ 2 pixels.
  • a 3 ⁇ 3 kernel filter 3134 slides 9 ⁇ 9 ⁇ 64 data by 1 ⁇ 1 pixels and performs convolution processing.
  • a feature map is output each time convolution processing is performed.
  • the feature maps are pixel-reduced by a pooling layer 3135 and then combined to form a combined layer 3136 .

Abstract

The purpose of the present invention is to provide a computing device for reinforcement learning and a reinforcement learning method for executing highly efficient reinforcement learning in which physical reservoir computing (RC) is applied to reinforcement learning having a high computation processing load and computation resources are effectively reduced. Therefore, the computing device (112) for reinforcement learning according to the present invention comprises a physical medium (115) that includes an RC algorithm having an input layer for inputting a subject state st, an intermediate layer for learning a subject action at on the basis of input information inputted to the input layer, and an output layer for outputting output information relating to an action at, and also includes an action determination algorithm for outputting information relating to the subject action on the basis of the action at outputted from the RC algorithm, the RC algorithm executing at least some of the functions to physically convert the state st and learn a subject strategy.

Description

強化学習用演算装置、強化学習方法Arithmetic device for reinforcement learning, reinforcement learning method
 本発明は、強化学習をするためのアルゴリズムを備えた強化学習用演算装置に関する。 The present invention relates to a reinforcement learning computing device equipped with an algorithm for reinforcement learning.
 近年、深層ニューラルネットワーク(DNN:Deep Neural Network)を用いた深層強化学習の研究開発が進展している。深層ニューラルネットワークは、ゲームプレイ、ロボティクス、囲碁等と幅広い領域で、優れた学習性能を発揮することが報告されており、さまざまなタスクで方策の学習が可能な汎化性の高い手法が多く提案されている。一般的に、深層強化学習においては、時間tにおける状態をあらわす状態変数sと、その際に取り得る行動である行動変数a、その結果得られる報酬Rを用いて、ある状態における行動及び状態の持つ価値である行動価値関数Q(s,a)、状態価値関数V(s)をニューラルネットワークの出力として算出し、方策を決定する。強化学習では、入力情報にセンサデータをはじめとした時系列的な情報を取り扱うことが多いため、LSTM(Long short-term memory)をはじめとしたリカレントニューラルネットワーク(Recurrent Neural Network:以下、「RNN」とも記す)モデルを組み込むことが多い。一般的にその演算処理や学習コストは重いため、これらの処理負荷の低減が望まれている。 In recent years, research and development of deep reinforcement learning using a deep neural network (DNN: Deep Neural Network) has progressed. Deep neural networks have been reported to exhibit excellent learning performance in a wide range of fields such as game play, robotics, and Go, and many proposals have been made for highly generalized methods that can learn strategies for various tasks. It is Generally, in deep reinforcement learning, a state variable s t that represents the state at time t, an action variable a t that is an action that can be taken at that time, and a reward R t that is obtained as a result are used to determine the action in a certain state Also, the action value function Q(s t , a t ) and the state value function V(s t ), which are the values of the state, are calculated as the output of the neural network, and the policy is determined. Reinforcement learning often handles time-series information such as sensor data as input information, so recurrent neural networks such as LSTM (Long short-term memory) (hereinafter referred to as "RNN") (also written) model is often incorporated. Since the computational processing and learning costs are generally high, it is desired to reduce these processing loads.
 ところで、近年、RNNの一種として、リザバーコンピューティング(Reservoir Computing(以下、「RC」とも記す)と呼ばれるモデルが提案されている。RCは、入力情報が各々のニューロンに結合する入力層、各ニューロンが相互に結合する中間層、各ニューロンの信号を和算し出力する出力層からなる。このような構成は、例えば、特許文献1に記載されている。 By the way, in recent years, as a type of RNN, a model called reservoir computing (hereinafter also referred to as “RC”) has been proposed. RC is an input layer in which input information is connected to each neuron, each neuron are connected to each other, and an output layer for summing and outputting the signals of each neuron.
 RCは、入力層と中間層のネットワークを固定とし、学習に用いる変数を出力層の重み係数のみとしている点で一般的なRNNと相違している。このようなRCの方式は、学習すべき変数を大幅に削減できるため、データが膨大かつ高速な処理を要する時系列学習に好適である。RCは、ほとんどのノード間結合をランダムな写像で生成する。このランダム写像は計算機上で演算すると、ノード数の増加に伴い多くの計算リソースを必要とするし、計算機を含む演算システムを大規模化、高コスト化させる一因になる。この点を解消するため、写像を物理系で演算させる物理RCと呼ばれる構成の研究実証が進んでいる。物理RCによれば、物理素子に計算の一部をアウトソースできるため、低い演算リソースのデバイスでも効率的な学習が可能となる。物理RCについては、例えば、特許文献2に記載されている。 RC differs from general RNN in that the input layer and intermediate layer networks are fixed, and the only variable used for learning is the weighting coefficient of the output layer. Such an RC method can greatly reduce the number of variables to be learned, and is therefore suitable for time-series learning that requires a large amount of data and high-speed processing. RC generates most node-to-node connections by random mapping. When this random mapping is operated on a computer, it requires a large number of computational resources as the number of nodes increases, and is one of the factors that increase the scale and cost of the computing system including the computer. In order to solve this problem, research and demonstration of a configuration called physics RC, in which a mapping is calculated by a physical system, are in progress. Physics RC enables efficient learning even on devices with low computational resources, as part of the computation can be outsourced to physical elements. Physical RC is described in Patent Document 2, for example.
 本開示は、演算処理量の大きい強化学習に対して物理RCを適用し、効果的に演算リソースを低減し、効率の高い強化学習を実行する強化学習用演算装置、強化学習方法を提供することを目的とする。 The present disclosure applies physical RC to reinforcement learning with a large amount of computational processing, effectively reduces computational resources, and provides a reinforcement learning arithmetic device and a reinforcement learning method that perform highly efficient reinforcement learning. With the goal.
 上記目的を達成するために本発明の一態様の強化学習用演算装置は、対象がタスクを実行する際の方策を学習する強化学習用の演算装置であって、対象の状態に関する入力情報を入力する入力層と、前記入力層に入力された前記入力情報に基づいて、前記対象の方策を学習する中間層と、前記方策に関する出力情報を出力する出力層と、を有するコンピュータシステムと、前記コンピュータシステムから出力された前記出力情報に基づいて、前記対象がとるべき行動に関する情報を出力する行動決定部と、を含み、前記コンピュータシステムは、前記入力情報を物理的に変換し、前記対象の方策を学習する機能の少なくとも一部を実行する物理媒質を備える。 To achieve the above object, a reinforcement learning arithmetic device of one aspect of the present invention is a reinforcement learning arithmetic device for learning a strategy when an object executes a task, wherein input information about the state of the object is inputted. a computer system having an input layer that performs the above, an intermediate layer that learns the target policy based on the input information input to the input layer, and an output layer that outputs output information related to the policy, and the computer and a behavior determination unit that outputs information about the behavior that the subject should take based on the output information output from the system, the computer system physically converting the input information, a physical medium that performs at least a portion of the function of learning the
 本発明の一態様の強化学習用演算方法は、対象がタスクを実行する際の方策を学習する強化学習用の演算方法であって、情報を入力する入力層と、前記入力層に入力された入力情報に基づいて演算を実行する中間層と、前記演算の結果を出力する出力層と、を有するコンピュータシステムに対し、前記入力層に前記対象の状態に関する入力情報を入力する工程と、前記中間層が前記入力情報に基づいて前記対象の前記方策を学習する工程と、
 前記学習によって得られた前記方策に関する情報に基づいて、前記対象の行動に関する情報を出力する工程と、を含み、前記対象の前記方策を学習する工程においては、少なくとも一部が、前記入力情報を物理的に変換する物理媒質により実行される。
A computational method for reinforcement learning according to one aspect of the present invention is a computational method for reinforcement learning in which a target learns a strategy for executing a task, comprising an input layer for inputting information, and inputting input information about the state of the target to the input layer for a computer system having an intermediate layer that performs an operation based on input information and an output layer that outputs the result of the operation; layer learning said policy of said target based on said input information;
and outputting information about the behavior of the target based on the information about the policy obtained by the learning, and in the step of learning the policy of the target, at least a part of the input information. It is performed by a physical medium that physically transforms.
 以上の態様によれば、演算処理量の大きい強化学習に対して物理RCを適用し、効果的に演算リソースを低減し、効率の高い強化学習を実行する強化学習用演算装置、強化学習方法を提供することができる。 According to the above aspects, a reinforcement learning arithmetic device and a reinforcement learning method that apply physical RC to reinforcement learning with a large amount of arithmetic processing, effectively reduce computational resources, and perform highly efficient reinforcement learning. can provide.
一般的なRCモデルの形態を説明するための図である。It is a figure for demonstrating the form of general RC model. 第1実施形態の強化学習用演算装置を有する演算システムのRCモデルを説明するための図である。FIG. 3 is a diagram for explaining an RC model of a computing system having the computing device for reinforcement learning according to the first embodiment; 第1実施形態において利用される演算装置の構成の模式図である。1 is a schematic diagram of a configuration of an arithmetic unit used in the first embodiment; FIG. 図3に示す演算装置のハードウェア構成を説明するための図である。4 is a diagram for explaining the hardware configuration of the arithmetic unit shown in FIG. 3; FIG. (a)、(b)は、本発明の第2実施形態の強化学習用演算装置を説明するための図である。(a), (b) is a figure for demonstrating the arithmetic unit for reinforcement learning of 2nd Embodiment of this invention. 本発明の第3実施形態の演算装置を説明するための図である。It is a figure for demonstrating the arithmetic unit of 3rd Embodiment of this invention. (a)、(b)は、本発明の一実施例の学習の状態を示すグラフである。(a) and (b) are graphs showing the state of learning in an embodiment of the present invention. (a)は本実施例の前処理部を説明するための図である。(b)は本実施例の学習の状態を示すグラフである。(a) is a diagram for explaining a preprocessing unit of the present embodiment. (b) is a graph showing the state of learning in this embodiment.
 以下、本発明の第1実施形態、第2実施形態及び第3実施形態を説明する。なお、第1実施形態から第3実施形態は、本発明の技術思想、構成、手順等を例示し、その具体的な構成や条件及びパラメータ等を限定するものではない。先ず、本明細書は、第1実施形態、第2実施形態及び第3実施形態の説明に先立って、リザバーコンピューティング(Reservoir Computing(以下、「RC」とも記す)について説明する。 The first, second, and third embodiments of the present invention will be described below. Note that the first to third embodiments illustrate the technical idea, configuration, procedure, etc. of the present invention, and do not limit the specific configuration, conditions, parameters, and the like. First, in this specification, prior to describing the first, second, and third embodiments, reservoir computing (hereinafter also referred to as "RC") will be described.
 図1は、一般的なRCモデルの形態を説明するための図である。図1に示すRCモデル10は、入力情報である入力信号が各々のニューロンに結合する入力層Ri、各ニューロンが相互に結合する中間層Rr、各ニューロンの信号を和算し出力する出力層Roによって構成されている。本明細書では、入力信号が各々のニューロンに結合することを、「入力」とも記す。式(1)は、入力層Riに入力される入力信号u(n)を決定する式である。また、式(2)は、入力信号u(n)を入力層Riに入力した場合に出力層Roから出力される出力信号y(n)を決定する式である。
Figure JPOXMLDOC01-appb-M000001
FIG. 1 is a diagram for explaining the form of a general RC model. The RC model 10 shown in FIG. 1 includes an input layer Ri in which an input signal, which is input information, is connected to each neuron, an intermediate layer Rr in which each neuron is connected to each other, and an output layer Ro in which the signals of each neuron are summed and output. It is composed by An input signal coupled to each neuron is also referred to herein as an "input". Equation (1) is an equation for determining the input signal u(n) input to the input layer Ri. Equation (2) is an equation for determining the output signal y(n) output from the output layer Ro when the input signal u(n) is input to the input layer Ri.
Figure JPOXMLDOC01-appb-M000001
 上記式(1)、式(2)において、Nはニューロンの数、x(n)は時間ステップnでのi番目のニューロンの状態であり、Ωij,mi,η,ωはそれぞれ、ニューロン間の相互結合、入力信号のニューロンへの結合、出力信号から各ニューロンへのFB(Feed Back)信号への結合、各ニューロンから出力信号への結合を表す係数である。また、f(・)は各ニューロンでの非線形応答を表し、tanh(・)等が頻繁に用いられる。RCネットワークと一般的なリカレントニューラルネットワーク(Recurrent Neural Network:以下、「RNN」とも記す)との大きな違いは、入力層Riと中間層Rrとのネットワークを固定とし、学習に用いる変数を出力層Roの重み係数のみとしている点である。RCの方式は、学習すべき変数を大幅に削減できるため、データが膨大かつ高速な処理を要する時系列学習に対して大きなアドバンテージを有する。 In the above equations (1) and (2), N is the number of neurons, x i (n) is the state of the i-th neuron at time step n, and Ω ij , mi, η i , ω i are , coupling between neurons, coupling of an input signal to a neuron, coupling of an output signal to an FB (Feed Back) signal to each neuron, and coupling from each neuron to an output signal. Also, f(·) represents a nonlinear response in each neuron, and tanh(·) and the like are frequently used. A major difference between the RC network and a general recurrent neural network (hereinafter also referred to as "RNN") is that the network of the input layer Ri and the intermediate layer Rr is fixed, and the variables used for learning are the output layer Ro is the only weighting factor of . Since the RC method can greatly reduce the variables to be learned, it has a great advantage over time-series learning, which requires a large amount of data and high-speed processing.
[第1実施形態]
(RCによる強化学習(Q学習))
 図2は、第1実施形態の強化学習用の演算装置112を有する演算システムのRCモデル10を説明するための図である。第1実施形態のRCモデル10は、エージェント111、状態測定装置113、演算装置112及び行動制御装置114を有している。RCモデル10は、エージェント111があるタスクを実行する際の方策を、演算装置112にて学習する。時刻tにおけるエージェント111の状態はsで表され、エージェント111は状態sに応じて複数の行動aをとり得る。エージェント111は、行動のaの結果として報酬Rを得る。状態sは状態測定装置113によって測定される。また、行動aは、行動制御装置114によって方策に沿う行動aを実現すべく制御される。
[First Embodiment]
(Reinforcement learning by RC (Q learning))
FIG. 2 is a diagram for explaining the RC model 10 of the computing system having the computing device 112 for reinforcement learning according to the first embodiment. The RC model 10 of the first embodiment has an agent 111 , a state measuring device 113 , an arithmetic device 112 and a behavior control device 114 . The RC model 10 learns the policy when the agent 111 executes a certain task in the arithmetic device 112 . The state of the agent 111 at time t is represented by s t , and the agent 111 can take a plurality of actions a t according to the state s t . The agent 111 obtains a reward R t as a result of the action at. State s t is measured by state measuring device 113 . Further, the action at is controlled by the action control device 114 so as to realize the action at in line with the policy.
 上記の構成の具体例として、例えば、多関節のロボットアームを制御するためのモデルが考えられる。ロボットアームの制御において、状態sはロボットアームの各関節部のセンサーデータ(位置・角度・加速度など)に相当する。行動aは、ロボットアームを駆動する制御信号(回転量、変位量など)に相当する。報酬Rは、ロボットアームが物を掴んだ場合、あるいは所望の動作を成しえた際にソフトウェア上で得る仮想的な値である。ロボットアームを制御するシステムにあっては、状態測定装置113は、例えば電流や電圧等のセンサデータを取得する電流計、電圧計といった装置に相当し、行動制御装置114は、電圧源や電流源等の駆動装置に相当する。 As a specific example of the above configuration, for example, a model for controlling a multi-joint robot arm can be considered. In robot arm control, the state st corresponds to sensor data (position, angle, acceleration, etc.) of each joint of the robot arm. The action at corresponds to a control signal (amount of rotation, amount of displacement, etc.) for driving the robot arm. The reward Rt is a virtual value obtained on software when the robot arm grabs an object or performs a desired action. In a system that controls a robot arm, the state measuring device 113 corresponds to a device such as an ammeter or a voltmeter that acquires sensor data such as current or voltage, and the behavior control device 114 corresponds to a voltage source or a current source. It corresponds to a driving device such as
 また、図2に示す構成は、ゲームや将棋を学習することにも適用できる。このような学習のシステムにおいて、状態sは画面や盤面の画像であり、行動aは取り得る一手に相当する。また、報酬Rは、行動によって得られたポイント等に相当する。なお、第1実施形態の手法は、上記のタスクに依存する入力や制御量に依らず適応可能であり、上述以外のタスクにも広く利用できる。 The configuration shown in FIG. 2 can also be applied to learning games and shogi. In such a learning system, the state s t is an image of a screen or board, and the action a t corresponds to a possible move. Also, the reward Rt corresponds to points or the like obtained by actions. Note that the method of the first embodiment can be applied regardless of the above task-dependent inputs and control amounts, and can be widely used for tasks other than those described above.
 演算装置112は、状態s、行動a及び報酬Rの情報を基にエージェント111の行動を決定する。以下、その手法について説明する。図3は、第1実施形態において利用される演算装置112の構成の模式図である。演算装置112は、RCアルゴリズム211と、行動決定アルゴリズム212を含んでいる。RCアルゴリズム211は、RCモデル10を有し、RCネットワークを使って強化学習を行っている。第1実施形態においては、状態測定装置113を介して測定された状態sが、式(1)の時間tにおける入力信号u(t)とみなされて演算装置112に入力される。この信号は、RCアルゴリズム211及び行動決定アルゴリズム212により、式(2)で記述される変換を受け、中間状態x(t)、最終出力y(t)へと変換される。ここで、RCモデルの入力、出力の次元は状態sと行動aの次元と等しくなるように設定されている。 The computing device 112 determines the behavior of the agent 111 based on the information of the state s t , the behavior a t and the reward R t . The method will be described below. FIG. 3 is a schematic diagram of the configuration of the arithmetic unit 112 used in the first embodiment. Computing unit 112 includes RC algorithm 211 and action decision algorithm 212 . RC algorithm 211 has RC model 10 and performs reinforcement learning using an RC network. In the first embodiment, the state s t measured via the state measuring device 113 is regarded as the input signal u(t) at time t in equation (1) and input to the arithmetic device 112 . This signal undergoes the transformation described by equation (2) by the RC algorithm 211 and the action decision algorithm 212, and is transformed into an intermediate state x(t) and a final output y(t). Here, the dimensions of the input and output of the RC model are set to be equal to the dimensions of the state s t and the action a t .
 また、学習変数はωのみであり、その他の変数m,Ωは後述する物理媒質115(図4)の構成に従って生成され、学習は行われない。第1実施形態は、これによって通常の深層学習モデルに比較し、少数パラメータで学習できるという優れた効果を有する。第1実施形態において、RCアルゴリズム211は、RCアルゴリズム211の出力y(t)が、状態sにおける行動aの価値を表す指標である行動価値関数Q(s,a)となるように式(2)の変数ωの学習を行う。行動決定アルゴリズム212は、出力された行動価値関数Q(s,a)に基づいて行動aを決定し、行動aに関する情報として行動制御装置114に出力する。行動制御装置114は、行動aに基づいてエージェント111の行動を制御する。 Also, only ω is a learning variable, and the other variables m and Ω are generated according to the configuration of the physical medium 115 (FIG. 4) described later, and are not learned. As a result, the first embodiment has an excellent effect of being able to learn with a small number of parameters compared to a normal deep learning model. In the first embodiment, the RC algorithm 211 is configured such that the output y( t ) of the RC algorithm 211 is the action-value function Q( st , at ) , which is an index representing the value of the action at in the state s t . , the variable ω of the equation (2) is learned. The action decision algorithm 212 decides the action at based on the output action-value function Q(s t , at ) and outputs it to the action control device 114 as information on the action at. The behavior control device 114 controls the behavior of the agent 111 based on the behavior at.
 以上説明した構成においては、エージェント111が対象に相当し、状態に関する情報が状態sに、方策に関する情報が行動aに相当する。状態sはRCモデル10の入力層Riに入力され、中間層Rrにおいては状態sに基づいてエージェントの行動aが学習される。学習の結果得られた行動aが行動決定アルゴリズム212に出力される。行動決定アリゴリズム212は、行動aのうちからエージェント111の行動aを決定し、行動aを行動制御装置114に出力する。RCモデル10は、ニューロンを模した計算処理のモデルであり、RCアルゴリズム211は、RCモデル10に基づく計算処理を実現し、報酬Rを含む行動価値関数Q(s,a)を出力するためのプログラムである。行動決定アルゴリズム212は、RCアルゴリズム211によって行われた処理結果からエージェント111の行動を決定するプログラムである。演算装置112は、このようなプログラムを含むソフトウェアと、プログラムを動作させるためのハードウェアとを含む概念である。第1実施形態において、RCモデル10はコンピュータシステムに相当する。 In the configuration described above, the agent 111 corresponds to the object, the information about the state corresponds to the state st , and the information about the policy corresponds to the action at. The state s t is input to the input layer Ri of the RC model 10, and the agent's action at is learned based on the state s t in the intermediate layer Rr. The action at obtained as a result of learning is output to the action decision algorithm 212 . The action determination algorithm 212 determines the action at of the agent 111 from among the actions at, and outputs the action at to the action control device 114 . The RC model 10 is a computational processing model simulating a neuron, and the RC algorithm 211 realizes computational processing based on the RC model 10 and outputs an action-value function Q(s t , at ) including a reward R t . It is a program for The behavior determination algorithm 212 is a program that determines the behavior of the agent 111 from the results of processing performed by the RC algorithm 211 . Arithmetic device 112 is a concept that includes software including such a program and hardware for operating the program. In the first embodiment, RC model 10 corresponds to a computer system.
 また、上記構成は、エージェント111がタスクを実行する際の行動aを学習する強化学習用の演算方法を実行する。この強化学習用演算方法は、RCモデル10に対し、入力層Riに状態sを入力する工程と、状態sに基づいてエージェント111の行動aを学習する工程と、学習によって得られた行動aに基づいてエージェント111の行動に関する情報を出力する工程と、を含んでいる。そして、エージェント111の行動aを学習する工程においては、行動aの方策を学習する処理の少なくとも一部が、状態stを物理的に変換する物理媒質115により実行される強化学習用演算方法を実行する。 Moreover, the above configuration executes a computation method for reinforcement learning for learning the behavior at when the agent 111 executes a task. This computational method for reinforcement learning includes the steps of inputting a state s t to the input layer Ri of the RC model 10, learning the action a t of the agent 111 based on the state s t , and and outputting information about the behavior of the agent 111 based on the behavior at. In the step of learning the action at of the agent 111 , at least a part of the process of learning the policy of the action at is performed by the physical medium 115 that physically transforms the state st. to run.
 学習は、例えば、次式(3)で定義されるTD(Temporal Differential)誤差Lを最小化するように実行される。
Figure JPOXMLDOC01-appb-M000002
Learning is performed, for example, so as to minimize a TD (Temporal Differential) error L defined by the following equation (3).
Figure JPOXMLDOC01-appb-M000002
 上記式(3)において、γは割引率であり、将来の報酬をどれだけ割り引くかを示すハイパーパラメータである。典型的には、0.99等の1より小さい数値を有し、且つ1に近い値を利用する。学習変数はωだけなので、
Figure JPOXMLDOC01-appb-M000003
In the above equation (3), γ is a discount rate, which is a hyperparameter indicating how much future rewards are discounted. Typically, values less than and close to 1, such as 0.99, are used. Since the only learning variable is ω,
Figure JPOXMLDOC01-appb-M000003
ここから、ωの勾配を例えば次式(5)で更新する。
Figure JPOXMLDOC01-appb-M000004
From this, the gradient of ω is updated, for example, by the following equation (5).
Figure JPOXMLDOC01-appb-M000004
 ただし、式(5)中のλは学習率である。式(5)は単純な勾配降下法に基づいた更新則であるが、Adam-optimizerや確率的勾配降下法(Stochastic Gradient Descent:SGD)など機械学習分野で用いられる種々の最適化アルゴリズムが利用可能である。また、式(3)では、1ステップ分のTD誤差について求めたが、nステップ分のTD誤差を求めてもよい。その場合は、コスト関数として、以下の式(6)を利用する。
Figure JPOXMLDOC01-appb-M000005
However, λ in Equation (5) is the learning rate. Formula (5) is an update rule based on a simple gradient descent method, but various optimization algorithms used in the field of machine learning such as Adam-optimizer and stochastic gradient descent (SGD) can be used. is. Also, in equation (3), the TD error for one step is obtained, but the TD error for n steps may be obtained. In that case, the following formula (6) is used as the cost function.
Figure JPOXMLDOC01-appb-M000005
式(6)中のnを1より大きな値に設定することで、学習の安定性、収束性を高めることができる。nは10以下の値であることが望ましい。式(6)中のωに対する偏微分を考えることで、式(5)、式(6)と同様に更新則を導出することができる。 By setting n in Equation (6) to a value greater than 1, the stability and convergence of learning can be improved. Desirably, n is a value of 10 or less. By considering the partial differential with respect to ω in Equation (6), the update rule can be derived in the same manner as Equations (5) and (6).
 一般的な深層強化学習では、学習データのサンプリングには、データ系列をランダムにサンプリングする経験再生(Experimental replay)がしばしば用いられるが、第1実施形態はRNNの一種であるRCモデルを利用しているため、このような方法は利用できない。これを避けるためには、一試行中の時系列データ(エピソード)をそのまま入力する手法や、エピソード内からある長さのデータを抽出するサンプリング手法及びエピソード内のある時刻を指定し、その事前データで内部状態を再現しておく手法等が利用可能である。なお、一試行中の時系列データ(エピソード)をそのまま入力する手法と、エピソード内からある長さのデータを抽出するサンプリング手法とは、例えば、Matthew, Hausknecht. and Peter, Stone. (11,Jan,2017). Deep recurrent q-learning for partially observable. MDPs.Austin.the University of Texas(以下、「参考文献1」とも記す)に記載されている。エピソード内のある時刻を指定し、その事前データで内部状態を再現しておく手法は、例えば、Steven, Kapturowski.et al. (2019)Recurrent experience replay in distributed reinforcement learning. Published as a conference paper at ICLR.(以下、「参考文献2」とも記す)に記載されている。 In general deep reinforcement learning, learning data is often sampled by experimental replay that randomly samples a data series, but the first embodiment uses an RC model, which is a type of RNN Therefore, such a method cannot be used. In order to avoid this, there are methods of inputting the time series data (episode) during one trial as it is, sampling methods of extracting data of a certain length from within the episode, and specifying a certain time in the episode and pre-data It is possible to use a method of reproducing the internal state with The method of inputting the time series data (episode) during one trial as it is and the sampling method of extracting data of a certain length from within the episode are, for example, Matthew, Hausknecht. and Peter, Stone. (11, Jan , 2017). Deep recurrent q-learning for partially observable. Steven, Kapturowski.et al. (2019) Recurrent experience replay in distributed reinforcement learning. Published as a conference paper at ICLR (hereinafter also referred to as “reference document 2”).
 行動決定アルゴリズムにおいては、例えばε-greedy法が利用される。ε-greedy法は、確率εである方策πに基づく行動を選定し、(1-ε)の確率で上記により算出された行動価値関数Q(s,a)が最も高い行動を選択する。方策πは、典型的には、一様な確率分布から生成され、すべての取り得る行動を一様に等しく選択するようにする。演算装置112は、式(1)で記述される数値演算に多くの計算リソースを要求するため、ネットワークの大規模化に従って電力や実行時間を要求する。第1実施形態は、図3におけるRCモデルのうち、式(1)に相当する演算を物理媒質115で演算する。 For example, the ε-greedy method is used in the action decision algorithm. The ε-greedy method selects an action based on policy π with probability ε, and selects the action with the highest action value function Q(s, a) calculated above with probability (1−ε). The policy π is typically generated from a uniform probability distribution, and tries to uniformly choose all possible actions equally. Arithmetic unit 112 requires a large amount of computational resources for the numerical computation described by equation (1), and therefore requires power and execution time as the scale of the network increases. In the first embodiment, the physical medium 115 calculates the calculation corresponding to the formula (1) in the RC model in FIG.
 物理媒質115は、例えば、入力情報を光として入力し、入力した光の振幅、波長、周波数といった物理的なパラメータを変換するものであってもよい。また、物理媒質115は、入力情報を電気信号として入力し、この値や周波数といった物理的なパラメータを変換するものであってもよい。さらに、入力情報として液体を入力し、この圧力や流量といった物理的なパラメータを変換するものであってもよい。変換は、パラメータの種別を変更することなく数値を変換することに限らず、他のパラメータへの変更も含む。学習は、状態sに応じてタスクがより達成される可能性が高い行動aを選択することが可能になるように、行動に関する情報を蓄積、あるいは選別することを指す。 The physical medium 115 may, for example, receive input information as light and convert physical parameters such as the amplitude, wavelength, and frequency of the input light. Also, the physical medium 115 may receive input information as an electric signal and convert physical parameters such as this value and frequency. Furthermore, it is also possible to input a liquid as input information and convert physical parameters such as pressure and flow rate. Conversion is not limited to conversion of numerical values without changing the type of parameter, but also includes changes to other parameters. Learning refers to accumulating or sorting out information about actions so that actions at which are more likely to accomplish the task can be selected according to the state st .
 図4は、式(1)に相当する演算を物理媒質115で演算する第1実施形態の演算装置112のハードウェア構成を説明するための図である。図4に示す演算装置112は、信号変換装置116,117、物理媒質115、電子演算装置119及び記憶部118を含んでいる。状態測定装置113からの入力情報sは、信号変換装置116を介してディジタル的な情報からアナログ的な物理信号へと変換される。変換された物理信号は、物理媒質115へと入力される。ここで、物理信号とは、例えば電流、電圧、光強度等を指す。物理信号への変換は、物理媒質115の構成に応じた信号方式によって行うことが望ましい。物理媒質115における信号伝搬は、それぞれの系で定まる物理法則によって記述されるが、種々の物理系で式(1)と等価な演算が実行可能であることが、例えば、先に挙げた非特許文献1によって報告されている。 FIG. 4 is a diagram for explaining the hardware configuration of the arithmetic unit 112 of the first embodiment that performs arithmetic operations corresponding to equation (1) in the physical medium 115. As shown in FIG. The arithmetic unit 112 shown in FIG. 4 includes signal converters 116 and 117 , a physical medium 115 , an electronic arithmetic unit 119 and a storage unit 118 . The input information st from the state measuring device 113 is converted from digital information to analog physical signals via the signal conversion device 116 . The converted physical signal is input to physical medium 115 . Here, the physical signal refers to, for example, current, voltage, light intensity, and the like. The conversion into physical signals is desirably performed by a signal system according to the configuration of the physical medium 115 . Signal propagation in the physical medium 115 is described by physical laws determined by each system. reported by Ref.
 物理媒質115を伝搬した信号は、信号変換装置117によって測定され、再びディジタル情報へと変換される。この測定信号は、式(1)におけるx(t)と等価であるとみなせる。測定信号は、メモリやハードディスクなどからなる記憶部118に保持される。また、記憶部118には、式(2)から式(5)を実行するためのプログラムやパラメータも同時に定義、保存されている。測定信号、プログラム及びパラメータを含む情報は、CPU(central Processing Unit)やGPU(Graphics Processing Unit)等を備える電子演算装置119に転送し、電子演算装置119は、式(2)から(5)を演算することで強化学習を実行する。このような構成により、第1実施形態は、信号伝搬中に式(1)を自動的に演算できるといった優れた効果を有する。また、このような構成の第1実施形態は、物理媒質115において式(1)に相当する演算を物理法則の中で自動的に実行するため、計算リソースの増加に関する課題を軽減、解決することができる。 The signal propagated through the physical medium 115 is measured by the signal conversion device 117 and converted back into digital information. This measurement signal can be considered equivalent to x(t) in equation (1). The measurement signal is held in a storage unit 118 such as a memory or hard disk. The storage unit 118 also defines and stores programs and parameters for executing formulas (2) to (5) at the same time. Information including measurement signals, programs and parameters is transferred to an electronic arithmetic unit 119 comprising a CPU (central processing unit), a GPU (graphics processing unit), etc., and the electronic arithmetic unit 119 calculates equations (2) to (5). Reinforcement learning is performed by computing. With such a configuration, the first embodiment has the excellent effect of being able to automatically calculate equation (1) during signal propagation. In addition, the first embodiment with such a configuration automatically executes the calculation corresponding to the formula (1) in the physical medium 115 within the laws of physics. can be done.
 次に、第1実施形態の物理媒質115について説明する。物理媒質115は、公知のものでよく、第1実施形態は、物理媒質115に光回路を用いて演算する構成を用いた。このような光回路を用いる演算は、例えば、J, Bueno. et al. (2018). Reinforcement learning in a large-scale photonic recurrent neural network. Optica 5, 756-760 (以下、「参考文献3」とも記す)、L, Larger. et al, (2012) Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics express 20, 3241 (以下、「参考文献4」とも記す)、M, Nakajima. et al. (2021) Scalable reservoir computing on coherent linear photonic processor communications. physics.4. 20(以下、「参考文献5」とも記す)に記載されている。第1実施形態は、光通信用デバイスを介して100Gbit/sを超える高速な光の入出力が可能であるため、演算を高速に行うことができる。 Next, the physical medium 115 of the first embodiment will be explained. The physical medium 115 may be a known one, and the first embodiment uses an optical circuit for the physical medium 115 to perform calculations. For example, J, Bueno. et al. (2018). Reinforcement learning in a large-scale photonic recurrent neural network. Optica 5, 756-760 (hereinafter also referred to as "Reference 3" et al, (2012) Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics express 20, 3241 (hereinafter also referred to as "Reference 4"), M, Nakajima. et al. (2021) Scalable reservoir computing on coherent linear photonic processor communications. physics.4.20 (hereinafter also referred to as "Reference 5"). Since the first embodiment is capable of high-speed input/output of light exceeding 100 Gbit/s via the optical communication device, it is possible to perform calculations at high speed.
 光を高速で入出力するため、信号変換装置116は、ディジタル・アナログ変換器(Digital-to-Analog Converter:DAC)と光変調器によって構成され、ディジタル信号を光強度や光位相に変換し、光学的な物理媒質115に導入する。出力光を再びディジタル信号に変換する場合、信号変換装置117は、光受信機とアナログ・ディジタル変換器(Analog-to-Digital Converter:ADC)とによって構成される。光学的な物理媒質115は、例えば参考文献3に記載のレンズ・ミラー・空間変調器からなる空間光学系でもよく、参考文献4に記載の光ファイバリングを利用する構成であってもよい。このような構成は、比較的装置が大型化する一方、参考文献5に記載のように、光集積回路を利用することによって小型の演算回路を集積することも可能である。 In order to input and output light at high speed, the signal conversion device 116 is composed of a digital-to-analog converter (DAC) and an optical modulator, converts the digital signal into optical intensity and optical phase, It is introduced into an optical physical medium 115 . When converting the output light into a digital signal again, the signal conversion device 117 is composed of an optical receiver and an analog-to-digital converter (ADC). The optical physical medium 115 may be, for example, a spatial optical system composed of a lens, a mirror, and a spatial modulator described in Reference 3, or may be configured using an optical fiber ring described in Reference 4. While such a configuration makes the device relatively large, as described in reference 5, it is also possible to integrate a small arithmetic circuit by using an optical integrated circuit.
 物理媒質115は、非線形変換が可能なように、非線形の光学素子であることが好ましい。非線形素子は、例えば、物理媒質115を非線形光学材料(例えば、LiNbO等)で構成する他、光増幅器の利得飽和を利用する構成等で実現される。第1実施形態では、物式(1)における行列Ωのスペクトル半径(行列の固有値の最大値)が0.8から1.2の範囲になるように物理媒質115を物理的に実装することが好ましい。このような構成により、第1実施形態は、カオス転移点付近での動作を実現でき、RCの記憶を保持する性能を高めることができる。物理媒質115が利得媒質を含まない光学媒質である場合、カオス転移付近における動作とは、可能な限り光学損失を低減する構成によって実行される動作に相当する。なお、物理媒質115が、例えば増幅器のような利得媒質を含む構成である場合、利得を調整することによって記憶保持に係る機能を向上させることができる。 Physical medium 115 is preferably a non-linear optical element to allow non-linear transformation. The nonlinear element is realized, for example, by configuring the physical medium 115 with a nonlinear optical material (for example, LiNbO 3 or the like) or by using gain saturation of an optical amplifier. In the first embodiment, the physical medium 115 can be physically implemented such that the spectral radius (maximum eigenvalue of the matrix) of the matrix Ω in the formula (1) is in the range of 0.8 to 1.2. preferable. With such a configuration, the first embodiment can realize operation near the chaotic transition point, and can improve the performance of holding RC memory. If the physical medium 115 is an optical medium that does not contain a gain medium, operation near chaotic transitions corresponds to operation performed by a configuration that reduces optical losses as much as possible. Note that if the physical medium 115 includes a gain medium such as an amplifier, it is possible to improve the memory retention function by adjusting the gain.
 また、第1実施形態は、式(1)で示した演算と等価の機能の全てを物理媒質115によって行うことに限定されるものではない。例えば、第1実施形態は、式(1)によって得られる結果の一部を物理媒質115によって行うものであってもよい。そして、式(1)によって表される演算の一部を電子演算装置119によって行うものであってもよい。より具体的には、例えば、式(1)に示す演算の一部を予め電子演算装置119等によって行ってもよい。このような事前の処理については、例えば、参考文献4に記載えている。参考文献4に記載の事前処理においては、式(1)の第二項の処理(入力マスク処理)を事前に実行し、その結果を物理媒質115への力信号としている。このような事前処理によれば、状態sや中間状態x(t)の次元が大きい場合であっても、物理媒質115の実装が容易になる。 Also, the first embodiment is not limited to the physical medium 115 performing all of the functions equivalent to the calculation shown in Equation (1). For example, the first embodiment may use the physical medium 115 to perform part of the result obtained by equation (1). Then, part of the computation represented by Equation (1) may be performed by the electronic computation unit 119 . More specifically, for example, part of the computation shown in Equation (1) may be performed in advance by the electronic computation unit 119 or the like. Such pretreatment is described, for example, in Reference 4. In the pre-processing described in reference 4, the processing (input mask processing) of the second term of Equation (1) is performed in advance, and the result is used as the force signal to the physical medium 115 . Such preprocessing facilitates the implementation of the physical medium 115 even when the dimensionality of the state s t and the intermediate state x(t) is large.
 また、物理媒質115の他の態様としては、例えば、ReRAMなどのメモリスタが挙げられる。このような態様は、抵抗セルを通過した電流の履歴に応じて抵抗値が変化するという物理現象を利用する構成である。この構成では、信号変換装置116、117にDAC/ADCを利用する。このような構成は、アナログ電子回路で実装されるため、インターフェイスの親和性が高く、小型化に有利である。このような構成は、例えば、Yanan, Zhong. et al. (2021). Dynamic memristor-based reservoir computing for high-efficiency temporal signal processing. Nature Communications 12:408.(以下、「参考文献6」と記す)に記載されている。また、メモリスタは、半導体プロセスで形成可能なため大規模なノードを構成しやすい点でも有利である。さらに、このような構成の他、磁性体回路や流体、ソフトマテリアルなども物理媒質115の態様として利用可能である。この点は、例えば、参考文献1にまとめられている。 Another aspect of the physical medium 115 is, for example, a memristor such as ReRAM. Such an aspect is a configuration that utilizes a physical phenomenon in which the resistance value changes according to the history of the current that has passed through the resistance cell. In this configuration, DAC/ADC are used for the signal converters 116 and 117 . Since such a configuration is implemented by an analog electronic circuit, it has a high interface affinity and is advantageous for miniaturization. Such a configuration is, for example, Yanan, Zhong. et al. (2021). Dynamic memristor-based reservoir computing for high-efficiency temporal signal processing. Nature Communications 12:408. It is described in. In addition, since the memristor can be formed by a semiconductor process, it is advantageous in that a large-scale node can be easily configured. Furthermore, in addition to such a configuration, magnetic circuits, fluids, soft materials, and the like can also be used as aspects of the physical medium 115 . This point is summarized, for example, in Reference 1.
 以上説明したように、第1実施形態は、演算処理量の大きい強化学習に対して物理RCを適用し、効果的に演算リソースを低減することができる。このため、強化学習を実行する強化学習用演算装置の大型化、高コスト化を抑えることができる。あるいは、強化学習を実行する強化学習用演算装置の小型化により演算リソースに余裕が生じた場合には、さらに高度な演算処理を実行することができる。このような第1実施形態は、効率の高い強化学習を実行する強化学習用演算装置、強化学習方法を提供することができる。 As described above, the first embodiment applies physical RC to reinforcement learning with a large amount of computational processing, and can effectively reduce computational resources. For this reason, it is possible to suppress an increase in the size and cost of a reinforcement learning computing device that executes reinforcement learning. Alternatively, when there is room in the computational resources due to the miniaturization of the computational device for reinforcement learning, it is possible to perform more advanced computational processing. Such a first embodiment can provide a reinforcement learning computing device and a reinforcement learning method that perform highly efficient reinforcement learning.
[第2実施形態]
(RCによる強化学習(Actor-critic型))
 次に、本発明の第2実施形態について説明する。第2実施形態の強化学習用演算装置は、方策πについて一様の確率で生成し、学習しない方式(方策オフ型学習)であったことに対し、方策πについても学習する態様(方策オン型学習)である点で相違する。方策オン型学習を行う第2実施形態は、第一実施形態よりも解の速い収束が可能である。
[Second embodiment]
(Reinforcement learning by RC (Actor-critic type))
Next, a second embodiment of the invention will be described. In the reinforcement learning arithmetic device of the second embodiment, the policy π is generated with a uniform probability and is not learned (policy off-type learning), whereas the policy π is also learned (policy on-type learning). learning). The second embodiment with policy-on-learning is capable of faster convergence to a solution than the first embodiment.
 図5(a)、図5(b)は、第2実施形態の強化学習用の演算装置512、513を説明するための図である。図5(a)は、第2実施形態の強化学習用演算装置512の模式図である。図5(b)は、図5(a)に示す演算装置512の変形例である演算装置513の模式図である。演算装置512は、二つのRCモデル10a、10bを有するRCアルゴリズム311と、行動決定アリゴリズム212と異なる行動を選定する行動決定アリゴリズム312を備える点で第1実施形態の演算装置112と相違する。 FIGS. 5(a) and 5(b) are diagrams for explaining arithmetic units 512 and 513 for reinforcement learning according to the second embodiment. FIG. 5A is a schematic diagram of the reinforcement learning arithmetic device 512 of the second embodiment. FIG. 5(b) is a schematic diagram of an arithmetic device 513 that is a modification of the arithmetic device 512 shown in FIG. 5(a). The computing device 512 differs from the computing device 112 of the first embodiment in that it includes an RC algorithm 311 having two RC models 10a and 10b and an action decision algorithm 312 that selects an action different from the action decision algorithm 212.
 図5(a)に示す第2実施形態の演算装置512における信号の流れは第1実施形態で説明した図2の演算装置112と同様である。ただし、演算装置512は、RCネットワークの構成及び学習方法が演算装置112と異なっている。図5(a)に示す構成においては、状態測定装置113から入力された状態sが、式(1)における入力信号u(t)として二つのRCモデル10a、10bにそれぞれ入力される。RCモデル10a、10bの入力及び出力の次元は、状態s、行動aの次元とそれぞれ等しくなるように設定されている。RCアルゴリズム311における学習変数は、RCモデル10a、10bの各々の出力層の重みωとωのみである。これによって、RCアルゴリズム311は、通常の深層学習モデルに比較し、少数のパラメータを使って学習できるという利点がある。RCモデル10a、10bは、各々が「アクター(Actor)」と「クリティック(Critic)」と呼ばれる異なる処理を実行する。 The signal flow in the arithmetic device 512 of the second embodiment shown in FIG. 5(a) is the same as that of the arithmetic device 112 of FIG. 2 described in the first embodiment. However, the computing device 512 differs from the computing device 112 in the structure of the RC network and the learning method. In the configuration shown in FIG. 5(a), the state s t input from the state measuring device 113 is input to the two RC models 10a and 10b as the input signal u(t) in equation (1). The dimensions of the inputs and outputs of the RC models 10a and 10b are set equal to the dimensions of the states s t and actions a t , respectively. The only learning variables in the RC algorithm 311 are the output layer weights ω a and ω c of each of the RC models 10a, 10b. This gives the RC algorithm 311 the advantage of being able to learn using a small number of parameters compared to regular deep learning models. The RC models 10a, 10b each perform different processes called "Actors" and "Critics".
 ここで、アクターはエージェント111の取り得る方策を決定し、クリティックは状態測定装置113から出力される状態Sから情報を集めることにより状態の価値を推定する。RCアルゴリズム311は、アクター側の出力が方策を表す行動価値関数π(s,a)となるように、また、クリティック側の出力が状態価値関数V(s)になるように学習を行う。 Here, the actor determines the possible course of action of the agent 111 and the critic estimates the value of the state by gathering information from the state S t output from the state measurer 113 . The RC algorithm 311 learns so that the output on the actor side is the action value function π(s t , a t ) representing the policy, and the output on the critic side is the state value function V(s t ). I do.
 演算装置310の行動決定アルゴリズム312は、実施の形態1とは異なり、アクター側で学習される方策πに基づいて図2に示したエージェント111の行動を選定する。行動の選定は、例えば、以下の式(7)、式(8)で定義されるアクター側とクリティック側のコスト関数L,Lをそれぞれ最小化するように実行する。
Figure JPOXMLDOC01-appb-M000006
The action determination algorithm 312 of the arithmetic device 310 selects the action of the agent 111 shown in FIG. 2 based on the policy π learned on the actor side, unlike the first embodiment. Action selection is performed, for example, so as to minimize cost functions L a and L c on the actor side and the critic side defined by the following equations (7) and (8), respectively.
Figure JPOXMLDOC01-appb-M000006
 式(7)、式(8)のω,ωに対する偏微分を考えることで、式(5),式(6)と同様に更新則を導出することができる。学習を安定化させるため、Lについては以下の式(9)のようにnステップ先まで考慮したコスト関数を用いてもよい。
Figure JPOXMLDOC01-appb-M000007
By considering the partial differentiation of Equations (7) and (8) with respect to ω a and ω c , the update rule can be derived in the same manner as Equations (5) and (6). In order to stabilize learning, a cost function that considers up to n steps ahead may be used for Lc , as in the following equation (9).
Figure JPOXMLDOC01-appb-M000007
式(9)において、nを1より大きな値に設定することで、学習の安定性、収束性を高めることができる。nは10以下の値に設定することが望ましく、式(9)のωに対する偏微分を考えることで、式(5)、式(6)と同様に更新則を導出することができる。このような構成の演算装置310は、RCモデル10a、10bを並列に2つ接続し、行動決定アルゴリズム212が方策に関する情報と価値に関する情報とを併せてエージェント111の行動を決定するようにすることによって実現できる。 By setting n to a value greater than 1 in Equation (9), the stability and convergence of learning can be improved. It is desirable to set n to a value of 10 or less, and by considering the partial differential of Equation (9) with respect to ωc , the update rule can be derived in the same manner as Equations (5) and (6). The arithmetic unit 310 having such a configuration connects two RC models 10a and 10b in parallel so that the action decision algorithm 212 decides the action of the agent 111 by combining the information on the policy and the information on the value. can be realized by
 また、上記では、アクター側とクリティック側で異なるRCモデル10a,10bを利用していたが、図5(b)のRCアルゴリズム411のように、入力層、中間層をシェアし、出力層のみ異なるように設定してもよい。これによって、RCアルゴリズム411において同一の物理媒115質の計算結果を共有できるので、第1実施形態と同様に、1つのRCモデル10を有する演算装置410でも演算の実行が可能となる。 Also, in the above, different RC models 10a and 10b were used on the actor side and the critic side, but like the RC algorithm 411 in FIG. You can set it differently. As a result, calculation results of the same physical medium 115 can be shared in the RC algorithm 411, so that even the calculation device 410 having one RC model 10 can perform calculations, as in the first embodiment.
 なお、図5(a)、図5(b)に示すRCアルゴリズム311、411は、ハードウェアとして構成する場合、いずれも第1実施形態の図4に示す演算装置112と同様に、信号変換装置116,117、物理媒質115、記憶部118及び電子演算装置119を含むように構成される。 When the RC algorithms 311 and 411 shown in FIGS. 5(a) and 5(b) are configured as hardware, both of them are similar to the arithmetic unit 112 shown in FIG. 4 of the first embodiment. 116 , 117 , physical medium 115 , storage unit 118 and electronic computing unit 119 .
[第3実施形態]
(ランダム畳み込み層の追加)
 次に、本開示の第3実施形態を説明する。図6は、第3実施形態の演算装置612を説明するための図である。演算装置612は、第1実施形態の図2に示したRCアルゴリズム211、行動決定アルゴリズム212と共に、RCアルゴリズム211の前段に設けられた前処理部613を有している。前処理部613は、ランダム畳み込み層とプーリング層による前処理をする機能を有し、状態測定装置113から出力される状態sを状態st´に変換する。RCアルゴリズム211は、状態st´に基づいてRCモデルの演算を実行する。
[Third embodiment]
(adding random convolutional layers)
Next, a third embodiment of the present disclosure will be described. FIG. 6 is a diagram for explaining the arithmetic unit 612 of the third embodiment. The arithmetic device 612 has the RC algorithm 211 and the action determination algorithm 212 shown in FIG. The preprocessing unit 613 has a function of performing preprocessing using a random convolution layer and a pooling layer, and converts the state s t output from the state measuring device 113 into a state s t ' . The RC algorithm 211 performs RC model calculations based on the state st' .
 前処理部613は、畳み込みを行うカーネルフィルタと、プーリングを行うカーネルフィルタの2つの層を持っている。畳み込みは、画像の小領域の部分の画素値に対し、この領域を1つの特徴料としてより小さい領域に圧縮する処理をいい、画像のデータ量を低減する処理である。プーリングは、画像を間引いてデータ量を低減する処理である。ランダム畳み込み層におけるカーネルフィルタの係数は、例えば[-1:1]区間の乱数テーブルによって生成され、その後学習はしない。したがって、第3実施形態では、RCアルゴリズム211が第1実施形態、第2実施形態が状態sを処理するのと同様に状態st´を処理することができる。第3実施形態は、ゲームや車載カメラといった画素数に対応する高次元な入力情報を扱う必要がある場合、予め畳み込みニューラルネットワークによって次元を圧縮し、RCアルゴリズム211における処理量を低減することができる。 The preprocessing unit 613 has two layers, a kernel filter that performs convolution and a kernel filter that performs pooling. Convolution is a process of compressing the pixel values of a small area of an image into a smaller area using this area as one feature, and is a process of reducing the amount of image data. Pooling is a process of thinning out images to reduce the amount of data. The coefficients of the kernel filter in the random convolution layer are generated, for example, by a random number table in the interval [-1:1], and are not learned thereafter. Thus, in the third embodiment, the RC algorithm 211 can process state s t' in the same way that the first and second embodiments process state s t . In the third embodiment, when it is necessary to handle high-dimensional input information corresponding to the number of pixels such as a game or an in-vehicle camera, the dimensions are compressed in advance by a convolutional neural network, and the amount of processing in the RC algorithm 211 can be reduced. .
 (数値計算による検証)
 上記の第1実施形態から第3実施形態を用いて行った強化学習の実施例について説明する。本実施例は、強化学習のテストで一般的に用いられるカートポール問題とよばれるタスクをコンピュータシミュレーション上で実行する。このタスクは、一次元空間中を移動する台車の位置を制御することで、台車の上に立つ棒を倒さないように制御することを目標とする。状態sとしては、各時刻の台車の速度、位置、角度及び角速度が与えられる。行動aとしては、その時刻において棒が左右のどちらに動くかが選択できる。角度が±20.9°以上傾く、あるいは±2.4以上位置が変わると失敗となり、-200の報酬Rを得る。それ以外の場合、各時刻において+1の報酬Rtを得る。このようなタスクを実装するために、本実施例は、実施形態2のアクターとクリティックの2つのRCモデルを有する演算装置を適用した。
(Verification by numerical calculation)
An example of reinforcement learning performed using the above-described first to third embodiments will be described. In this embodiment, a task called cartpole problem, which is generally used in reinforcement learning tests, is executed on a computer simulation. The goal of this task is to control the position of the cart moving in one-dimensional space so that the pole standing on top of the cart does not topple. As the state st , the speed, position, angle and angular velocity of the truck at each time are given. As the action at, it is possible to select whether the rod moves left or right at that time. If the angle tilts more than ±20.9° or the position changes more than ±2.4, it fails and receives a reward R t of −200. Otherwise, get +1 reward Rt each time. In order to implement such a task, this embodiment applied the computing device with two RC models of the actor and the critic of the second embodiment.
 また、RCモデルの物理媒媒質として、光学的な媒質を採用し、非線形関数として半導体光増幅器(Semiconductor Optical Amplifier:SOA)の飽和挙動に基づくtanh()を利用している。入力結合mは、例えば、事前に演算してから物理媒質115に導入することとしており、[0:1]区間から一様乱数で生成している。物理媒質115はファイバベースで構築することを前提とし、媒質内の相互結合に相当するΩは次式のようなリングトポロジを反映したものとなっている。このような媒質は、例えば、Francois, Duport. et al.(2012). All-optical reservoir computing. Vol. 20, No. 20 / Optics Express, 22783-22795.(以下、「参考文献7」と記す)に記載されている。
Figure JPOXMLDOC01-appb-M000008
An optical medium is adopted as the physical medium of the RC model, and tanh() based on the saturation behavior of a semiconductor optical amplifier (SOA) is used as a nonlinear function. The input coupling m is, for example, calculated in advance before being introduced into the physical medium 115, and is generated by a uniform random number from the [0:1] section. It is assumed that the physical medium 115 is constructed on a fiber basis, and Ω corresponding to the mutual coupling within the medium reflects the ring topology as expressed by the following equation. Such a medium is described in, for example, Francois, Duport. et al. (2012). All-optical reservoir computing. Vol. 20, No. 20 / Optics Express, 22783-22795. )It is described in.
Figure JPOXMLDOC01-appb-M000008
 リング内の戻り光の利得αは、Ωのスペクトル半径に概ね等しい。RCモデル10における学習は、式(7)、(9)記載の方法で実施される。式(9)のn、γをn=5,γ=0.99とし、RCモデル10のノード数は64とした。 The return light gain α in the ring is approximately equal to the spectral radius of Ω. Learning in the RC model 10 is performed by the method described in equations (7) and (9). n and γ in equation (9) were set to n=5 and γ=0.99, and the number of nodes of the RC model 10 was set to 64.
 図7(a)、図7(b)は、本実施例の学習の状態を示すグラフである。図7(a)は、横軸に試行回数、縦軸に継続して棒を倒さずにいたステップ数を示している。図7(a)中の実線は、本実施例のRCモデルを用いた学習を示す。破線は、比較のために3層の全結合ニューラルネットワークを用いて行った同様の学習の結果を示す。図7(a)から明らかなように、本実施例は、破線に比べて試行回数の少ないうちに棒を倒さずにいられるステップ数が上昇し、全結合ニューラルネットワークを用いるよりも収束が早い。さらに、RCアルゴリズムによれば、学習変数はωのみであり、その他の変数は後述する物理媒質115により処理されるので、少ないパラメータで学習が可能である。 FIGS. 7(a) and 7(b) are graphs showing the state of learning in this embodiment. In FIG. 7(a), the horizontal axis indicates the number of trials, and the vertical axis indicates the number of steps in which the stick was not knocked over. A solid line in FIG. 7A indicates learning using the RC model of this embodiment. The dashed line shows the results of similar training using a 3-layer fully-connected neural network for comparison. As is clear from FIG. 7(a), in this embodiment, compared to the dashed line, the number of steps in which the stick can not be knocked down while the number of trials is small increases, and convergence is faster than using a fully-connected neural network. . Furthermore, according to the RC algorithm, the only learning variable is ω, and the other variables are processed by the physical medium 115, which will be described later, so learning is possible with a small number of parameters.
 図7(b)は、リング内の戻り光の利得αと、収束するまでに要する試行回数の平均値との関係を示している。ここで、収束は、200ステップまで棒を倒さずに保持することを5回連続で成功することと定義する。図7(b)から明らかなように、αが1.2以上になると、収束までの試行回数が増えて、αが0.6から1.0までの間では収束にかかる試行回数が少なく、かつ安定している。この理由は、αが1.2を超えると光学媒質の挙動がカオス的になり、光学媒質のカオス転移点付近でRCアルゴリズムの性能が高まるという傾向を反映している。したがって、RCアルゴリズムのスペクトル半径は、カオス転移点に至らない範囲で、かつ、なるべく大きな値を持つように設定することが望ましい。 FIG. 7(b) shows the relationship between the return light gain α in the ring and the average number of trials required for convergence. Here, convergence is defined as five consecutive successes in holding the bar up to 200 steps without falling. As is clear from FIG. 7B, when α is 1.2 or more, the number of trials until convergence increases, and when α is between 0.6 and 1.0, the number of trials required for convergence is small. and stable. The reason for this reflects the tendency that when α exceeds 1.2, the behavior of the optical medium becomes chaotic and the performance of the RC algorithm increases near the chaotic transition point of the optical medium. Therefore, it is desirable to set the spectral radius of the RC algorithm within a range that does not reach the chaotic transition point and to have as large a value as possible.
 また、本実施例では、別の学習の例として、コンピュータゲームPong(Atari社製)をタスクとし、コンピュータと対戦して勝利することを目的とする。RCモデルでは、状態sを第3実施形態で述べた前処理部で予め処理し、RCモデルに入力する。図8(a)は、本実施例で使用した前処理部613を説明するための図である。図8(b)は、学習の状態を示すグラフである。前処理部613は、複数の中間層3130、3131、3133を有する畳み込みフィルタ、プーリング層3135を有している。畳み込みフィルタの後段には、RCモデル10を有する演算装置が設けられている。中間層3130においては、8×8のカーネルフィルタ3137が、84×84×4の入力データに対し、4×4画素分ずつスライドしながら畳み込み処理を行う。中間層3131においては、4×4のカーネルフィルタ3132が、20×20×32のデータに対し、2×2画素分ずつスライドして畳み込み処理が行う。中間層3133においては、3×3のカーネルフィルタ3134が、9×9×64のデータに対し、1×1画素分ずつスライドして畳み込み処理が行う。この結果、畳み込み処理が行われる度に特徴マップが出力される。特徴マップは、プーリング層3135によって画素数を減じられた後、結合して結合層3136を構成する。 In this embodiment, as another example of learning, the computer game Pong (manufactured by Atari) is used as a task, and the object is to compete against a computer and win. In the RC model, the state st is preprocessed by the preprocessing unit described in the third embodiment and input to the RC model. FIG. 8A is a diagram for explaining the preprocessing unit 613 used in this embodiment. FIG. 8(b) is a graph showing the state of learning. The preprocessing unit 613 has a convolution filter pooling layer 3135 having a plurality of intermediate layers 3130 , 3131 , 3133 . After the convolution filter, an arithmetic unit with an RC model 10 is provided. In the intermediate layer 3130, an 8×8 kernel filter 3137 performs convolution processing on 84×84×4 input data while sliding by 4×4 pixels. In the intermediate layer 3131, a 4×4 kernel filter 3132 performs convolution processing by sliding 20×20×32 data by 2×2 pixels. In the intermediate layer 3133, a 3×3 kernel filter 3134 slides 9×9×64 data by 1×1 pixels and performs convolution processing. As a result, a feature map is output each time convolution processing is performed. The feature maps are pixel-reduced by a pooling layer 3135 and then combined to form a combined layer 3136 .
 学習は、式(7)、(9)に示す方法で行った。式(9)において、n=5,γ=0,99とし、RCのノード数は4000としている。図8(b)の横軸は試行回数を示し、縦軸は得点差(コンピュータの得点から対象の得点を差し引いた値)を示している。得点差が「+」である場合、対象が勝利したことを示している。図8(b)から明らかなように、得点差は、試行回数が増えるにつれて大きくなり、以降安定して収束した。このような結果は、ゲームプレイのような比較的難しいタスクであっても本開示のRCモデルを使って学習することができることを示している。 Learning was performed by the method shown in formulas (7) and (9). In equation (9), n=5, γ=0,99, and the number of RC nodes is 4,000. The horizontal axis of FIG. 8(b) indicates the number of trials, and the vertical axis indicates the score difference (a value obtained by subtracting the score of the subject from the score of the computer). A score difference of "+" indicates that the subject has won. As is clear from FIG. 8(b), the score difference increased as the number of trials increased, and then stably converged. Such results demonstrate that even relatively difficult tasks such as game play can be learned using the RC model of the present disclosure.
10,10a,10b・・・RCモデル
111・・・エージェント
112,512,513,612・・・演算装置
113・・・状態測定装置
114・・・行動制御装置
115・・・物理媒質
116,117・・・信号変換装置
118・・・記憶部
119・・・電子演算装置
211,311,411・・・RCアルゴリズム
212,312,412・・・行動決定アルゴリズム
613・・・前処理部
3130,3131,3133・・・中間層
3132,3134,3137・・・カーネルフィルタ
3135・・・プーリング層
3136・・・結合層
10, 10a, 10b... RC model 111... Agents 112, 512, 513, 612... Arithmetic device 113... State measurement device 114... Action control device 115... Physical media 116, 117 ... Signal conversion device 118 ... Storage unit 119 ... Electronic arithmetic unit 211, 311, 411 ... RC algorithm 212, 312, 412 ... Action determination algorithm 613 ... Preprocessing section 3130, 3131 , 3133... Intermediate layers 3132, 3134, 3137... Kernel filter 3135... Pooling layer 3136... Connection layer

Claims (7)

  1.  対象がタスクを実行する際の方策を学習する強化学習用の演算装置であって、
     対象の状態に関する入力情報を入力する入力層と、前記入力層に入力された前記入力情報に基づいて、前記対象の前記方策を学習する中間層と、前記方策に関する出力情報を出力する出力層と、を有するコンピュータシステムと、
     前記コンピュータシステムから出力された前記出力情報に基づいて、前記対象の行動に関する情報を出力する行動決定部と、を含み、
     前記コンピュータシステムは、前記入力情報を物理的に変換し、前記対象の前記方策を学習する機能の少なくとも一部を実行する物理媒質を備える、強化学習用演算装置。
    A computing device for reinforcement learning that learns a policy when a subject performs a task,
    An input layer that inputs input information about the state of a target, an intermediate layer that learns the policy of the target based on the input information that is input to the input layer, and an output layer that outputs output information about the policy. a computer system having
    a behavior determination unit that outputs information about the behavior of the target based on the output information output from the computer system;
    A computational unit for reinforcement learning, wherein the computer system comprises a physical medium that physically transforms the input information and performs at least part of the function of learning the policy of the object.
  2.  前記コンピュータシステムは、前記入力情報に基づいて前記対象の前記方策を決定するアクター部と、前記入力情報に基づいて前記状態の価値を推定するクリティック部と、を有する、請求項1に記載の強化学習用演算装置。 2. The computer system of claim 1, wherein the computer system comprises an actor part that determines the policy of the object based on the input information, and a critic part that estimates the value of the state based on the input information. Arithmetic unit for reinforcement learning.
  3.  前記アクター部として機能する第1のコンピュータシステムと、前記クリティック部として機能する第2のコンピュータシステムと、を含む、請求項2に記載の強化学習用演算装置。 The arithmetic device for reinforcement learning according to claim 2, comprising a first computer system functioning as the actor unit and a second computer system functioning as the critic unit.
  4.  前記コンピュータシステムの前段に、前記入力情報を圧縮する前処理部を有する、請求項1から3のいずれか一項に記載の強化学習用演算装置。 4. The computing device for reinforcement learning according to any one of claims 1 to 3, comprising a preprocessing unit that compresses the input information in the preceding stage of the computer system.
  5.  前記物理媒質は、非線形の光学素子である、請求項1から4のいずれか一項に記載の強化学習用演算装置。 The arithmetic device for reinforcement learning according to any one of claims 1 to 4, wherein the physical medium is a nonlinear optical element.
  6.  前記コンピュータシステムは、前記入力情報を前記物理媒質に対応して変換する入力変換部と、前記物理媒質から出力された前記出力情報を後に行われる演算処理に応じて変換する出力変換部と、前記出力変換部によって変換された前記出力情報に基づいて、強化学習に関する演算処理を実行する演算部と、前記演算部の演算結果に基づいて、前記対象の行動に関する情報を出力する、請求項1から5のいずれか一項に記載の強化学習用演算装置。 The computer system includes an input conversion unit that converts the input information corresponding to the physical medium, an output conversion unit that converts the output information output from the physical medium according to arithmetic processing to be performed later, and the From claim 1, wherein a computing unit that executes computational processing related to reinforcement learning based on the output information converted by the output transforming unit, and outputs information related to the behavior of the target based on the computation results of the computing unit. 6. The computing device for reinforcement learning according to any one of 5.
  7.  対象がタスクを実行する際の方策を学習する強化学習用の演算方法であって、
     情報を入力する入力層と、前記入力層に入力された入力情報に基づいて演算を実行する中間層と、前記演算の結果を出力する出力層と、を有するコンピュータシステムに対し、
     前記入力層に前記対象の状態に関する入力情報を入力する工程と、
     前記中間層が前記入力情報に基づいて前記対象の前記方策を学習する工程と、
     前記学習によって得られた前記方策に関する情報に基づいて、前記対象の行動に関する情報を出力する工程と、を含み、
     前記対象の前記方策を学習する工程においては、少なくとも一部が、前記入力情報を物理的に変換する物理媒質により実行される、強化学習用演算方法。
    A computational method for reinforcement learning for learning a strategy for a subject to perform a task, comprising:
    For a computer system having an input layer for inputting information, an intermediate layer for executing operations based on input information input to the input layer, and an output layer for outputting the results of the operations,
    inputting input information about the state of the object into the input layer;
    the intermediate layer learning the policy of the target based on the input information;
    and outputting information about the behavior of the target based on the information about the policy obtained by the learning,
    A computational method for reinforcement learning, wherein the step of learning the policy of the object is performed, at least in part, by a physical medium that physically transforms the input information.
PCT/JP2021/021969 2021-06-09 2021-06-09 Computing device for reinforcement learning and reinforcement learning method WO2022259434A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/021969 WO2022259434A1 (en) 2021-06-09 2021-06-09 Computing device for reinforcement learning and reinforcement learning method
JP2023526735A JPWO2022259434A1 (en) 2021-06-09 2021-06-09

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/021969 WO2022259434A1 (en) 2021-06-09 2021-06-09 Computing device for reinforcement learning and reinforcement learning method

Publications (1)

Publication Number Publication Date
WO2022259434A1 true WO2022259434A1 (en) 2022-12-15

Family

ID=84425968

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/021969 WO2022259434A1 (en) 2021-06-09 2021-06-09 Computing device for reinforcement learning and reinforcement learning method

Country Status (2)

Country Link
JP (1) JPWO2022259434A1 (en)
WO (1) WO2022259434A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018200391A (en) * 2017-05-26 2018-12-20 日本電信電話株式会社 Optical signal processing circuit

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018200391A (en) * 2017-05-26 2018-12-20 日本電信電話株式会社 Optical signal processing circuit

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KANNO, KAZUTAKA; UCHIDA, ATSUSHI: "N-2-5 Reinforcement Learning using Delay-Based Reservoir Computing", PROCEEDINGS OF THE IEICE ENGINEERING SCIENCES SOCIETY/NOLTA SOCIETY CONFERENCE, 30 November 2019 (2019-11-30), JP , pages 123, XP009541833, ISSN: 2189-700X *
KOBAYASHI, TAISUKE: "Multi-Objective Switchable Reinforcement Learning by using Reservoir Computing. The Japan Society of Mechanical Engineers", THE PROCEEDINGS OF 2017 JSME ANNUAL CONFERENCE ON ROBOTICS AND MECHATRONICS, vol. 2017, 2017, pages 1 - 4 *
KONISHI BUNGO; HIROSE AKIRA; NATSUAKI RYO: "Complex-Valued Reservoir Computing for Interferometric SAR Applications With Low Computational Cost and High Resolution", IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, IEEE, USA, vol. 14, 5 August 2021 (2021-08-05), USA, pages 7981 - 7993, XP011874235, ISSN: 1939-1404, DOI: 10.1109/JSTARS.2021.3102620 *
MATSUKI TOSHITAKA; INOUE SOUYA; SHIBATA KATSUNARI: "Q-learning with exploration driven by internal dynamics in chaotic neural network", 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), IEEE, 19 July 2020 (2020-07-19), pages 1 - 7, XP033831444, DOI: 10.1109/IJCNN48605.2020.9207114 *

Also Published As

Publication number Publication date
JPWO2022259434A1 (en) 2022-12-15

Similar Documents

Publication Publication Date Title
Xin et al. Application of deep reinforcement learning in mobile robot path planning
EP3992857A1 (en) Method and device for generating neural network model, and computer-readable storage medium
Cruz et al. Path planning of multi-agent systems in unknown environment with neural kernel smoothing and reinforcement learning
Rafiq et al. Neural network design for engineering applications
Chen et al. Self-learning exploration and mapping for mobile robots via deep reinforcement learning
CN112119409A (en) Neural network with relational memory
WO2020024172A1 (en) Collaborative type method and system of multistate continuous action space
CN107766292B (en) Neural network processing method and processing system
Ruan et al. A new multi-function global particle swarm optimization
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
Özalp et al. A review of deep reinforcement learning algorithms and comparative results on inverted pendulum system
Fan et al. Neighborhood centroid opposite-based learning Harris Hawks optimization for training neural networks
Liu et al. Neural network control system of cooperative robot based on genetic algorithms
Zhang et al. Evolutionary echo state network for long-term time series prediction: on the edge of chaos
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Acuto et al. Variational quantum soft actor-critic for robotic arm control
WO2022259434A1 (en) Computing device for reinforcement learning and reinforcement learning method
Vatankhah et al. Active leading through obstacles using ant-colony algorithm
Stanton et al. Heterogeneous complexification strategies robustly outperform homogeneous strategies for incremental evolution.
CN114378820B (en) Robot impedance learning method based on safety reinforcement learning
JP2022165395A (en) Method for optimizing neural network model and method for providing graphical user interface for neural network model
Mai-Phuong et al. Balancing a practical inverted pendulum model employing novel meta-heuristic optimization-based fuzzy logic controllers
Naderi et al. Learning physically based humanoid climbing movements
Kadokawa et al. Binarized P-network: deep reinforcement learning of robot control from raw images on FPGA
KR20220166716A (en) Demonstration-conditioned reinforcement learning for few-shot imitation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945110

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023526735

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE