WO2017038104A1 - Installation device and installation method - Google Patents

Installation device and installation method Download PDF

Info

Publication number
WO2017038104A1
WO2017038104A1 PCT/JP2016/004028 JP2016004028W WO2017038104A1 WO 2017038104 A1 WO2017038104 A1 WO 2017038104A1 JP 2016004028 W JP2016004028 W JP 2016004028W WO 2017038104 A1 WO2017038104 A1 WO 2017038104A1
Authority
WO
WIPO (PCT)
Prior art keywords
function
code
processing
source code
execution
Prior art date
Application number
PCT/JP2016/004028
Other languages
French (fr)
Japanese (ja)
Inventor
辰哉 加藤
遼介 奥田
誠也 得居
裕也 海野
将平 比戸
Original Assignee
株式会社Preferred Networks
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2015213294A external-priority patent/JP2018173672A/en
Application filed by 株式会社Preferred Networks filed Critical 株式会社Preferred Networks
Publication of WO2017038104A1 publication Critical patent/WO2017038104A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the technology described in this specification relates to a mounting apparatus and a mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit.
  • Non-patent Document 1 Non-patent Document 1
  • an algorithm for executing machine learning designed using a computer having abundant calculation resources is used as an embedded system chip (embedded system semiconductor integrated circuit) having less calculation resources than such a computer. ) And a mounting method for mounting appropriately.
  • An apparatus is a mounting apparatus for mounting a computer program for executing machine learning on a semiconductor integrated circuit, wherein the first source code described in a first programming language is mounted on the semiconductor integrated circuit.
  • First execution means that can be executed by a first arithmetic unit, and second source code described in a second programming language different from the first programming language are mounted on the semiconductor integrated circuit.
  • Second execution means that can be executed by the second arithmetic unit, a result of execution of the first specific code included in the first source code by the first execution means, and the first The second execution means is a second specific code included in the second source code, and the first specific code is rewritten by the second programming language.
  • Comparing means for comparing the result of executing the second specific code with each other and outputting a comparison result, wherein the second executing means is the first source described in the first programming language. It is characterized by comprising bytecode generation means for generating a bytecode that is described in an arbitrary data format and includes at least one of input / output data information, weight information, and function call information during backward processing. To do.
  • an algorithm for executing machine learning designed using a computer having abundant computing resources is applied to an embedded system chip (embedded semiconductor integrated circuit) having a scarce computing resource compared to such a computer.
  • a mounting device that is appropriately mounted can be provided.
  • FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art.
  • FIG. 2 is a schematic diagram conceptually showing a technique called “Define-by-Run” according to the embodiment of the present invention.
  • FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network.
  • FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network.
  • FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network.
  • FIG. 6 is a diagram illustrating a pseudo code for realizing the calculation executed in the forward process by the Linear.
  • FIG. 7 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by Linear.
  • FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art.
  • FIG. 2 is a schematic diagram conceptually showing a technique called “Define-by-Run” according
  • FIG. 8 is a diagram illustrating a pseudo code for realizing a calculation executed at the time of forward processing by the ReLU.
  • FIG. 9 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by the ReLU.
  • FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D.
  • FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention.
  • FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention.
  • FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention.
  • FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG.
  • FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art.
  • FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention.
  • FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG.
  • FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art.
  • FIG. 19 is a diagram illustrating still another example of the source code input to the learning device according to the embodiment of the present invention.
  • FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention.
  • FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention.
  • FIG. 22 is a schematic diagram illustrating a case where an execution unit using Python and an execution unit using a chip communicate with each other.
  • FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention.
  • FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention.
  • FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention.
  • FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention.
  • FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting technique (second technique) according to an embodiment of the present invention.
  • FIG. 28 is a schematic diagram conceptually showing functions included in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 30 is a diagram illustrating a structure definition example of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention.
  • FIG. 31 is a diagram showing the mutual compatibility and reference relationship of multidimensional array data in the multidimensional array module of the mounting apparatus according to the embodiment of the present invention.
  • FIG. 32 is a diagram for explaining a memory pool module of the mounting apparatus according to the embodiment of the present invention.
  • FIG. 33 is a diagram for explaining a structure definition example of the memory pool module of the mounting apparatus according to the embodiment of the present invention.
  • FIG. 34 is a diagram showing a coding example of pipelining in the mounting apparatus according to an embodiment of the present invention.
  • FIG. 35 is a diagram showing an internal state of the virtual machine module in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 36 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 37 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 38 is a diagram showing an example of address setting in the virtual machine module in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 39 is a diagram illustrating a specific example of the entire operation in which the Python layer and the Native layer cooperate in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 40 is a diagram illustrating an output of a plurality of network configurations in the bytecode generation unit in the mounting apparatus according to an embodiment of the present invention.
  • FIG. 41 is a diagram illustrating a code example in the byte code generation unit in the mounting apparatus according to the embodiment of the present invention.
  • FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention.
  • FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention.
  • FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention.
  • FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention.
  • FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention.
  • FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention.
  • FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention.
  • FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged.
  • FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention.
  • FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
  • FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
  • FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention.
  • FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention.
  • FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment
  • Part 1 (Learning device according to the embodiment) 1.
  • Background and Outline Machine learning algorithms including deep learning can often be formulated as a minimization problem of the sum of loss functions defined for each model.
  • the loss function is an index represented by an error between the model output and the correct answer in a given learning data sample.
  • a series of processes from inputting data into the model until obtaining an output and comparing with the correct answer is called a calculation graph, and the result is defined as a loss function.
  • the loss function minimization problem can be solved by a general method called a gradient method as long as the gradient obtained by differentiating the loss function can be calculated.
  • Such a calculation library has so far been based on a calculation procedure called “Define-and-Run” by the applicant.
  • This is an approach in which a calculation graph is first defined (Define), a gradient is derived by automatic differentiation, and then learning (Run) is performed using learning data.
  • the calculation graph does not have a complicated control syntax (if, for, etc.) and does not change over time, a series of gradient calculations can be compiled and prepared as a batch at the time of Define. It brought merit that memory management was unnecessary.
  • a new calculation procedure called “Define-by-Run” by the applicant is proposed.
  • the graph structure is dynamically extracted and stored in each learning (Run), and the The approach of recalculating the gradient each time is adopted.
  • FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art
  • FIG. 2 is called “Define-by-Run” according to an embodiment of the present invention. It is a schematic diagram which shows the method to be performed notionally.
  • the mini-programming language first inputs only the model definition, and outputs the calculation procedure of forward (identification) processing and backward (learning) processing, which are the entities of the calculation graph. (Define step).
  • the forward / backward processing system inputs data and updates parameters (weights) according to the calculation procedure of forward (identification) processing and backward (learning) processing (Run step).
  • the general-purpose programming language processing system executes the forward (identification) process while inputting the model definition, input data, and parameters, and at the same time the backward (learning) process.
  • the model definition is defined in conformity with the grammar of a general-purpose programming language such as function call, four arithmetic operations, loop and branch.
  • the calculation procedure of the backward (learning) process can be dynamically changed independently of the execution of the forward (identification) process.
  • Backward processing system can be called at any timing.
  • the Backward processing system updates the parameters from the input data and the results of the Forward processing according to the Backward calculation procedure.
  • the processing performed in the neural network mainly includes forward processing, backward processing, and updating of weights.
  • the forward process is a process for processing and propagating information from the input layer to the output layer of the neural network.
  • Backward processing refers to performing two processes, error back propagation and weight gradient calculation, from the output layer to the input layer of the neural network.
  • the error back propagation is a process of propagating the error ( ⁇ ) obtained from the output side layer to the input side layer.
  • the weight gradient calculation is processing for obtaining a weight gradient ( ⁇ ⁇ W) from an error ( ⁇ ) obtained from an output layer and an output value of an input layer for a layer having a weight.
  • the update of the weight is a process of updating the weight with an algorithm derived from the stochastic gradient descent method (SGD) using the weight gradient ( ⁇ W) obtained by the weight gradient calculation for the layer having the weight. Say. This weight update is executed once for each unit of batch processing.
  • SGD stochastic gradient descent method
  • ⁇ W weight gradient
  • Each layer constituting a neural network is realized by, for example, a layer algorithm listed below. -Linear -ReLu -Dropout -Softmax Cross Entropy -Convolution 2D -Pooling (Average Pooling, Max Pooling, etc.) etc.
  • Typical examples of the weight update algorithm include the following. -Momentum-SGD -Adam, etc.
  • FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network.
  • FIG. 3 illustrates a neural network in which six intermediate layers (Linear, ReLU, Linear, ReLU, Dropout, and Linear) are arranged between an input layer and an output layer (Softmax).
  • Softmax an output layer
  • a rightward arrow indicates forward processing
  • a leftward arrow indicates backward processing. Since the input layer has no weight to be updated, the backward processing is arranged adjacent to the input layer (in the example shown in FIG. 3, adjacent to the input layer). To the Linear layer).
  • FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network.
  • a plurality of intermediate layers Convolution 2D, ReLU, Convolution 2D
  • a plurality of intermediate layers Convolution 2D, ReLU, Convolution 2D
  • Linear intermediate layer
  • Softmax output layer
  • ReLU ReLU
  • FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network.
  • a plurality of intermediate layers Convolution 2D, ReLU, Convolution 2D
  • Linear intermediate layer
  • Softmax output layer
  • ReLU is illustrated as a neural network in which a plurality (three) are arranged in parallel.
  • an upward arrow indicates forward processing
  • a downward arrow indicates backward processing.
  • FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network.
  • FIG. 5 illustrates, as an example, a neural network having a loop (this may be referred to as “Recurrent Neural Network”).
  • the intermediate layer here, “Linear” executes a calculation using the previous output value of the intermediate layer and the output value of the current input layer as the input of the intermediate layer.
  • BPTT a method for realizing backward processing in such a neural network
  • BPTT a method in which the network is expanded in the time axis direction in advance and converted into a loop-free network is known.
  • Layer algorithm calculation (Linear) Linear, which is one of the layer algorithms, executes a calculation that repeats the operation of taking the weighted average of all the nodes in the input side layer by the number of nodes in the intermediate layer.
  • FIG. 6 is a diagram showing pseudo code for realizing a calculation executed at the time of forward processing by Linear
  • FIG. 7 is a diagram showing a pseudo code for realizing a calculation executed at the time of backward processing by Linear. It is.
  • Layer algorithm calculation (ReLU) ReLU which is one of layer algorithms, calculates Max (0, val) for each node in the input side layer. This algorithm is the most used technique in recent years in the processing (activation function) for adding nonlinearity to the calculation of the neural network.
  • FIG. 8 is a diagram showing pseudo code for realizing the calculation executed at the time of forward processing by ReLU
  • FIG. 9 is a diagram showing the pseudo code for realizing the calculation executed at the time of backward processing by ReLU. It is.
  • Layer algorithm calculation details Dropout
  • Dropout which is one of the layer algorithms, randomly selects a fixed ratio of nodes and executes a calculation that inactivates output and back propagation. This algorithm is unnecessary when only identification is performed (that is, when learning is not performed).
  • Layer algorithm calculation Softmax Cross Entropy
  • Softmax Cross Entropy one of the layer algorithms, corrects the value of the input side layer using the following formula. This layer algorithm is generally used in the output layer. Also, this layer algorithm calculates an error from the difference between the correct answer label (1 or 0) and the output value during backward processing.
  • Convolution 2D which is one of the layer algorithms, convolves an image having a data structure of Channel * Width * Height. Both the input layer and the output of the layer have a data structure of Channel * Width * Height. With this algorithm, the image size can be reduced by stride processing. In this algorithm, padding is inserted into an image on the input side layer.
  • This algorithm has the same calculation structure as the Linear (repeating the inner product calculation of the input channel for the number of output channels) with respect to the channel direction.
  • FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D. Convolution 2D performs weight gradient calculation and error backpropagation in the same way as Linear during backward processing. The scale of each processing loop is the same as that during forward processing.
  • Max Pooling which is one of the layer algorithms, reduces the image vertically and horizontally by taking the maximum value of the image on the input side layer. Note that the filter size taking the maximum value and the stride width for image reduction may be different. There is no change in the number of channels.
  • Layer algorithm calculation (Average Pooling) Max Pooling, which is one of the layer algorithms, reduces the image in the vertical and horizontal directions by taking the average value of the images on the input side layer. Note that the filter size taking the average value may differ from the stride width for image reduction. There is no change in the number of channels.
  • Weight Update Algorithm There are various algorithms derived from the stochastic gradient descent method (SGD) as the weight update algorithm. In these algorithms, the calculation is independent for each weight element.
  • SGD stochastic gradient descent method
  • the formula of momentum-SGD mentioned above is as follows. Also, Adam's formula mentioned above is as follows.
  • FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention.
  • the learning device 10 includes a CPU 11, a main memory 12, an input I / F 13, an output I / F 14, a communication I / F 15, an external memory 16, and a user I / F 17. These components are electrically connected to each other via an internal bus 18. Note that the learning apparatus 10 can selectively include a GPU (not shown).
  • the CPU 11 loads various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python) from the external memory 16 into the main memory 12, and is included in the loaded program. Execute the instruction.
  • the main memory 12 is used for storing a program executed by the CPU 11, and is constituted by, for example, a DRAM.
  • the input I / F 13 has a function of capturing output data of a measuring device (not shown), and is connected to each component by an internal bus 18.
  • the various measurement data that are the output of the measurement device include information acquired by a sensor or the like, for example, temperature, humidity, position information, image data, etc., and moving image data or temperature data acquired at certain intervals of temperature. Time series data such as columns may be used.
  • the output I / F 14 receives data from each component through the internal bus 18 and outputs the data to an output device (not shown) outside the learning device.
  • the data output to the output device may be, for example, control information for driving a motor, control information for an information output device such as a buzzer, a control switch, an automobile accelerator or brake, or a liquid crystal display.
  • the communication I / F 15 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and communicates various information with a server device (not shown) via the communication network 20. It is configured to be possible.
  • the external memory 16 is composed of, for example, a magnetic disk drive, a flash memory, or the like, and stores various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python).
  • the learning device 10 is configured such that the CPU 11 (and optionally the GPU) executes machine learning by executing a predetermined program loaded from the external memory 16 to the main memory 12.
  • the learning device 10 that performs machine learning can be realized as a learning device that is modeled by a neural network by the CPU 11 (optionally in addition to the GPU) executing various programs.
  • the learning device 10 having the above configuration can be mounted on a corresponding individual (device). Further, the learning device 10 can be connected to a corresponding measurement device and a corresponding output device. These measuring devices and output devices may be mounted on a corresponding individual (device) or may be connected as separate devices using communication means.
  • the learning device 10 is an arbitrary information processing device capable of executing machine learning, such as a personal computer, a tablet, a mobile phone, a smartphone, a mobile information terminal, a touch pad, and an information processing server. Including but not limited to.
  • FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention.
  • the learning apparatus 10 is based on a technique called “Define-by-Run” as described above. Specifically, the learning device 10 according to the embodiment performs backward processing and weight update processing at a timing at which the neural network forward processing is executed by a general procedural language including branching, looping, and function calling. By dynamically generating necessary network configuration information, a mechanism capable of actually executing backward processing and weight update processing is provided.
  • the learning device 10 mainly includes an acquisition unit 110, a storage unit 120, an execution unit 130, including.
  • the acquisition unit 110 acquires a source code including a code defining a forward process of each layer constituting the neural network.
  • source code is created by a developer or user using a text editor using a predetermined programming language (for example, Python).
  • a predetermined programming language for example, Python.
  • the acquisition unit 110 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, the user I / F 17, and the like illustrated in FIG.
  • the storage unit 120 stores a correspondence relationship between each of a plurality of forward processes that can be defined in the source code and a backward process corresponding to the forward process.
  • a corresponding backward process is associated with a certain forward process included in a plurality of forward processes in a one-to-one relationship.
  • a forward process corresponding to Linear and a backward process corresponding to this forward process are associated with each other. .
  • the one-to-one correspondence between the forward process and the backward process is a process corresponding to the forward process when the backward process is executed using the reference structure for the backward process. For example, when the forward process is executed in the order of A ⁇ B ⁇ C, the backward process is executed in the order of C ⁇ B ⁇ A. On the other hand, since both forward processing and backward processing are implemented in pairs, such backward processing can be realized.
  • the storage unit 120 can store various information including the source code acquired by the acquisition unit 110 and various libraries used in a programming language corresponding to the source code. For example, the storage unit 120 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG. 11.
  • the execution unit 130 sequentially executes each code included in the source code acquired by the acquisition unit 110 (stored in the storage unit 120).
  • the execution unit 130 can calculate the output value of the forward process defined by the code based on the input value when each code is executed.
  • the execution unit 130 can generate a reference structure between objects in a layer corresponding to the code when the code is executed.
  • the execution unit 130 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG.
  • the learning device 10 uses the acquisition unit 110, the storage unit 120, and the execution unit 130 described above, thereby providing three classes. That is, three classes of Function, Variable, and Optimizer are used. Note that the names of these classes are given for convenience and are not limiting.
  • a class called Function is a class defined by pairing forward processing and backward processing.
  • the class called Function defines the specific layer algorithm exemplified in the above “2-6” to “2-12” as a subclass.
  • a class called Variable is a class that manages data input and output between functions.
  • This class of Variable has a role of concealing the difference between the GPU and the CPU, and has a method (unchain_backward, which will be described later) for terminating backward processing of a network including a loop within a finite range.
  • a class called Optimizer is a class for updating weights.
  • FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention.
  • the source code illustrated in FIG. 13 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Further, the number of lines described at the left end in FIG. 13 is given for explaining this specific example, and is not included in the actual source code.
  • the source code is described in Python as an example will be described, but the source code may be described in a programming language other than Python. Details of Phython are disclosed at https://www.python.org/. The content disclosed at this URL is incorporated herein by reference in its entirety.
  • a developer or the like creates the source code illustrated in FIG. 13 using a text editor or the like.
  • the acquisition unit 110 (see FIG. 12) of the learning device 10 acquires the source code created in this way and stores it in the storage unit 120.
  • the execution unit 130 executes each code included in the source code stored in the storage unit 120 line by line.
  • the execution unit 130 executes the first line to the last line one line at a time from the top to the bottom. Run sequentially.
  • the control syntax is included in the source code, the execution unit 130 executes each code in the order according to the control syntax.
  • Lines 1 to 3 describe registration of a function including parameters by FunctionSet.
  • a function including a weight an instance of linear class l1, l2, l3, which is a function subclass defining a layer algorithm for performing an inner product
  • FunctionSet an instance of linear class l1, l2, l3, which is a function subclass defining a layer algorithm for performing an inner product
  • FunctionSet can be updated by the Optimizer.
  • FunctionSet is a mechanism for improving the readability of code by grouping the functions updated by Optimizer.
  • Lines 4 and 5 describe the initialization of the Optimizer.
  • the fourth line creates an instance of an Optimizer (class for updating weights) subclass that implements the algorithm called Adam.
  • Adam's processing is to update for each element of weight by the mathematical expression described in “2-13” above.
  • a list of functions including the weights already defined in the first to third lines is passed to the setup method of the instance of the optimizer subclass generated in the fourth line. By executing this setup method, the internal state of the Optimizer subclass for updating the weight included in the Function list passed to this method is initialized.
  • the sixth line describes loading of input data. That is, the sixth line illustrates the process of reading the input data x and t from a file or the like.
  • x holds data with a large amount of information such as images and sounds
  • t holds a label ID corresponding to x (data with a small amount of information for answer matching).
  • the 7th line describes holding input data by Variable object. That is, in the seventh line, a Variable class object that holds input data is generated.
  • the “Define-by-Run” function is realized by the mutual dependence of the Variable object and the Function object, and any input data has a mechanism to realize the “Define-by-Run” function. Therefore, a procedure that is explicitly held by an instance of the Variable class is required.
  • Lines 8 to 11 describe execution of forward processing. Specifically, in the 8th to 11th lines, the Forward process is executed by the description of a general programming language.
  • the “Define-by-Run” function generates a reference structure for backward processing simultaneously with the execution of this definition.
  • the instances of the Function class and the Variable class it is possible to express the correspondence between arbitrary processing and data. This is obvious because the Variable class represents data and the Function class represents processing.
  • a data structure expressing the backward calculation procedure shown in FIG. 2 using this reference structure is defined as a reference structure for backward processing.
  • the reference structure for backward processing grows every time a basic calculation (arithmetic operation or power) for a Variable object and a Function call with a Variable object as an argument or return value are called. Therefore, a reference structure for backward processing can be generated even for forward processing descriptions including function calls other than those for branches, loops, and functions and variables.
  • Each basic calculation for Variable objects also has a Function subclass associated with it.
  • Line 12 describes the execution of backward processing. Specifically, the 12th line executes the backward process by calling the backward method of the loss variable (an instance of the Variable class) obtained as a result of executing the forward process executed in the 8th to 11th lines. .
  • the backward process is automatically executed in the reverse order of the forward process by following the reference structure for the backward process generated when the forward process is executed.
  • Line 13 describes the weight update. Specifically, in the 13th row, a weight gradient is calculated as a result of executing backward processing in the 12th row.
  • the update method of the instance of the Optimizer subclass is called as in the 13th line, the weight is updated using the weight gradient. Since the update method call for the Optimizer subclass and the backward method call for the Variable class are separate functions, it is also possible to update the weight after partially executing the backward processing. This is effective when it is not desired to update the weight for a function that has already been learned.
  • x ' represents a Variable object that is a copy of x
  • l1' represents a copy of l1 (shallow copy)
  • y represents a value (Variable object) returned by the forward method of l1 '
  • splitter is a network Represents an instance of a class that manages the branch of.
  • Shallow copy is a method of copying an object that does not copy data that the object internally references when copying the object. By making a shallow copy, for example, duplication of weight data of a Funtion instance can be avoided.
  • the meaning of the arrow indicates the direction of reference of the object.
  • the description A ⁇ B means that a member of B object includes a reference to the object of A.
  • the above reference structure becomes a reference structure for backward processing as follows after execution of “F.relu (”.
  • a reference structure for backward processing is generated when the code described in the eighth line is executed has been described.
  • a reference structure is generated.
  • the forward process is executed, a reference structure for the backward process is generated by a natural-type function call.
  • the backward processing can be executed starting from h1.
  • the flow of processing executed by the backward processing system when actually executing backward processing from h1 is shown below.
  • the relu 'referenced by the h1 instance and call the relu' Backward process.
  • the input at this time is an error value held by h1, and the output result is stored in an instance of y ′.
  • Correspondence between data input and output by such Function instances is defined in the Forward process / Backward process defined for each Function subclass.
  • spliter copies the error value held by y' to y. (The reason why the splitter is inserted is described in the next section).
  • the input at this time is an error value held by y, and the output result is stored in an instance of x ′.
  • a weight error is also calculated. The weight error is calculated from the forward output value stored in x ′ and the error value held by y. In the same manner, backward processing ends when x is reached, which is the end point of the reference structure for backward processing.
  • the result of adding and combining the error values transmitted by the instance of the splitter to x ′ and x ′′ is set as the error of x.
  • FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG.
  • a block drawn by a dotted line shows an instance of a variable
  • a block drawn by a solid line shows a function.
  • variable x when the seventh line is executed, an instance 30 of variable x and an instance of variable t are generated.
  • the instance 30 of the variable x is shown in FIG. 14, but in reality, an instance of the variable t is generated in the same manner.
  • the variable x instance actually holds data such as images and sounds.
  • the eighth line when the eighth line is executed by the execution unit 130, the neural network in a state in which the function “l1” 31, the function “relu” 32, and the instance 33 of the variable h1 are sequentially grown after the instance 30 of the variable x. Is generated. Note that when the eighth line is executed, the execution result of the forward process described in the eighth line is already held by the instance 33 of the variable h1. Further, when the eighth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.
  • the ninth line when the ninth line is executed by the execution unit 130, the neural network in a state where the function “l2” 34, the function “relu” 35, and the instance 36 of the variable h2 are sequentially grown after the instance 33 of the variable h1. Is generated. Note that when the ninth line is executed, the execution result of the forward process described in the ninth line is already held in the instance 36 of the variable h2. Further, when the ninth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.
  • the backward process is executed by the execution unit 130 executing the 12th line. Since the generated reference structure for backward processing has already been generated, the execution unit 130 executes backward processing based on the reference structure, thereby performing each intermediate layer (however, , Only the weighted intermediate layer) can be calculated.
  • the execution unit 130 executes the 13th line. Thereby, the weight of each intermediate layer (however, only the intermediate layer having the weight) is updated using the weight gradient calculated by executing the 12th row. That is, learning is executed.
  • the developer or the like obtains the execution result obtained by giving any variable instance to any function, and any variable instance. It is possible to construct a neural network by a method that describes whether to hold each line by line. Thereby, the developer or the like can easily perform the forward process intuitively in the source code. Further, the developer or the like describes the forward process in the source code (without needing to be aware of the backward process), and automatically executes the backward process by causing the learning apparatus according to the present embodiment to execute the source code.
  • the learning device can execute the program.
  • FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art.
  • the definition of the layer (corresponding to the function in the present embodiment) is described in the block surrounded by ⁇ described immediately after the term “layer”.
  • the descriptions “top” and “bottom” represent the dependency between layers. “Bottom” represents from which layer the input to the layer is obtained, and “top” represents to which layer the processing result in the layer is output.
  • a branch may be added to the code of FIG. 13 and the layer for executing the forward processing may be switched according to the value of the variable t or the data size of the variable x.
  • a variable value can be given as input data instead of the constant “10” on the ninth line.
  • FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention. It should be noted that the source code illustrated in FIG. 16 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Also, the number of lines shown at the left end in FIG. 16 is given for explaining this specific example, and is not included in the actual source code.
  • the learning device explains that a neural network can be easily constructed using a control syntax (here, a for sentence). To do.
  • the fourth line describes that the process described in the fifth to tenth lines is looped until the value of i becomes 0 to 1000.
  • the fifth and sixth lines are the same as the sixth and seventh lines in the source code shown in FIG.
  • the seventh line describes that y, which is the processing result of the function l1 and the function relu, is added again to the argument of l1.
  • the eighth to tenth lines are the same as the eleventh to thirteenth lines in the source code shown in FIG.
  • FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG.
  • a block drawn by a dotted line shows an instance of a variable
  • a block drawn by a solid line shows a function.
  • FIG. 17 shows only the configuration of a neural network generated only when the variable i is 0 to 2 for convenience of explanation.
  • the same configuration including the variable instance and the function here, the instance 51 of the variable x and the instance 50 of the variable y are added.
  • the function “l1” 53 and the function “relu” 54 are sequentially followed by the function 52 to be performed, and the output value of the function “relu” 54 is held in the instance of the variable y).
  • a simple control syntax here for statement. That is, it can be seen that the source code used in the learning apparatus according to the present embodiment has a high affinity with the control syntax of the programming language.
  • FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art.
  • the configuration of the neural network cannot be defined using the control syntax.
  • a basic configuration as shown in FIG. 18 is defined.
  • the developer or the like gives the initial value of the instance 75 of the variable y to the function 72 and also gives the instance 75 of the variable y at the previous time and the instance 71 of the variable x at the current time to the function 72. (A portion indicated by a thick line in FIG. 18)
  • the processing must be described specifically.
  • the developers etc. Each layer must have such a special description.
  • the source code to be described can be easily obtained without using a special description using the control syntax of the programming language. It can be described. Therefore, according to the learning device according to the present embodiment, even a complicated or large-scale neural network can be easily and efficiently constructed.
  • the learning device may be able to execute a function that cuts off a reference structure for backward processing. Specifically, when the unchain_backward method of an instance of the Variable class is called, the reference structure for backward processing toward the input side from that instance is cut off. For example, it is assumed that the following reference structure for backward processing is generated by executing forward processing (detailed configuration such as splitter is omitted).
  • a (input layer) ⁇ Convolution2D ⁇ B ⁇ Linear ⁇ C ⁇ Softmax ⁇ D (output layer)
  • A, B, C, and D represent Variable class instances
  • Convolution2D, Linear, and Softmax represent Function class instances.
  • FIG. 19 is a diagram showing still another example of the source code input to the learning device according to the embodiment of the present invention. Note that the number of lines described at the left end in FIG. 19 is given for explaining this specific example, and is not included in the actual source code.
  • the ninth line describes that the processing of the tenth to twelfth lines is executed every time the loop after the fourth line is executed 10 times.
  • the eleventh line calls unchain_backward and discards the reference structure for backward processing starting from loss. Thereby, the calculation time of the entire loop process can be kept short.
  • unchain_backward can be used for the purpose of not updating the weight for a specific function.
  • the learning device can specify the volatile attribute when an instance of the Variable class is initialized. When the volatile attribute is valid, a reference structure for backward processing is not generated for the forward processing for inputting the variable.
  • the processing for generating the reference structure for the backward processing is executed when the forward processing is executed. If this is done, waste will occur in both execution speed and memory usage.
  • the generation of the reference structure for backward processing is stopped and efficient forward processing is performed. Can only run.
  • the technology disclosed in this specification can be realized by executing source code written in Python and its equivalent programming language, and instead of Python and its equivalent programming language. It may be realized by executing the described module or library.
  • the processes and procedures described in this specification can be realized not only by those explicitly described in the embodiment but also by software, hardware, or a combination thereof. Specifically, the processes and procedures described in this specification are realized by mounting logic corresponding to the processes on a medium such as an integrated circuit, a volatile memory, a nonvolatile memory, a magnetic disk, or an optical storage. Is done. Further, the processes and procedures described in the present specification can be implemented as a computer program and executed by various computers.
  • processes and procedures described herein are described as being performed by a single device, software, component, or module, such processes or procedures may include multiple devices, multiple software, It may be executed by multiple components and / or multiple modules. Further, even if it is described that the data, table, or database described in this specification is stored in a single memory, such data, table, or database may be stored in a single device. Or a plurality of memories arranged in a distributed manner in a plurality of devices. Further, the software and hardware elements described herein may be realized by integrating them into fewer components or by disassembling them into more components.
  • Part 2 Algorithm mounting method for embedded chip.
  • BACKGROUND Deep learning deep learning is an algorithm that requires a large amount of computation and memory usage, and a learning sample amount, while achieving high performance.
  • the problems to be solved by the embodiments of the present invention remain mainly in the software environment due to the development of a framework for designing deep learning algorithms that operate on an embedded chip while satisfying product level requirements. It is intended to break through barriers to deep learning adaptation to embedded environments and accelerate development.
  • the learning device according to the embodiment described in the first part which is a framework that provides high productivity in developing a deep learning algorithm while being GPU-based, is functionally expanded for an embedded environment. Therefore, in the next paragraph and after, problems for adaptation to the embedded environment focused on the learning device according to the embodiment will be described.
  • a new neural network (NN) algorithm designed in a personal computer or the like having abundant calculation resources can be applied to any embedded chip (embedded semiconductor integrated circuit). Achieving a state in which the product level requirements can be met and operating in the shortest possible period. For that purpose, it is desirable that the developer who designs the algorithm and the developer who is deeply aware of the hardware can work as independently as possible. In this embodiment, the technical idea regarding the apparatus (framework) which assists it is proposed.
  • Step I A state in which a code used in the learning device according to the embodiment (a code written in Python as an example) is running on a PC (+ GPU). This state is an algorithm using a neural network having a complicated configuration. This is a state in which the design and verification of is realized with less code description. This is the concept of the method “Define-by-Run” described above.
  • Step II Chip-optimized implementation and Python code mixed together This state is almost the same as the Python code for the operation check and performance verification of the algorithm designed by the learning device according to the embodiment on the chip. It is a state that has been realized.
  • Step III A state where the algorithm designed by the learning device according to the embodiment is operated only by implementation optimized for the chip. In this state, the algorithm operates after satisfying the specification requirements of the product level as a chip (on-chip In other words, a real-time cooperative operation with other modules and control mechanisms can be performed).
  • the mounting method according to the present embodiment when a new algorithm is developed by the learning device according to the embodiment, by avoiding re-correction, redesign, and re-learning as much as possible between the steps I to III described above, Propose a framework that can be developed in a short period of time.
  • FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention.
  • the configuration shown in FIG. 20 is a configuration premised on the learning device according to the embodiment described in the first part. That is, in this configuration, the source code written in Python as one aspect of the programming language uses PyCUDA as one aspect of the library and numpy (BLAS) as one aspect of the library. And a general-purpose computer. Note that “Chainer” shown in FIG. 20 is the name given by the applicant to the framework for describing the source code used in the learning apparatus according to the embodiment described in Part 1 above. It is.
  • FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention.
  • the front end of Chainer is executed on Python.
  • a Native I / F for example, an interface for calling an implementation equivalent to the main function of Chainer described in a low-level language such as C language
  • the execution optimized for the embedded chip can be executed with the same code.
  • FIG. 22 is a schematic diagram for explaining a case where an execution unit based on Python and an execution unit based on a chip communicate with each other. As shown in Fig. 22, it is possible to remove the dependency on Python from the configuration on the embedded chip by providing a communication function in the implementation of the Native I / F (the optimization implementation on the embedded chip is driven from the Chainer on the PC) ).
  • FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention. As shown in FIG. 23, a method for outputting the network definition and weight as a byte code from Chainer has been added. In addition, a virtual machine that interprets the bytecode and executes neural network processing (forward processing, backward processing, weight update) is provided. Native I / F chip optimization implementation can be diverted.
  • FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention.
  • a configuration that provides an interface independent of the type of computer for each NN algorithm.
  • a processing system using the NN algorithm instructs a specific computer to execute the algorithm via this interface.
  • the interface here is a means for defining the correspondence between the format of the input data and the format of the output data, and the processing method of the format of the input data and the format of the output data. If the interface is the same, the same output result can be obtained for the same input. For example, a function written in C language and its function declaration can be mentioned.
  • the processing system on the side using the NN algorithm is not particularly limited.
  • the computer means a device that executes a calculation.
  • a computer is a device that includes a computing core, a memory hierarchy, and hardware resources necessary to perform a calculation.
  • a general-purpose computer means a commonly used computer. It is a computer in which conventional algorithms including Linux (registered trademark) OS and Python easily operate.
  • the accelerator here means a device that executes specific calculations including the calculation of the NN algorithm at high speed.
  • the GPU here is a computer specialized in image processing, but also capable of performing general-purpose calculations.
  • the GPU also includes a form of the accelerator. Since there are software assets such as CUDA, the ease of implementing the NN algorithm is about halfway between a general-purpose computer and a general accelerator.
  • FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention.
  • the Native I / F has at least a Forward processing unit. With this configuration, the Native I / F can perform identification processing using the NN algorithm. Further, the Native I / F includes at least a forward processing unit, a backward processing unit, an internal state initialization processing unit of a weight update algorithm, and a weight update processing unit. With this configuration, the Native I / F can execute identification processing and learning processing using the NN algorithm. A Forward processing unit and a Backward processing unit are included for each layer algorithm.
  • the weight update algorithm internal state initialization processing unit and the weight update processing unit are included for each weight update algorithm.
  • the Native I / F includes a forward processing call interface and a backward processing call interface for each layer algorithm, and an internal state initialization processing interface for the weight update algorithm and a weight for each weight update algorithm.
  • the implementation that is called through the Native I / F has a Native I / F call management unit. With this configuration, the implementation that can be called through the Native I / F can change the implementation that can optimally execute the operation of the Native I / F according to the difference in the parameters of the Native I / F.
  • the call manager of the Native I / F returns an error to the caller. Therefore, Native The implementation called by the I / F can select and execute an implementation that can optimally execute the operation.
  • FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention.
  • the Native I / F further has a multidimensional array management unit.
  • the multidimensional array management unit generates, destroys multidimensional arrays, acquires attributes (number of axes, number of elements per axis), acquires aggregate results (total, average, variance, etc. for each axis), and multidimensional arrays At least one selected from the group including the four arithmetic operations for each element.
  • FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention.
  • the Native I / F has a data expression conversion unit.
  • the data representation conversion unit can mutually convert a data representation (device-dependent data representation) that depends on a specific computer and a data representation (device-independent data representation) that does not depend on a specific computer. it can.
  • Configuration 1-2-2 (Configuration 2 for sharing data; + when having an external storage medium) Furthermore, the processing system that calls the Native I / F has an external storage medium.
  • the external storage medium can store weight data converted into device-independent data.
  • FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention.
  • the implementation that is called through the Natieve I / F has a communication unit.
  • the communication unit can communicate the call information of the Native I / F to the called implementation.
  • the implementation on the side called through the Native I / F must Optimal communication processing can be executed.
  • the physical distance of the computer, the presence / absence of memory sharing, or the difference in communication protocol can be hidden from any processing system using the NN algorithm.
  • the Native I / F that is not related to the presence or absence of communication of call information includes an interface for executing a layer algorithm, an interface for executing a weight update algorithm, or an interface for executing data representation conversion. is there.
  • FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention.
  • the Native I / F includes a type conversion unit, an NN algorithm execution unit for floating point, and / or an NN algorithm execution unit for fixed point.
  • a computer B having only a type conversion unit
  • a computer A having only a floating-point NN algorithm execution unit
  • a computer C having only a fixed-point NN algorithm execution unit.
  • the floating point type data generated by the computer A is transferred to the computer B. Subsequently, the data transferred from the computer A to the computer B is converted into fixed-point data by the computer B. Then, the fixed-point type data converted by the computer B is transferred to the computer C. Then, the fixed point type data transferred from the computer B becomes the input data of the computer C, and the entire operation of the NN algorithm is executed. Such steps can also be performed in reverse order.
  • FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention. Furthermore, the implementation that is called through the Native I / F has a memory pool module. The memory pool module can realize dynamic memory management.
  • FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged. Furthermore, the Native I / F has an algorithm execution unit that combines a plurality of NN algorithms. The algorithm execution unit that fuses the plurality of NN algorithms simultaneously executes the plurality of algorithms for combinations of frequent NN algorithms.
  • FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention. Further, the implementation called through the Native I / F has a multidimensional array data compression / decompression unit. The multi-dimensional array data compression / decompression unit is provided in the communication unit.
  • FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
  • FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention.
  • the Chainer execution unit has a byte code generation unit.
  • the byte code generation unit inputs a backward calculation procedure and a weight, and outputs it as a byte code. For example, it has a bytecode generator in Chainer's Python layer.
  • the Native I / F has a virtual machine.
  • the virtual machine interprets the bytecode and executes NN algorithm processing.
  • the NN algorithm processing here is any one of forward processing, backward processing, and weight update, or a combination thereof.
  • FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention. Further, the Chainer execution unit has a comparison unit. The comparison unit compares the input / output results of the existing execution unit and the native layer execution unit corresponding to the same NN algorithm, or calls the native layer execution unit of different implementation in the same Native I / F Compare the input / output results of each other.
  • FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment of the present invention.
  • the Chainer execution unit has a function synthesis unit.
  • the function synthesizer inputs the Backward calculation procedure, and “Native that executes multiple algorithms simultaneously”
  • the combination of instances of the Function class that can handle “I / F” is replaced with an instance of the Function class corresponding to “Native I / F that executes multiple algorithms simultaneously”.
  • the above replacement is not performed if there is no “Native I / F that executes multiple algorithms simultaneously”.
  • the replacement here can be executed by partial match search when the Backward calculation procedure is regarded as a character string.
  • the function synthesis unit is provided in the Python layer of Chainer.
  • Configuration 4 (Configuration of optimization device that specializes forward processing execution)
  • Configuration 4-1 (Configuration 1 of optimization device specialized for forward processing execution; with weight optimization processing means) Further, the Chainer execution unit includes a weight optimization processing unit.
  • the weight optimization processing means executes a weight optimization process suitable for the Function class.
  • Configuration 4-2 (Optimization device configuration 2 specializing forward processing execution; with data memory area reuse means)
  • the Chainer execution unit and Native The I / F has means for reusing the data memory area.
  • the data memory area reuse means reuses a memory area for data input / output between layers.
  • the reuse means is provided in the Forward processing execution unit or the virtual machine. For example, a flag for identifying the execution of only the forward process is provided in the argument of the interface (defined by Native I / F) that executes the forward process of the virtual machine. The condition for executing this process is when the volatile attribute is specified in the Variable variable that is input by the Chainer Function class instance, or This is when the flag for identifying that only the Forward process is executed when the Forward process of the virtual machine is executed is valid.
  • Action 1 (Operation by the configuration of NativeIF)
  • the division of labor between developers who design and use NN algorithms and developers who are deeply aware of the hardware configuration of computers will be easier.
  • a developer who designs and uses an algorithm can guarantee the identity of the interface for each NN algorithm to be executed by the Native I / F, so various computers can be used without changing the software that calls the Native I / F.
  • it becomes possible to select computers based on more essential criteria such as the price of computers and weaknesses and weaknesses in specific applications.
  • providing an optimized implementation for a computer that supports Native I / F allows a wide range of NN algorithm users to use their own computers. .
  • Action 1-1 (Effects of configuration for executing identification and learning by NN) Developers who design and use the NN algorithm can implement the entire operation of the NN algorithm by calling the interface provided in the Native I / F using any processing system that uses the NN algorithm. . In addition, a developer who designs and uses an NN algorithm can realize the entire operation of the NN algorithm using an implementation that is optimal for the computer that is being used without being aware of the specific configuration of the computer.
  • Action 1-1-1 (Operation by Configuration 1 for executing identification / learning by NN; in the case of a configuration managing a multidimensional array (multidimensional array management unit))
  • a developer who designs and uses an NN algorithm can execute any combination of NN algorithms without performing unnecessary data conversion processing when executing the entire operation of the NN algorithm. At this time, it is possible to confirm whether or not the NN algorithm can perform the calculation as intended by confirming the total result of the contents of the multidimensional array that is the processing result of the arbitrary NN algorithm.
  • Action 1-2 (Operation by configuration for sharing data)
  • Action 1-2-1 (Operation by configuration 1 for sharing data; in the case of the data representation conversion unit)
  • Action 1-2-2 Information unique to each computer can be hidden. (Configuration 2 for sharing data; + when having an external storage medium) After the weight data is converted into a device-independent data representation and stored in an external storage medium, identification processing can be executed on any computer using weights learned using a specific computer .
  • Action 1-2-3 (Operation by configuration 3 for sharing data; + when having a communication unit) Regardless of the hardware configuration of the computer, the physical distance, and the presence or absence of memory sharing, data necessary to realize the entire operation of the NN algorithm can be exchanged. It is also possible to call an NN algorithm implementation implemented on a computer in which the processing system on the side using the NN algorithm cannot operate from a computer capable of operating the processing system on the side using the NN algorithm. Therefore, the entire operation of the NN algorithm can be realized using a plurality of computers connected to a computer network.
  • Action 2-1 (Operation by configuration 1 of the extended version of Native I / F; type conversion unit and NN algorithm execution unit for floating point and / or NN algorithm execution unit for fixed point)
  • the overall operation of the NN algorithm can be realized using a data type suitable for each computer.
  • the entire algorithm operation of the NN can be realized by using floating point arithmetic or fixed point arithmetic.
  • the computer A transfers the floating point type data generated by the floating point NN algorithm execution unit of the computer A to the computer B.
  • the computer B converts the floating-point type data transferred from the computer A into fixed-point type data by the type conversion unit, and then transfers the fixed-point type data to the computer C.
  • the computer C transfers the fixed-point type data generated by the fixed-point NN algorithm execution unit of the computer C to the computer B.
  • the computer B converts the fixed-point type data transferred from the computer C into floating-point type data by the type conversion unit, and then transfers the floating-point type data to the computer A.
  • Action 2-2 (Operation by configuration 2 of extended version of Native I / F; with memory pool module)
  • the operation can be realized lightly.
  • Action 2-3 (Operation by configuration 3 of the extended version of Native I / F; when having an algorithm execution unit that fuses multiple NN algorithms) Unnecessary access to global memory can be avoided. In addition, the overhead of function calls can be reduced. Therefore, combinations of frequent NN algorithms can be executed at high speed.
  • Action 2-4 (Operation by configuration 4 of the extended version of Native I / F; multi-dimensional array data compression / decompression unit)
  • the amount of data communication of a multidimensional array can be reduced. Therefore, the operation speed can be improved.
  • Action 3 (Operation by Native I / F + Chainer execution unit)
  • Action 3-1 (Operation by configuration 1 of Native I / F + Chainer execution unit; with bytecode generation unit and virtual machine) Since the Chainer execution unit has a bytecode generation unit and the Native I / F has a virtual machine, the dependence on advanced libraries and programming languages can be reduced. Therefore, even in various computers including poor execution environments such as accelerators, the entire operation of the NN designed by Chainer can be executed while satisfying the product level requirements.
  • Action 3-2 (Operation by configuration 2 of Native I / F + Chainer execution unit; with comparison unit)
  • the comparison unit compares the input / output results of the existing execution unit and the Native layer execution unit corresponding to the same NN algorithm, and inputs the native layer execution units that call the native layers of different implementations in the same Native I / F. Compare the output results.
  • By having such a comparison unit it is possible to compare the accuracy of the processing result of the floating-point NN algorithm execution unit and the accuracy of the processing result of the fixed-point NN algorithm execution unit. Therefore, it is possible to compare the processing result of the execution unit that has been sufficiently tested that the NN algorithm can be calculated correctly with the processing result of the newly created Native layer. Therefore, it can be assured that the newly created Native layer implementation can correctly calculate the NN algorithm.
  • Action 3-3 (Operation by configuration 3 of Native I / F + Chainer execution unit; with function composition unit)
  • the function synthesizer inputs the Backward calculation procedure, and selects a combination of instances of the Function class that can handle “Native I / F that executes multiple algorithms simultaneously”. Replace with an instance of Function class corresponding to 1: 1 in F. If there is no “Native I / F that simultaneously executes a plurality of algorithms”, the function synthesis unit does not perform the above replacement.
  • the Backward calculation procedure is automatically processed regardless of the presence or absence of “Native I / F that executes multiple algorithms simultaneously”.
  • the function synthesis unit can convert Convolution2D + BatchNormalization into a single Convolution2D by adjusting the weights and biases. The same applies to conversion from Linear + BatchNormalization to a single Linear.
  • Memory can be reduced by reducing the amount of weight information in forward processing or the amount of data memory of input data and executing forward processing. Further, the amount of calculation can be reduced by reducing the number of weight elements or executing the forward processing without calculating zero weight.
  • Action 4-1 (Operation by the optimization device 1 specializing forward processing execution; when there is a weight optimization processing means)
  • weight optimization processing is executed for any instance of Function class included in the learned network configuration. can do.
  • the weight optimization process since the weight optimization process can be executed, it is possible to reduce the memory and the calculation amount in the forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.
  • Action 4-2 (Operation by the optimizing device 2 specializing forward processing execution; in the case of having a means for reusing the data memory area)
  • By giving a flag that executes only the Forward process as an argument to the execution unit (Chainer or virtual machine) of the Forward process it is possible to reduce the memory during the Forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.
  • FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention.
  • a mounting apparatus according to an embodiment mainly includes an evaluation board (motherboard) 100, an embedded chip (embedded semiconductor integrated circuit) 200 detachably mounted on the evaluation board 100, including.
  • the evaluation board 100 mainly includes a CPU 101, a main memory 102, a communication I / F 103, and an external memory 104. Each of these components is electrically connected via an internal bus 109.
  • the CPU 101 loads various programs such as an operating system from the external memory 103 into the main memory 102, and executes instructions included in the loaded program.
  • the main memory 102 is used for storing a program executed by the CPU 101, and is constituted by, for example, a DRAM.
  • the communication I / F 103 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and a communication network (not shown) including the Ethernet (registered trademark) and the Internet. Through this, it is possible to communicate with a computer and an input / output device (not shown) operated by a developer or the like.
  • the communication I / F 103 can also communicate with a communication I / F 204 described later of the embedded chip 200.
  • the external memory 104 is configured by a flash memory, for example, and stores various programs such as an operating system.
  • the embedded chip 200 includes a CPU 201, an accelerator (auxiliary arithmetic unit) 202, a main memory 203, a communication I / F 204, and an external memory 205. These components are electrically connected via an internal bus 209.
  • the embedded chip can optionally include a GPU (not shown).
  • the CPU 201 loads source code (for example, source code written in Python) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and is included in the loaded source code. Execute each code that will be executed.
  • the accelerator 202 loads the source code (source code written in C language, assembler, etc.) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and loads it. Execute each code included in the source code.
  • the main memory 203 is used for storing source code executed by the CPU 201 and the accelerator 202, and is configured by a DRAM, for example.
  • the communication I / F 204 communicates with the communication I / F 103 of the evaluation board 100 to transmit / receive various information.
  • the external memory 205 is configured by, for example, a flash memory and stores various data.
  • FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention.
  • source code described in a first programming language for example, Python
  • the developer confirms whether the source code operates on a personal computer or the like based on the execution result.
  • the personal computer or the like refers to a computer having abundant calculation resources, and includes, for example, the learning apparatus according to the embodiment described in the first part.
  • the state in which the source code operates in the personal computer or the like in step 301 is the same state as step I described in “4-1” above.
  • step 302 the CPU 201 of the embedded chip 200 is caused to execute the source code written in Python or the like that has been confirmed to operate on a personal computer or the like in step 301 using the evaluation board 100.
  • the developer confirms whether or not the source code is operable by the CPU 201 based on the execution result.
  • Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104.
  • the source code described in Python or the like can be passed to the CPU 201 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the CPU 201, the developer corrects the source code and repeats step 302. When it is confirmed that the source code is operable by the CPU 201, the developer or the like proceeds to the next step 303.
  • a developer or the like uses a second programming language (for example, C language) to operate the accelerator 202 (at least a part of the source code) that is confirmed to be operable by the CPU 201 in step 302. Or assembler).
  • a second programming language for example, C language
  • step 304 the source code rewritten in C language or the like in step 303 is executed by the accelerator 202 of the embedded chip 200 using the evaluation board 100.
  • the developer confirms whether or not the rewritten source code is operable by the accelerator 202 based on the execution result. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104.
  • the source code described in C language or the like can be passed to the accelerator 202 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the accelerator 202, the developer corrects the source code and repeats step 304. When it is confirmed that the source code is operable by the accelerator 202, the developer or the like proceeds to the next step 305.
  • step 305 in the evaluation board 100, the result of the CPU 201 executing the first specific code (code to be verified) in the source code described in Python or the like, and the accelerator 202 in the source code described in C language or the like
  • the result of executing the second specific code which is the second specific code and the first specific code is rewritten from Python or the like to C language or the like (referred to as a unit test executed by the embedded chip 200, for example)
  • the module to output the comparison result.
  • the developer verifies whether the same output is obtained for the same input in both execution results.
  • Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104.
  • the developer or the like repeats steps 303 to 305 described above.
  • the developer or the like moves to the next step 306.
  • step 306 the developer tunes the source code in step 305 so that the source code written in the C language or the like operates at a higher speed by the accelerator 202.
  • step 307 in the evaluation board 100, the result of the CPU 201 executing the source code described in Python or the like and the result of the accelerator 202 executing the source code described in the C language or the like tuned in step 306 are shown. (For example, using a module called a unit test executed by the embedded chip 200) and compare results are output. Based on the comparison result, the developer verifies whether the same output is obtained for the same input in both execution results. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Until this verification is completed, the developer or the like repeats Step 306 and Step 307 described above. When this verification is completed, the developer or the like moves to the next step 308.
  • step 307 the embedded chip 200 is in a state of being operated by two source codes such as Python and C language. This state will be described with reference to FIG.
  • FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention.
  • step 301 (which corresponds to step I), the function calling side (that is, the main body that calls the function) is described in Python or the like, and the called side (that is, the calling side). Functions) are also written in Python.
  • the function calling side is still described in Python or the like, and the called side is a mixture of those written in Python or the like and those written in C language or the like. It is in a state. That is, in the state where step 307 is completed, the embedded chip 200 is in a state of being operated by two source codes such as Python and C language.
  • the final purpose of the mounting method according to the present embodiment is that, as shown at the right end of FIG. 26, both the calling side and the called side are described in C language, that is, embedded. This is a state in which the system chip 200 is operated only by the source code described in the C language or the like.
  • step 308 the developer or the like has source code written in Python and still in C language so that the embedded chip 200 operates only with source code written in C language or the like. Rewrite all parts that have not been rewritten to C language or the like.
  • step 308 the embedded chip 200 is disconnected from Python.
  • the source code described in C language or the like generated in this way is stored in the external memory 205 or the like of the embedded chip 200.
  • the embedded chip 200 can read the source code stored in the external memory 205 or the like and cause the accelerator 202 to execute the machine learning.
  • This state is a state targeted by the mounting method according to the embodiment, and the state described in the above “1” and “2” has been solved.
  • FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (second method) according to an embodiment of the present invention.
  • the mounting device (FIG. 27) used in the second method is different from the mounting device (FIG. 24) used in the first method in that the embedded chip 200 does not include a CPU.
  • the operation performed by the CPU 201 of the embedded chip 200 in the first method is performed by a CPU provided in a computer (such as a personal computer) (not shown) provided outside.
  • the computer such as a personal computer
  • the computer such as a personal computer
  • the learning device such as the personal computer illustrated in FIG. 11 described in the first part.
  • the evaluation board 100 shown in FIG. 27 is communicably connected to a computer (not shown) provided outside, for example, via the communication I / F 103, so that the computer
  • the CPU provided in may be able to execute the source code written in Python and receive the execution result.
  • a module is a collection of procedures and data defined and implemented to achieve a specific purpose (a concept that is independent of the support of a specific programming language)
  • a class is a module defined and implemented with the support of an object-oriented language such as Python
  • the Native layer refers to the layer of the Native I / F and the implementation (software and hardware) called from it.
  • the Python layer is a software layer that is supposed to be executed on the Python language. Currently Chainer is written in the Python language, but Chainer may be ported to another programming language in the future. The function described here as a Python layer does not necessarily mean specializing in the Python language. As a division of roles between the Python layer and the Native layer, the Python layer assumes a development environment with a high level of abstraction that is more suitable for algorithm design, and the Native layer has a development environment with a low level of abstraction that is more conscious of the hardware configuration. Is assumed.
  • FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
  • the execution part is a method of the Function / Optimizer class for actually calculating the neural network algorithm.
  • the existing execution unit is a general-purpose computer execution unit and / or a GPU execution unit.
  • the general-purpose computer execution unit calculates the NN algorithm using the general-purpose computer.
  • the GPU execution unit calculates the NN algorithm using the GPU.
  • the Native execution unit calculates the NN algorithm using the Native layer implementation. Since the Native layer is implemented for each type of computer, all computer types (general-purpose computers, GPUs, accelerators) can operate through Native I / F.
  • FIG. 28 is a schematic diagram conceptually showing functions of a mounting apparatus according to an embodiment of the present invention.
  • the mounting unit 400 mainly includes a drive unit 401, a Function class / Optimizer class 402, a general-purpose computer execution unit 403, a GPU execution unit 404, a Native layer execution unit 405, and a general-purpose computer.
  • the multi-dimensional array 406 for GPU, the multi-dimensional array 407 for GPU, the multi-dimensional array 408 for Native, and the Variable class 409 are included.
  • the drive unit 401 mainly executes an execution unit that instructs the Function class / Optimizer class 402 to execute an algorithm (function), and an execution result (or GPU execution unit 404) of the algorithm (function) by the general-purpose computer execution unit 403. And a comparison unit that compares the execution result of the Native layer execution unit 405 using, for example, a module referred to as a unit test, and outputs the comparison result.
  • the Function class / Optimizer class 402 causes at least one of the general-purpose computer execution unit 403, the GPU execution unit 404, and the Native layer execution unit 405 to execute the algorithm (function) instructed from the drive unit 401.
  • the general-purpose computer execution unit 403 acquires a multi-dimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the multi-dimensional array 406 for general-purpose computers, and executes the algorithm (function) using the CPU. To do.
  • the execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
  • the GPU execution unit 404 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the GPU multidimensional array 407, and executes the algorithm (function) using the GPU.
  • the execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
  • the Native layer execution unit 405 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the native multidimensional array 408, and executes the algorithm (function) using an accelerator. .
  • the execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
  • the Variable class 409 holds all the multidimensional arrays used by the general-purpose computer multidimensional array 406, the GPU multidimensional array 407, and the native multidimensional array 408.
  • the general-purpose computer multidimensional array 406, GPU Corresponding multidimensional arrays 407 and Native multidimensional arrays 408 are supplied.
  • the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU 201 mounted on the embedded system chip 200
  • the GPU execution unit 404 includes a GPU (not shown) mounted on the embedded system chip 200
  • the native layer execution unit 405 executes the algorithm (function) mainly using the accelerator 202 mounted on the embedded chip 200.
  • the Function class / Optimizer class 402 the general-purpose computer execution unit 403
  • the GPU execution unit 404, the general-purpose computer multi-dimensional array 406, the GPU multi-dimensional array 407, and the Variable class 409 are arranged in a computer (such as a personal computer) provided outside.
  • the implementation of the Native layer is still arranged in the embedded chip 200.
  • the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU of the computer provided outside, and the GPU execution unit 404 uses the GPU of the computer provided outside. Execute algorithm (function).
  • FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to an embodiment of the present invention.
  • the Native layer execution unit 405 mainly includes a NativeDevice class 501, a NativeArray class 502, a Function class / Optimizer class 503, and a byte code generation unit 504 in the Python layer.
  • the Function class / Optimizer class 503 described in FIG. 29 and the Function class / Optimizer class 402 described in FIG. 28 are the same components, and the NativeArray class 502 described in FIG. 29 and FIG.
  • the native multidimensional array 408 is the same component.
  • the Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 509, and a memory in the Native layer.
  • a pool module 510 The Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 509, and a memory in the Native layer.
  • a pool module 510 The Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 509, and a memory in the Native layer.
  • a pool module 510 The Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 5
  • the NativeDevice class 502 wraps the device management module in the Native layer in the Python layer and hides function calls and data input / output to the Native layer.
  • the NativeArray class 502 wraps a multi-dimensional array in the Native layer with a Python layer.
  • the Function class wraps the Function module in the Native layer in the Python layer
  • the Optimizer class wraps the Optimizer module in the Native layer in the Python layer.
  • the Function class and the Optimizer class have been implemented in Chainer, and have a function of hiding the difference in execution between the general-purpose computer and the GPU. Execution in the Native layer can be hidden by extending this function.
  • the byte code generation unit generates a byte code. The details of the constituent elements illustrated in FIG. 29 will be described later.
  • Deep learning is a developing technology with active research and development. Therefore, a new layer algorithm having better performance than the conventional one was invented during the development period for embedded chips. It is anticipated that there will be a demand to incorporate new algorithms into software being developed or hardware implementations.
  • the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment.
  • the algorithm verified in 2.1 is combined with the neural network module that has already been optimized and implemented on the embedded chip, and the operation is verified. Based on the result, the optimization specialized for the corresponding chip is applied to the algorithm verified by implementation in 1.
  • the mounting device When the neural network algorithm is operated on Python, the mounting device according to the embodiment performs an execution unit using the Python language that operates on a general-purpose computer, an execution unit using a GPU, and an optimized mounting for a specific chip. And an execution unit using the network are configured to call each layer, and the entire neural network algorithm is operated by using only the optimized implementation for the specific chip by way of the byte code.
  • the algorithm implementation code created in step 1 between steps 1 and 2 described in the previous paragraph can be used for 2, and the difference in operation results between steps 1 and 2 can be easily compared.
  • the optimization implementation created for step 2 can be diverted for step 3 and conversely, the defect correction related to the optimization implementation found in step 3 can also be diverted for step 2. Can do. As a result, it is possible to realize a state in which the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment and can be operated with the minimum development cost.
  • “Overall operation” represents a processing unit in which forward processing alone or forward processing, backward processing, and weight update processing are repeatedly executed. This overall operation is envisaged as a typical neural network learning and identification embodiment.
  • the device management module 505 performs initialization and release processing of devices (software and hardware status depending on optimization implementation). Specifically, the processing performed in the device management module 505 differs depending on the form of the device, but as a typical processing content, for example, a later-described memory pool is secured or released.
  • the device need not be on the same chip or on the same board as the general-purpose computer that runs Chainer and Python. It is possible to implement an optimized implementation that communicates with a device on another board and initializes / releases it.
  • Example 1 Device * chnr_init_device (void) As a result, the device can be initialized.
  • Example 2 void chnr_release_device (Device * device) As a result, the device can be released.
  • Specific processing contents in each function include those exemplified in “2-6” to “2-12” in the first part.
  • the multidimensional array module 507 manages a multidimensional array that is input and output between the functions of the Native layer.
  • the multidimensional array module 507 can manage an arbitrary size and number of dimensions. Further, as will be described later, the multidimensional array module 507 has a mechanism for mutual conversion with Numpy (multidimensional array class of python layer on which Chainer depends) and a multidimensional array library for GPU. Furthermore, the multidimensional array module 507 can hold not only floating point types but also fixed point types. As a result, the calculation of the neural network can be easily realized even with hardware having no FPU (floating point arithmetic unit).
  • the multidimensional array module 507 has a function of mutual conversion with a floating-point multidimensional array.
  • FIG. 30 shows an example of the structure definition of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention.
  • function definition examples are as follows.
  • MDArray chnr_create_md_array (dimensions [], numaxis, type)
  • a multidimensional array can be generated and initialized.
  • Example 2 void chnr_delete_md_array (MDArray * mdrray)
  • a multidimensional array can be deleted.
  • Example 3 void chnr_md_array_add (MDArray * dst, MDArray * a, MDArray * b)
  • the elements of the multidimensional array can be added.
  • Management generation / destroy
  • Management generation / destroy of the memory area that stores the entities of multidimensional arrays is realized in the Native layer.
  • an environment having a memory configuration that cannot be managed by a memory management mechanism (malloc / free) provided as standard by Linux (registered trademark) OS or the like can be considered.
  • the Python layer considering the role division of the software hierarchy, which is algorithm development in the Native layer and development that is strongly aware of hardware, the management mechanism responsible for the characteristics of these hardware environments can be implemented in the Native layer. Is appropriate. This memory management mechanism can be reused when using the virtual machine described later (when the dependency on the Python layer is removed).
  • a class that wraps the multi-dimensional array in the Native layer is prepared in the Python layer, and the generation / release timing of the memory area is matched with the instance lifetime of this Python class. Such a mechanism is necessary in order to handle multidimensional arrays naturally in Python code.
  • the “Define-by-Run” function also depends on the Python memory management mechanism.
  • Fig. 31 shows the mutual compatibility and reference relationship of multidimensional array data.
  • the memory pool module 510 is a mechanism for reducing the number of calls of a memory management mechanism having a high cost (such as the number of processing cycles) by using the memory area once reserved.
  • An example of function definition is as follows. (Example 1) void chnr_momory_pool_init (MemoryPool * momory_pool) Thereby, the memory pool can be initialized. (Example 2) void chnr_momory_pool_release (MemoryPool * momory_pool) Thereby, the memory pool can be discarded.
  • Example 3 void * chnr_momory_pool_alloc_buffer (MemoryPool * momory_pool, int byte_size, void * old_addr) Thereby, a memory can be secured.
  • Example 4 void chnr_momory_pool_free_buffer (MemoryPool * momory_pool, void * addr) Thereby, the memory can be released.
  • Memory pool implementation example (1) An example of structure definition is shown in FIG.
  • the processing flow when securing the memory is as follows. 1. Search the Buffer_size array for an index where the released flag is 1 and the size when the previously secured flag matches the size that you want to secure this time. Returns the value of buffer_addr (memory buffer address).
  • the released flag is managed by the sign bit of the buffer_size array element, for example. By searching for the array element based on the combination of the size and the address obtained at the previous time, it is possible to reduce the replacement of addresses. 2. If no matching index is found, the memory is actually allocated (call malloc etc.), its address and size are added to the array, and the address is returned.
  • Memory pool implementation example (2) As a process for releasing the memory, the address to be released is searched from the buffer_addr array, and if the address is found, the released flag is set to 1. As a process when releasing the memory pool, the memory is released (such as calling the free function) for the element whose address is set from the buffer_addr array.
  • the Optimizer module 508 is a function group that performs weight update for each layer having a weight of the neural network.
  • the Optimizer module 508 defines the following functions for each weight update algorithm.
  • Example 1 chnr_op_init_state_xxxx
  • Example 3 chnr_op_init_state_xxxx_fx
  • the internal state initialization process of the weight update algorithm can be implemented (fixed point version).
  • Example 4 chnr_op_update_one_xxxx_fx
  • xxxx represents a name assigned to each weight update algorithm.
  • the weight update algorithm can include the algorithm described in “2-13” of the first part.
  • the data conversion module 506 is a function group that performs data format conversion.
  • An example of function definition is as follows.
  • (Example 1) chnr_float_to_fixed (MDArray * dst, MDArray * src, int Q)
  • the floating-point type can be converted to the fixed-point type.
  • (Example 2) chnr_fixed_to_float (MDArray * dst, MDArray * src) Thereby, it is possible to convert from the fixed-point type to the floating-point type.
  • the FPU floating-point arithmetic unit
  • the circuit design that does not use the FPU for large-scale parallel computation is used.
  • Many are aimed at reducing resources (number of transistors and power consumption).
  • a numerical computation algorithm such as a neural network without using an FPU
  • a data type called a fixed-point type that expresses a numerical value including information after the decimal point by using an integer arithmetic unit and a shift arithmetic unit is often used.
  • the floating-point type is a data type suitable for algorithm design in the sense that a real value can be handled more intuitively
  • the fixed-point type is a data type suitable for effective utilization of hardware resources.
  • Device-independent data representation refers to data representation that does not have information that depends on a particular computer.
  • a typical implementation of this data representation is a multi-dimensional array in C language format with continuous memory addresses. If you use a library such as numpy in the Python layer, you can easily handle such data representation, but it does not specify the library or host language.
  • a device-dependent data representation is a data representation suitable for an optimization implementation specialized for a specific computer. By preparing a function that interconverts these two data expressions, hardware-aware optimization implementations and algorithm-aware implementations (for example, easy-to-read code with a structure similar to mathematical formulas written in Python) The whole operation can be executed in cooperation.
  • Communication means The implementation of the Native layer function group (Native I / F) described so far is adapted to the following changes to speed up the overall operation while communicating with devices on separate chips or substrates. Can be executed (1) RPC (remote procedure call) (2) Instruction queue (3) Reduction of data traffic of multi-dimensional array (4) Asynchronous processing of transfer and calculation The following terms are defined to explain these change policies.
  • Device to be used in the normal implementation, the device that runs Chainer code on Python) "Remote device”: A device that requires communication processing because it is on a separate chip or substrate
  • RPC remote procedure call
  • a function defined in Native I / F When a function defined in Native I / F is called, its processing requirements (memory allocation and calculation execution) are not executed directly, but information (function type and argument) indicating the processing request is generated and A mechanism is provided in which the host device receives the processing result after the processing is transmitted to the device and the remote device executes processing based on the instruction.
  • Instruction queue Communication of processing requests by RPC is not executed each time a function defined in Native I / F is called, but the information indicating the processing requests is temporarily stored in a queue (FIFO buffer) to improve the communication schedule. To do. Reduction of data communication amount of multi-dimensional array Since multi-dimensional array has enormous data size, reducing the communication amount is an important issue in improving the speed of the whole operation. There are two major measures to reduce the amount of communication. (1) Reduce the number of transfers of multidimensional arrays (2) Reduce the amount of data communication of each multidimensional array
  • the “weight” may be transferred to the remote device at the initial stage of defining the network structure, and transferred to the host device at the end of learning.
  • the device-independent data representation and the conversion function of the device-dependent data representation described for the data conversion module 506 are optimal for managing such transfer timing. Specifically, each function performs the following processing. When converting from the device-independent data representation to the device-dependent data representation, data transfer from the host device to the remote device is performed. When converting from a device-dependent data representation to a device-independent data representation, data is transferred from the remote device to the host device.
  • the virtual machine module 509 is a function group that realizes a function of interpreting byte codes and executing neural network learning / identification processing (forward processing, backward processing, and weight update).
  • Byte code is assumed to be generated by a Python layer byte code output device, which will be described later, but even byte code generated by other software can be interpreted and executed by the virtual machine module if the format is correct.
  • An example of function definition is as follows. (Example 1) void chnr_init_with_bytecode (VMState * state, char * byte_code) Thereby, it is possible to parse the bytecode and initialize the internal state of the virtual machine.
  • Example 2 void chnr_forward (VMState * state) Thereby, a forward process can be performed.
  • Example 3 void chnr_backward (VMState * state) Thereby, the backward process can be executed.
  • Example 4 void chnr_update_weight (VMState * state) Thereby, the weight update process can be executed.
  • Byte code format example Store the following information in binary format.
  • Input / output data information ⁇ number of array dimensions / size, data type (float32, FixedPoint) ⁇ * Variable number
  • Weight information ⁇ number of array dimensions / size, data type (float32, FixedPoint), realized value ⁇ * weight Number
  • Function call information during backward processing ⁇ Function type, I / O data index, weight information index, function specific parameters for each function type ⁇ * Number of functions
  • Weight update type and parameters and more bytes
  • An index of a multidimensional array that is an input / output of the entire neural network processing may be added to the code.
  • the user code that uses the virtual machine can appropriately link the multidimensional array that becomes the input of the entire neural network processing and the multidimensional array that becomes the output with respect to the function call. it can. For example, this association can be performed by the following flow.
  • the user code obtains an input multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine.
  • Step 2 The user code copies the input data to the multidimensional array acquired in Step 1 above.
  • Step 3 The user code calls a function for executing the entire operation prepared in the configuration of the virtual machine.
  • Step 4 The user code obtains the output multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine (this multidimensional array is the result of the overall operation executed in Step 3 above). (The functions in step 3 and step 4 do not necessarily need to be separated and may be integrated functions). (Step 5) The user code acquires the contents of the output data from the multidimensional array acquired in Step 4.
  • FIG. 1 The configuration diagram of the internal state of the virtual machine is shown in FIG.
  • Virtual machine module execution flow example (1) forward processing and backward processing
  • the virtual machine module executes processing such as pseudo code shown in FIG.
  • Virtual machine execution flow example (2) (Optimizer) The virtual machine module executes processing such as pseudo code shown in FIG.
  • Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (1) When performing the entire operation, if only the identification process is performed without learning (updating the weights), only the forward process may be performed. In this case, the following data is unnecessary. (1) Data input / output between layers that are not accessed by currently executing function (2) Weight gradient (3) Internal state of weight update algorithm Weight gradient when initializing internal state of virtual machine And it is not necessary to secure the internal state of the weight update algorithm. For data input / output between layers, for example, the memory allocation amount can be reduced by the procedure described in the next paragraph.
  • Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (2)
  • Example procedure when initializing the internal state of a virtual machine module (1) Calculate the sum of the data size (memory size) to be input / output for each function, and select the one with the maximum size.
  • An address is set so that the memory area secured in 1 is reused when a structure (MDArray) that handles a multidimensional array is initialized. By setting addresses so that the left and right ends of the memory area are alternately switched as input and output for each layer, copy of array data is prevented from occurring. If a function that performs input / output with a loop is included, the output data carried over to the next iteration is not subject to reuse as shown in this procedure, and a memory area is secured individually.
  • FIG. 38 shows an example of address setting for data input / output by Function.
  • Linking data input / output from / to the external code from the virtual machine A list of data to be input / output between the functions is created in “Initialization of the internal state of the virtual machine”. The simplest way is to directly access the elements of this list. If the variable name of the Python variable instance is stored in the "input / output data information" at the time of bytecode generation, the input / output can be linked using this name.
  • the NativeArray class 502 is a class that wraps a multi-dimensional array in the Native layer in the Python layer.
  • the NativeArray class 502 is generated as an instance corresponding to the multi-dimensional array of the Native layer 1: 1.
  • the NativeArray class 502 has a lifetime management function based on a reference count as a basic function as a Python object.
  • the NativeArray class 502 has a function of requesting the release of the multi-dimensional array in the Native layer when the lifetime ends.
  • the NativeArray class 502 has a copy of type information of the multi-dimensional array of the Native layer and has a function of transmitting it to other objects in the Python layer. Furthermore, the NativeArray class 502 has functions such as data copying and addition for each array element, and has a function of requesting the execution to the Native layer. In addition, the NativeArray class 502 has a function that is compatible with other multi-dimensional array libraries in the Python layer such as Numpy and GPUArray on which Chainer depends.
  • the NativeDevice class 501 is an abstraction of optimization implementation and reference implementation of the Native layer.
  • the NativeDevice class 501 has a function of requesting the Native layer for the following processing in response to requests from other objects in the Python layer.
  • (1) Initialization and release of device (2) Generation and copy of multidimensional array (Generate NativeArray instance of Python layer that wraps this) (3) Conversion between device-independent data representation and device-dependent data representation (can also instruct conversion between floating point and fixed point) (4) Process execution of Function and Optimizer (each function of Native layer is called)
  • Function class The Function class 503 is a class that defines a forward process and a backward process as a pair.
  • Function class 503 is a class that exists in Chainer, but adds a function for requesting the Native layer for forward processing and backward processing.
  • An example of method implementation is as follows. (Example 1) forward_native mandatory Thereby, forward processing can be requested to the Native layer. (Example 2) backward_native mandatory As a result, backward processing can be requested to the Native layer.
  • (B) Determine and call the function of the Native layer that is actually called from the type of input data (floating point, fixed point) and Function (the Native layer function writes the processing result to the multidimensional array secured in A)
  • (C) Generate a NativeArray instance that wraps the multidimensional array secured in (A) above. (4) The NativeArray instance generated in (C) above is returned as the return value of Function.
  • Optimizer class The Optimizer class 503 is a class for updating weights.
  • the Optimizer class 503 is a class that exists in the chainer, but adds a function for requesting the native layer to perform state initialization and weight update processing.
  • An example of method implementation is as follows. (Example 1) init_state_native: Thereby, it is possible to request the Native layer for the internal state initialization process of the weight update algorithm. (Example 2) update_one_native syntax Thereby, the weight update process can be requested to the Native layer.
  • the processing flow at the time of calling these methods is the same as that already described in the above “Function class”.
  • the byte code generator 504 is a mechanism for converting the network configuration of the neural network defined by “Define-by-Run” into a byte code (data format that can be interpreted and executed) and outputting it.
  • the byte code format for example, the one described above regarding the “virtual machine module” can be considered.
  • output to the following format can be considered.
  • Neural network definition format of Caffe It can be executed with Caffe (Caffe is one of the representative frameworks for neural network design execution).
  • Programming language such as C language or Java (registered trademark) Software that executes the entire operation can be generated.
  • Hardware description languages such as HDL and Verilog Hardware that executes the entire operation can be synthesized.
  • a function definition example of the bytecode generation unit is as follows.
  • Function name write_network_difinition (output_node, path, format)
  • Function specifications The network configuration connected from output_node to the input side is output to the file specified by path in the format specified by format.
  • output_node can be specified in a list (can start from multiple nodes).
  • Example procedure for outputting bytecode from a reference structure for backward processing As explained in Part 1 above, Chainer has a function to generate a reference structure for backward processing according to the description of calculation of natural forward processing. . Since the forward process can be calculated by following the reference structure for the backward process in the reverse order, if the byte code is generated from this reference structure, both the forward process and the backward process can be executed. This procedure is roughly divided into the following steps. (1) Element information generation for byte code generation Input / output data information generation Weight information generation Function call information generation during backward processing (2) Conversion of element information into byte code
  • Element information generation procedure for creating bytecode-“reference structure for backward processing” is traced to output_node passed to write_network_difinition, and the following processing is executed.
  • the information (size / number of dimensions, floating point / fixed point (Q value)) of the multidimensional array is added to the list of “input / output data information”.
  • the current node is Function, the following processing is performed.
  • (I) Add information of multi-dimensional array of weights (size / number of dimensions, floating point / fixed point (Q value), actual value of weight) to the list of "weight information" so as not to allow duplication (multiple functions Because instances can share the same weight).
  • Element information creation procedure when multiple origin nodes are passed (1) Create a list (empty) of “Function call information during backward processing”. (2) The following procedure is performed for each starting node in output_node.
  • A Create a list of “Function call information during backward processing” specific to the origin node.
  • B The registration procedure described in the previous paragraph is performed on the list created in (A) above. At this time, the registration procedure is not executed for nodes already registered in the list created in (1) to avoid duplicate registration.
  • C The list created in (A) is linked to the front of the list created in (1) above.
  • Python layer Byte code output device (6) Conversion of element information to byte code The following information created in the procedure of “Generating element information for creating byte code” is converted to the format specified by the format argument of write_network_difinition. (1) Generation of input / output data information (2) Generation of weight information (3) Function call information generation at the time of backward processing Examples of the format for the “byte code generation unit” are given above.
  • the write_network_difinition function described above for the "byte code generator” is a specification that directly writes the network configuration to the file passed in the argument path. You can also pass an object to write to bytecode.
  • the network configuration indicates the components (1), (2), and (3) described in “Python layer: byte code output unit (6) conversion of element information into byte code”.
  • This “object for writing a plurality of network configurations to bytecode” shares the same “(2) weight information” among “plurality of network configurations” and reduces the weight information to be written to the bytecode.
  • “(1) Generation of input / output data information” and “(3) Generation of function call information during backward processing” are generated independently by the above-described steps even when partial information is duplicated.
  • FIG. 6 A code example in the case of using this object is shown in FIG.
  • the reference structure for backward processing is output from nodeA, and the network structure is output.
  • the reference structure for backward processing is traced from nodeB, and the network structure is output. Output.
  • these two network configurations are output to one file (./bytecode.bin).
  • Chainer is a reference for Backward from a specific Variable instance to the input layer side. Has the ability to abort the structure (unchain_backward method).
  • unchain_backward method By combining this unchain_backward method and the “output of multiple network configurations” described in the previous paragraph, it is possible to specify different function call orders for forward processing and backward processing calculation in the same network.
  • a network definition that executes all processes from A to D is output by calling # 1, whereas only a process from B to D is executed by calling # 2.
  • the network definition is output. For example, forward processing is performed for the network configuration output in # 1 when bytecode is executed by the virtual machine, and backward processing is performed for the # 2 one.
  • Convolution2D and ReLU are executed as an integrated function
  • the Convolution2D processing results can be used directly as input for ReLU processing on the cache memory or register file before writing to the main memory.
  • the frequency of data transfer is reduced, and the chances of executing processing more efficiently (high speed) are increased.
  • more functions can be executed as an integrated function, such as Convolution2D ⁇ ReLU ⁇ Convolution2D ⁇ ReLU, the chance of improving the processing efficiency further increases. This is because the amount of access to the main memory can be further reduced in consideration of the size of the cache memory and the data dependency in the combination of functions.
  • the amount of calculation can be reduced because measures such as reducing the number of weight elements or not calculating zero weights can be taken.
  • the weight data size is obtained by performing singular value decomposition on the weight information (matrix of the number of input nodes * number of output nodes) and removing elements that have smaller diagonal components.
  • Techniques for compressing and reducing the computation size are known. (J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition.

Abstract

[Problem] To allow appropriately installing an algorithm which executes machine learning in an embedded chip which is formed from more meager computing resources than a computer which is used in the design of the algorithm. [Solution] Provided is an installation device, comprising: a first executing means which causes a first computational device which is mounted upon a semiconductor integrated circuit to execute first source code which is written in a first programming language; a second executing means which causes a second computational device which is mounted in the semiconductor integrated circuit to execute second source code which is written in a second programming language which is a lower-level language than the first programming language; and a comparison means which compares the result of the execution by the first executing means of first specific code which is included in the first source code with the result of the execution by the second executing means of second specific code (which is the first specific code rewritten in the second programming language) which is included in the second source code, and outputs the result of the comparison. The second executing means further comprises a byte code generating means which generates byte code from the first source code, said byte code storing I/O data information, weighting information, backward process function call information, etc.

Description

実装装置および実装方法Mounting apparatus and mounting method
 本明細書に記載された技術は、機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装装置および実装方法に関する。 The technology described in this specification relates to a mounting apparatus and a mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit.
 近年、ニューラルネットワークを用いた機械学習が様々な分野において利用されている。
 このような機械学習を実行するに際して、開発者等は、所定のプログラミング言語を用いて、ニューラルネットワークのネットワーク構造等を定義したソースコードを作成し、そのように作成したソースコードをパーソナルコンピュータ等に実行させることによって、そのようなパーソナルコンピュータに機械学習を実行させることができる(非特許文献1)。
In recent years, machine learning using a neural network has been used in various fields.
When executing such machine learning, a developer or the like uses a predetermined programming language to create a source code that defines a network structure of a neural network, and the source code thus created is stored in a personal computer or the like. By executing it, it is possible to cause such a personal computer to execute machine learning (Non-patent Document 1).
 近年、豊富な計算資源を有する計算機(パーソナルコンピュータ等)を用いて設計した機械学習を実行するアルゴリズムを、かかる計算機に比べて乏しい計算資源を有する組み込み系チップ(組み込み系半導体集積回路)に適切に実装するためのフレームワークが必要とされている。 In recent years, an algorithm that performs machine learning designed using a computer with abundant computing resources (such as a personal computer) is appropriately applied to an embedded chip (embedded semiconductor integrated circuit) having a scarce computing resource compared to such a computer. There is a need for a framework to implement.
 そこで、本発明の様々な実施形態により、豊富な計算資源を有する計算機を用いて設計した機械学習を実行するアルゴリズムを、かかる計算機に比べて乏しい計算資源を有する組み込み系チップ(組み込み系半導体集積回路)に適切に実装する実装装置および実装方法を提供する。 Therefore, according to various embodiments of the present invention, an algorithm for executing machine learning designed using a computer having abundant calculation resources is used as an embedded system chip (embedded system semiconductor integrated circuit) having less calculation resources than such a computer. ) And a mounting method for mounting appropriately.
 一態様に係る装置は、機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装装置であって、第1のプログラミング言語により記述された第1のソースコードを前記半導体集積回路に搭載された第1の演算装置に実行させることが可能な第1の実行手段と、前記第1のプログラミング言語とは異なる第2のプログラミング言語により記述された第2のソースコードを前記半導体集積回路に搭載された第2の演算装置に実行させることが可能な第2の実行手段と、前記第1の実行手段が前記第1のソースコードに含まれた第1の特定コードを実行した結果と、前記第2の実行手段が前記第2のソースコードに含まれた第2の特定コードであって前記第1の特定コードが前記第2のプログラミング言語により書き換えられた第2の特定コードを実行した結果と、を比較して比較結果を出力する比較手段とを具備し、前記第2の実行手段は、前記第1のプログラミング言語により記述された前記第1のソースコードから、任意のデータ形式により記述され、かつ、入出力データ情報、重み情報、バックワード処理時Function呼び出し情報のうち少なくとも1つを含むバイトコードを生成するバイトコード生成手段を含むことを特徴とする。 An apparatus according to an aspect is a mounting apparatus for mounting a computer program for executing machine learning on a semiconductor integrated circuit, wherein the first source code described in a first programming language is mounted on the semiconductor integrated circuit. First execution means that can be executed by a first arithmetic unit, and second source code described in a second programming language different from the first programming language are mounted on the semiconductor integrated circuit. Second execution means that can be executed by the second arithmetic unit, a result of execution of the first specific code included in the first source code by the first execution means, and the first The second execution means is a second specific code included in the second source code, and the first specific code is rewritten by the second programming language. Comparing means for comparing the result of executing the second specific code with each other and outputting a comparison result, wherein the second executing means is the first source described in the first programming language. It is characterized by comprising bytecode generation means for generating a bytecode that is described in an arbitrary data format and includes at least one of input / output data information, weight information, and function call information during backward processing. To do.
 本発明の様々な実施形態により、豊富な計算資源を有する計算機を用いて設計した機械学習を実行するアルゴリズムを、かかる計算機に比べて乏しい計算資源を有する組み込み系チップ(組み込み系半導体集積回路)に適切に実装する実装装置を提供することができる。 According to various embodiments of the present invention, an algorithm for executing machine learning designed using a computer having abundant computing resources is applied to an embedded system chip (embedded semiconductor integrated circuit) having a scarce computing resource compared to such a computer. A mounting device that is appropriately mounted can be provided.
図1は、従来技術に係る「Define-and-Run」と称される手法を概念的に示す模式図である。FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art. 図2は、本発明の実施形態に係る「Define-by-Run」と称される手法を概念的に示す模式図である。FIG. 2 is a schematic diagram conceptually showing a technique called “Define-by-Run” according to the embodiment of the present invention. 図3は、ニューラルネットワークのネットワーク構成の一例を示す模式図である。FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network. 図4は、ニューラルネットワークのネットワーク構成の別の例を示す模式図である。FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network. 図5は、ニューラルネットワークのネットワーク構成のさらに別の例を示す模式図である。FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network. 図6は、Linearによりフォワード処理時に実行される計算を実現するための擬似コードを示す図である。FIG. 6 is a diagram illustrating a pseudo code for realizing the calculation executed in the forward process by the Linear. 図7は、Linearによりバックワード処理時に実行される計算を実現するための擬似コードを示す図である。FIG. 7 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by Linear. 図8は、ReLUによりフォワード処理時に実行される計算を実現するための擬似コードを示す図である。FIG. 8 is a diagram illustrating a pseudo code for realizing a calculation executed at the time of forward processing by the ReLU. 図9は、ReLUによりバックワード処理時に実行される計算を実現するための擬似コードを示す図である。FIG. 9 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by the ReLU. 図10は、Convolution 2Dによりフォワード処理時に実行される計算を実現するための擬似コードを示す図である。FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D. 図11は、本発明の一実施形態に係る学習装置のハードウェア構成例を示す模式図である。FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention. 図12は、本発明の一実施形態に係る学習装置が有する機能例を模式的に示すブロック図である。FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention. 図13は、本発明の一実施形態に係る学習装置に入力されるソースコードの一例を示す図である。FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention. 図14は、図13に示されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. 図15は、従来技術に係るCaffeにより記述されたソースコードの一例を示す図である。FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art. 図16は、本発明の一実施形態に係る学習装置に入力されるソースコードの別の例を示す図である。FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention. 図17は、図16に示されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. 図18は、従来技術に係るCaffeにより記述されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art. 図19は、本発明の一実施形態に係る学習装置に入力されるソースコードのさらに別の例を示す図である。FIG. 19 is a diagram illustrating still another example of the source code input to the learning device according to the embodiment of the present invention. 図20は、本発明の実施形態に係る実装手法のステップIを説明する模式図である。FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention. 図21は、本発明の実施形態に係る実装手法のステップIIを説明する模式図である。FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention. 図22は、Pythonによる実行部とチップによる実行部とが通信する場合を説明する模式図である。FIG. 22 is a schematic diagram illustrating a case where an execution unit using Python and an execution unit using a chip communicate with each other. 図23は、本発明の実施形態に係る実装手法のステップIIIを説明する模式図である。FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention. 図24は、本発明の一実施形態に係る実装手法(第1の手法)に用いられる実装装置の構成例を示す模式図である。FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention. 図25は、本発明の一実施形態に係る実装手法に用いられる手順の一例を示すフロー図である。FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention. 図26は、本発明の一実施形態に係る実装手法における組み込み系チップの動作状態を示す模式図である。FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention. 図27は、本発明の一実施形態に係る実装手法(第2の手法)に用いられる実装装置の構成例を示す模式図である。FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting technique (second technique) according to an embodiment of the present invention. 図28は、本発明の一実施形態に係る実装装置が有する機能を概念的に示す模式図である。FIG. 28 is a schematic diagram conceptually showing functions included in the mounting apparatus according to the embodiment of the present invention. 図29は、本発明の一実施形態に係る実装装置に含まれるNative層実行部の構成例を示す模式図である。FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to the embodiment of the present invention. 図30は、本発明の一実施形態に係る実装装置の多次元配列モジュールの構造体定義例を示す図である。FIG. 30 is a diagram illustrating a structure definition example of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention. 図31は、本発明の一実施形態に係る実装装置の多次元配列モジュールにおける多次元配列データの相互互換及び参照関係を示す図である。FIG. 31 is a diagram showing the mutual compatibility and reference relationship of multidimensional array data in the multidimensional array module of the mounting apparatus according to the embodiment of the present invention. 図32は、本発明の一実施形態に係る実装装置のメモリプールモジュールを説明するための図である。FIG. 32 is a diagram for explaining a memory pool module of the mounting apparatus according to the embodiment of the present invention. 図33は、本発明の一実施形態に係る実装装置のメモリプールモジュールの構造体定義例を説明するための図である。FIG. 33 is a diagram for explaining a structure definition example of the memory pool module of the mounting apparatus according to the embodiment of the present invention. 図34は、本発明の一実施形態に係る実装装置におけるパイプライン化のコーディング例を示す図である。FIG. 34 is a diagram showing a coding example of pipelining in the mounting apparatus according to an embodiment of the present invention. 図35は、本発明の一実施形態に係る実装装置における仮想マシンモジュールの内部状態を示す図である。FIG. 35 is a diagram showing an internal state of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. 図36は、本発明の一実施形態に係る実装装置における仮想マシンモジュールの実行フロー例を示す図である。FIG. 36 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. 図37は、本発明の一実施形態に係る実装装置における仮想マシンモジュールの実行フロー例を示す図である。FIG. 37 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. 図38は、本発明の一実施形態に係る実装装置における仮想マシンモジュールにおいてアドレス設定例を示す図である。FIG. 38 is a diagram showing an example of address setting in the virtual machine module in the mounting apparatus according to the embodiment of the present invention. 図39は、本発明の一実施形態に係る実装装置におけるPython層とNative層とが連携する全体動作の具体例を示す図である。FIG. 39 is a diagram illustrating a specific example of the entire operation in which the Python layer and the Native layer cooperate in the mounting apparatus according to the embodiment of the present invention. 図40は、本発明の一実施形態に係る実装装置におけるバイトコード生成部における複数ネットワーク構成の出力を示す図である。FIG. 40 is a diagram illustrating an output of a plurality of network configurations in the bytecode generation unit in the mounting apparatus according to an embodiment of the present invention. 図41は、本発明の一実施形態に係る実装装置におけるバイトコード生成部におけるコード例を示す図である。FIG. 41 is a diagram illustrating a code example in the byte code generation unit in the mounting apparatus according to the embodiment of the present invention. 図42は、本発明の一実施形態に係るNative I/Fの構成例を示す図である。FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention. 図43は、本発明の一実施形態に係るNNによる識別・学習を実行するための構成例を示す図である。FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention. 図44は、本発明の一実施形態に係る多次元配列を管理する構成例を示す図である。FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention. 図45は、本発明の一実施形態に係るデータ表現変換部の構成例を示す図である。FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention. 図46は、本発明の一実施形態に係る通信部の構成例を示す図である。FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention. 図47は、本発明の一実施形態に係る浮動小数点及び固定小数点の実行部及び型変換部の構成例を示す図である。FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention. 図48は、本発明の一実施形態に係るメモリプールの構成例を示す図である。FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention. 図49は、本発明の一実施形態に係る複数のNNアルゴリズムを融合したアルゴリズム実行部の構成例を示す図である。FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged. 図50は、本発明の一実施形態に係る多次元配列のデータ通信量削減部の構成例を示す図である。FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention. 図51は、本発明の一実施形態に係る既存実行部との連携例を示す図である。FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention. 図52は、本発明の一実施形態に係る既存実行部との連携例を示す図である。FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention. 図53は、本発明の一実施形態に係るバイトコード生成部と仮想マシンの構成例を示す図である。FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention. 図54は、本発明の一実施形態に係る比較部の構成例を示す図である。FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention. 図55は、本発明の一実施形態に係る関数合成部の構成例を示す図である。FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment of the present invention.
 以下、本発明の様々な実施形態について添付図面を参照して説明する。なお、各図面において共通する構成要素には同一の参照符号が付されている。
 まず、第1部において、実施形態に係る情報処理装置(以下、情報処理装置の一例である学習装置として説明を行う)について説明し、第2部において、実施形態に係る情報処理装置に実装されたアルゴリズムを組み込み系チップ(組み込み系半導体集積回路)に実装する手法について説明する。
Hereinafter, various embodiments of the present invention will be described with reference to the accompanying drawings. In addition, the same referential mark is attached | subjected to the component which is common in each drawing.
First, in the first part, the information processing apparatus according to the embodiment (hereinafter described as a learning apparatus that is an example of the information processing apparatus) will be described, and in the second part, the information processing apparatus according to the embodiment will be implemented. A method for mounting the above algorithm on an embedded chip (embedded semiconductor integrated circuit) will be described.
 第1部(実施形態に係る学習装置)
 1.背景及び概略
 深層学習(ディープラーニング)を含む機械学習アルゴリズムは、モデル毎に定義される損失関数の総和の最小化問題として定式化できることが多い。損失関数とは、与えられた学習用データサンプルにおいて、モデルの出力と正解との誤差で表されるような指標のことである。ここでは、データをモデルに入力してから出力を得て正解との比較を行うまでの一連の処理を計算グラフと呼び、その結果を損失関数とする。損失関数の最小化問題は、損失関数を微分した勾配(gradient)さえ計算可能であれば、勾配法と呼ばれる一般的な手法で解ける。
Part 1 (Learning device according to the embodiment)
1. Background and Outline Machine learning algorithms including deep learning can often be formulated as a minimization problem of the sum of loss functions defined for each model. The loss function is an index represented by an error between the model output and the correct answer in a given learning data sample. Here, a series of processes from inputting data into the model until obtaining an output and comparing with the correct answer is called a calculation graph, and the result is defined as a loss function. The loss function minimization problem can be solved by a general method called a gradient method as long as the gradient obtained by differentiating the loss function can be calculated.
 計算機プログラムとして実装しようとした場合は、損失関数も勾配も全てを自らコーディングするという方法があるが、複雑なモデルの勾配の算出は一般に困難であり、明示的に計算式を得ることが難しいことが多く、直接的にプログラムとして記述できない。そこで、Caffe(http://caffe.berkeleyvision.org/)、Torch(http://torch.ch/)、Theano(http://deeplearning.net/software/theano/)といったような計算ライブラリを使うのが第二の方法である。なお、これらのURLに開示された内容は、引用によりその内容の全体が本明細書に組み込まれる。 When trying to implement it as a computer program, there is a method of coding all of the loss function and the gradient, but it is generally difficult to calculate the gradient of a complex model, and it is difficult to obtain the calculation formula explicitly There are many, and cannot be described directly as a program. So, use calculation libraries such as Caffe (http://caffe.berkeleyvision.org/), Torch (http://torch.ch/), Theano (http://deeplearning.net/software/theano/) This is the second method. The contents disclosed in these URLs are incorporated herein in their entirety by reference.
 これらのライブラリでは、専用のミニ・プログラミング言語において、用意された基本計算要素(Primitive)の組み合わせとして損失関数を記述するだけで、その勾配関数をも自動的に得ることができる。これは、各基本計算要素の勾配自体が定義されているため、その組み合わせ全体の勾配も自動微分として得ることができるためである。すなわち、ディープラーニングで用いられるような、大規模な計算グラフとして表現できるニューラルネットワークも、その損失関数の計算がこのミニ・プログラミング言語を用いて明示的に表現できれば、その勾配関数を用いて勾配法で学習できる。 These libraries can automatically obtain the gradient function just by describing the loss function as a combination of the prepared basic calculation elements (Primitive) in a dedicated mini-programming language. This is because the gradient of each basic calculation element is defined, and the gradient of the entire combination can be obtained as an automatic differentiation. In other words, a neural network that can be expressed as a large-scale calculation graph, such as that used in deep learning, can be expressed using the gradient function if the loss function calculation can be expressed explicitly using this mini-programming language. To learn.
 そのような計算ライブラリは、これまで、本件出願人が「Define-and-Run」と呼ぶ計算手順をベースとしてきた。これは、まず計算グラフを定義(Define)し、勾配を自動微分により導出した後、学習データによる学習(Run)を進めるというアプローチである。このアプローチは、計算グラフが複雑な制御構文(ifやforなど)を持たず、時間的にも変化しない場合には、Define時に一連の勾配計算をひとかたまりとして高速化コンパイルして準備しておける、メモリ管理が不要などのメリットをもたらすものであった。 Such a calculation library has so far been based on a calculation procedure called “Define-and-Run” by the applicant. This is an approach in which a calculation graph is first defined (Define), a gradient is derived by automatic differentiation, and then learning (Run) is performed using learning data. In this approach, if the calculation graph does not have a complicated control syntax (if, for, etc.) and does not change over time, a series of gradient calculations can be compiled and prepared as a batch at the time of Define. It brought merit that memory management was unnecessary.
 しかしながら、ディープラーニングの研究の発展に伴って増えてきたような、複雑な制御構文を持つ計算グラフの場合や、データに依存しないメタな条件でも計算グラフが動的に変化するモデルの場合には、ミニ・プログラミング言語の表現力の低さやデバッグの困難性、動的に構造を変更できないことによるメモリ効率の悪化などの課題が存在していた。そのため、モデルの複雑さやデータの規模によっては、実装や実行が困難である場合があった。 However, in the case of a calculation graph with a complicated control syntax, which has increased with the development of deep learning research, or in the case of a model in which the calculation graph changes dynamically even under meta conditions that do not depend on data However, there were problems such as low expressiveness of mini-programming language, difficulty in debugging, and deterioration of memory efficiency due to inability to dynamically change the structure. For this reason, implementation and execution may be difficult depending on the complexity of the model and the scale of the data.
 そこで、実施形態では、本件出願人が「Define-by-Run」と呼ぶ新たな計算手順を提案する。具体的には、実施形態では、「Define-and-Run」のように固定されたグラフ構造を予め持つのではなく、毎回の学習(Run)においてグラフ構造を動的に抽出・記憶し、メタな変更を加え、勾配を都度計算し直すというアプローチを採用する。 Therefore, in the embodiment, a new calculation procedure called “Define-by-Run” by the applicant is proposed. Specifically, in the embodiment, instead of having a fixed graph structure like “Define-and-Run” in advance, the graph structure is dynamically extracted and stored in each learning (Run), and the The approach of recalculating the gradient each time is adopted.
 これによって、事前にグラフを定義するミニ・プログラミング言語が不要となり、開発者にとってはその設計と実装とメンテナンスのコスト、ユーザにとってはその学習コストやデバッグの困難性が取り除かれるという効果がある。また、制御構文についても、一般的なプログラミング言語(CやJava(登録商標)やPython)が持つものを自由に用いることができるようになるため、より複雑なグラフ構造を持つニューラルネットワークも容易に実装可能となる。さらに、グラフに対するある種の条件付けによるメタな変更操作を可能にすることで、メモリ効率の向上や、モデルの柔軟な学習・適用が実現される。 This eliminates the need for a mini-programming language that defines graphs in advance, which has the effect of eliminating the design, implementation, and maintenance costs for developers, and the learning costs and debugging difficulties for users. In addition, because control syntax can be freely used in common programming languages (C, Java (registered trademark) and Python), neural networks with more complicated graph structures can be easily used. Can be implemented. Furthermore, by enabling meta-changing operations with certain types of conditioning on the graph, memory efficiency can be improved and flexible learning and application of models can be realized.
 上述した従来技術に係る「Define-and-Run」と称される手法と、実施形態に係る「Define-by-Run」と称される手法との概念的な違いは、図1と図2とを対比することによっても明らかである。図1は、従来技術に係る「Define-and-Run」と称される手法を概念的に示す模式図であり、図2は、本発明の実施形態に係る「Define-by-Run」と称される手法を概念的に示す模式図である。図1に示すDefine-and-runの構成では、まずミニ・プログラミング言語がモデル定義のみを入力し、計算グラフの実体であるフォワード(識別)処理とバックワード(学習)処理の計算手順を出力する(Defineステップ)。次のステップでFoward/Backwardの処理系がフォワード(識別)処理とバックワード(学習)処理の計算手順に従いデータの入力とパラメータ(重み)の更新を行う(Runステップ)ものである。
 これに対して図2に示すDefine-by-runの構成では、汎用プログラミング言語の処理系がモデル定義、入力データ、パラメータを入力しながらフォワード(識別)処理を実行すると同時にバックワード(学習)処理の計算手順を生成する。ここでモデル定義は関数呼び出しや四則演算、ループや分岐といった汎用プログラミング言語の文法にそのまま準拠して定義されたものである。バックワード(学習)処理の計算手順はフォワード(識別)処理の実行とは独立して動的に変更することもできる。任意のタイミングでBackwardの処理系を呼び出すことができる。Backwardの処理系はBackwardの計算手順に従って入力データとFoward処理の結果からパラメータを更新する。
The conceptual difference between the above-described technique called “Define-and-Run” and the technique called “Define-by-Run” according to the embodiment is the difference between FIG. 1 and FIG. It is also clear by comparing. FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art, and FIG. 2 is called “Define-by-Run” according to an embodiment of the present invention. It is a schematic diagram which shows the method to be performed notionally. In the Define-and-run configuration shown in Fig. 1, the mini-programming language first inputs only the model definition, and outputs the calculation procedure of forward (identification) processing and backward (learning) processing, which are the entities of the calculation graph. (Define step). In the next step, the forward / backward processing system inputs data and updates parameters (weights) according to the calculation procedure of forward (identification) processing and backward (learning) processing (Run step).
On the other hand, in the Define-by-run configuration shown in Fig. 2, the general-purpose programming language processing system executes the forward (identification) process while inputting the model definition, input data, and parameters, and at the same time the backward (learning) process. Generate the calculation procedure. Here, the model definition is defined in conformity with the grammar of a general-purpose programming language such as function call, four arithmetic operations, loop and branch. The calculation procedure of the backward (learning) process can be dynamically changed independently of the execution of the forward (identification) process. Backward processing system can be called at any timing. The Backward processing system updates the parameters from the input data and the results of the Forward processing according to the Backward calculation procedure.
 2.ニューラルネットワークに関連した背景技術
 2-1.ニューラルネットワークの基本的な処理の流れ
 ニューラルネットワークにおいて行われる処理には、主に、フォワード(Forward)処理、バックワード(Backward)処理、及び、重みの更新が含まれる。
 フォワード処理とは、ニューラルネットワークの入力層から出力層に向かって情報を加工して伝播する処理をいう。
2. 2. Background Art Related to Neural Networks 2-1. Basic Processing Flow of Neural Network The processing performed in the neural network mainly includes forward processing, backward processing, and updating of weights.
The forward process is a process for processing and propagating information from the input layer to the output layer of the neural network.
 バックワード処理とは、ニューラルネットワークの出力層から入力層に向かって、誤差逆伝播及び重み勾配算出という2つの処理を行うものをいう。誤差逆伝播とは、出力側の層から得られた誤差(δ)を入力側レイヤーに伝播する処理をいう。重み勾配算出とは、重みを有する層について、出力側の層から得られた誤差(δ)と入力側レイヤーの出力値から重みの勾配(∂W)を求める処理をいう。 Backward processing refers to performing two processes, error back propagation and weight gradient calculation, from the output layer to the input layer of the neural network. The error back propagation is a process of propagating the error (δ) obtained from the output side layer to the input side layer. The weight gradient calculation is processing for obtaining a weight gradient (勾 配 W) from an error (δ) obtained from an output layer and an output value of an input layer for a layer having a weight.
 重みの更新とは、重みを有する層について、上記重み勾配算出により得られた重みの勾配(∂W)を用いて、確率的勾配降下法(SGD)から派生したアルゴリズムにより重みを更新する処理をいう。この重みの更新は、バッチ処理の単位ごとに1回実行される。 The update of the weight is a process of updating the weight with an algorithm derived from the stochastic gradient descent method (SGD) using the weight gradient (∂W) obtained by the weight gradient calculation for the layer having the weight. Say. This weight update is executed once for each unit of batch processing.
 2-2.ニューラルネットワークの実例において頻出する計算モジュール
 ニューラルネットワークを構成する各層は、例えば次に列挙するレイヤーアルゴリズムにより実現されるものである。
 -Linear
 -ReLu
 -Dropout
 -Softmax Cross Entropy
 -Convolution 2D
 -Pooling(Average Pooling及びMax Pooling等)等
2-2. Calculation Modules Frequently Appearing in Examples of Neural Networks Each layer constituting a neural network is realized by, for example, a layer algorithm listed below.
-Linear
-ReLu
-Dropout
-Softmax Cross Entropy
-Convolution 2D
-Pooling (Average Pooling, Max Pooling, etc.) etc.
 重みの更新アルゴリズムの代表例としては、次のものが挙げられる。
 -momentum-SGD
 -Adam等
Typical examples of the weight update algorithm include the following.
-Momentum-SGD
-Adam, etc.
 2-3.ニューラルネットワークのネットワーク構成例(1)
 図3は、ニューラルネットワークのネットワーク構成の一例を示す模式図である。図3には、一例として、入力層と出力層(Softmax)との間に6つの中間層(Linear、ReLU、Linear、ReLU、Dropout及びLinear)が配置されたニューラルネットワークが例示されている。紙面上、右向きの矢印がフォワード処理を示し、左向きの矢印がバックワード処理を示す。
 入力層は、更新すべき重みを持たないものであるので、バックワード処理は、この入力層に最も近い重みを有する中間層(図3に示した例では、入力層に隣接して配置されたLinear層)まで行われる。
2-3. Network configuration example of neural network (1)
FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network. As an example, FIG. 3 illustrates a neural network in which six intermediate layers (Linear, ReLU, Linear, ReLU, Dropout, and Linear) are arranged between an input layer and an output layer (Softmax). On the page, a rightward arrow indicates forward processing, and a leftward arrow indicates backward processing.
Since the input layer has no weight to be updated, the backward processing is arranged adjacent to the input layer (in the example shown in FIG. 3, adjacent to the input layer). To the Linear layer).
 2-4.ニューラルネットワークのネットワーク構成例(2)
 図4は、ニューラルネットワークのネットワーク構成の別の例を示す模式図である。図4には、一例として、入力層と出力層(Softmax)に隣接して配置された中間層(Linear)との間に、直列に配置された複数の中間層(Convolution 2D、ReLU、Convolution 2D、ReLU)が、複数(3つ)並列に配置されたニューラルネットワークが例示されている。紙面上、上向きの矢印がフォワード処理を示し、下向きの矢印がバックワード処理を示す。
2-4. Network configuration example of neural network (2)
FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network. In FIG. 4, as an example, a plurality of intermediate layers (Convolution 2D, ReLU, Convolution 2D) arranged in series between an input layer and an intermediate layer (Linear) arranged adjacent to the output layer (Softmax) are shown. , ReLU) is illustrated as a neural network in which a plurality (three) are arranged in parallel. On the page, an upward arrow indicates forward processing, and a downward arrow indicates backward processing.
 2-5.ニューラルネットワークのネットワーク構成例(3)
 図5は、ニューラルネットワークのネットワーク構成のさらに別の例を示す模式図である。図5には、一例として、ループを有するニューラルネットワーク(これは、「Recurrent Neural Network」と呼ばれることがある。)が例示されている。同図には、フォワード処理におけるデータの流れが矢印により示されている。中間層(ここではLinear)は、前回のこの中間層の出力値と今回の入力層の出力値とを加算したものをこの中間層の入力とする計算を実行する。このようなニューラルネットワークにおいてバックワード処理を実現する方法としては、予めネットワークを時間軸方向に展開してループのないネットワークに変換する方法(BPTT)が知られている。
2-5. Network configuration example of neural network (3)
FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network. FIG. 5 illustrates, as an example, a neural network having a loop (this may be referred to as “Recurrent Neural Network”). In the figure, the flow of data in the forward processing is indicated by arrows. The intermediate layer (here, “Linear”) executes a calculation using the previous output value of the intermediate layer and the output value of the current input layer as the input of the intermediate layer. As a method for realizing backward processing in such a neural network, a method (BPTT) in which the network is expanded in the time axis direction in advance and converted into a loop-free network is known.
 2-6.レイヤーアルゴリズムの計算内容(Linear)
 レイヤーアルゴリズムの1つであるLinearは、入力側レイヤーの全ノードの加重平均を取る操作をその中間層内のノード数分だけ繰り返す計算を実行するものである。
 図6は、Linearによりフォワード処理時に実行される計算を実現するための擬似コードを示す図であり、図7は、Linearによりバックワード処理時に実行される計算を実現するための擬似コードを示す図である。
2-6. Layer algorithm calculation (Linear)
Linear, which is one of the layer algorithms, executes a calculation that repeats the operation of taking the weighted average of all the nodes in the input side layer by the number of nodes in the intermediate layer.
FIG. 6 is a diagram showing pseudo code for realizing a calculation executed at the time of forward processing by Linear, and FIG. 7 is a diagram showing a pseudo code for realizing a calculation executed at the time of backward processing by Linear. It is.
 2-7.レイヤーアルゴリズムの計算内容(ReLU)
 レイヤーアルゴリズムの1つであるReLUは、入力側レイヤーの各ノードにMax(0,val)の計算を実行するものである。このアルゴリズムは、ニューラルネットワークの計算に非線形性を加える処理(活性化関数)で近年最も使われている手法である。
 図8は、ReLUによりフォワード処理時に実行される計算を実現するための擬似コードを示す図であり、図9は、ReLUによりバックワード処理時に実行される計算を実現するための擬似コードを示す図である。
2-7. Layer algorithm calculation (ReLU)
ReLU, which is one of layer algorithms, calculates Max (0, val) for each node in the input side layer. This algorithm is the most used technique in recent years in the processing (activation function) for adding nonlinearity to the calculation of the neural network.
FIG. 8 is a diagram showing pseudo code for realizing the calculation executed at the time of forward processing by ReLU, and FIG. 9 is a diagram showing the pseudo code for realizing the calculation executed at the time of backward processing by ReLU. It is.
 2-8.レイヤーアルゴリズムの計算内容(Dropout)
 レイヤーアルゴリズムの1つであるDropoutは、ランダムに一定の割合のノードを選択し、出力及び誤差逆伝播を不活性化する計算を実行するものである。このアルゴリズムは、識別のみを実行する場合(すなわち、学習を実行しない場合)には、不要となるものである。
 2-9.レイヤーアルゴリズムの計算内容(Softmax Cross Entropy)
 レイヤーアルゴリズムの1つであるSoftmax Cross Entropyは、入力側レイヤーの値を以下の式により補正するものである。
Figure JPOXMLDOC01-appb-M000001
 このレイヤーアルゴリズムは、一般に出力層で用いられる。
 また、このレイヤーアルゴリズムは、バックワード処理時には、正解ラベル(1又は0)と出力値との差分から誤差を計算する。
2-8. Layer algorithm calculation details (Dropout)
Dropout, which is one of the layer algorithms, randomly selects a fixed ratio of nodes and executes a calculation that inactivates output and back propagation. This algorithm is unnecessary when only identification is performed (that is, when learning is not performed).
2-9. Layer algorithm calculation (Softmax Cross Entropy)
Softmax Cross Entropy, one of the layer algorithms, corrects the value of the input side layer using the following formula.
Figure JPOXMLDOC01-appb-M000001
This layer algorithm is generally used in the output layer.
Also, this layer algorithm calculates an error from the difference between the correct answer label (1 or 0) and the output value during backward processing.
 2-10.レイヤーアルゴリズムの計算内容(Convolution 2D)
 レイヤーアルゴリズムの1つであるConvolution 2Dは、Channel*Width*Heightのデータ構造を有する画像を畳み込むものである。入力側レイヤーも、当該レイヤーの出力も、Channel*Width*Heightのデータ構造を有する。このアルゴリズムでは、ストライド処理により画像サイズの縮小も可能である。また、このアルゴリズムでは、入力側レイヤーの画像にパディングを挿入することが行われる。このアルゴリズムは、Channel方向に関しては、Linearと同様の計算構造(出力Channel回数だけ入力Channelの内積計算を繰り返す)を有する。
 図10は、Convolution 2Dによりフォワード処理時に実行される計算を実現するための擬似コードを示す図である。
 なお、Convolution 2Dは、バックワード処理時には、Linearと同様に重みの勾配計算及び誤差逆伝播を実行する。それぞれの処理のループの規模は、フォワード処理時におけるものと同様である。
2-10. Calculation contents of layer algorithm (Convolution 2D)
Convolution 2D, which is one of the layer algorithms, convolves an image having a data structure of Channel * Width * Height. Both the input layer and the output of the layer have a data structure of Channel * Width * Height. With this algorithm, the image size can be reduced by stride processing. In this algorithm, padding is inserted into an image on the input side layer. This algorithm has the same calculation structure as the Linear (repeating the inner product calculation of the input channel for the number of output channels) with respect to the channel direction.
FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D.
Convolution 2D performs weight gradient calculation and error backpropagation in the same way as Linear during backward processing. The scale of each processing loop is the same as that during forward processing.
 2-11.レイヤーアルゴリズムの計算内容(Max Pooling)
 レイヤーアルゴリズムの1つであるMax Poolingは、入力側レイヤーの画像の最大値を取ることによってその画像を縦横方向に縮小するものである。なお、最大値を取るフィルタサイズと画像縮小のストライド幅とが異なる場合もある。また、Channel数に変化はない。
2-11. Layer algorithm calculation (Max Pooling)
Max Pooling, which is one of the layer algorithms, reduces the image vertically and horizontally by taking the maximum value of the image on the input side layer. Note that the filter size taking the maximum value and the stride width for image reduction may be different. There is no change in the number of channels.
 2-12.レイヤーアルゴリズムの計算内容(Average Pooling)
 レイヤーアルゴリズムの1つであるMax Poolingは、入力側レイヤーの画像の平均値を取ることによってその画像を縦横方向に縮小するものである。なお、平均値を取るフィルタサイズと画像縮小のストライド幅とが異なる場合もある。また、Channel数に変化はない。
2-12. Layer algorithm calculation (Average Pooling)
Max Pooling, which is one of the layer algorithms, reduces the image in the vertical and horizontal directions by taking the average value of the images on the input side layer. Note that the filter size taking the average value may differ from the stride width for image reduction. There is no change in the number of channels.
 2-13.重みの更新アルゴリズム
 重みの更新アルゴリズムとしては、確率的勾配降下法(SGD)から派生した様々なアルゴリズムが存在する。これらのアルゴリズムでは、重みの要素ごとに計算は独立している。
 先に挙げたmomentum-SGDの計算式は次のとおりである。
Figure JPOXMLDOC01-appb-M000002
 また、先に挙げたAdamの計算式は次のとおりである。
Figure JPOXMLDOC01-appb-M000003
2-13. Weight Update Algorithm There are various algorithms derived from the stochastic gradient descent method (SGD) as the weight update algorithm. In these algorithms, the calculation is independent for each weight element.
The formula of momentum-SGD mentioned above is as follows.
Figure JPOXMLDOC01-appb-M000002
Also, Adam's formula mentioned above is as follows.
Figure JPOXMLDOC01-appb-M000003
 3.実施形態に係る学習装置のハードウェア構成
 次に、本発明の実施形態に係る学習装置のハードウェア構成について説明する。図11は、本発明の一実施形態に係る学習装置のハードウェア構成例を示す模式図である。
3. Next, a hardware configuration of the learning device according to the embodiment of the present invention will be described. FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention.
 学習装置10は、図11に示されているとおり、CPU11と、メインメモリ12と、入力I/F13と、出力I/F14と、通信I/F15と、外部メモリ16と、ユーザI/F17と、を含み、これらの各構成要素が内部バス18を介して互いに電気的に接続されている。なお、学習装置10は、選択的にGPU(図示せず)を含むことも可能である。 As shown in FIG. 11, the learning device 10 includes a CPU 11, a main memory 12, an input I / F 13, an output I / F 14, a communication I / F 15, an external memory 16, and a user I / F 17. These components are electrically connected to each other via an internal bus 18. Note that the learning apparatus 10 can selectively include a GPU (not shown).
 CPU11は、外部メモリ16からオペレーティングシステム、及び、プログラミング言語(例えばPython)をサポートするプログラム(ソースコードの作成に用いられるプログラム)等の様々なプログラムをメインメモリ12にロードし、ロードしたプログラムに含まれる命令を実行する。メインメモリ12は、CPU11が実行するプログラムを格納するために用いられ、例えば、DRAMによって構成される。 The CPU 11 loads various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python) from the external memory 16 into the main memory 12, and is included in the loaded program. Execute the instruction. The main memory 12 is used for storing a program executed by the CPU 11, and is constituted by, for example, a DRAM.
 入力I/F13は、測定装置(図示しない)の出力データを取り込む機能を有し、内部バス18によって、各構成要素と接続される。ここで、測定装置の出力である各種測定データは、センサ等で取得した情報、例えば、温度、湿度、位置情報、画像データなどを含み、動画データや温度のある一定間隔で取得された温度データ列など時系列データでもよい。出力I/F14は、内部バス18を通して各構成要素からデータを受信し、学習装置の外部にある出力装置(図示しない)に出力するものである。ここで、出力装置に出力されるデータは、例えばモータを駆動する際の制御情報や、ブザー、制御スイッチ、自動車のアクセルやブレーキ、液晶ディスプレイなどの情報出力装置に対する制御情報などが想定される。 The input I / F 13 has a function of capturing output data of a measuring device (not shown), and is connected to each component by an internal bus 18. Here, the various measurement data that are the output of the measurement device include information acquired by a sensor or the like, for example, temperature, humidity, position information, image data, etc., and moving image data or temperature data acquired at certain intervals of temperature. Time series data such as columns may be used. The output I / F 14 receives data from each component through the internal bus 18 and outputs the data to an output device (not shown) outside the learning device. Here, the data output to the output device may be, for example, control information for driving a motor, control information for an information output device such as a buzzer, a control switch, an automobile accelerator or brake, or a liquid crystal display.
 通信I/F15は、ハードウェア、ファームウェア、又は、TCP/IPドライバやPPPドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装され、通信網20を介して、図示しないサーバ装置と様々な情報を通信することが可能となるように構成される。
 外部メモリ16は、例えば磁気ディスクドライブやフラッシュメモリ等により構成され、オペレーティングシステム及びプログラミング言語(例えばPython)をサポートするプログラム(ソースコードの作成に用いられるプログラム)等の様々なプログラムを記憶する。
The communication I / F 15 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and communicates various information with a server device (not shown) via the communication network 20. It is configured to be possible.
The external memory 16 is composed of, for example, a magnetic disk drive, a flash memory, or the like, and stores various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python).
 以上の構成を有する一実施形態に係る学習装置10は、CPU11(選択的にはこれに加えてGPU)が、外部メモリ16からメインメモリ12にロードした所定のプログラムを実行することによって、機械学習を行う学習装置として機能することができる。例えば、機械学習を行う学習装置10は、CPU11(選択的にはこれに加えてGPU)が様々なプログラムを実行することにより、ニューラルネットワークによりモデル化された学習装置として実現され得る。 The learning device 10 according to the embodiment having the above-described configuration is configured such that the CPU 11 (and optionally the GPU) executes machine learning by executing a predetermined program loaded from the external memory 16 to the main memory 12. Can function as a learning device. For example, the learning device 10 that performs machine learning can be realized as a learning device that is modeled by a neural network by the CPU 11 (optionally in addition to the GPU) executing various programs.
 上記構成を有する学習装置10は、対応する個体(機器)に搭載されるものとすることができる。また、学習装置10は、対応する測定装置、及び、対応する出力装置に接続されるものとすることができる。これらの測定装置及び出力装置は、対応する個体(機器)に搭載される場合もあるし、別の機器として通信手段を使って接続される場合もある。 The learning device 10 having the above configuration can be mounted on a corresponding individual (device). Further, the learning device 10 can be connected to a corresponding measurement device and a corresponding output device. These measuring devices and output devices may be mounted on a corresponding individual (device) or may be connected as separate devices using communication means.
 学習装置10は、一実施形態において、機械学習を実行可能な任意の情報処理装置であり、例えば、パーソナルコンピュータ、タブレット、携帯電話機、スマートフォン、携帯情報端末、タッチパッド、及び、情報処理サーバ等を含むが、これらには限られない。 In one embodiment, the learning device 10 is an arbitrary information processing device capable of executing machine learning, such as a personal computer, a tablet, a mobile phone, a smartphone, a mobile information terminal, a touch pad, and an information processing server. Including but not limited to.
 4.実施形態に係る学習装置の機能ブロック
 次に、上記構成を有する学習装置10が有する機能について簡単に説明する。図12は、本発明の一実施形態に係る学習装置が有する機能例を模式的に示すブロック図である。
4). Functional Blocks of Learning Device According to Embodiment Next, functions of the learning device 10 having the above configuration will be briefly described. FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention.
 実施形態に係る学習装置10は、上述したように「Define-by-Run」と呼ばれる手法に基づくものである。具体的には、実施形態に係る学習装置10は、ニューラルネットワークのフォワード処理を、分岐、ループ及び関数呼び出しを含む一般的な手続き型言語により実行するタイミングで、バックワード処理及び重み更新の処理に必要なネットワーク構成の情報が動的に生成されることによって、実際にバックワード処理及び重み更新の処理を実行することができる仕組みを具備するものである。 The learning apparatus 10 according to the embodiment is based on a technique called “Define-by-Run” as described above. Specifically, the learning device 10 according to the embodiment performs backward processing and weight update processing at a timing at which the neural network forward processing is executed by a general procedural language including branching, looping, and function calling. By dynamically generating necessary network configuration information, a mechanism capable of actually executing backward processing and weight update processing is provided.
 このような「Define-by-Run」を実現するために、図12に示すように、一実施形態に係る学習装置10は、主に、取得部110と、記憶部120と、実行部130とを含む。取得部110は、ニューラルネットワークを構成する各層のフォワード処理を定義したコードを含むソースコードを取得するものである。具体的には、かかるソースコードは、開発者やユーザ等によりテキストエディタを用いて所定のプログラミング言語(例えばPython等)を用いて作成されたものであり、取得部110は、このようなソースコードを取得する。
 かかる取得部110は、例えば、図11に示したCPU11、メインメモリ12、外部メモリ16及びユーザI/F17等が協働することによって、実現されるものとすることができる。
In order to realize such “Define-by-Run”, as illustrated in FIG. 12, the learning device 10 according to an embodiment mainly includes an acquisition unit 110, a storage unit 120, an execution unit 130, including. The acquisition unit 110 acquires a source code including a code defining a forward process of each layer constituting the neural network. Specifically, such source code is created by a developer or user using a text editor using a predetermined programming language (for example, Python). To get.
For example, the acquisition unit 110 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, the user I / F 17, and the like illustrated in FIG.
 記憶部120は、ソースコードにおいて定義可能な複数のフォワード処理の各々と該フォワード処理に対応するバックワード処理との対応関係を記憶するものである。記憶部120に記憶された対応関係では、複数のフォワード処理に含まれる或るフォワード処理には、対応するバックワード処理が1対1の関係により対応付けられている。すなわち、記憶部120に記憶された対応関係では、例えば、Linearというレイヤー(中間層)については、Linearに対応したフォワード処理と、このフォワード処理に対応したバックワード処理とが、対応付けられている。(このようにフォワード処理とバックワード処理との1対1で対応付けられた対応関係は、バックワード処理のための参照構造を用いてバックワード処理を実行する際に、フォワード処理に対応した処理を実行するために用いられる。例えば、フォワード処理をA→B→Cという順番で実行すると、バックワード処理は、C→B→Aという順番で実行されるところ、A~Cのそれぞれの関数に対して、フォワード処理及びバックワード処理という両方の処理がペアで実装されていることによって、このようなバックワード処理を実現することができる。)
 なお、記憶部120は、取得部110に取得されたソースコード、及び、このソースコードに対応するプログラミング言語において用いられる様々なライブラリ等を含む様々な情報を記憶することができる。
 かかる記憶部120は、例えば、図11に示したCPU11、メインメモリ12及び外部メモリ16等が協働することによって、実現されるものとすることができる。
The storage unit 120 stores a correspondence relationship between each of a plurality of forward processes that can be defined in the source code and a backward process corresponding to the forward process. In the correspondence relationship stored in the storage unit 120, a corresponding backward process is associated with a certain forward process included in a plurality of forward processes in a one-to-one relationship. In other words, in the correspondence relationship stored in the storage unit 120, for example, for a layer called “Linear” (intermediate layer), a forward process corresponding to Linear and a backward process corresponding to this forward process are associated with each other. . (As described above, the one-to-one correspondence between the forward process and the backward process is a process corresponding to the forward process when the backward process is executed using the reference structure for the backward process. For example, when the forward process is executed in the order of A → B → C, the backward process is executed in the order of C → B → A. On the other hand, since both forward processing and backward processing are implemented in pairs, such backward processing can be realized.)
The storage unit 120 can store various information including the source code acquired by the acquisition unit 110 and various libraries used in a programming language corresponding to the source code.
For example, the storage unit 120 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG. 11.
 実行部130は、取得部110に取得され(記憶部120に記憶され)たソースコードに含まれた各コードを順次実行するものである。この実行部130は、各コードを実行した時点において、該コードにより定義されたフォワード処理の出力値を入力値に基づいて計算することができる。また、この実行部130は、各コードを実行した時点において、該コードに対応する層におけるオブジェクト間の参照構造を生成することができる。
 かかる実行部130は、例えば、図11に示したCPU11、メインメモリ12及び外部メモリ16等が協働することによって、実現されるものとすることができる。
The execution unit 130 sequentially executes each code included in the source code acquired by the acquisition unit 110 (stored in the storage unit 120). The execution unit 130 can calculate the output value of the forward process defined by the code based on the input value when each code is executed. In addition, the execution unit 130 can generate a reference structure between objects in a layer corresponding to the code when the code is executed.
The execution unit 130 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG.
 また、上述した「Define-by-Run」という手法を実現するために、一実施形態に係る学習装置10は、上述した取得部110、記憶部120及び実行部130を用いることによって、3つのクラス、すなわち、Function、Variable及びOptimizerという3つのクラスを利用する。なお、これらのクラスの名称は、便宜的に付されたものであって、限定的なものではない。
 まず、Functionというクラスは、フォワード処理とバックワード処理とをペアにして定義したクラスである。このFunctionというクラスは、上記「2-6」~「2-12」で例示した具体的なレイヤーアルゴリズムをサブクラスで定義するものである。
 次に、Variableというクラスは、Function間で入出力されるデータを管理するクラスである。このVariableというクラスは、GPUとCPUとの違いを隠蔽する役割を有するものであり、また、ループを含むネットワークのバックワード処理を有限範囲で打ち切るためのメソッド(後述するunchain_backward)を有する。
 さらに、Optimizerというクラスは、重みの更新を行うクラスである。
In order to realize the above-described method “Define-by-Run”, the learning device 10 according to an embodiment uses the acquisition unit 110, the storage unit 120, and the execution unit 130 described above, thereby providing three classes. That is, three classes of Function, Variable, and Optimizer are used. Note that the names of these classes are given for convenience and are not limiting.
First, a class called Function is a class defined by pairing forward processing and backward processing. The class called Function defines the specific layer algorithm exemplified in the above “2-6” to “2-12” as a subclass.
Next, a class called Variable is a class that manages data input and output between functions. This class of Variable has a role of concealing the difference between the GPU and the CPU, and has a method (unchain_backward, which will be described later) for terminating backward processing of a network including a loop within a finite range.
Furthermore, a class called Optimizer is a class for updating weights.
 5.動作例1
 次に、上記構成を有する実施形態に係る学習装置10によりなされる動作の具体例について説明する。図13は、本発明の一実施形態に係る学習装置に入力されるソースコードの一例を示す図である。なお、図13に例示されたソースコードは、本実施形態に係る学習装置の特徴を説明することを目的として意図的に簡略化されたものであることに留意されたい。また、図13において左端に記載された行数は、本具体例を説明するために付されたものであって、実際のソースコードには含まれない。
 以下、本実施形態では、ソースコードが一例としてPythonにより記述される場合について説明するが、ソースコードはPython以外のプログラミング言語により記載されるものであってもよい。Phythonの詳細は、https://www.python.org/に開示されている。このURLに開示された内容は、引用によりその内容の全体が本明細書に組み込まれる。
5). Operation example 1
Next, a specific example of an operation performed by the learning apparatus 10 according to the embodiment having the above configuration will be described. FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention. Note that the source code illustrated in FIG. 13 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Further, the number of lines described at the left end in FIG. 13 is given for explaining this specific example, and is not included in the actual source code.
Hereinafter, in the present embodiment, a case where the source code is described in Python as an example will be described, but the source code may be described in a programming language other than Python. Details of Phython are disclosed at https://www.python.org/. The content disclosed at this URL is incorporated herein by reference in its entirety.
 まず、開発者等が図13に例示されたソースコードをテキストエディタ等を利用して作成する。このように作成されたソースコードを学習装置10の取得部110(図12参照)が取得して記憶部120に記憶させる。次に、実行部130が、記憶部120に記憶されたソースコードに含まれた各コードを1行ずつ実行する。図13に例示されたようにソースコードにif文やfor文等の制御構文が含まれていない場合には、実行部130は、第1行から最終行まで上から下に向かって1行ずつ順次実行する。逆に、ソースコードに制御構文が含まれている場合には、実行部130は、制御構文に従った順序により各コードを実行する。 First, a developer or the like creates the source code illustrated in FIG. 13 using a text editor or the like. The acquisition unit 110 (see FIG. 12) of the learning device 10 acquires the source code created in this way and stores it in the storage unit 120. Next, the execution unit 130 executes each code included in the source code stored in the storage unit 120 line by line. When the source code does not include a control syntax such as an if statement or a for statement as illustrated in FIG. 13, the execution unit 130 executes the first line to the last line one line at a time from the top to the bottom. Run sequentially. Conversely, when the control syntax is included in the source code, the execution unit 130 executes each code in the order according to the control syntax.
 図13に例示されているソースコードの内容について説明する。
 第1行~第3行は、FunctionSetによるパラメータを含むFunctionの登録を記述している。具体的には、ここでは、FunctionSetというクラスのオブジェクトに重みを含むFunction(本例では内積を行うレイヤーアルゴリズムを定義したFunctionサブクラスであるLinearクラスのインスタンスl1,l2,l3)を登録している。重みを含むFunctionはOptimizerによりその重みを更新できる。FunctionSetは、Optimizerによって更新されるFunctionをひとまとめにすることでコードの可読性を高めるための仕組みである。
The contents of the source code illustrated in FIG. 13 will be described.
Lines 1 to 3 describe registration of a function including parameters by FunctionSet. Specifically, here, a function including a weight (an instance of linear class l1, l2, l3, which is a function subclass defining a layer algorithm for performing an inner product) is registered in an object of a class called FunctionSet. Functions with weights can be updated by the Optimizer. FunctionSet is a mechanism for improving the readability of code by grouping the functions updated by Optimizer.
 第4行及び第5行は、Optimizerの初期化を記述している。第4行でAdamというアルゴリズムを実装したOptimizer(重みを更新するためのクラス)サブクラスのインスタンスを生成している。Adamの処理内容は、上記「2-13」に記載した数式による更新を重みの要素ごとに実行するものである。第5行では、第4行で生成したOptimizerサブクラスのインスタンスのsetupメソッドに対して、第1行~第3行で定義済みの重みを含むFunctionの一覧を渡している。このsetupメソッドの実行により、本メソッドに渡されたFunctionの一覧に含まれる重みを更新するためのOptimizerサブクラスの内部状態が初期化される。 Lines 4 and 5 describe the initialization of the Optimizer. The fourth line creates an instance of an Optimizer (class for updating weights) subclass that implements the algorithm called Adam. Adam's processing is to update for each element of weight by the mathematical expression described in “2-13” above. In the fifth line, a list of functions including the weights already defined in the first to third lines is passed to the setup method of the instance of the optimizer subclass generated in the fourth line. By executing this setup method, the internal state of the Optimizer subclass for updating the weight included in the Function list passed to this method is initialized.
 第6行は、入力データのロードを記述している。すなわち、第6行は、入力データxおよびtをファイルなどから読み込む処理を例示している。本例において、xには画像や音声など情報量の多いデータが保持され、tにはxに対応するラベルID(答え合わせのための情報量の少ないデータ)が保持される。 The sixth line describes loading of input data. That is, the sixth line illustrates the process of reading the input data x and t from a file or the like. In this example, x holds data with a large amount of information such as images and sounds, and t holds a label ID corresponding to x (data with a small amount of information for answer matching).
 第7行は、入力データのVariableオブジェクトによる保持を記述している。すなわち、第7行において、入力データを保持するVariableクラスのオブジェクトを生成する。「Define-by-Run」の機能は、VariableオブジェクトとFunctionオブジェクトとが相互依存することで実現されており、任意の入力データは「Define-by-Run」の機能を実現するための仕組みを有していないため、明示的にVariableクラスのインスタンスによって保持する手続きが必要となる。 7th line describes holding input data by Variable object. That is, in the seventh line, a Variable class object that holds input data is generated. The “Define-by-Run” function is realized by the mutual dependence of the Variable object and the Function object, and any input data has a mechanism to realize the “Define-by-Run” function. Therefore, a procedure that is explicitly held by an instance of the Variable class is required.
 第8行~第11行は、フォワード処理の実行を記述している。具体的には、第8行~第11行において、一般的なプログラミング言語の記述によりForward処理を実行する。「Define-by-Run」の機能により、本定義の実行と同時にバックワード処理のための参照構造が生成される。FunctionクラスのインスタンスとVariableクラスのインスタンスが相互に参照することで、任意の処理とデータの対応関係を表現できる。Variableクラスはデータを代表し、Functionクラスは処理を代表するのであるからこのことは自明である。この参照構造を用いて図2に示すBackwardの計算手順を表現するデータ構造を構築したものをバックワード処理のための参照構造と定義する。バックワード処理のための参照構造は、Variableオブジェクトに対する基本的な計算(四則演算やべき乗)及びVariableオブジェクトを引数や戻り値とするFunctionの呼び出しが行われる都度成長する。よって、分岐やループ、FunctionやVariableに対するもの以外の関数呼び出しを含むフォワード処理の記述であっても、バックワード処理のための参照構造を生成することができる。Variableオブジェクトに対する基本的な計算にもそれぞれFunctionサブクラスが対応付いている。 Lines 8 to 11 describe execution of forward processing. Specifically, in the 8th to 11th lines, the Forward process is executed by the description of a general programming language. The “Define-by-Run” function generates a reference structure for backward processing simultaneously with the execution of this definition. By referring to the instances of the Function class and the Variable class, it is possible to express the correspondence between arbitrary processing and data. This is obvious because the Variable class represents data and the Function class represents processing. A data structure expressing the backward calculation procedure shown in FIG. 2 using this reference structure is defined as a reference structure for backward processing. The reference structure for backward processing grows every time a basic calculation (arithmetic operation or power) for a Variable object and a Function call with a Variable object as an argument or return value are called. Therefore, a reference structure for backward processing can be generated even for forward processing descriptions including function calls other than those for branches, loops, and functions and variables. Each basic calculation for Variable objects also has a Function subclass associated with it.
 第12行は、バックワード処理の実行を記述している。具体的には、第12行は、第8行~第11行で実行したフォワード処理の実行結果として得られたloss変数(Variableクラスのインスタンス)のバックワードメソッド呼び出しにより、バックワード処理を実行する。バックワード処理は、フォワード処理実行時に生成されたバックワード処理のための参照構造をたどることでフォワード処理とは逆の順序で自動的に実行される。 Line 12 describes the execution of backward processing. Specifically, the 12th line executes the backward process by calling the backward method of the loss variable (an instance of the Variable class) obtained as a result of executing the forward process executed in the 8th to 11th lines. . The backward process is automatically executed in the reverse order of the forward process by following the reference structure for the backward process generated when the forward process is executed.
 第13行は、重みの更新を記述している。具体的には、第13行では、第12行でバックワード処理を実行した結果として重みの勾配が算出される。第13行のようにOptimizerサブクラスのインスタンスのupdateメソッドを呼び出すと、この重みの勾配を用いて重みが更新される。Optimizerサブクラスに対するupdateメソッドの呼び出しとVariableクラスのbackwardメソッドの呼び出しとは別関数となっているので、部分的にバックワード処理を実行した後、重みの更新を実行することもできる。これは、すでに学習済みのFunctionに対して重みの更新を行いたくない場合に有効である。 Line 13 describes the weight update. Specifically, in the 13th row, a weight gradient is calculated as a result of executing backward processing in the 12th row. When the update method of the instance of the Optimizer subclass is called as in the 13th line, the weight is updated using the weight gradient. Since the update method call for the Optimizer subclass and the backward method call for the Variable class are separate functions, it is also possible to update the weight after partially executing the backward processing. This is effective when it is not desired to update the weight for a function that has already been learned.
 ここで、フォワード処理時に処理される内容として、特に第8行に記述されたコードによって処理される内容に着目する。
 第8行は、h1=F.relu(model.l1(x))と記述している。
Here, as the contents processed during the forward processing, attention is particularly paid to the contents processed by the code described in the eighth line.
The eighth line describes h1 = F.relu (model.l1 (x)).
 “model.l1(x)”の実行時には、以下のようなバックワード処理のための参照構造が生成される。
Figure JPOXMLDOC01-appb-M000004
 上記参照構造において、x'はxをコピーしたVariableオブジェクトを表し、l1’はl1のコピー(浅いコピー)を表し、yはl1’のforwardメソッドが返す値(Variableオブジェクト)を表し、splitterはネットワークの分岐を管理するクラスのインスタンスを表す。
浅いコピーとはオブジェクトをコピーする際に、オブジェクトが内部的に参照するデータをコピーしないようなオブジェクトのコピー方法である。浅いコピーとすることで例えばFuntionのインスタンスが持つ重みのデータの重複を避けることができる。
また、矢印の意味はオブジェクトの参照の方向を表す。例えばA←Bという記述はBのオブジェクトのメンバにAのオブジェクトへの参照が含まれることを意味する。
When “model.l1 (x)” is executed, the following reference structure for backward processing is generated.
Figure JPOXMLDOC01-appb-M000004
In the above reference structure, x 'represents a Variable object that is a copy of x, l1' represents a copy of l1 (shallow copy), y represents a value (Variable object) returned by the forward method of l1 ', and splitter is a network Represents an instance of a class that manages the branch of.
Shallow copy is a method of copying an object that does not copy data that the object internally references when copying the object. By making a shallow copy, for example, duplication of weight data of a Funtion instance can be avoided.
The meaning of the arrow indicates the direction of reference of the object. For example, the description A ← B means that a member of B object includes a reference to the object of A.
 上記参照構造は、“F.relu(”の実行後には以下のようなバックワード処理のための参照構造となる。
Figure JPOXMLDOC01-appb-M000005
 ここでは、第8行に記述されたコードの実行時にバックワード処理のための参照構造が生成される場合について説明したが、第9行及び第10行に記述されたコードの実行時においても同様に参照構造が生成されることはいうまでもない。
 以上のように、フォワード処理の実行時には、自然な形式の関数呼び出しによってバックワード処理のための参照構造が生成される。
 この時点で、h1を起点にバックワードの処理が実行可能な状態と成っている。実際にh1からバックワード処理を実行する場合のバックワード処理系が実行する処理の流れを以下に示す。
 h1のインスタンスが参照するrelu'をたどりrelu'のBackward処理を呼び出す。この時の入力はh1が保持する誤差の値であり、出力結果はy'のインスタンスに格納される。このようなFunctionインスタンスが入出力するデータ対応はFunctionのサブクラスごとに定義されるFoward処理/Backward処理においてそれぞれ定義される。次にrelu'からy'を経由してspliterにたどり着く、spliterはy'が保持する誤差の値をyにコピーする。(splitterが挿入される理由は次節で述べる)。次にyからl1'をたどりl1'のBackward処理を実行する。この時の入力はyが保持する誤差の値であり、出力結果はx'のインスタンスに格納される。また重みの誤差も計算される。重みの誤差は、x’に格納されたFoward時の出力値と、yが保持する誤差の値から計算される。以下同様にバックワード処理のための参照構造の終点であるxまでたどるとバックワード処理は終了となる。
The above reference structure becomes a reference structure for backward processing as follows after execution of “F.relu (”.
Figure JPOXMLDOC01-appb-M000005
Here, the case where a reference structure for backward processing is generated when the code described in the eighth line is executed has been described. Needless to say, a reference structure is generated.
As described above, when the forward process is executed, a reference structure for the backward process is generated by a natural-type function call.
At this point, the backward processing can be executed starting from h1. The flow of processing executed by the backward processing system when actually executing backward processing from h1 is shown below.
Follow the relu 'referenced by the h1 instance and call the relu' Backward process. The input at this time is an error value held by h1, and the output result is stored in an instance of y ′. Correspondence between data input and output by such Function instances is defined in the Forward process / Backward process defined for each Function subclass. Then, from relu 'to y' via y ', spliter copies the error value held by y' to y. (The reason why the splitter is inserted is described in the next section). Next, follow l1 'from y and execute the backward processing of l1'. The input at this time is an error value held by y, and the output result is stored in an instance of x ′. A weight error is also calculated. The weight error is calculated from the forward output value stored in x ′ and the error value held by y. In the same manner, backward processing ends when x is reached, which is the end point of the reference structure for backward processing.
 なお、Splitterが参照構造に挿入される理由について念のため説明する。
 上記参照構造の作成直後に“model.l1(x)”をもう一度コールすると以下のような参照構造が生成される。
Figure JPOXMLDOC01-appb-M000006
 上記参照構造において、l1''はl1のコピー(浅いコピー)を表し(l1’とは別インスタンス)、x''はxをコピーしたVariableオブジェクトを表し(x’とは別インスタンス)、zはl1''のforwardメソッドが返す値(Variableオブジェクト)を表す。
The reason why Splitter is inserted into the reference structure will be explained just in case.
When “model.l1 (x)” is called again immediately after the above reference structure is created, the following reference structure is generated.
Figure JPOXMLDOC01-appb-M000006
In the above reference structure, l1 '' represents a copy (shallow copy) of l1 (an instance different from l1 '), x''represents a Variable object copied from x (an instance different from x'), and z is Indicates the value (Variable object) returned by the l1 '' forward method.
 バックワード処理時に誤差値を出力側レイヤーから伝播する際、splitterのインスタンスがx’とx''にそれぞれ伝わる誤差の値を加算合成した結果をxの誤差として設定する。このようにsplitterを挿入することで、フォワード処理時にxを入力として用いた全てのFunctionからバックワード処理時に誤差を伝播することができる。 When propagating an error value from the output layer during backward processing, the result of adding and combining the error values transmitted by the instance of the splitter to x ′ and x ″ is set as the error of x. By inserting a splitter in this way, an error can be propagated during backward processing from all functions using x as an input during forward processing.
 次に、図13に例示されたソースコードを実行した際に生成されるニューラルネットワークのネットワーク構成について補足する。図14は、図13に示されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。なお、図14において、点線により描かれたブロックは、変数のインスタンスを示し、実線により描かれたブロックは、関数を示す。 Next, it supplements about the network structure of the neural network produced | generated when the source code illustrated in FIG. 13 is executed. FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. In FIG. 14, a block drawn by a dotted line shows an instance of a variable, and a block drawn by a solid line shows a function.
 まず、第7行が実行された時点では、変数xのインスタンス30及び変数tのインスタンスが生成される。説明の便宜上、図14には、変数xのインスタンス30のみしか示されていないが、実際には、変数tのインスタンスも同様に生成される。第7行が実行された時点では、変数xのインスタンスには、実際に画像や音声などのデータが保持されている。 First, when the seventh line is executed, an instance 30 of variable x and an instance of variable t are generated. For convenience of explanation, only the instance 30 of the variable x is shown in FIG. 14, but in reality, an instance of the variable t is generated in the same manner. At the time when the seventh line is executed, the variable x instance actually holds data such as images and sounds.
 次に、実行部130により第8行が実行された時点では、変数xのインスタンス30の後に、関数「l1」31、関数「relu」32及び変数h1のインスタンス33が順次成長した状態のニューラルネットワークが生成される。第8行が実行された時点では、第8行に記述されたフォワード処理の実行結果が既に変数h1のインスタンス33により保持されていることに留意されたい。また、第8行が実行された時点では、上述したように、現時点において生成されているバックワード処理のための参照構造が生成される。 Next, when the eighth line is executed by the execution unit 130, the neural network in a state in which the function “l1” 31, the function “relu” 32, and the instance 33 of the variable h1 are sequentially grown after the instance 30 of the variable x. Is generated. Note that when the eighth line is executed, the execution result of the forward process described in the eighth line is already held by the instance 33 of the variable h1. Further, when the eighth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.
 次に、実行部130により第9行が実行された時点では、変数h1のインスタンス33の後に、関数「l2」34、関数「relu」35及び変数h2のインスタンス36が順次成長した状態のニューラルネットワークが生成される。第9行が実行された時点では、第9行に記述されたフォワード処理の実行結果が既に変数h2のインスタンス36に保持されていることに留意されたい。また、第9行が実行された時点では、上述したように、現時点において生成されているバックワード処理のための参照構造が生成される。 Next, when the ninth line is executed by the execution unit 130, the neural network in a state where the function “l2” 34, the function “relu” 35, and the instance 36 of the variable h2 are sequentially grown after the instance 33 of the variable h1. Is generated. Note that when the ninth line is executed, the execution result of the forward process described in the ninth line is already held in the instance 36 of the variable h2. Further, when the ninth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.
 同様に、実行部130により第10行が実行された時点では、変数h2のインスタンス36の後に、関数「l3」37及び変数yのインスタンス38が順次成長した状態のニューラルネットワークが生成される。第10行が実行された時点では、第10行に記述されたフォワード処理の実行結果が既に変数yのインスタンス38に保持されていることに留意されたい。また、第10行が実行された時点では、上述したように、現時点において生成されているバックワード処理のための参照構造が生成される。 Similarly, when the 10th line is executed by the execution unit 130, a neural network in which the function “l3” 37 and the instance 38 of the variable y are sequentially grown after the instance 36 of the variable h2 is generated. Note that when the tenth line is executed, the execution result of the forward process described in the tenth line is already held in the instance 38 of the variable y. At the time when the 10th line is executed, as described above, the reference structure for backward processing generated at the present time is generated.
 最後に、実行部130により第11行が実行された時点では、変数yのインスタンス38の後に、関数「Softmax」39及び変数lossのインスタンス40が順次成長した状態のニューラルネットワークが生成される。第11行が実行された時点では、第11行に記述されたフォワード処理の実行結果が既に変数lossの新スタンス40に保持されていることに留意されたい。また、第11行が実行された時点では、上述したように、現時点において生成されているバックワード処理のための参照構造が生成される。第11行が実行された時点において、ソースコードに記述されたフォワード処理は完了している。すなわち、第11行が実行された時点において、最終的に得られたニューラルネットワークにより行われた識別の結果と変数tによって与えられる真の識別結果との差分が変数lossのインスタンス40に保持されている。この差分を入力として次のステップのバックワード処理が実行される。 Finally, when the 11th line is executed by the execution unit 130, a neural network in which the function “Softmax” 39 and the instance 40 of the variable loss are sequentially grown after the instance 38 of the variable y is generated. Note that when the eleventh line is executed, the execution result of the forward process described in the eleventh line is already held in the new stance 40 of the variable loss. At the time when the eleventh line is executed, as described above, the reference structure for backward processing generated at the present time is generated. When the eleventh line is executed, the forward process described in the source code is completed. That is, at the time when the eleventh line is executed, the difference between the result of the identification performed by the finally obtained neural network and the true identification result given by the variable t is held in the instance 40 of the variable loss. Yes. The backward processing of the next step is executed with this difference as an input.
 第8行~第11行に記述されたフォワード処理が完了した後、次に、実行部130により第12行が実行されることにより、バックワード処理が実行される。生成されたバックワード処理のための参照構造が既に生成されているため、実行部130は、この参照構造に基づいてバックワード処理を実行することにより、ニューラルネットワークに含まれた各中間層(但し、重みを有する中間層のみ)の重みの勾配を算出することができる。 After the forward process described in the 8th to 11th lines is completed, the backward process is executed by the execution unit 130 executing the 12th line. Since the generated reference structure for backward processing has already been generated, the execution unit 130 executes backward processing based on the reference structure, thereby performing each intermediate layer (however, , Only the weighted intermediate layer) can be calculated.
 次に、実行部130により第13行が実行される。これにより、第12行の実行により算出された重みの勾配を用いて、各中間層(但し、重みを有する中間層のみ)の重みが更新される。すなわち、学習が実行される。 Next, the execution unit 130 executes the 13th line. Thereby, the weight of each intermediate layer (however, only the intermediate layer having the weight) is updated using the weight gradient calculated by executing the 12th row. That is, learning is executed.
 このように、本実施形態に係る学習装置にあっては、開発者等は、フォワード処理に関しては、いずれの変数のインスタンスをいずれの関数に与えることによって得られた実行結果をいずれの変数のインスタンスに保持させるかを1行ずつ記述する方式により、ニューラルネットワークを構築することができる。これにより、開発者等は、フォワード処理をソースコードにおいて直感的に記述することを容易に行うことができる。また、開発者等は、ソースコードにおいて(バックワード処理を意識する必要なく)フォワード処理を記述し、そのソースコードを本実施形態に係る学習装置に実行させることにより、自動的にバックワード処理を学習装置に実行させることができる。 As described above, in the learning device according to the present embodiment, for the forward process, the developer or the like obtains the execution result obtained by giving any variable instance to any function, and any variable instance. It is possible to construct a neural network by a method that describes whether to hold each line by line. Thereby, the developer or the like can easily perform the forward process intuitively in the source code. Further, the developer or the like describes the forward process in the source code (without needing to be aware of the backward process), and automatically executes the backward process by causing the learning apparatus according to the present embodiment to execute the source code. The learning device can execute the program.
 6.比較例1
 次に、本実施形態に係る学習装置の優位性を示すために、図13に例示したソースコードにより実行されるものと等価な処理を従来技術に係るCaffeにより記述した場合について説明する。図15は、従来技術に係るCaffeにより記述されたソースコードの一例を示す図である。
6). Comparative Example 1
Next, in order to show the superiority of the learning apparatus according to the present embodiment, a case where a process equivalent to that executed by the source code illustrated in FIG. 13 is described by Caffe according to the prior art will be described. FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art.
 図15に示すように、「layer」という用語の直後に記載された{}で囲まれたブロックにレイヤー(本実施形態におけるFunctionに対応するもの)の定義が記述されている。この従来技術に係る手法では、レイヤー間の依存関係をコード中に明示する必要がある。例えば、「top」及び「bottom」という記述がレイヤー同士の依存関係を表している。「bottom」は、レイヤーに対する入力がどのレイヤーから得られるかを表し、「top」は、レイヤーでの処理結果がいずれのレイヤーに出力されるのかを表す。 As shown in FIG. 15, the definition of the layer (corresponding to the function in the present embodiment) is described in the block surrounded by {} described immediately after the term “layer”. In the technique according to this prior art, it is necessary to clearly indicate the dependency between layers in the code. For example, the descriptions “top” and “bottom” represent the dependency between layers. “Bottom” represents from which layer the input to the layer is obtained, and “top” represents to which layer the processing result in the layer is output.
 この従来技術に係る手法では、ニューラルネットワークで行われる学習及び識別の処理に先んじて静的にネットワークの構成を定義しておく必要がある。すなわち、ニューラルネットワークのネットワーク構成をまず定義し、その後に、そのニューラルネットワークの学習及び識別を実行させる必要がある。よって、データの性質に応じてネットワークの構成を動的に変更することが困難である。
 これに対して、本実施形態に係る学習装置では、図14を参照して上述したように、ニューラルネットワークの構成を定義する各コードを実行した時点において、そのコードに対応するフォワード処理が実行される。すなわち、ニューラルネットワークの構成の定義とその構成によるフォワード処理の実行とが同じタイミングで実行される。これにより、データの性質に応じてネットワークの構成を動的に変更することも容易に行うことができる。例えば図13のコードに分岐を加えて変数tの値や、変数xのデータサイズに応じて、フォワード処理を実行するレイヤーを切り替えても良い。また例えば図19のコードで9行目の定数「10」の代わりに可変な値を入力データとして与えることもできる。
In this conventional technique, it is necessary to statically define the network configuration prior to the learning and identification processing performed in the neural network. That is, it is necessary to first define the network configuration of the neural network, and then perform learning and identification of the neural network. Therefore, it is difficult to dynamically change the network configuration in accordance with the nature of the data.
On the other hand, in the learning device according to the present embodiment, as described above with reference to FIG. 14, when each code defining the configuration of the neural network is executed, forward processing corresponding to the code is executed. The That is, the definition of the configuration of the neural network and the execution of the forward process based on the configuration are executed at the same timing. As a result, it is possible to easily change the network configuration dynamically according to the nature of the data. For example, a branch may be added to the code of FIG. 13 and the layer for executing the forward processing may be switched according to the value of the variable t or the data size of the variable x. Further, for example, in the code of FIG. 19, a variable value can be given as input data instead of the constant “10” on the ninth line.
 また、従来技術に係る手法では、開発者等は、ソースコードを作成する際に、フォワード処理及びバックワード処理の両方を適切に実行できるように、ニューラルネットワークのネットワーク構成の定義を記述する必要がある。これに対して、本実施形態に係る学習装置では、図14を参照して上述したように、バックワード処理が適切に実行できるかどうかを意識する必要なく、単にフォワード処理(ネットワーク構成)を記述した後、ソースコードを学習装置に実行させることによって、学習装置が自動的にバックワード処理を実行する。したがって、開発者等は、簡単かつ効率的にニューラルネットワークを構築して識別及び学習を実行させることができる。 In addition, in the method according to the prior art, developers and the like need to describe the definition of the network configuration of the neural network so that both forward processing and backward processing can be appropriately executed when creating the source code. is there. On the other hand, in the learning device according to the present embodiment, as described above with reference to FIG. 14, the forward processing (network configuration) is simply described without having to be aware of whether the backward processing can be appropriately executed. After that, the learning device automatically executes the backward processing by causing the learning device to execute the source code. Therefore, a developer or the like can easily and efficiently construct a neural network and execute identification and learning.
 さらに、従来技術に係る手法では、開発者等は、ソースコードを作成する際には、フォワード処理及びバックワード処理の両方を適切に実行できるように、ニューラルネットワークを定義した後、そのように定義されたニューラルネットワークに対してデータ(入力データ及び教師データ等)を代入する、という手順を踏む。したがって、ソースコードを直感的に記述することが困難である。
 これに対して、本実施形態に係る学習装置では、開発者等は、フォワード処理(ネットワーク構成)を1行ずつ記載する時点において、いずれの変数のインスタンスをいずれの関数に与えることによって得られた実行結果をいずれの変数のインスタンスに保持させるかを1行ずつ記述する方式により、ソースコードを記述する。これにより、開発者等は、ソースコードを直感的に記述することができる。
Furthermore, in the method according to the prior art, when creating a source code, developers and the like define a neural network so that both forward processing and backward processing can be appropriately executed, and then define the neural network. The procedure of substituting data (input data, teacher data, etc.) into the neural network is performed. Therefore, it is difficult to intuitively describe source code.
On the other hand, in the learning device according to the present embodiment, the developer or the like was obtained by giving an instance of any variable to any function at the time of describing the forward processing (network configuration) line by line. The source code is described by a method that describes in which variable instance the execution result is to be held line by line. As a result, the developer or the like can intuitively describe the source code.
 7.動作例2
 次に、上記構成を有する実施形態に係る学習装置10によりなされる動作の別の具体例について説明する。図16は、本発明の一実施形態に係る学習装置に入力されるソースコードの別の例を示す図である。なお、図16に例示されたソースコードは、本実施形態に係る学習装置の特徴を説明することを目的として意図的に簡略化されたものであることに留意されたい。また、図16において左端に記載された行数は、本具体例を説明するために付されたものであって、実際のソースコードには含まれない。
7). Operation example 2
Next, another specific example of the operation performed by the learning apparatus 10 according to the embodiment having the above configuration will be described. FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention. It should be noted that the source code illustrated in FIG. 16 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Also, the number of lines shown at the left end in FIG. 16 is given for explaining this specific example, and is not included in the actual source code.
 図16に示されたソースコードを参照して、本実施形態に係る学習装置によれば、制御構文(ここではfor文)を用いて容易にニューラルネットワークを構築することも可能である点について説明する。 With reference to the source code shown in FIG. 16, the learning device according to the present embodiment explains that a neural network can be easily constructed using a control syntax (here, a for sentence). To do.
 第1行~第3行については、図13に示したソースコードにおける第1行~第5行と同様であるので、詳細な説明を省略する。
 第4行は、iの値が0~1000になるまで第5行~第10行に記述された処理をループ処理することを記述している。
 第5行及び第6行については、図13に示したソースコードにおける第6行及び第7行と同様であるので、詳細な説明を省略する。
 第7行は、関数l1及び関数reluの処理結果であるyを再度l1の引数に足し込むことを記述している。
 第8行~第10行については、図13に示したソースコードにおける第11行~第13行と同様であるので、詳細な説明を省略する。
Since the first to third lines are the same as the first to fifth lines in the source code shown in FIG. 13, detailed description thereof is omitted.
The fourth line describes that the process described in the fifth to tenth lines is looped until the value of i becomes 0 to 1000.
The fifth and sixth lines are the same as the sixth and seventh lines in the source code shown in FIG.
The seventh line describes that y, which is the processing result of the function l1 and the function relu, is added again to the argument of l1.
The eighth to tenth lines are the same as the eleventh to thirteenth lines in the source code shown in FIG.
 図17は、図16に示されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。なお、図17において、点線により描かれたブロックは、変数のインスタンスを示し、実線により描かれたブロックは、関数を示す。また、図17は、説明の便宜上、変数iが0~2である場合のみに生成されるニューラルネットワークの構成しか示していない。 FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. In FIG. 17, a block drawn by a dotted line shows an instance of a variable, and a block drawn by a solid line shows a function. FIG. 17 shows only the configuration of a neural network generated only when the variable i is 0 to 2 for convenience of explanation.
 図16及び図17から明らかであるように、本実施形態に係る学習装置では、変数のインスタンス及び関数を含んだ同一の構成(ここでは、変数xのインスタンス51と変数yのインスタンス50とを加算する関数52の後に、関数「l1」53及び関数「relu」54が順次続き、関数「relu」54の出力値が変数yのインスタンスに保持されるような構成)を複数繰り返すようなニューラルネットワークであっても、簡単な制御構文(ここではfor文)を用いて簡単に構築することができることが分かる。すなわち、本実施形態に係る学習装置において用いられるソースコードは、プログラミング言語の制御構文と親和性の高いものであることが分かる。 As is apparent from FIGS. 16 and 17, in the learning device according to the present embodiment, the same configuration including the variable instance and the function (here, the instance 51 of the variable x and the instance 50 of the variable y are added). In the neural network, the function “l1” 53 and the function “relu” 54 are sequentially followed by the function 52 to be performed, and the output value of the function “relu” 54 is held in the instance of the variable y). Even if it exists, it turns out that it is easy to construct using a simple control syntax (here for statement). That is, it can be seen that the source code used in the learning apparatus according to the present embodiment has a high affinity with the control syntax of the programming language.
 8.比較例2
 次に、本実施形態に係る学習装置の優位性を示すために、図16に例示したソースコードにより実行されるものと等価な処理を従来技術に係るCaffeにより記述した場合について説明する。図18は、従来技術に係るCaffeにより記述されたソースコードにより生成されるニューラルネットワークのネットワーク構成を概念的に示す模式図である。
8). Comparative Example 2
Next, in order to show the superiority of the learning device according to the present embodiment, a case where a process equivalent to that executed by the source code illustrated in FIG. 16 is described by Caffe according to the prior art will be described. FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art.
 図16及び図17に例示したものと同様のニューラルネットワークを従来技術に係るCaffeにより構築しようとする場合には、制御構文を用いてニューラルネットワークの構成を定義することができないため、開発者等は、まず、図18に示すような基本的な構成を定義する。次に、開発者等は、関数72に対して変数yのインスタンス75の初期値を与えるとともに、関数72に対して前時間における変数yのインスタンス75と現時間における変数xのインスタンス71とを与える(図18における太線で描かれた矢印の部分)処理を特別に記述しなくてはならない。上記基本的な構成が多数繰り返されるようなニューラルネットワークを構築する場合、又は、多層構造を有するニューラルネットワークを構築する場合には、開発者等は、そのように多数繰り返す都度、又は、多層構造における各層ごとに、かかる特別な記述をしなくてはならない。 When a neural network similar to that illustrated in FIGS. 16 and 17 is to be constructed using Caffe according to the prior art, the configuration of the neural network cannot be defined using the control syntax. First, a basic configuration as shown in FIG. 18 is defined. Next, the developer or the like gives the initial value of the instance 75 of the variable y to the function 72 and also gives the instance 75 of the variable y at the previous time and the instance 71 of the variable x at the current time to the function 72. (A portion indicated by a thick line in FIG. 18) The processing must be described specifically. When constructing a neural network in which the above basic configuration is repeated many times, or when constructing a neural network having a multilayer structure, the developers etc. Each layer must have such a special description.
 これに対して、本実施形態に係る学習装置では、図16及び図17に例示したとおり、記述されるソースコードは、プログラミング言語の制御構文を用いて特別な記述を必要とすることなく簡単に記述可能なものである。したがって、本実施形態に係る学習装置によれば、複雑な又は大規模なニューラルネットワークであっても、簡単かつ効率的に構築することができる。 On the other hand, in the learning device according to the present embodiment, as illustrated in FIGS. 16 and 17, the source code to be described can be easily obtained without using a special description using the control syntax of the programming language. It can be described. Therefore, according to the learning device according to the present embodiment, even a complicated or large-scale neural network can be easily and efficiently constructed.
 9.付加的な機能(1)について
 一実施形態に係る学習装置は、バックワード処理のための参照構造を断ち切る関数を実行することが可能であってもよい。
 具体的には、Variableクラスのインスタンスのunchain_backwardメソッドがコールされると、そのインスタンスを起点に入力側へと向かうバックワード処理のための参照構造が断ち切られる。例えば、以下のようなバックワード処理のための参照構造がフォワード処理の実行により生成されていたとする(splitterなどの詳細な構成は省略して示す)。
9. About Additional Function (1) The learning device according to an embodiment may be able to execute a function that cuts off a reference structure for backward processing.
Specifically, when the unchain_backward method of an instance of the Variable class is called, the reference structure for backward processing toward the input side from that instance is cut off. For example, it is assumed that the following reference structure for backward processing is generated by executing forward processing (detailed configuration such as splitter is omitted).
 A(入力層)←Convolution2D←B←Linear←C←Softmax←D(出力層)
 ここで、A,B,C,DはVariableクラスのインスタンスを表し、Convolution2D, Linear, SoftmaxはFunctionクラスのインスタンスを表す。
A (input layer) ← Convolution2D ← B ← Linear ← C ← Softmax ← D (output layer)
Here, A, B, C, and D represent Variable class instances, and Convolution2D, Linear, and Softmax represent Function class instances.
 このとき、B. unchain_backward()をコールするとバックワード処理のための参照構造はBを起点に断ち切られる結果、以下のように変化する。
 B←Linear←C←Softmax←D(出力層)
At this time, when B. unchain_backward () is called, the reference structure for backward processing is changed as follows as a result of being cut off starting from B.
B ← Linear ← C ← Softmax ← D (Output layer)
 このunchain_backwardメソッドを図16に示したソースコードに適用する局面について検討する。このソースコードにおいて、第7行では、関数l1及び関数reluの処理結果であるyを再び関数l1の引数に足し込んでいる。「Define-by-Run」の仕組みでは、”x+y”の記述を実行する際に、yのコピーが生成された上で、これまで実行されたフォワード処理によって生成済みのバックワード処理のための参照構造が連結される。よって、本例ではループが繰り返されるごとにバックワード処理のための参照構造が成長し続ける。
 第9行で実行されるバックワード処理は、成長したバックワード処理のための参照構造に対して実行される。第9行の処理は、ループ内に含まれているため、このループ処理全体の計算時間は、ループサイズの2乗に比例してしまう。
Consider the situation where this unchain_backward method is applied to the source code shown in FIG. In this source code, the seventh line adds y, which is the processing result of the function l1 and the function relu, to the argument of the function l1 again. In the “Define-by-Run” mechanism, when the description of “x + y” is executed, a copy of y is generated, and the backward processing already generated by the forward processing executed so far is performed. Are linked together. Therefore, in this example, the reference structure for backward processing continues to grow every time the loop is repeated.
The backward process performed in line 9 is performed on the reference structure for the grown backward process. Since the process of the ninth line is included in the loop, the calculation time of the entire loop process is proportional to the square of the loop size.
 図19は、本発明の一実施形態に係る学習装置に入力されるソースコードのさらに別の例を示す図である。なお、図19において左端に記載された行数は、本具体例を説明するために付されたものであって、実際のソースコードには含まれない。 FIG. 19 is a diagram showing still another example of the source code input to the learning device according to the embodiment of the present invention. Note that the number of lines described at the left end in FIG. 19 is given for explaining this specific example, and is not included in the actual source code.
 図16に示したソースコードを変更し、図19に示すように、第11行においてunchain_backwardを定期的に呼び出すことにより、計算時間の増加を抑えることができる。
 第9行は、第4行以降のループが10回実行される度に、第10行~第12行の処理が実行されることを記述している。
 第11行は、unchain_backwardをコールし、lossを起点にバックワード処理のための参照構造を破棄している。これにより、ループ処理全体の計算時間を短く抑えることができる。
By changing the source code shown in FIG. 16 and periodically calling unchain_backward in the 11th line as shown in FIG. 19, an increase in calculation time can be suppressed.
The ninth line describes that the processing of the tenth to twelfth lines is executed every time the loop after the fourth line is executed 10 times.
The eleventh line calls unchain_backward and discards the reference structure for backward processing starting from loss. Thereby, the calculation time of the entire loop process can be kept short.
 このようにunchain_backwardを用いることにより、参照構造にループを有するフォワード処理に対して学習を行う場合であっても、バックワード処理のための参照構造の過剰な成長を抑え、現実的な計算量で学習処理を実行することができる。
 さらに、別の実施形態では、特定のFunctionについて重みの更新を行わないようにすることを目的として、unchain_backwardを用いることも可能である。
By using unchain_backward in this way, even when learning is performed for forward processing having a loop in the reference structure, excessive growth of the reference structure for backward processing is suppressed, and the amount of calculation is realistic. A learning process can be executed.
Furthermore, in another embodiment, unchain_backward can be used for the purpose of not updating the weight for a specific function.
 10.付加的な機能(2)について
 一実施形態に係る学習装置は、Variableクラスのインスタンス初期化時にvolatile属性を指定することができる。volatile属性が有効な場合には、そのVariableを入力するフォワード処理についてバックワード処理のための参照構造は生成されない。
10. About Additional Function (2) The learning device according to an embodiment can specify the volatile attribute when an instance of the Variable class is initialized. When the volatile attribute is valid, a reference structure for backward processing is not generated for the forward processing for inputting the variable.
 学習済みの重みを用いてフォワード処理のみを実行する場合(すなわち、バックワード処理を実行する必要がない場合)には、フォワード処理の実行時にバックワード処理のための参照構造を生成する処理が実行されてしまうと、実行速度及びメモリ使用量の両方において無駄が発生してしまう。このような場合に、フォワード処理の入力データを保持するVariableクラスのインスタンス初期化時にvolatile属性を指定しておくことにより、バックワード処理のための参照構造の生成を止めて、効率的にフォワード処理のみを実行することができる。 When only the forward processing is executed using the learned weight (that is, when it is not necessary to execute the backward processing), the processing for generating the reference structure for the backward processing is executed when the forward processing is executed. If this is done, waste will occur in both execution speed and memory usage. In such a case, by specifying the volatile attribute when initializing the Variable class instance that holds the input data for forward processing, the generation of the reference structure for backward processing is stopped and efficient forward processing is performed. Can only run.
 11.付言
 最も好ましい実施形態として、Pythonにより記述されたソースコードが学習装置に入力される実施形態を説明してきたが、本明細書に開示された技術は、Pythonにより記述されたソースコードを用いた場合のみに限定されるものではない。すなわち、本明細書に開示された技術は、学習装置が各コードを実行した時点においてそのコードに記述されたフォワード処理の出力値を入力値に基づいて計算すること、学習装置が各コードに記述されたフォワード処理を実行する度にバックワード処理のための参照構造を生成(してこの参照構造に基づいてバックワード処理を実行可能に)すること、及び、制御構文を用いてニューラルネットワークの構成を定義すること、のうちの少なくとも1つを実現することが可能なPythonと等価なプログラミング言語(例えば、R、Julia、Sparkz及びMLib等)により記述されたソースコードを用いた場合にも、同様に適用可能なものである。
11. Addendum As the most preferred embodiment, the embodiment in which source code written in Python is input to the learning device has been described. However, the technique disclosed in this specification uses the source code written in Python. It is not limited to only. In other words, the technology disclosed in this specification calculates the output value of the forward processing described in the code when the learning device executes each code based on the input value, and the learning device describes in each code. Generating a reference structure for backward processing each time the forward processing is performed (and enabling backward processing based on this reference structure), and constructing a neural network using a control syntax The same is true when using source code written in a programming language equivalent to Python (eg, R, Julia, Sparkz, MLib, etc.) that can implement at least one of Is applicable.
 本明細書に開示された技術は、Python及びこれと等価なプログラミング言語により記述されたソースコードを実行することによって実現可能なものであり、これに代えて、Python及びこれと等価なプログラミング言語により記述されたモジュール又はライブラリを実行することによって実現可能なものであってもよい。 The technology disclosed in this specification can be realized by executing source code written in Python and its equivalent programming language, and instead of Python and its equivalent programming language. It may be realized by executing the described module or library.
 本明細書において、変数、関数、メソッド、クラス、サブクラス等を識別するために用いられた名称は、本明細書に開示された技術を限定するものではなく、任意のものであってよい。 In this specification, the names used to identify variables, functions, methods, classes, subclasses, etc. are not limited to the technology disclosed in this specification, and may be arbitrary.
 本明細書で説明される処理及び手順は、実施形態において明示的に説明されたものによってのみならず、ソフトウェア、ハードウェア又はこれらの組み合わせによっても実現可能なものである。具体的には、本明細書で説明された処理及び手順は、集積回路、揮発性メモリ、不揮発性メモリ、磁気ディスク、光ストレージ等の媒体に、当該処理に相当するロジックを実装することによって実現される。また、本明細書で説明された処理及び手順は、それらの処理・手順をコンピュータプログラムとして実装し、各種のコンピュータに実行させることが可能である。 The processes and procedures described in this specification can be realized not only by those explicitly described in the embodiment but also by software, hardware, or a combination thereof. Specifically, the processes and procedures described in this specification are realized by mounting logic corresponding to the processes on a medium such as an integrated circuit, a volatile memory, a nonvolatile memory, a magnetic disk, or an optical storage. Is done. Further, the processes and procedures described in the present specification can be implemented as a computer program and executed by various computers.
 本明細書中で説明される処理及び手順が単一の装置、ソフトウェア、コンポーネント、モジュールによって実行される旨が説明されたとしても、そのような処理又は手順は、複数の装置、複数のソフトウェア、複数のコンポーネント、及び/又は、複数のモジュールによって実行されるものとすることができる。また、本明細書中で説明されるデータ、テーブル又はデータベースが単一のメモリに格納される旨説明されたとしても、そのようなデータ、テーブル又はデータベースは、単一の装置に備えられた複数のメモリ又は複数の装置に分散して配置された複数のメモリに分散して格納されるものとすることができる。さらに、本明細書において説明されるソフトウェア及びハードウェアの要素は、それらをより少ない構成要素に統合して、又は、より多い構成要素に分解することによって実現されるものとすることができる。 Even though the processes and procedures described herein are described as being performed by a single device, software, component, or module, such processes or procedures may include multiple devices, multiple software, It may be executed by multiple components and / or multiple modules. Further, even if it is described that the data, table, or database described in this specification is stored in a single memory, such data, table, or database may be stored in a single device. Or a plurality of memories arranged in a distributed manner in a plurality of devices. Further, the software and hardware elements described herein may be realized by integrating them into fewer components or by disassembling them into more components.
 第2部(アルゴリズムの組み込み系チップへの実装手法)
 1.背景
 ディープラーニング(深層学習)は、その高い性能を得られる反面、大規模な演算量及びメモリ使用量、並びに、学習サンプル量を要求するアルゴリズムである。潤沢な計算資源を安価に得られるGPU及びクラウド、並びに、学習サンプルの共有を可能とするWebインフラの普及が、近年のディープラーニングの隆盛を支える背景にあったといえる。
 ディープラーニングのアルゴリズム開発を支援する環境(ライブラリ、フレームワーク)には様々なものが存在する。多くの開発環境はGPUを用いて学習速度を向上させる機能を有している。
Part 2 (Algorithm mounting method for embedded chip)
1. BACKGROUND Deep learning (deep learning) is an algorithm that requires a large amount of computation and memory usage, and a learning sample amount, while achieving high performance. The widespread use of GPUs and clouds that provide abundant computing resources at low cost, and the widespread use of Web infrastructure that enables sharing of learning samples, was the background that supported the rise of deep learning in recent years.
There are various environments (libraries, frameworks) that support deep learning algorithm development. Many development environments have a function to improve learning speed using GPU.
 自動車の完全自動運転や汎用性の高いロボット制御といった分野では、カメラやLIDAR(レーザー距離測定)といった様々なセンサーからの取得した情報をリアルタイムに解析し、無数のモーターをコントロールしてその課題を解決するための高度な情報処理能力が要求されるので、従来とは一線を画す性能を持つディープラーニングの応用が強く期待されている。
 しかしながら、こうした分野は、安全性やチップ価格、消費電力などの要請からGPUやクラウドと比べると計算資源に乏しい組み込み環境に依存しているため、高い計算資源を必要とするディープラーニングの応用が遅れている。
 組み込み環境へのディープラーニングの応用が遅れている要因は、こうしたアルゴリズムの計算資源への要求が現実的・経済的な組み込み環境の性能を超えているという側面の他に、ソフトウェア環境を初めとしてディープラーニングをサポートする実装が出揃っていない面も挙げられる。
 組み込み環境においても、ハードウェアの性能は年々向上しており、ディープラーニングのアルゴリズムにおいても、計算資源への要請を緩和する改良が継続しているので、前者の要因は徐々に解決していくものと考えられる。
In fields such as fully automated driving of automobiles and highly versatile robot control, information obtained from various sensors such as cameras and LIDAR (laser distance measurement) is analyzed in real time, and countless motors are controlled to solve the problem. Advanced information processing capability is required, so the application of deep learning with a performance that is different from the conventional one is strongly expected.
However, these fields depend on embedded environments that have less computational resources than GPUs and clouds because of demands for safety, chip price, power consumption, etc., so the application of deep learning that requires high computational resources is delayed. ing.
In addition to the fact that the demand for computational resources for these algorithms exceeds the realistic and economical performance of embedded environments, the reason why the application of deep learning to embedded environments has been delayed is that it is not limited to software environments. There is also an aspect where there is no implementation that supports learning.
Even in the embedded environment, hardware performance has improved year by year, and the deep learning algorithm has been continuously improved to alleviate the demand for computational resources, so the former factor will gradually be solved. it is conceivable that.
 本発明の実施形態が解決すべき課題は、組み込み系チップにおいて製品レベルの要求を満たした上で動作するディープラーニングのアルゴリズムを設計するためのフレームワーク開発により、主にソフトウェア環境面で残っているディープラーニングの組み込み環境適応への障壁を突破し、開発速度を促進するためのものである。
 GPUベースではありながらディープラーニングのアルゴリズム開発において高い生産性をもたらすフレームワークである上記第1部において述べた実施形態に係る学習装置を、組み込み環境向けに機能拡張することが本課題を解決する上で最適な手段と考えられるので、次段落以降では実施形態に係る学習装置にフォーカスした組み込み環境適応への課題を述べる。
The problems to be solved by the embodiments of the present invention remain mainly in the software environment due to the development of a framework for designing deep learning algorithms that operate on an embedded chip while satisfying product level requirements. It is intended to break through barriers to deep learning adaptation to embedded environments and accelerate development.
In order to solve this problem, the learning device according to the embodiment described in the first part, which is a framework that provides high productivity in developing a deep learning algorithm while being GPU-based, is functionally expanded for an embedded environment. Therefore, in the next paragraph and after, problems for adaptation to the embedded environment focused on the learning device according to the embodiment will be described.
 2.実施形態に係る実装手法の課題
 上記第1部において説明した実施形態に係る学習装置は、高度な言語機能やライブラリに依存しているので、この学習装置において動作するアルゴリズムをそのまま組み込み系半導体チップ上で動作させようとすることは、以下のような弊害をもたらす可能性がある。
 まずセキュリティ面に関して、ライブラリや言語の規模が大きくなると事実上不可知な実装にアプリケーションが依存する度合いが高まる。それにつれて、そうした実装に含まれる不具合が、そのままチップ製品の不具合となってしまうリスクが高まる。
 次に、フットプリント面に関して、ライブラリや言語の実装自体がチップ製品のメモリ資源を圧迫する。
 さらに、オーバーヘッド面に関して、高度に抽象化したAPIを持つライブラリ経由では、チップ製品の計算資源をフル活用できない。少なくともニューラルネットワークで必要とされる大規模計算についてはチップに特化した低レベルなパフォーマンスチューニングが必須となる。
 以上のような理由から、実施形態に係る学習装置において動作するアルゴリズムをそのまま組み込み系半導体チップ上で動作させるだけでは、製品レベルの要求を満たせない可能性が高い。
2. Problems of Implementation Method According to Embodiment Since the learning device according to the embodiment described in the first part depends on advanced language functions and libraries, an algorithm that operates in this learning device is directly used on an embedded semiconductor chip. Attempting to operate with this may cause the following adverse effects.
First, with regard to security, as the scale of libraries and languages grows, the degree to which applications depend on a virtually unknown implementation increases. Along with this, there is an increased risk that a defect included in such mounting becomes a defect of a chip product as it is.
Next, regarding the footprint, the implementation of the library and language itself puts pressure on the memory resources of the chip product.
Furthermore, with regard to overhead, the computational resources of chip products cannot be fully utilized via a library with a highly abstract API. At least for large-scale calculations required by neural networks, low-level performance tuning specialized for chips is essential.
For the reasons described above, it is highly possible that an algorithm operating in the learning apparatus according to the embodiment is not operated on the embedded semiconductor chip as it is, so that the product level requirement cannot be satisfied.
 3.実施形態に係る実装手法のコンセプト
 実施形態に係る実装手法では、豊富な計算資源を有するパーソナルコンピュータ等において設計した新規のニューラルネットワーク(NN)のアルゴリズムを任意の組み込みチップ(組み込み系半導体集積回路)において製品レベルの要件を満たした上で動作できる状態を最短期間で実現する。そのためには、アルゴリズムを設計する開発者と、ハードウェアを深く意識する開発者とがなるべく独立して仕事を進められることが望ましい。本実施形態では、それを助ける装置(フレームワーク)に関する技術的思想を提案する。
3. Concept of Mounting Method According to Embodiment In the mounting method according to the embodiment, a new neural network (NN) algorithm designed in a personal computer or the like having abundant calculation resources can be applied to any embedded chip (embedded semiconductor integrated circuit). Achieving a state in which the product level requirements can be met and operating in the shortest possible period. For that purpose, it is desirable that the developer who designs the algorithm and the developer who is deeply aware of the hardware can work as independently as possible. In this embodiment, the technical idea regarding the apparatus (framework) which assists it is proposed.
 4.組み込みチップ開発で想定される開発ステップ
 組み込み系チップを開発する際に辿るステップとして、以下の3つのステップが想定される。
 ステップI:PC(+GPU)上で実施形態に係る学習装置に用いられるコード(一例としてPythonで記述されたコード)が動いている状態
 この状態は、複雑な構成を有するニューラルネットワークを用いたアルゴリズムの設計・検証を少ないコード記述で実現した状態である。これは、上述した「Define-by-Run」という手法のコンセプトである。
 ステップII:チップ向けに最適化した実装とPythonコードとが混在した状態
 この状態は、実施形態に係る学習装置で設計したアルゴリズムのチップ上での動作確認及びパフォーマンス検証をPythonコードをほとんど変えずに実現した状態である。
 ステップIII:チップ向けに最適化した実装のみで実施形態に係る学習装置で設計したアルゴリズムが動作する状態
 この状態は、チップとしての製品レベルの仕様要件を満たした上でアルゴリズムが動作する(チップ上で他のモジュールや制御機構とのリアルタイム性の高い協調動作ができる)状態である。
 本実施形態に係る実装手法では、実施形態に係る学習装置で新しいアルゴリズムを開発するときに、上記ステップI~IIIのステップ間において、なるべく再修正、再設計及び再学習の手間を省くことにより、短期間で開発を進められるようなフレームワークを提案する。
4). Development steps assumed in embedded chip development The following three steps are assumed as steps to follow when developing an embedded chip.
Step I: A state in which a code used in the learning device according to the embodiment (a code written in Python as an example) is running on a PC (+ GPU). This state is an algorithm using a neural network having a complicated configuration. This is a state in which the design and verification of is realized with less code description. This is the concept of the method “Define-by-Run” described above.
Step II: Chip-optimized implementation and Python code mixed together This state is almost the same as the Python code for the operation check and performance verification of the algorithm designed by the learning device according to the embodiment on the chip. It is a state that has been realized.
Step III: A state where the algorithm designed by the learning device according to the embodiment is operated only by implementation optimized for the chip. In this state, the algorithm operates after satisfying the specification requirements of the product level as a chip (on-chip In other words, a real-time cooperative operation with other modules and control mechanisms can be performed).
In the mounting method according to the present embodiment, when a new algorithm is developed by the learning device according to the embodiment, by avoiding re-correction, redesign, and re-learning as much as possible between the steps I to III described above, Propose a framework that can be developed in a short period of time.
 4―1. ステップIについて
 図20は、本発明の実施形態に係る実装手法のステップIを説明する模式図である。図20に示された構成は、上記第1部において説明した実施形態に係る学習装置が前提としている構成である。すなわち、この構成は、プログラミング言語の一態様としてのPythonで記述されたソースコードが、ライブラリの一態様としてのPyCUDA及びライブラリの一態様としてのnumpy(BLAS)を利用し、これらのライブラリがそれぞれGPU及び汎用計算機をドライブするものである。なお、図20に示された「Chainer」とは、上記第1部において述べた実施形態に係る学習装置において用いられるソースコードを記述するためのフレームワークに対して本件出願人により付された名称である。
4-1. About Step I FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention. The configuration shown in FIG. 20 is a configuration premised on the learning device according to the embodiment described in the first part. That is, in this configuration, the source code written in Python as one aspect of the programming language uses PyCUDA as one aspect of the library and numpy (BLAS) as one aspect of the library. And a general-purpose computer. Note that “Chainer” shown in FIG. 20 is the name given by the applicant to the framework for describing the source code used in the learning apparatus according to the embodiment described in Part 1 above. It is.
 4-2.ステップIIについて
 図21は、本発明の実施形態に係る実装手法のステップIIを説明する模式図である。図21に示された構成では、Python上でChainerのフロントエンドを実行する。図21に示すように、本実施形態では、Native I/F(例えばC言語など低水準の言語で記載されたChainerの主要機能と同等な実装を呼び出すためのインターフェイス)を設けることで、PC上の実行と組み込みチップ向けに最適化した実行とを同一コードで実行できる。
4-2. About Step II FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention. In the configuration shown in FIG. 21, the front end of Chainer is executed on Python. As shown in FIG. 21, in this embodiment, a Native I / F (for example, an interface for calling an implementation equivalent to the main function of Chainer described in a low-level language such as C language) is provided on the PC. And the execution optimized for the embedded chip can be executed with the same code.
 図22は、Pythonによる実行部とチップによる実行部とが通信する場合を説明する模式図である。図22に示すように、Native I/Fの実装に通信機能を設けることで組み込みチップ上の構成からPythonへの依存を取り除くことも可能(PC上のChainerから組み込みチップ上の最適化実装をドライブする)である。 FIG. 22 is a schematic diagram for explaining a case where an execution unit based on Python and an execution unit based on a chip communicate with each other. As shown in Fig. 22, it is possible to remove the dependency on Python from the configuration on the embedded chip by providing a communication function in the implementation of the Native I / F (the optimization implementation on the embedded chip is driven from the Chainer on the PC) ).
 Native I/Fの実装について
 ChainerのFunction及びOptimizerに対してリファレンスコード(例えばC言語など低水準の言語を想定)を実装する。このリファレンスコードは、numpyなどの外部ライブラリに依存しない形で実装する。また、動的ネットワーク定義に適したメモリプールの仕組みを実装する。また、Function/Optimizerとは別口でnumpyとのデータコンバート関数を作成する。さらに、上記Function/Optimizerの浮動小数点版リファレンスコードを作成する。
 さらにまた、上記Function/Optimizerの固定小数点版リファレンスコードを作成する。また、Function/Optimizerとは別口で浮動小数点・固定小数点間のデータコンバート関数を作成する。これは、固定小数点版リファレンスコードを作成する理由は、FPUが付いてないチップも少なくないためである。
 上記リファレンスコードを元に各種チップ向けに最適化したコードを実装する。
Implementation of Native I / F Implement reference code (for example, low-level languages such as C) for Chainer Function and Optimizer. This reference code is implemented in a manner independent of external libraries such as numpy. Also, a memory pool mechanism suitable for dynamic network definition is implemented. Also, create a data conversion function with numpy separately from Function / Optimizer. In addition, create a floating-point reference code for the above Function / Optimizer.
Furthermore, create a fixed-point reference code for the above Function / Optimizer. Create a data conversion function between floating point and fixed point separately from Function / Optimizer. This is because the fixed-point reference code is created for many chips that do not have an FPU.
Implement code optimized for various chips based on the above reference code.
 4-3.ステップIIIについて
 図23は、本発明の実施形態に係る実装手法のステップIIIを説明する模式図である。図23に示すように、Chainerからネットワーク定義及び重みをバイトコードとして出力するメソッドが追加されている。また、バイトコードを解釈してニューラルネットワークの処理(フォワード処理、バックワード処理、重みの更新)を実行する仮想マシンが設けられる。Native I/Fのチップ向け最適化実装を流用することができる。
4-3. About Step III FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention. As shown in FIG. 23, a method for outputting the network definition and weight as a byte code from Chainer has been added. In addition, a virtual machine that interprets the bytecode and executes neural network processing (forward processing, backward processing, weight update) is provided. Native I / F chip optimization implementation can be diverted.
 構成1
(Native IFの構成)
 図42は、本発明の一実施形態に係るNative I/Fの構成例を示す図である。
 NNアルゴリズム毎に、計算機の種類に非依存のインターフェイスを設ける構成。
 NNアルゴリズムを利用する処理系がこのインターフェイスを経由して特定の計算機にアルゴリズムの実行を指示する。
 ここでいうインターフェイスとは、入力データの形式と出力データの形式、並びに、入力データの形式の加工方法と出力データの形式の対応を定義する手段。インターフェイスが同一であれば同一の入力に対して同一の出力結果が得られる。例えば、C言語で記載された関数と、その関数宣言が挙げられる。
 NNアルゴリズムを利用する側の処理系は、特に限定しない。例えば、NN設計のための既存のフレームワーク(Chainerほか)などがあげられ。また、アルゴリズムの開発と併せて開発される処理系も挙げられる。
 ここでいう計算機とは、計算を実行する装置を意味する。計算機は、演算コア、メモリ階層、及び、計算を実行するのに必要なハードウェア資源を含む装置である。
 汎用計算機は、一般的に使用される計算機を意味する。Linux(登録商標)OS、及びPythonを含む従来のアルゴリズムが容易に動作する計算機である。
 ここでいうアクセラレータとは、NNアルゴリズムの計算を含む特定の計算を高速に実行する装置を意味する。
 ここでいうGPUとは、画像処理に特化した計算機だが汎用的計算を実行する能力も持つ計算機である。GPUは、前記アクセラレータの一形態も含む。CUDAなどのソフトウェア資産があるので、NNアルゴリズムを実装する容易さは、汎用計算機の場合と、一般的なアクセラレータの場合の中間程度である。
Configuration 1
(Configuration of Native IF)
FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention.
A configuration that provides an interface independent of the type of computer for each NN algorithm.
A processing system using the NN algorithm instructs a specific computer to execute the algorithm via this interface.
The interface here is a means for defining the correspondence between the format of the input data and the format of the output data, and the processing method of the format of the input data and the format of the output data. If the interface is the same, the same output result can be obtained for the same input. For example, a function written in C language and its function declaration can be mentioned.
The processing system on the side using the NN algorithm is not particularly limited. For example, the existing framework for NN design (Chainer et al.). Moreover, the processing system developed in conjunction with the development of the algorithm is also mentioned.
The computer here means a device that executes a calculation. A computer is a device that includes a computing core, a memory hierarchy, and hardware resources necessary to perform a calculation.
A general-purpose computer means a commonly used computer. It is a computer in which conventional algorithms including Linux (registered trademark) OS and Python easily operate.
The accelerator here means a device that executes specific calculations including the calculation of the NN algorithm at high speed.
The GPU here is a computer specialized in image processing, but also capable of performing general-purpose calculations. The GPU also includes a form of the accelerator. Since there are software assets such as CUDA, the ease of implementing the NN algorithm is about halfway between a general-purpose computer and a general accelerator.
 構成1-1
(NNによる識別・学習を実行するための構成)
 図43は、本発明の一実施形態に係るNNによる識別・学習を実行するための構成例を示す図である。
 Native I/Fは、少なとも、Forward処理部を有する。この構成によって、Native I/Fは、NNアルゴリズムを用いて、識別処理を実行することができる。
 更に、Native I/Fは、少なくとも、Forward処理部、Backward処理部、重みの更新アルゴリズムの内部状態初期化処理部、及び重みの更新処理部を有する。かかる構成によって、Native I/Fは、NNアルゴリズムを用いて識別処理及び学習処理を実行することができる。
 Forward処理部、及び Backward処理部は、レイヤーアルゴリズム毎に含まれる。重みの更新アルゴリズムの内部状態初期化処理部、及び重みの更新処理部は、重みの更新アルゴリズム毎に含まれる。
 更に、Native I/Fは、レイヤーアルゴリズム毎に、Forward処理呼び出しインターフェイス、及び、Backward処理呼び出しインターフェイスを、並びに、重みの更新アルゴリズム毎に、重みの更新アルゴリズムの内部状態初期化処理インターフェイス、及び、重みの更新処理呼び出しインターフェイスを、有する。
 更に、Native I/Fを通じて呼び出される側の実装は、NativeI/F呼び出し管理部を有する。かかる構成によって、Native I/Fを通じて呼び出される側の実装は、NativeI/Fのパラメータの違いによって、Native I/Fの動作を最適に実行できる実装を変えることができる。因みに、NativeI/Fの動作を実行できる実装が存在しない場合、該Native I/Fの呼び出し管理部は、その呼び出し元にエラーを返す。したがって、Native
I/Fを通じて呼び出される側の実装は、その動作を最適に実行できる実装を選択して実行することができる。
Configuration 1-1
(Configuration for identifying and learning by NN)
FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention.
The Native I / F has at least a Forward processing unit. With this configuration, the Native I / F can perform identification processing using the NN algorithm.
Further, the Native I / F includes at least a forward processing unit, a backward processing unit, an internal state initialization processing unit of a weight update algorithm, and a weight update processing unit. With this configuration, the Native I / F can execute identification processing and learning processing using the NN algorithm.
A Forward processing unit and a Backward processing unit are included for each layer algorithm. The weight update algorithm internal state initialization processing unit and the weight update processing unit are included for each weight update algorithm.
Further, the Native I / F includes a forward processing call interface and a backward processing call interface for each layer algorithm, and an internal state initialization processing interface for the weight update algorithm and a weight for each weight update algorithm. An update processing call interface.
Furthermore, the implementation that is called through the Native I / F has a Native I / F call management unit. With this configuration, the implementation that can be called through the Native I / F can change the implementation that can optimally execute the operation of the Native I / F according to the difference in the parameters of the Native I / F. Incidentally, when there is no implementation that can execute the operation of the Native I / F, the call manager of the Native I / F returns an error to the caller. Therefore, Native
The implementation called by the I / F can select and execute an implementation that can optimally execute the operation.
 構成1-1-1
(NNによる識別・学習を実行するための構成1;多次元配列を管理する構成(多次元配列管理部)の場合)
 図44は、本発明の一実施形態に係る多次元配列を管理する構成例を示す図である。
 Native I/Fは、更に、多次元配列管理部を有する。該多次元配列管理部は、多次元配列の生成、破棄、属性(軸数、軸毎の要素数)取得、集計結果(軸毎の総和や平均、分散など)取得、及び、多次元配列同士の要素ごとの四則演算、を含む群から選択される少なくとも1つを行うことができる。
Configuration 1-1-1
(Configuration 1 for performing identification / learning by NN; configuration for managing a multidimensional array (multidimensional array management unit))
FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention.
The Native I / F further has a multidimensional array management unit. The multidimensional array management unit generates, destroys multidimensional arrays, acquires attributes (number of axes, number of elements per axis), acquires aggregate results (total, average, variance, etc. for each axis), and multidimensional arrays At least one selected from the group including the four arithmetic operations for each element.
 構成1-2
(データを共有するための構成)
Configuration 1-2
(Configuration for sharing data)
 構成1-2-1
(データを共有するための構成1;データ表現変換部の場合)
 図45は、本発明の一実施形態に係るデータ表現変換部の構成例を示す図である。
 更に、Native I/Fは、データ表現変換部を有する。該データ表現変換部は、Native I/Fに特定の計算機に依存するデータ表現(デバイス依存データ表現)と、特定の計算機に依存しないデータ表現(デバイス非依存データ表現)を相互に変換することができる。
Configuration 1-2-1
(Configuration 1 for sharing data; in the case of a data representation conversion unit)
FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention.
Further, the Native I / F has a data expression conversion unit. The data representation conversion unit can mutually convert a data representation (device-dependent data representation) that depends on a specific computer and a data representation (device-independent data representation) that does not depend on a specific computer. it can.
 構成1-2-2
(データを共有するための構成2;+外部記憶媒体を有する場合)
 更に、Native I/Fを呼び出す側の処理系は、外部記憶媒体を有する。該外部記憶媒体は、デバイス非依存データに変換された重みデータを保存することができる。
Configuration 1-2-2
(Configuration 2 for sharing data; + when having an external storage medium)
Furthermore, the processing system that calls the Native I / F has an external storage medium. The external storage medium can store weight data converted into device-independent data.
 構成1-2-3
(データを共有するための構成3;+通信部を有する場合)
 図46は、本発明の一実施形態に係る通信部の構成例を示す図である。
 更に、Natieve I/Fを通じて呼び出される側の実装は、通信部を有する。該通信部は、呼び出される側の実装にNativeI/Fの呼び出し情報を通信することができる。また、NNアルゴリズムを利用する任意の処理系が、呼び出し情報の通信の有無と関係ないNative I/Fを呼び出そうとするとき、該NativeI/Fを通じて呼び出される側の実装は、必要に応じて最適な通信処理を実行することができる。この工程によって、計算機の物理的な距離、メモリ共有の有無、又は、通信プロトコルの違いを、NNアルゴリズムを利用する任意の処理系から隠蔽することができる。
 例えば、呼び出し情報の通信の有無と関係ないNative I/Fは、レイヤーアルゴリズムを実行するためのインターフェイス、重みの更新アルゴリズムを実行するためのインターフェイス、又は、データ表現変換を実行するためのインターフェイスなどがある。
Configuration 1-2-3
(Configuration 3 for sharing data; + with communication unit)
FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention.
Further, the implementation that is called through the Natieve I / F has a communication unit. The communication unit can communicate the call information of the Native I / F to the called implementation. Also, when an arbitrary processing system that uses the NN algorithm tries to call a Native I / F that is not related to the presence or absence of call information communication, the implementation on the side called through the Native I / F must Optimal communication processing can be executed. By this step, the physical distance of the computer, the presence / absence of memory sharing, or the difference in communication protocol can be hidden from any processing system using the NN algorithm.
For example, the Native I / F that is not related to the presence or absence of communication of call information includes an interface for executing a layer algorithm, an interface for executing a weight update algorithm, or an interface for executing data representation conversion. is there.
 構成2
(Native I/Fの拡張版の構成)
Configuration 2
(Configuration of extended version of Native I / F)
 構成2-1
(Native I/Fの拡張版の構成1;型変換部、並びに、浮動小数点用のNNアルゴリズム実行部、及び/又は、固定小数点用のNNアルゴリズム実行部を有する場合)
 図47は、本発明の一実施形態に係る浮動小数点及び固定小数点の実行部及び型変換部の構成例を示す図である。
 Native I/Fは、型変換部、並びに、浮動小数点用のNNアルゴリズム実行部、及び/又は、固定小数点用のNNアルゴリズム実行部を有する。
 例えば、型変換部のみを有する計算機B、浮動小数点用のNNアルゴリズム実行部のみを有する計算機A、又は固定小数点用のNNアルゴリズム実行部のみを有する計算機Cがある。かかる計算機A, 計算機B、及び計算機Cを、Native I/Fの基本構成と組み合わせとき、該計算機Aが生成した浮動小数点型のデータは計算機Bに転送される。引き続き、該計算機Aから該計算機Bに転送されたデータは、該計算機Bによって固定小数点型のデータに変換される。そして、該計算機Bによって変換された固定小数点型のデータは、該計算機Cに転送される。そして、該計算機Bから転送された固定小数点型のデータは、該計算機Cの入力データになり、NNアルゴリズムの全体動作が実行される。かかる工程は逆順に実行することもできる。
Configuration 2-1
(Configuration 1 of the extended version of Native I / F; with type conversion unit and NN algorithm execution unit for floating point and / or NN algorithm execution unit for fixed point)
FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention.
The Native I / F includes a type conversion unit, an NN algorithm execution unit for floating point, and / or an NN algorithm execution unit for fixed point.
For example, there is a computer B having only a type conversion unit, a computer A having only a floating-point NN algorithm execution unit, or a computer C having only a fixed-point NN algorithm execution unit. When the computer A, the computer B, and the computer C are combined with the basic configuration of Native I / F, the floating point type data generated by the computer A is transferred to the computer B. Subsequently, the data transferred from the computer A to the computer B is converted into fixed-point data by the computer B. Then, the fixed-point type data converted by the computer B is transferred to the computer C. Then, the fixed point type data transferred from the computer B becomes the input data of the computer C, and the entire operation of the NN algorithm is executed. Such steps can also be performed in reverse order.
 構成2-2
(Native I/Fの拡張版の構成2;メモリプールモジュールを有する場合)
 図48は、本発明の一実施形態に係るメモリプールの構成例を示す図である。
 更に、Native I/Fを通じて呼び出される側の実装は、メモリプールモジュールを有す
る。該メモリプールモジュールは、動的なメモリ管理を実現することができる。
Configuration 2-2
(Configuration 2 of extended version of Native I / F; with memory pool module)
FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention.
Furthermore, the implementation that is called through the Native I / F has a memory pool module. The memory pool module can realize dynamic memory management.
 構成2-3
(Native I/Fの拡張版の構成3;複数のNNアルゴリズムを融合したアルゴリズム実行部を有する場合)
 図49は、本発明の一実施形態に係る複数のNNアルゴリズムを融合したアルゴリズム実行部の構成例を示す図である。
 更に、Native I/Fは、複数のNNアルゴリズムを融合したアルゴリズム実行部を有する。該複数のNNアルゴリズムを融合したアルゴリズム実行部は、頻出するNNアルゴリズムの組み合わせに対して、複数のアルゴリズムを同時に実行する。
Configuration 2-3
(Configuration 3 of the extended version of Native I / F; with an algorithm execution unit that fuses multiple NN algorithms)
FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged.
Furthermore, the Native I / F has an algorithm execution unit that combines a plurality of NN algorithms. The algorithm execution unit that fuses the plurality of NN algorithms simultaneously executes the plurality of algorithms for combinations of frequent NN algorithms.
 構成2-4
(Native I/Fの拡張版の構成4;多次元配列データ圧縮解凍部を有する場合)
 図50は、本発明の一実施形態に係る多次元配列のデータ通信量削減部の構成例を示す図である。
 更に、Native I/Fを通じて呼び出される側の実装は、多次元配列データ圧縮解凍部を有する。該多次元配列データ圧縮解凍部は、前記通信部に備わる。
Configuration 2-4
(Configuration 4 of extended version of Native I / F; with multi-dimensional array data compression / decompression unit)
FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention.
Further, the implementation called through the Native I / F has a multidimensional array data compression / decompression unit. The multi-dimensional array data compression / decompression unit is provided in the communication unit.
 構成3
(Native I/F+Chainer実行部の構成)
 図51は、本発明の一実施形態に係る既存実行部との連携例を示す図である。
Configuration 3
(Configuration of Native I / F + Chainer execution unit)
FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
 構成3-1
(Native I/F+Chainer実行部の構成1;バイトコード生成部と仮想マシンを有する場合)
 図53は、本発明の一実施形態に係るバイトコード生成部と仮想マシンの構成例を示す図である。
 更に、Chainer実行部は、バイトコード生成部を有する。該バイトコード生成部は、Backwardの計算手順と重みを入力し、バイトコードとして出力する。例えば、バイトコード生成部を ChainerのPython層に有する。
 また、Native I/Fは、仮想マシンを有する。該仮想マシンは、バイトコードを解釈してNNアルゴリズム処理を実行させる。ここでいうNNアルゴリズム処理とは、フォワード処理、バックワード処理、及び、重みの更新のいずれか、又は、その組み合わせである。
Configuration 3-1
(Native I / F + Chainer execution unit configuration 1; with a bytecode generation unit and a virtual machine)
FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention.
Further, the Chainer execution unit has a byte code generation unit. The byte code generation unit inputs a backward calculation procedure and a weight, and outputs it as a byte code. For example, it has a bytecode generator in Chainer's Python layer.
The Native I / F has a virtual machine. The virtual machine interprets the bytecode and executes NN algorithm processing. The NN algorithm processing here is any one of forward processing, backward processing, and weight update, or a combination thereof.
 構成3-2
(Native I/F+Chainer実行部の構成2;比較部を有する場合)
 図54は、本発明の一実施形態に係る比較部の構成例を示す図である。
 更に、Chainer実行部は、比較部を有する。該比較部は、同一のNNアルゴリズムに対応する既存実行部とNative層実行部の入出力結果を比較する、又は、同一のNativeI/Fにおいて異なる実装のNative層実行部を呼びだすNative層実行部同士の入出力結果を比較する。
Configuration 3-2
(Native I / F + Chainer execution unit configuration 2; with comparison unit)
FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention.
Further, the Chainer execution unit has a comparison unit. The comparison unit compares the input / output results of the existing execution unit and the native layer execution unit corresponding to the same NN algorithm, or calls the native layer execution unit of different implementation in the same Native I / F Compare the input / output results of each other.
 構成3-3
(Native I/F+Chainer実行部の構成3;関数合成部を有する場合)
 図55は、本発明の一実施形態に係る関数合成部の構成例を示す図である。
 更に、Chainer実行部は、関数合成部を有する。該関数合成部は、Backwardの計算手順を入力し、「複数のアルゴリズムを同時に実行するNative
I/F」が対応可能なFunctionクラスのインスタンスの組み合わせを、「複数のアルゴリズムを同時に実行するNative I/F」に対応したFunctionクラスのインスタンスに置換する。しかし、Backwardの計算手順を実行する計算機のためのNative層の実装において、「複数のアルゴリズムを同時に実行するNative I/F」が存在しない場合は、上記の置換はしない。
 ここでいう置換は、Backwardの計算手順を文字列として見立てた場合、部分一致検索により実行することができる。
 例えば、該関数合成部は、ChainerのPython層に備わる。
Configuration 3-3
(Native I / F + Chainer execution part 3; with function composition part)
FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment of the present invention.
Further, the Chainer execution unit has a function synthesis unit. The function synthesizer inputs the Backward calculation procedure, and “Native that executes multiple algorithms simultaneously”
The combination of instances of the Function class that can handle “I / F” is replaced with an instance of the Function class corresponding to “Native I / F that executes multiple algorithms simultaneously”. However, in the implementation of the Native layer for a computer that executes the Backward calculation procedure, the above replacement is not performed if there is no “Native I / F that executes multiple algorithms simultaneously”.
The replacement here can be executed by partial match search when the Backward calculation procedure is regarded as a character string.
For example, the function synthesis unit is provided in the Python layer of Chainer.
 構成4
(Forward処理実行を特化させる最適化装置の構成)
Configuration 4
(Configuration of optimization device that specializes forward processing execution)
 構成4-1
(Forward処理実行を特化させた最適化装置の構成1;重みの最適化処理手段を有する場合)
 更に、Chainer実行部は、重みの最適化処理手段を有する。該重みの最適化処理手段は、Functionクラスに適した重みの最適化処理を実行する。
Configuration 4-1
(Configuration 1 of optimization device specialized for forward processing execution; with weight optimization processing means)
Further, the Chainer execution unit includes a weight optimization processing unit. The weight optimization processing means executes a weight optimization process suitable for the Function class.
 構成4-2
(Forward処理実行を特化させた最適化装置の構成2;データメモリ領域の再利用手段を有する場合)
 更に、Chainer実行部、並びにNative
I/Fは、データメモリ領域の再利用手段を有する。該データメモリ領域の再利用手段は、レイヤー間で入出力されるデータのメモリ領域を再利用する。該再利用手段は、Forward処理実行部、又は、前記仮想マシンに備わる。
 例えば、仮想マシンのForward処理を実行するインターフェイス(NativeI/Fにて定義されるもの)の引数にForward処理のみの実行であることを識別するためのフラグを設ける。
この処理を実行する条件は、ChainerのFunctionクラスのインスタンスが入力するVariable変数にvolatile属性が指定されているとき、又は、
仮想マシンのForward処理が実行される際にForward処理のみの実行であることを識別するためのフラグが有効となっているとき、である。
Configuration 4-2
(Optimization device configuration 2 specializing forward processing execution; with data memory area reuse means)
In addition, the Chainer execution unit and Native
The I / F has means for reusing the data memory area. The data memory area reuse means reuses a memory area for data input / output between layers. The reuse means is provided in the Forward processing execution unit or the virtual machine.
For example, a flag for identifying the execution of only the forward process is provided in the argument of the interface (defined by Native I / F) that executes the forward process of the virtual machine.
The condition for executing this process is when the volatile attribute is specified in the Variable variable that is input by the Chainer Function class instance, or
This is when the flag for identifying that only the Forward process is executed when the Forward process of the virtual machine is executed is valid.
 作用1
(NativeIFの構成による作用)
 NNのアルゴリズムを設計・利用する開発者と、計算機のハードウェア構成を深く意識する開発者の分業が容易となる。
 例えば、アルゴリズムを設計・利用する開発者は、実行したいNNアルゴリズム毎のインターフェイスの同一性がNative I/Fにより保証されるので、Native I/Fを呼び出す側のソフトウェアを変更することなく様々な計算機で処理を実行することができる。
 具体的には、特定の計算機に自ら開発しているソフトウェアが依存してしまうリスクを下げることができる。その結果、計算機の価格や特定用途での弱み強みといったより本質的な基準で計算機を選別することができるようになる。
 計算機のハードウェア構成を深く意識する開発者にとっては、 Native I/Fに対応した計算機に最適化な実装を提供すれば、幅広いNNアルゴリズムの利用者に対して自ら開発した計算機を利用してもらえる。
Action 1
(Operation by the configuration of NativeIF)
The division of labor between developers who design and use NN algorithms and developers who are deeply aware of the hardware configuration of computers will be easier.
For example, a developer who designs and uses an algorithm can guarantee the identity of the interface for each NN algorithm to be executed by the Native I / F, so various computers can be used without changing the software that calls the Native I / F. You can execute the process.
Specifically, it is possible to reduce the risk that software that is being developed on a specific computer depends on it. As a result, it becomes possible to select computers based on more essential criteria such as the price of computers and weaknesses and weaknesses in specific applications.
For developers who are deeply aware of the hardware configuration of a computer, providing an optimized implementation for a computer that supports Native I / F allows a wide range of NN algorithm users to use their own computers. .
 作用1-1
(NNによる識別・学習を実行するための構成による作用)
 NNのアルゴリズムを設計・利用する開発者は、NNアルゴリズムを利用する任意の処理系を用いて、NativeI/Fに具備されているインターフェイスを呼び出すことにより、NNのアルゴリズム全体動作を実現することができる。
 また、NNのアルゴリズムを設計・利用する開発者は、計算機の具体的な構成を意識せずとも利用している計算機に最適な実装を用いて、NNのアルゴリズム全体動作を実現することができる。
Action 1-1
(Effects of configuration for executing identification and learning by NN)
Developers who design and use the NN algorithm can implement the entire operation of the NN algorithm by calling the interface provided in the Native I / F using any processing system that uses the NN algorithm. .
In addition, a developer who designs and uses an NN algorithm can realize the entire operation of the NN algorithm using an implementation that is optimal for the computer that is being used without being aware of the specific configuration of the computer.
 作用1-1-1
(NNによる識別・学習を実行するための構成1による作用;多次元配列を管理する構成(多次元配列管理部)の場合)
 NNのアルゴリズムを設計・利用する開発者は、NNアルゴリズムの全体動作を実行するとき、余計なデータ変換処理を経ることなく任意のNNアルゴリズムを組み合わせて実行することができる。
 このとき、任意のNNアルゴリズムの処理結果である多次元配列の内容の集計結果を確認することにより、NNアルゴリズムが意図した通りの計算を実行できているかどうかを確認することができる。
Action 1-1-1
(Operation by Configuration 1 for executing identification / learning by NN; in the case of a configuration managing a multidimensional array (multidimensional array management unit))
A developer who designs and uses an NN algorithm can execute any combination of NN algorithms without performing unnecessary data conversion processing when executing the entire operation of the NN algorithm.
At this time, it is possible to confirm whether or not the NN algorithm can perform the calculation as intended by confirming the total result of the contents of the multidimensional array that is the processing result of the arbitrary NN algorithm.
 作用1-2
(データを共有するための構成による作用)
 作用1-2-1
(データを共有するための構成1による作用;データ表現変換部の場合)
 デバイス非依存データ表現を経由することで、異なるハードウェア構成の計算機の間でNNのアルゴリズム全体動作を実現するのに必要なデータを交換することができる。
Action 1-2
(Operation by configuration for sharing data)
Action 1-2-1
(Operation by configuration 1 for sharing data; in the case of the data representation conversion unit)
Through the device-independent data representation, it is possible to exchange data necessary for realizing the entire operation of the NN algorithm between computers with different hardware configurations.
 作用1-2-2
 計算機毎に固有の情報を隠すことができる。
(データを共有するための構成2;+外部記憶媒体を有する場合)
 重みのデータをデバイス非依存データ表現に変換した後、外部記憶媒体に保存することによって、特定の計算機を用いて学習済みの重みを用いて、任意の計算機上で識別処理を実行することができる。
Action 1-2-2
Information unique to each computer can be hidden.
(Configuration 2 for sharing data; + when having an external storage medium)
After the weight data is converted into a device-independent data representation and stored in an external storage medium, identification processing can be executed on any computer using weights learned using a specific computer .
 作用1-2-3
(データを共有するための構成3による作用;+通信部を有する場合)
 計算機のハードウェア構成や、物理的な距離、メモリ共有の有無に関わらず、NNのアルゴリズム全体動作を実現するのに必要なデータを交換することができる。
 NNアルゴリズムを利用する側の処理系が動作可能な計算機から、NNアルゴリズムを利用する側の処理系が動作不能な計算機に実装されたNNアルゴリズム実装を呼び出すこともできる。
 したがって、コンピュータネットワークに繋がれた複数台の計算機を用いてNNのアルゴリズム全体動作を実現することができる。
Action 1-2-3
(Operation by configuration 3 for sharing data; + when having a communication unit)
Regardless of the hardware configuration of the computer, the physical distance, and the presence or absence of memory sharing, data necessary to realize the entire operation of the NN algorithm can be exchanged.
It is also possible to call an NN algorithm implementation implemented on a computer in which the processing system on the side using the NN algorithm cannot operate from a computer capable of operating the processing system on the side using the NN algorithm.
Therefore, the entire operation of the NN algorithm can be realized using a plurality of computers connected to a computer network.
 作用2
(Native I/Fの拡張版の構成による作用)
Action 2
(Operation by configuration of extended version of Native I / F)
 作用2-1
(Native I/Fの拡張版の構成1による作用;型変換部、並びに、浮動小数点用のNNアルゴリズム実行部、及び/又は、固定小数点用のNNアルゴリズム実行部を有する場合)
 浮動小数点演算器(FPU)を持たない計算機と、FPUを持つ計算機が混在するハードウェア構成において、それぞれの計算機に適したデータ型を用いて、NNのアルゴリズム全体動作を実現することができる。
 浮動小数点演算、又は、固定小数点演算を用いて、NNのアルゴリズム全体動作を実現できる。
 具体的には、計算機Aは、計算機Aの浮動小数点用のNNアルゴリズム実行部が生成した浮動小数点型データを計算機Bに転送する。次に、計算機Bは、計算機Aから転送された当該浮動小数点型データを型変換部により固定小数点型データに変換した後、当該固定小数点型データを計算機Cに転送する。
 計算機Cは、計算機Cの固定小数点用のNNアルゴリズム実行部が生成した固定小数点型データを計算機Bに転送する。次に、計算機Bは、計算機Cから転送された当該固定小数点型データを型変換部により浮動小数点型データに変換した後、当該浮動小数点型データを計算機Aに転送する。
Action 2-1
(Operation by configuration 1 of the extended version of Native I / F; type conversion unit and NN algorithm execution unit for floating point and / or NN algorithm execution unit for fixed point)
In a hardware configuration in which a computer that does not have a floating-point arithmetic unit (FPU) and a computer that has an FPU are mixed, the overall operation of the NN algorithm can be realized using a data type suitable for each computer.
The entire algorithm operation of the NN can be realized by using floating point arithmetic or fixed point arithmetic.
Specifically, the computer A transfers the floating point type data generated by the floating point NN algorithm execution unit of the computer A to the computer B. Next, the computer B converts the floating-point type data transferred from the computer A into fixed-point type data by the type conversion unit, and then transfers the fixed-point type data to the computer C.
The computer C transfers the fixed-point type data generated by the fixed-point NN algorithm execution unit of the computer C to the computer B. Next, the computer B converts the fixed-point type data transferred from the computer C into floating-point type data by the type conversion unit, and then transfers the floating-point type data to the computer A.
 作用2-2
(Native I/Fの拡張版の構成2による作用;メモリプールモジュールを有する場合)
 動的なメモリ管理の仕組みに依存した処理系が、データの生成と破棄を含むNativeI/Fを呼び出しNNアルゴリズム全体動作を実行するとき、その動作を軽量に実現することができる。
Action 2-2
(Operation by configuration 2 of extended version of Native I / F; with memory pool module)
When a processing system that relies on a dynamic memory management mechanism calls the Native I / F including data generation and destruction and executes the entire NN algorithm operation, the operation can be realized lightly.
 作用2-3
(Native I/Fの拡張版の構成3による作用;複数のNNアルゴリズムを融合したアルゴリズム実行部を有する場合)
 グローバルなメモリへの不要なアクセスを回避することができる。また、関数呼び出しのオーバーヘッドを削減できる。したがって、頻出するNNアルゴリズムの組み合わせを高速に実行することができる。
Action 2-3
(Operation by configuration 3 of the extended version of Native I / F; when having an algorithm execution unit that fuses multiple NN algorithms)
Unnecessary access to global memory can be avoided. In addition, the overhead of function calls can be reduced. Therefore, combinations of frequent NN algorithms can be executed at high speed.
 作用2-4
(Native I/Fの拡張版の構成4による作用;多次元配列データ圧縮解凍部を有する場合)
 コンピュータネットワークに繋がれた複数台の計算機を用いてNNのアルゴリズム全体動作を実行させるとき、多次元配列のデータ通信量を削減することができる。したがって、動作速度を向上させることができる。
Action 2-4
(Operation by configuration 4 of the extended version of Native I / F; multi-dimensional array data compression / decompression unit)
When a plurality of computers connected to a computer network are used to execute the overall operation of the NN algorithm, the amount of data communication of a multidimensional array can be reduced. Therefore, the operation speed can be improved.
 作用3
(Native I/F+Chainer実行部による作用)
 NativeI/FのサポートがあるNNのアルゴリズムと、NativeI/FのサポートのないNNのアルゴリズムとを組み合わせて、NNの全体動作を定義及び実行することができる。
 NativeI/Fのサポートを得ることができ次第、適宜NativeI/Fに置き換えることによって、NN全体動作を実行することができる。したがって、既存のソフトウェアの修正が不要になる。
 NativeI/Fを組み合わせる場合も、Define-by-runの既存のメリットを享受することができる。
Action 3
(Operation by Native I / F + Chainer execution unit)
By combining the NN algorithm with NativeI / F support and the NN algorithm without NativeI / F support, the overall operation of the NN can be defined and executed.
As soon as support for Native I / F is available, the entire NN operation can be performed by substituting Native I / F as appropriate. Therefore, it is not necessary to modify existing software.
When combining Native I / F, you can enjoy the existing benefits of Define-by-run.
 作用3-1
(Native I/F+Chainer実行部の構成1による作用;バイトコード生成部と仮想マシンを有する場合)
 Chainer実行部がバイトコード生成部を有し、NativeI/Fが仮想マシンを有することによって、高度なライブラリやプログラミング言語への依存度を減少させることができる。よって、アクセラレータなどの貧弱な実行環境を含む様々な計算機においても、製品レベルの要件を満たしつつ、Chainerで設計したNNの全体動作を実行することができる。
Action 3-1
(Operation by configuration 1 of Native I / F + Chainer execution unit; with bytecode generation unit and virtual machine)
Since the Chainer execution unit has a bytecode generation unit and the Native I / F has a virtual machine, the dependence on advanced libraries and programming languages can be reduced. Therefore, even in various computers including poor execution environments such as accelerators, the entire operation of the NN designed by Chainer can be executed while satisfying the product level requirements.
 作用3-2
(Native I/F+Chainer実行部の構成2による作用;比較部を有する場合)
 比較部は、同一のNNアルゴリズムに対応する既存実行部とNative層実行部の入出力結果を比較する、並びに、同一のNativeI/Fにおいて異なる実装のNative層を呼びだすNative層実行部同士の入出力結果を比較する。
 かかる比較部を有することによって、浮動小数点用のNNアルゴリズム実行部の処理結果の精度と、固定小数点用のNNアルゴリズム実行部の処理結果の精度を比較することができる。したがって、NNのアルゴリズムが正しく計算できることが既に十分テストされた実行部の処理結果と、新規に作成するNative層の処理結果とを比較することができる。故に、新規に作成するNative層の実装がNNのアルゴリズムを正しく計算することが可能であることを保証することができる。
Action 3-2
(Operation by configuration 2 of Native I / F + Chainer execution unit; with comparison unit)
The comparison unit compares the input / output results of the existing execution unit and the Native layer execution unit corresponding to the same NN algorithm, and inputs the native layer execution units that call the native layers of different implementations in the same Native I / F. Compare the output results.
By having such a comparison unit, it is possible to compare the accuracy of the processing result of the floating-point NN algorithm execution unit and the accuracy of the processing result of the fixed-point NN algorithm execution unit. Therefore, it is possible to compare the processing result of the execution unit that has been sufficiently tested that the NN algorithm can be calculated correctly with the processing result of the newly created Native layer. Therefore, it can be assured that the newly created Native layer implementation can correctly calculate the NN algorithm.
 作用3-3
(Native I/F+Chainer実行部の構成3による作用;関数合成部を有する場合)
 関数合成部は、Backwardの計算手順を入力し、「複数のアルゴリズムを同時に実行するNative I/F」が対応可能なFunctionクラスのインスタンスの組み合わせを、当該「複数のアルゴリズムを同時に実行するNative I/F」に1:1対応したFunctionクラスのインスタンスに置換する。尚、当該「複数のアルゴリズムを同時に実行するNative I/F」が存在しない場合、かかる関数合成部は、上記の置換を実行しない。
 かかる関数合成部をChainerのPython層の構成に有することによって、「複数のアルゴリズムを同時に実行するNative I/F」の有無に関わらず、Backwardの計算手順が自動的に加工される。かかるBackwarodの計算手順の加工によって、「複数のアルゴリズムを同時に実行するNative
I/F」が存在する場合は、当該Native I/Fが呼び出され、対応したFunctionクラスのインスタンスに置換される。これによって、高速なNNのアルゴリズム全体動作を常に実現することができる。
 また、「複数のアルゴリズムを同時に実行するNative I/F」が存在しないような関数の組み合わせであっても、関数合成部がメリットを発揮する場合もある。具体的には、Forward処理に限定した場合のConvolution2D+BatchNormalizationの組み合わせや、Linear+BatchNormalizationの組み合わせである。
 BatchNormalizationは、その入力となる多次元配列の一つ一つの要素に対して、NNの学習を通じた長期的な統計情報に基づいて、要素間の分散を揃え、平均をとり除く処理である。学習するのではなくForward処理のみを行う場合であれば、分散や平均を更新する必要はなく、例えばaとbを定数とすればy=ax+bのような変換を配列要素ごとに行う処理に過ぎない。Linear処理は行列積を行う処理である。また、Convolution2Dは畳み込みと行列積の組み合わせの計算を行う処理である。これらの処理は先に例示したy=ax+bのような変換を内包しているので、LinearやConvolution2Dの重みやバイアスを調整することで、これらのFunctionの出力結果をBatchNormalizationへ入力し処理するのと同じ結果を得ることができる。
 関数合成部はこうした重みやバイアスの調整を行うことで、Convolution2D+BatchNormalizationを単独のConvolution2Dに変換することができる。Linear+BatchNormalizationから単独のLinearへの変換も同様である。
Action 3-3
(Operation by configuration 3 of Native I / F + Chainer execution unit; with function composition unit)
The function synthesizer inputs the Backward calculation procedure, and selects a combination of instances of the Function class that can handle “Native I / F that executes multiple algorithms simultaneously”. Replace with an instance of Function class corresponding to 1: 1 in F. If there is no “Native I / F that simultaneously executes a plurality of algorithms”, the function synthesis unit does not perform the above replacement.
By having such a function synthesis unit in the structure of Chainer's Python layer, the Backward calculation procedure is automatically processed regardless of the presence or absence of “Native I / F that executes multiple algorithms simultaneously”. By processing this Backwarod calculation procedure, “Native that executes multiple algorithms simultaneously”
If "I / F" exists, the Native I / F is called and replaced with an instance of the corresponding Function class. As a result, the entire operation of the high-speed NN algorithm can always be realized.
Even if the combination of functions does not have a “Native I / F that executes multiple algorithms simultaneously”, the function synthesis unit may be advantageous. Specifically, a combination of Convolution2D + BatchNormalization or a combination of Linear + BatchNormalization when limited to the Forward process.
BatchNormalization is a process for aligning the variance between elements and removing the average based on long-term statistical information through NN learning for each element of the input multidimensional array. If only forward processing is performed instead of learning, there is no need to update the variance or average. For example, if a and b are constants, processing such as y = ax + b is performed for each array element. Only. Linear processing is processing to perform matrix product. Convolution2D is a process for calculating a combination of convolution and matrix product. Since these processes include the conversion such as y = ax + b exemplified earlier, by adjusting the weight and bias of Linear and Convolution2D, the output results of these functions are input to BatchNormalization and processed. The same result can be obtained.
The function synthesis unit can convert Convolution2D + BatchNormalization into a single Convolution2D by adjusting the weights and biases. The same applies to conversion from Linear + BatchNormalization to a single Linear.
 作用4
(Forward処理実行に特化した最適化装置の構成による作用)
 Forward処理における重みの情報量、あるいは入力データのデータメモリ量を削減してForward処理を実行することによって、メモリを削減することができる。また、重みの要素数を削減したり0の重みを計算しないでFoward処理を実行することによって、計算量を削減することができる。
Action 4
(Operation by the configuration of the optimization device specialized for forward processing execution)
Memory can be reduced by reducing the amount of weight information in forward processing or the amount of data memory of input data and executing forward processing. Further, the amount of calculation can be reduced by reducing the number of weight elements or executing the forward processing without calculating zero weight.
 作用4-1
(Forward処理実行を特化させた最適化装置1による作用;重みの最適化処理手段を有する場合)
 Forward処理実行に特化した重みの最適化処理手段を、ChainerのFunctionクラスに有することによって、学習済みのネットワークの構成に含まれる任意のFunctionクラスのインスタンスに対して、重みの最適化処理を実行することができる。このように、重みの最適化処理を実行することができることで、Forward処理の際のメモリや計算量を削減することができる。これによって、NNのアルゴリズム全体動作を高速に実行することができる。
Action 4-1
(Operation by the optimization device 1 specializing forward processing execution; when there is a weight optimization processing means)
By having a weight optimization processing means specialized for forward processing in Chainer's Function class, weight optimization processing is executed for any instance of Function class included in the learned network configuration. can do. As described above, since the weight optimization process can be executed, it is possible to reduce the memory and the calculation amount in the forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.
 作用4-2
(Forward処理実行を特化させた最適化装置2による作用;データメモリ領域の再利用手段を有する場合)
 Forward処理のみを実行するフラグを、Forward処理の実行部(Chainer、又は、仮想マシン)に引数として与えることによって、Forward処理の際のメモリを削減することができる。これによって、NNのアルゴリズム全体動作を高速に実行することができる。
Action 4-2
(Operation by the optimizing device 2 specializing forward processing execution; in the case of having a means for reusing the data memory area)
By giving a flag that executes only the Forward process as an argument to the execution unit (Chainer or virtual machine) of the Forward process, it is possible to reduce the memory during the Forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.
 5.実施形態に係る実装手法の具体的な手順について
 実施形態に係る実装手法は、第1の手法及び第2の手法を含む。
 5-1.第1の手法について
 図24は、本発明の一実施形態に係る実装手法(第1の手法)に用いられる実装装置の構成例を示す模式図である。図24に示すように、一実施形態に係る実装装置は、主に、評価ボード(マザーボード)100と、評価ボード100に着脱自在に搭載された組み込み系チップ(組み込み系半導体集積回路)200と、を含む。
5). Specific Procedure of Mounting Method According to Embodiment The mounting method according to the embodiment includes a first method and a second method.
5-1. First Method FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention. As shown in FIG. 24, a mounting apparatus according to an embodiment mainly includes an evaluation board (motherboard) 100, an embedded chip (embedded semiconductor integrated circuit) 200 detachably mounted on the evaluation board 100, including.
 評価ボード100は、主に、CPU101と、メインメモリ102と、通信I/F103と、外部メモリ104と、を含む。これらの各構成要素が内部バス109を介して電気的に接続されている。 The evaluation board 100 mainly includes a CPU 101, a main memory 102, a communication I / F 103, and an external memory 104. Each of these components is electrically connected via an internal bus 109.
 CPU101は、外部メモリ103からオペレーティングシステム等の様々なプログラムをメインメモリ102にロードし、ロードしたプログラムに含まれる命令を実行する。メインメモリ102は、CPU101が実行するプログラムを格納するために用いられ、例えば、DRAMによって構成される。 The CPU 101 loads various programs such as an operating system from the external memory 103 into the main memory 102, and executes instructions included in the loaded program. The main memory 102 is used for storing a program executed by the CPU 101, and is constituted by, for example, a DRAM.
 通信I/F103は、ハードウェア、ファームウェア、又は、TCP/IPドライバやPPPドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装され、イーサネット(登録商標)やインターネット等を含む通信網(図示しない)を介して、開発者等により操作される図示しない計算機及び入出力装置等と通信することが可能となるように構成される。また、通信I/F103は、組み込み系チップ200の後述する通信I/F204と通信することも可能である。外部メモリ104は、例えばフラッシュメモリ等により構成され、オペレーティングシステム等の様々なプログラムを記憶する。 The communication I / F 103 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and a communication network (not shown) including the Ethernet (registered trademark) and the Internet. Through this, it is possible to communicate with a computer and an input / output device (not shown) operated by a developer or the like. The communication I / F 103 can also communicate with a communication I / F 204 described later of the embedded chip 200. The external memory 104 is configured by a flash memory, for example, and stores various programs such as an operating system.
 次に、組み込み系チップ200は、CPU201と、アクセラレータ(補助演算装置)202と、メインメモリ203と、通信I/F204と、外部メモリ205と、を含む。これらの各構成要素が内部バス209を介して電気的に接続されている。なお、組み込み系チップは、選択的に、GPU(図示せず)を含むことも可能である。 Next, the embedded chip 200 includes a CPU 201, an accelerator (auxiliary arithmetic unit) 202, a main memory 203, a communication I / F 204, and an external memory 205. These components are electrically connected via an internal bus 209. The embedded chip can optionally include a GPU (not shown).
 CPU201は、評価ボード100(の通信I/F103)から通信I/F204を介して受信したソースコード(例えばPython等で記述されたソースコード)をメインメモリ203にロードし、ロードしたソースコードに含まれる各コードを実行する。
 アクセラレータ202は、評価ボード100(の通信I/F103)から通信I/F204を介して受信したソースコード(例えばC言語やアセンブラ等)で記述されたソースコード)をメインメモリ203にロードし、ロードしたソースコードに含まれる各コードを実行する。メインメモリ203は、CPU201及びアクセラレータ202が実行するソースコードを格納するために用いられ、例えば、DRAMによって構成される。
The CPU 201 loads source code (for example, source code written in Python) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and is included in the loaded source code. Execute each code that will be executed.
The accelerator 202 loads the source code (source code written in C language, assembler, etc.) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and loads it. Execute each code included in the source code. The main memory 203 is used for storing source code executed by the CPU 201 and the accelerator 202, and is configured by a DRAM, for example.
 通信I/F204は、評価ボード100の通信I/F103と通信して様々な情報を送受信する。外部メモリ205は、例えばフラッシュメモリ等により構成され、様々なデータを記憶する。 The communication I / F 204 communicates with the communication I / F 103 of the evaluation board 100 to transmit / receive various information. The external memory 205 is configured by, for example, a flash memory and stores various data.
 図25は、本発明の一実施形態に係る実装手法に用いられる手順の一例を示すフロー図である。
 まず、ステップ301において、第1のプログラミング言語(例えばPython等)で記述されたソースコードをパーソナルコンピュータ等において実行させる。開発者等は、当該実行結果に基づいて、当該ソースコードがパーソナルコンピュータ等において動作するか否かを確認する。ここでいうパーソナルコンピュータ等とは、豊富な計算資源を有する計算機をいい、例えば、上記第1部において説明した実施形態に係る学習装置を含むものである。このステップ301においてソースコードがパーソナルコンピュータ等において動作する状態とは、上記「4-1」において説明したステップIと同一の状態である。
FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention.
First, in step 301, source code described in a first programming language (for example, Python) is executed on a personal computer or the like. The developer confirms whether the source code operates on a personal computer or the like based on the execution result. Here, the personal computer or the like refers to a computer having abundant calculation resources, and includes, for example, the learning apparatus according to the embodiment described in the first part. The state in which the source code operates in the personal computer or the like in step 301 is the same state as step I described in “4-1” above.
 ステップ302において、上記ステップ301においてパーソナルコンピュータ等において動作するものであることを確認したPython等で記述されたソースコードを、評価ボード100を用いて組み込み系チップ200のCPU201に実行させる。開発者等は、当該実行結果に基づいて、このソースコードがCPU201により動作可能なものであるか否かを確認する。なお、このような動作は、評価ボード100のCPU101が外部メモリ104に記憶された所定のプログラムをロードして実行することによって実現されるものとすることができる。ここで、Python等で記述されたソースコードは、評価ボード100の通信I/F103及び組み込み系チップ200の通信I/F204を介してCPU201に渡されるものとすることができる。このソースコードがCPU201により動作可能なものでないことが判明した場合には、開発者等は、このソースコードを修正して、ステップ302を繰り返す。このソースコードがCPU201により動作可能なものであることを確認した場合には、開発者等は、次のステップ303に移行する。 In step 302, the CPU 201 of the embedded chip 200 is caused to execute the source code written in Python or the like that has been confirmed to operate on a personal computer or the like in step 301 using the evaluation board 100. The developer confirms whether or not the source code is operable by the CPU 201 based on the execution result. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Here, the source code described in Python or the like can be passed to the CPU 201 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the CPU 201, the developer corrects the source code and repeats step 302. When it is confirmed that the source code is operable by the CPU 201, the developer or the like proceeds to the next step 303.
 ステップ303において、開発者等は、上記ステップ302においてCPU201により動作可能なものであると確認されたソースコード(の少なくとも一部)をアクセラレータ202で動作させるために第2のプログラミング言語(例えばC言語やアセンブラ)で書き換える。 In step 303, a developer or the like uses a second programming language (for example, C language) to operate the accelerator 202 (at least a part of the source code) that is confirmed to be operable by the CPU 201 in step 302. Or assembler).
 ステップ304において、上記ステップ303においてC言語等で書き換えたソースコードを、評価ボード100を用いて組み込み系チップ200のアクセラレータ202に実行させる。開発者等は、当該実行結果に基づいて、この書き換えられたソースコードがアクセラレータ202により動作可能なものであるか否かを確認する。このような動作は、評価ボード100のCPU101が外部メモリ104に記憶された所定のプログラムをロードして実行することによって実現されるものとすることができる。ここで、C言語等で記述されたソースコードは、評価ボード100の通信I/F103及び組み込み系チップ200の通信I/F204を介してアクセラレータ202に渡されるものとすることができる。このソースコードがアクセラレータ202により動作可能なものでないことが判明した場合には、開発者等は、このソースコードを修正して、ステップ304を繰り返す。このソースコードがアクセラレータ202により動作可能なものであることを確認した場合には、開発者等は、次のステップ305に移行する。 In step 304, the source code rewritten in C language or the like in step 303 is executed by the accelerator 202 of the embedded chip 200 using the evaluation board 100. The developer confirms whether or not the rewritten source code is operable by the accelerator 202 based on the execution result. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Here, the source code described in C language or the like can be passed to the accelerator 202 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the accelerator 202, the developer corrects the source code and repeats step 304. When it is confirmed that the source code is operable by the accelerator 202, the developer or the like proceeds to the next step 305.
 ステップ305において、評価ボード100において、CPU201がPython等で記述されたソースコードにおける第1の特定コード(検証対象のコード)を実行した結果と、アクセラレータ202がC言語等で記述されたソースコードにおける第2の特定コードであって第1の特定コードがPython等からC言語等に書き換えられた第2の特定コードを実行した結果とを(例えば組み込み系チップ200により実行されるユニットテストと称されるモジュールを用いて)比較させて、比較結果を出力させる。開発者等は、当該比較結果に基づいて、両方の実行結果において同一の入力に対して同一の出力が得られているかを検証する。このような動作は、評価ボード100のCPU101が外部メモリ104に記憶された所定のプログラムをロードして実行することによって実現されるものとすることができる。この検証が完了するまで、開発者等は、上述したステップ303~305を繰り返す。この検証が完了した場合には、開発者等は、次のステップ306に移行する。 In step 305, in the evaluation board 100, the result of the CPU 201 executing the first specific code (code to be verified) in the source code described in Python or the like, and the accelerator 202 in the source code described in C language or the like The result of executing the second specific code, which is the second specific code and the first specific code is rewritten from Python or the like to C language or the like (referred to as a unit test executed by the embedded chip 200, for example) Using the module to output the comparison result. Based on the comparison result, the developer verifies whether the same output is obtained for the same input in both execution results. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Until this verification is completed, the developer or the like repeats steps 303 to 305 described above. When this verification is completed, the developer or the like moves to the next step 306.
 ステップ306において、開発者等は、ステップ305において、C言語等で記述されたソースコードがアクセラレータ202によりさらに高速に動作するように、このソースコードをチューニングする。 In step 306, the developer tunes the source code in step 305 so that the source code written in the C language or the like operates at a higher speed by the accelerator 202.
 ステップ307において、評価ボード100において、CPU201がPython等で記述されたソースコードを実行した結果と、アクセラレータ202が上記ステップ306においてチューニングされたC言語等で記述されたソースコードを実行した結果とを(例えば組み込み系チップ200により実行されるユニットテストと称されるモジュールを用いて)を比較させて、比較結果を出力させる。開発者等は、当該比較結果に基づいて、両方の実行結果において同一の入力に対して同一の出力が得られているかを検証する。このような動作は、評価ボード100のCPU101が外部メモリ104に記憶された所定のプログラムをロードして実行することによって実現されるものとすることができる。この検証が完了するまで、開発者等は、上述したステップ306及びステップ307を繰り返す。この検証が完了した場合には、開発者等は、次のステップ308に移行する。 In step 307, in the evaluation board 100, the result of the CPU 201 executing the source code described in Python or the like and the result of the accelerator 202 executing the source code described in the C language or the like tuned in step 306 are shown. (For example, using a module called a unit test executed by the embedded chip 200) and compare results are output. Based on the comparison result, the developer verifies whether the same output is obtained for the same input in both execution results. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Until this verification is completed, the developer or the like repeats Step 306 and Step 307 described above. When this verification is completed, the developer or the like moves to the next step 308.
 ステップ307が完了した状態では、組み込み系チップ200は、Python等及びC言語等の2つのソースコードによって動作する状態にある。この状態について、図26を参照して説明する。図26は、本発明の一実施形態に係る実装手法における組み込み系チップの動作状態を示す模式図である。 When step 307 is completed, the embedded chip 200 is in a state of being operated by two source codes such as Python and C language. This state will be described with reference to FIG. FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention.
 図26に示すように、上記ステップ301(これはステップIに相当する)では、関数の呼出し側(すなわち、関数を呼び出す主体)は、Python等により記述されており、被呼出し側(すなわち、呼び出される関数)もまた、Python等により記述されている。
 次に、上記ステップ302~307では、関数の呼出し側は、依然としてPython等により記述されており、被呼出し側は、Python等により記述されたものとC言語等により記述されたものとが混在した状態となっている。すなわち、ステップ307が完了した状態では、組み込み系チップ200は、Python等及びC言語等の2つのソースコードによって動作する状態にある。
As shown in FIG. 26, in step 301 (which corresponds to step I), the function calling side (that is, the main body that calls the function) is described in Python or the like, and the called side (that is, the calling side). Functions) are also written in Python.
Next, in the above steps 302 to 307, the function calling side is still described in Python or the like, and the called side is a mixture of those written in Python or the like and those written in C language or the like. It is in a state. That is, in the state where step 307 is completed, the embedded chip 200 is in a state of being operated by two source codes such as Python and C language.
 本実施形態に係る実装手法が最終的に目的とするのは、図26の右端に示すように、呼出し側、及び、被呼出し側の両方がともにC言語等により記述された状態、すなわち、組み込み系チップ200がC言語等に記述されたソースコードのみによって動作する状態である。 The final purpose of the mounting method according to the present embodiment is that, as shown at the right end of FIG. 26, both the calling side and the called side are described in C language, that is, embedded. This is a state in which the system chip 200 is operated only by the source code described in the C language or the like.
 そこで、図25に戻り、ステップ308では、開発者等は、組み込み系チップ200がC言語等で記述されたソースコードのみによって動作するように、Pythonで記述されたソースコードであってまだC言語等に書き換えられていない部分をすべてC言語等に書き換える。このステップ308では、組み込み系チップ200がPythonから切り離されることになる。このようにして生成されたC言語等により記述されたソースコードは、組み込み系チップ200の外部メモリ205等に記憶される。これにより、組み込み系チップ200は、外部メモリ205等に記憶されたソースコードを読み出してアクセラレータ202に実行させ、機械学習を実行することができるようになる。この状態は、実施形態に係る実装手法が目標とする状態であって、上記「1」及び「2」において述べた課題が解決されている状態である。 Therefore, returning to FIG. 25, in step 308, the developer or the like has source code written in Python and still in C language so that the embedded chip 200 operates only with source code written in C language or the like. Rewrite all parts that have not been rewritten to C language or the like. In this step 308, the embedded chip 200 is disconnected from Python. The source code described in C language or the like generated in this way is stored in the external memory 205 or the like of the embedded chip 200. As a result, the embedded chip 200 can read the source code stored in the external memory 205 or the like and cause the accelerator 202 to execute the machine learning. This state is a state targeted by the mounting method according to the embodiment, and the state described in the above “1” and “2” has been solved.
 5-2.第2の手法について
 図27は、本発明の一実施形態に係る実装手法(第2の手法)に用いられる実装装置の構成例を示す模式図である。第2の手法において用いられる実装装置(図27)が、第1の手法において用いられる実装装置(図24)と異なるのは、組み込み系チップ200がCPUを備えない点である。第2の手法では、第1の手法で組み込み系チップ200のCPU201が担っていた動作は、外部に設けられた図示しない計算機(パーソナルコンピュータ等)に設けられたCPUによりなされる。例えば、ここでいう計算機(パーソナルコンピュータ等)は、上記第1部において説明した学習装置(図11に例示したパーソナルコンピュータ等)であってもよい。
5-2. Second Method FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (second method) according to an embodiment of the present invention. The mounting device (FIG. 27) used in the second method is different from the mounting device (FIG. 24) used in the first method in that the embedded chip 200 does not include a CPU. In the second method, the operation performed by the CPU 201 of the embedded chip 200 in the first method is performed by a CPU provided in a computer (such as a personal computer) (not shown) provided outside. For example, the computer (such as a personal computer) referred to here may be the learning device (such as the personal computer illustrated in FIG. 11) described in the first part.
 第2の手法で行われる実装方法では、図25を参照して説明した実装方法について、ステップ302、305及び307において組み込み系チップ200のCPU201で実行された動作を、上記外部に設けられた計算機(図示せず)に設けられたCPUにおいて行うように変更される。これを実現するために、図27に示した評価ボード100は、例えば通信I/F103を介して、上記外部に設けられた計算機(図示せず)と通信可能に接続されることにより、この計算機に設けられたCPUにPythonで記述されたソースコードを実行させ、その実行結果を受信することができるものであってもよい。 In the mounting method performed by the second technique, the operation executed by the CPU 201 of the embedded chip 200 in steps 302, 305, and 307 in the mounting method described with reference to FIG. It changes so that it may be performed in CPU provided in (not shown). In order to realize this, the evaluation board 100 shown in FIG. 27 is communicably connected to a computer (not shown) provided outside, for example, via the communication I / F 103, so that the computer The CPU provided in may be able to execute the source code written in Python and receive the execution result.
 6.実装装置の構成
 次に、上述した実施形態に係る実装装置100が上記「5」において説明した手法を実現するために必要な構成について説明する。
6). Configuration of Mounting Device Next, a configuration necessary for the mounting device 100 according to the above-described embodiment to realize the technique described in the above “5” will be described.
 6-1.本発明を構成を説明するための用語定義
(クラスとモジュールの違い)
 モジュールは特定の目的を実現するために定義・実装された手続きとデータの集合 (特定のプログラミング言語による支援の有無とは無関係の概念)
 クラスはPythonなどのオブジェクト指向言語の支援を用いて定義・実装されたモジュール
6-1. Definition of terms used to describe the configuration of the present invention (difference between class and module)
A module is a collection of procedures and data defined and implemented to achieve a specific purpose (a concept that is independent of the support of a specific programming language)
A class is a module defined and implemented with the support of an object-oriented language such as Python
(Python層とNative層)
 Native層とは、Native I/Fとそこから呼び出される実装(ソフト及びハード)の階層をいう。Python層とは、Python言語上で実行されることが想定されるソフトウェア階層をいう。現状ChainerはPython言語で記述されているが、将来Chainerが別のプログラミング言語に移植されることも考えられる。ここでPython層として説明する機能は必ずしもPython言語に特化することを意味するわけではない。Python層及びNative層の役割分担として、Python層はよりアルゴリズム設計に適した抽象化レベルの高い開発環境を想定し、Native層はよりハードウェア構成を具体的に意識する抽象化レベルの低い開発環境を想定する。
(Python layer and Native layer)
The Native layer refers to the layer of the Native I / F and the implementation (software and hardware) called from it. The Python layer is a software layer that is supposed to be executed on the Python language. Currently Chainer is written in the Python language, but Chainer may be ported to another programming language in the future. The function described here as a Python layer does not necessarily mean specializing in the Python language. As a division of roles between the Python layer and the Native layer, the Python layer assumes a development environment with a high level of abstraction that is more suitable for algorithm design, and the Native layer has a development environment with a low level of abstraction that is more conscious of the hardware configuration. Is assumed.
(計算機と実行部の対応)
 図52は、本発明の一実施形態に係る既存実行部との連携例を示す図である。
 実行部とはニューラルネットのアルゴリズムを実際に計算するためのFunction/Optimizerクラスのメソッドである。
 既存実行部とは汎用計算機実行部もしくはGPU実行部もしくはその両方である。汎用計算機実行部は、汎用計算機を用いてNNのアルゴリズムを計算する。GPU実行部はGPUを用いてNNのアルゴリズムを計算する。
 Native実行部は、Native層の実装を用いてNNのアルゴリズムを計算する。Native層は計算機の種類毎に実装するので、全ての計算機の種別(汎用計算機、GPU、アクセラレータ)についてNativeI/Fを通じて動作可能である。
(Computer and execution unit correspondence)
FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
The execution part is a method of the Function / Optimizer class for actually calculating the neural network algorithm.
The existing execution unit is a general-purpose computer execution unit and / or a GPU execution unit. The general-purpose computer execution unit calculates the NN algorithm using the general-purpose computer. The GPU execution unit calculates the NN algorithm using the GPU.
The Native execution unit calculates the NN algorithm using the Native layer implementation. Since the Native layer is implemented for each type of computer, all computer types (general-purpose computers, GPUs, accelerators) can operate through Native I / F.
 6-2.実装部の構成
 図28は、本発明の一実施形態に係る実装装置が有する機能を概念的に示す模式図である。図28に示すように、実装部400は、主に、駆動部401と、Functionクラス/Optimizerクラス402と、汎用計算機実行部403と、GPU実行部404と、Native層実行部405と、汎用計算機用多次元配列406と、GPU用多次元配列407と、Native用多次元配列408と、Variableクラス409と、を含む。
6-2. Configuration of Mounting Unit FIG. 28 is a schematic diagram conceptually showing functions of a mounting apparatus according to an embodiment of the present invention. As shown in FIG. 28, the mounting unit 400 mainly includes a drive unit 401, a Function class / Optimizer class 402, a general-purpose computer execution unit 403, a GPU execution unit 404, a Native layer execution unit 405, and a general-purpose computer. The multi-dimensional array 406 for GPU, the multi-dimensional array 407 for GPU, the multi-dimensional array 408 for Native, and the Variable class 409 are included.
 駆動部401は、主に、あるアルゴリズム(関数)の実行をFunctionクラス/Optimizerクラス402に命令する実行部と、そのアルゴリズム(関数)についての汎用計算機実行部403による実行結果(又はGPU実行部404による実行結果)とNative層実行部405による実行結果とを例えばユニットテストと称されるモジュール等を用いて比較し、その比較結果を出力する比較部と、を含む。 The drive unit 401 mainly executes an execution unit that instructs the Function class / Optimizer class 402 to execute an algorithm (function), and an execution result (or GPU execution unit 404) of the algorithm (function) by the general-purpose computer execution unit 403. And a comparison unit that compares the execution result of the Native layer execution unit 405 using, for example, a module referred to as a unit test, and outputs the comparison result.
 Functionクラス/Optimizerクラス402は、駆動部401から命令されたアルゴリズム(関数)を汎用計算機実行部403、GPU実行部404及びNative層実行部405のうちの少なくとも1つに実行させる。 The Function class / Optimizer class 402 causes at least one of the general-purpose computer execution unit 403, the GPU execution unit 404, and the Native layer execution unit 405 to execute the algorithm (function) instructed from the drive unit 401.
 汎用計算機実行部403は、Functionクラス/Optimizerクラス402から命令されたアルゴリズム(関数)に対応する多次元配列を汎用計算機用多次元配列406から取得してそのアルゴリズム(関数)をCPUを用いて実行する。その実行結果は、Functionクラス/Optimizerクラス402を介して駆動部401に戻される。 The general-purpose computer execution unit 403 acquires a multi-dimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the multi-dimensional array 406 for general-purpose computers, and executes the algorithm (function) using the CPU. To do. The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
 GPU実行部404は、Functionクラス/Optimizerクラス402から命令されたアルゴリズム(関数)に対応する多次元配列をGPU用多次元配列407から取得してそのアルゴリズム(関数)をGPUを用いて実行する。その実行結果は、Functionクラス/Optimizerクラス402を介して駆動部401に戻される。 The GPU execution unit 404 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the GPU multidimensional array 407, and executes the algorithm (function) using the GPU. The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
 Native層実行部405は、Functionクラス/Optimizerクラス402から命令されたアルゴリズム(関数)に対応する多次元配列をNative用多次元配列408から取得してそのアルゴリズム(関数)をアクセラレータを用いて実行する。その実行結果は、Functionクラス/Optimizerクラス402を介して駆動部401に戻される。 The Native layer execution unit 405 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the native multidimensional array 408, and executes the algorithm (function) using an accelerator. . The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.
 Variableクラス409は、汎用計算機用多次元配列406、GPU用多次元配列407及びNative用多次元配列408により用いられるすべての多次元用配列を保持しており、汎用計算機用多次元配列406、GPU用多次元配列407及びNative用多次元配列408に対して対応する多次元用配列を供給する。 The Variable class 409 holds all the multidimensional arrays used by the general-purpose computer multidimensional array 406, the GPU multidimensional array 407, and the native multidimensional array 408. The general-purpose computer multidimensional array 406, GPU Corresponding multidimensional arrays 407 and Native multidimensional arrays 408 are supplied.
 なお、実装手法として、上記「5-1」で述べた第1の手法が採用される場合には、図28に示されたすべての構成要素が組み込み系チップ200(図24参照)に配置される。この場合、汎用計算機実行部403は、組み込み系チップ200に搭載されたCPU201を用いてアルゴリズム(関数)を実行し、GPU実行部404は、組み込み系チップ200に搭載されたGPU(図示せず)を用いてアルゴリズム(関数)を実行し、Native層実行部405は、主として組み込み系チップ200に搭載されたアクセラレータ202を用いてアルゴリズム(関数)を実行する。 When the first method described in “5-1” is adopted as a mounting method, all the components shown in FIG. 28 are arranged in the embedded chip 200 (see FIG. 24). The In this case, the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU 201 mounted on the embedded system chip 200, and the GPU execution unit 404 includes a GPU (not shown) mounted on the embedded system chip 200. The native layer execution unit 405 executes the algorithm (function) mainly using the accelerator 202 mounted on the embedded chip 200.
 一方、実装手法として、上記「5-2」で述べた第2の手法が採用される場合には、図28に示された構成要素のうち、Functionクラス/Optimizerクラス402、汎用計算機実行部403、GPU実行部404、汎用計算機用多次元配列406、GPU用多次元配列407及びVariableクラス409が、外部に設けられた計算機(パーソナルコンピュータ等)に配置される。この場合、Native層の実装は、組み込み系チップ200にも依然として配置されている。さらに、この場合、汎用計算機実行部403は、上記外部に設けられた計算機のCPUを用いてアルゴリズム(関数)を実行し、GPU実行部404は、上記外部に設けられた計算機のGPUを用いてアルゴリズム(関数)を実行する。 On the other hand, when the second method described in “5-2” is adopted as the mounting method, among the components shown in FIG. 28, the Function class / Optimizer class 402, the general-purpose computer execution unit 403 The GPU execution unit 404, the general-purpose computer multi-dimensional array 406, the GPU multi-dimensional array 407, and the Variable class 409 are arranged in a computer (such as a personal computer) provided outside. In this case, the implementation of the Native layer is still arranged in the embedded chip 200. Furthermore, in this case, the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU of the computer provided outside, and the GPU execution unit 404 uses the GPU of the computer provided outside. Execute algorithm (function).
 6-3.Native層実行部の構成
 次に、上述したNative層実行部405の構成について説明する。図29は、本発明の一実施形態に係る実装装置に含まれるNative層実行部の構成例を示す模式図である。
 図29に示すように、Native層実行部405は、Python層において、主に、NativeDeviceクラス501と、NativeArrayクラス502と、Functionクラス/Optimizerクラス503と、バイトコード生成部504と、を含む。なお、図29に記載されたFunctionクラス/Optimizerクラス503と図28に記載されたFunctionクラス/Optimizerクラス402とは同一の構成要素であり、図29に記載されたNativeArrayクラス502と図28に記載されたNative用多次元配列408とは同一の構成要素である。
 さらに、Native層実行部405は、Native層において、主に、デバイス管理モジュール505と、データ変換モジュール506と、多次元配列モジュール507と、Functionモジュール/Optimizerモジュール508と、仮想マシンモジュール509と、メモリプールモジュール510と、を含む。
6-3. Configuration of Native Layer Execution Unit Next, the configuration of the Native layer execution unit 405 described above will be described. FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to an embodiment of the present invention.
As shown in FIG. 29, the Native layer execution unit 405 mainly includes a NativeDevice class 501, a NativeArray class 502, a Function class / Optimizer class 503, and a byte code generation unit 504 in the Python layer. The Function class / Optimizer class 503 described in FIG. 29 and the Function class / Optimizer class 402 described in FIG. 28 are the same components, and the NativeArray class 502 described in FIG. 29 and FIG. The native multidimensional array 408 is the same component.
Further, the Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 509, and a memory in the Native layer. A pool module 510.
 NativeDeviceクラス502は、Native層のデバイス管理モジュールをPython層でラップするとともにNative層への関数呼び出しやデータ入出力を隠蔽する。NativeArrayクラス502は、Native層の多次元配列をPython層でラップする。Functionクラス/Optimizerクラス503のうち、Functionクラスは、Native層のFunctionモジュールをPython層でラップし、Optimizerクラスは、Native層のOptimizerモジュールをPython層でラップする。ここで、Functionクラス及びOptimizerクラスは、Chainerにおいて実装済みであり、汎用計算機とGPUによる実行の違いを隠蔽する機能を有する。この機能を拡張することでNative層での実行も隠蔽できる。
 バイトコード生成部は、バイトコードを生成する。
 なお、図29に例示した各構成要件の詳細については後述する。
The NativeDevice class 502 wraps the device management module in the Native layer in the Python layer and hides function calls and data input / output to the Native layer. The NativeArray class 502 wraps a multi-dimensional array in the Native layer with a Python layer. Of the Function class / Optimizer class 503, the Function class wraps the Function module in the Native layer in the Python layer, and the Optimizer class wraps the Optimizer module in the Native layer in the Python layer. Here, the Function class and the Optimizer class have been implemented in Chainer, and have a function of hiding the difference in execution between the general-purpose computer and the GPU. Execution in the Native layer can be hidden by extending this function.
The byte code generation unit generates a byte code.
The details of the constituent elements illustrated in FIG. 29 will be described later.
 7.実施形態に係る実装装置の効果
 ディープラーニングは研究開発が盛んな発展途上の技術であるので、組み込み系チップ向けの開発期間中に従来よりも優れた性能を持つ新規のレイヤーアルゴリズムが発明され、こうした新規のアルゴリズムを開発中のソフトウェア、あるいはハードウェア実装に取り込む需要が発生することが想定される。
 新規のレイヤーアルゴリズムを含むニューラルネットワークの構成を組み込み環境で製品レベルの仕様を満たした上で動作する状態を実現するためには以下のような開発ステップを踏む必要がある。
 1.GPUなど潤沢な計算資源を得られる環境でアルゴリズムの実装と検証を行う。
 2.1で実装検証したアルゴリズムを組み込みチップで既に最適化実装が終わったニューラルネットワークのモジュールと結合し動作検証を行う。その結果に応じて該当チップ向けに特化した最適化を1で実装検証したアルゴリズムに適応する。
 3.2の作業完了後、該当チップ向けに最適化されたニューラルネットの実装のみを用いて、これ以外のモジュール(センサーやモータの制御系など)とも結合した上で製品レベルの仕様を満たしているか否かを様々な試験項目により検証する。
7). Effects of the mounting apparatus according to the embodiment Deep learning is a developing technology with active research and development. Therefore, a new layer algorithm having better performance than the conventional one was invented during the development period for embedded chips. It is anticipated that there will be a demand to incorporate new algorithms into software being developed or hardware implementations.
In order to realize a state in which the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment, it is necessary to take the following development steps.
1. Implement and verify the algorithm in an environment where abundant computational resources such as GPU can be obtained.
The algorithm verified in 2.1 is combined with the neural network module that has already been optimized and implemented on the embedded chip, and the operation is verified. Based on the result, the optimization specialized for the corresponding chip is applied to the algorithm verified by implementation in 1.
After completing the work in 3.2, use only the neural network implementation optimized for the corresponding chip, and combine it with other modules (sensors, motor control systems, etc.) to meet product level specifications. It is verified by various test items.
 実施形態に係る実装装置は、Python上でニューラルネットワークのアルゴリズムを動作させるとき、汎用計算機上で動作するPython言語を用いた実行部と、GPUを用いた実行部と、特定チップ向けの最適化実装を用いた実行部とをレイヤーごとに呼び分ける構成を有し、バイトコードを経由することでこの特定チップ向けの最適化実装のみを用いてニューラルネットワークのアルゴリズム全体を動作させる構成を有する。
 前段落で説明したステップ1、2間で1のステップで作成したアルゴリズム実装コードを2のために流用でき、ステップ1、2間の動作結果の違いも容易に比較検討できる。さらに、ステップ2、3間では、ステップ2のために作成した最適化実装をステップ3のために流用でき、逆にステップ3で見つかった最適化実装に関する不具合修正もステップ2のために流用することができる。その結果、新規のレイヤーアルゴリズムを含むニューラルネットの構成を組み込み環境で製品レベルの仕様を満たした上で動作する状態を最小の開発コストで実現できる。
When the neural network algorithm is operated on Python, the mounting device according to the embodiment performs an execution unit using the Python language that operates on a general-purpose computer, an execution unit using a GPU, and an optimized mounting for a specific chip. And an execution unit using the network are configured to call each layer, and the entire neural network algorithm is operated by using only the optimized implementation for the specific chip by way of the byte code.
The algorithm implementation code created in step 1 between steps 1 and 2 described in the previous paragraph can be used for 2, and the difference in operation results between steps 1 and 2 can be easily compared. Furthermore, between steps 2 and 3, the optimization implementation created for step 2 can be diverted for step 3, and conversely, the defect correction related to the optimization implementation found in step 3 can also be diverted for step 2. Can do. As a result, it is possible to realize a state in which the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment and can be operated with the minimum development cost.
 用語の定義
 本発明の実施形態を詳細説明する上で以下の用語を定義する。
 「全体動作」とは、フォワード処理単独、又は、フォワード処理、バックワード処理及び重みの更新処理を繰り返して実行するような処理単位を表す。この全体動作は、典型的なニューラルネットワークの学習及び識別の実施形態として想定されるものである。
Definition of Terms In describing embodiments of the present invention in detail, the following terms are defined.
“Overall operation” represents a processing unit in which forward processing alone or forward processing, backward processing, and weight update processing are repeatedly executed. This overall operation is envisaged as a typical neural network learning and identification embodiment.
 8.Native層
 次に、図29に例示した実施形態に係る実装装置のNative層に関する構成について説明する。
 8-1.デバイス管理モジュール
 デバイス管理モジュール505は、デバイス(最適化実装に依存するソフトウェア及びハードウェア状態)の初期化と解放処理を行う。具体的には、デバイス管理モジュール505内で行う処理は、デバイスの形態次第で異なるが、典型的な処理内容として例えば後述するメモリプールの確保や解放を行う。デバイスは、Chainer及びPythonを実行する汎用計算機と同一チップや、同一基板上に存在する必要はない。別基板にあるデバイスと通信し初期化・解放するような最適化実装も可能である。
8). Native Layer Next, a configuration related to the Native layer of the mounting apparatus according to the embodiment illustrated in FIG. 29 will be described.
8-1. Device Management Module The device management module 505 performs initialization and release processing of devices (software and hardware status depending on optimization implementation). Specifically, the processing performed in the device management module 505 differs depending on the form of the device, but as a typical processing content, for example, a later-described memory pool is secured or released. The device need not be on the same chip or on the same board as the general-purpose computer that runs Chainer and Python. It is possible to implement an optimized implementation that communicates with a device on another board and initializes / releases it.
 デバイスを初期化又は解放する関数の定義例を以下に示す。
 (例1)Device* chnr_init_device(void)
 これによりデバイスを初期化することができる。
 (例2)void chnr_release_device(Device* device)
 これによりデバイスを解放することができる。
An example definition of a function that initializes or releases a device is shown below.
(Example 1) Device * chnr_init_device (void)
As a result, the device can be initialized.
(Example 2) void chnr_release_device (Device * device)
As a result, the device can be released.
 8-2.Functionモジュール
 Functionモジュール508は、ニューラルネットワークのレイヤーごとの計算を行う関数群であり、レイヤーの種類別に以下の関数を定義する。
 chnr_forward_xxxx(…)
 -フォワード処理(浮動小数点版)を実装
 chnr_backward_xxxx(…)
 -バクワード処理(浮動小数点版)を実装
 chnr_forward_xxxx_fx(…)
 -フォワード処理(固定小数点版)を実装
 chnr_backward_xxxx_fx(…)
 -バックワード処理(固定小数点版)を実装
 ここで、xxxxはレイヤーの種類別に付与される名称を表す。
 各関数での具体的な処理内容は、上記第1部の上記「2-6」~「2-12」において例示したものを含む。
8-2. Function Module The Function module 508 is a function group that performs calculation for each layer of the neural network, and defines the following functions for each type of layer.
chnr_forward_xxxx (…)
-Implemented forward processing (floating point version) chnr_backward_xxxx (…)
-Implemented word processing (floating point version) chnr_forward_xxxx_fx (...)
-Implemented forward processing (fixed point version) chnr_backward_xxxx_fx (...)
-Implementation of backward processing (fixed-point version) where xxxx represents the name assigned to each layer type.
Specific processing contents in each function include those exemplified in “2-6” to “2-12” in the first part.
 8-3.多次元配列(MDArray)モジュール
 多次元配列モジュール507は、Native層のFunctionの間で入出力される多次元配列を管理する。多次元配列モジュール507は、任意のサイズ・次元数を管理できる。また、多次元配列モジュール507は、後述するように、Numpy(Chainerが依存しているpython層の多次元配列クラス)やGPU向け多次元配列ライブラリとの相互変換の仕組みを有する。
 さらにまた、この多次元配列モジュール507は、浮動小数点型だけではなく固定小数点型も保持できる。これにより、FPU(浮動小数点演算ユニット)を持たないハードウェアでもニューラルネットの計算を容易に実現できる。また、この多次元配列モジュール507は、浮動小数点の多次元配列との相互変換の関数を有する。
8-3. Multidimensional Array (MDArray) Module The multidimensional array module 507 manages a multidimensional array that is input and output between the functions of the Native layer. The multidimensional array module 507 can manage an arbitrary size and number of dimensions. Further, as will be described later, the multidimensional array module 507 has a mechanism for mutual conversion with Numpy (multidimensional array class of python layer on which Chainer depends) and a multidimensional array library for GPU.
Furthermore, the multidimensional array module 507 can hold not only floating point types but also fixed point types. As a result, the calculation of the neural network can be easily realized even with hardware having no FPU (floating point arithmetic unit). The multidimensional array module 507 has a function of mutual conversion with a floating-point multidimensional array.
 多次元配列モジュール507の実装例について説明する。
 本発明の一実施形態に係る実装装置の多次元配列モジュールの構造体定義例が図30に示されている。
 次に、関数定義例は以下のとおりである。
 (例1)MDArray chnr_create_md_array(dimensions[], numaxis, type)
 これにより、多次元配列を生成・初期化することができる。
 (例2)void chnr_delete_md_array(MDArray* mdrray)
 これにより、多次元配列を削除することができる。
 (例3)void chnr_md_array_add(MDArray* dst, MDArray* a, MDArray* b)
 これにより、多次元配列の要素同士の加算を行うことができる。
A mounting example of the multidimensional array module 507 will be described.
FIG. 30 shows an example of the structure definition of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention.
Next, function definition examples are as follows.
(Example 1) MDArray chnr_create_md_array (dimensions [], numaxis, type)
Thereby, a multidimensional array can be generated and initialized.
(Example 2) void chnr_delete_md_array (MDArray * mdrray)
Thereby, a multidimensional array can be deleted.
(Example 3) void chnr_md_array_add (MDArray * dst, MDArray * a, MDArray * b)
Thereby, the elements of the multidimensional array can be added.
 次に、多次元配列モジュール507の多次元配列のメモリ管理について説明する。
 多次元配列の実体を格納するメモリ領域の管理(生成・破棄)はNative層で実現する。組み込み環境の場合、Linux(登録商標) OSなどが標準提供するメモリ管理機構(malloc/free)では管理できないようなメモリ構成を有する環境も考えらえる。Python層では、アルゴリズム開発、Native層では、ハードウェアを強く意識する開発というソフトウェア階層の役割分担を考えると、こうしたハードウェア環境の特徴に対して責任を持つ管理機構はNative層で実装することが適切である。後述する仮想マシンを利用する場合(Python層への依存を取り除いた場合)も、このメモリ管理機構を使い回すことができる。
 Python層でNative層の多次元配列をラップするクラスを用意し、メモリ領域の生成・解放タイミングはこのPythonクラスのインスタンス生存期間に一致させる。Python上のコードで多次元配列を自然に扱うために、こうした仕組みが必要となる。「Define-by-Run」の機能もPythonのメモリ管理機構に依存している。
Next, the memory management of the multidimensional array of the multidimensional array module 507 will be described.
Management (generation / destroy) of the memory area that stores the entities of multidimensional arrays is realized in the Native layer. In the case of an embedded environment, an environment having a memory configuration that cannot be managed by a memory management mechanism (malloc / free) provided as standard by Linux (registered trademark) OS or the like can be considered. In the Python layer, considering the role division of the software hierarchy, which is algorithm development in the Native layer and development that is strongly aware of hardware, the management mechanism responsible for the characteristics of these hardware environments can be implemented in the Native layer. Is appropriate. This memory management mechanism can be reused when using the virtual machine described later (when the dependency on the Python layer is removed).
A class that wraps the multi-dimensional array in the Native layer is prepared in the Python layer, and the generation / release timing of the memory area is matched with the instance lifetime of this Python class. Such a mechanism is necessary in order to handle multidimensional arrays naturally in Python code. The “Define-by-Run” function also depends on the Python memory management mechanism.
 多次元配列データの相互互換及び参照関係を図31に示す。 Fig. 31 shows the mutual compatibility and reference relationship of multidimensional array data.
 8-4.メモリプールモジュール
 メモリプールモジュール510は、一旦確保したメモリ領域を使い回すことで、コスト(処理サイクル数など)が高いメモリ管理機構のコール回数を削減するための仕組みである。
 関数定義例は以下のとおりである。
 (例1)void chnr_momory_pool_init(MemoryPool* momory_pool)
 これにより、メモリプールを初期化することができる。
 (例2)void chnr_momory_pool_release(MemoryPool* momory_pool)
 これにより、メモリプールを破棄することができる。
 (例3)void* chnr_momory_pool_alloc_buffer(MemoryPool* momory_pool, int byte_size, void* old_addr)
 これにより、メモリを確保することができる。
 (例4)void chnr_momory_pool_free_buffer(MemoryPool* momory_pool, void*addr)
 これにより、メモリを解放することができる。
8-4. Memory Pool Module The memory pool module 510 is a mechanism for reducing the number of calls of a memory management mechanism having a high cost (such as the number of processing cycles) by using the memory area once reserved.
An example of function definition is as follows.
(Example 1) void chnr_momory_pool_init (MemoryPool * momory_pool)
Thereby, the memory pool can be initialized.
(Example 2) void chnr_momory_pool_release (MemoryPool * momory_pool)
Thereby, the memory pool can be discarded.
(Example 3) void * chnr_momory_pool_alloc_buffer (MemoryPool * momory_pool, int byte_size, void * old_addr)
Thereby, a memory can be secured.
(Example 4) void chnr_momory_pool_free_buffer (MemoryPool * momory_pool, void * addr)
Thereby, the memory can be released.
 Native層でメモリプールモジュールが必要となる背景(1)
 Chainerの「Define-by-Run」はPythonが持つ動的メモリ管理の仕組みに依存している。その例(Linear Functionのフォワード処理)を図32に示す。この例では、3行目のWx = x.dot(self.W.T)の記述でWxのインスタンス(中身は多次元配列)が新たに生成されている。Wxはどの変数からも参照されなくなった時点でPythonのメモリ管理機構により自動破棄される。
 Functionが出力するデータ(上記の例ではWx)のサイズは入力データサイズやパラメータに依存して動的に変更可能であって、その実体(メモリ領域)の確保もフォワード処理やバックワード処理のコードフロー内で実行される。「Define-by-Run」(ネットワーク構成の実行時に定義)を実現するためには、こうした必要なタイミングで必要なサイズだけメモリを確保する仕組みが必要である。
Background that requires memory pool module in Native layer (1)
Chainer's "Define-by-Run" relies on Python's dynamic memory management mechanism. An example thereof (Linear Function forward processing) is shown in FIG. In this example, an instance of Wx (the content is a multidimensional array) is newly generated by the description of Wx = x.dot (self.WT) on the third line. Wx is automatically destroyed by the Python memory management mechanism when it is no longer referenced by any variable.
The size of the data output by the function (Wx in the above example) can be changed dynamically depending on the input data size and parameters, and the actual (memory area) can be reserved for forward processing and backward processing codes. It is executed in the flow. In order to realize “Define-by-Run” (defined when the network configuration is executed), it is necessary to have a mechanism for securing a memory of a necessary size at such a necessary timing.
 Native層でメモリプールモジュールが必要となる背景(2)
 Python層でNative層の多次元配列をラップするクラスを用意し、Native層の多次元配列の生存期間をPython層と一致させる工夫をすることで、「Define-by-Run」の柔軟性を享受しながらNative層の実装を利用することができる。
 しかしながら、通常Functionは高い頻度でコールされるため、その度にNative層のmallocやfreeなどコストの高いメモリ管理機構をコールしてしまうと、処理速度の低下を招く可能性がある。このため、一旦確保したメモリ領域を使い回す機能(メモリプール)を用意する必要が生じる。
Background that requires a memory pool module in the Native layer (2)
By preparing a class that wraps the multi-dimensional array of the Native layer in the Python layer and making the lifetime of the multi-dimensional array of the Native layer coincide with the Python layer, you can enjoy the flexibility of "Define-by-Run" While implementing the Native layer.
However, since Function is normally called at a high frequency, calling a memory management mechanism with high cost such as malloc or free in the Native layer every time may cause a reduction in processing speed. For this reason, it is necessary to prepare a function (memory pool) for reusing the memory area once reserved.
 メモリプールの実装例(1)
 構造体定義例を図33に示す。
 メモリ確保時の処理フローは次のとおりである。
 1.Buffer_sizeの配列の中から、解放済みフラグが1でかつ前回確保したときのサイズが今回確保したいサイズと一致しているインデックスを検索し、見つかれば、解放済みフラグを0にセット後、同じインデックスのbuffer_addrの値(メモリバッファのアドレス)を返す。ここで、解放済みフラグは例えばbuffer_size配列要素の符号ビットで管理する。前回確保したときのサイズとアドレスの組み合わせにより配列要素を検索することで、アドレスの入れ替わりを少なくすることもできる。
 2.上記一致しているインデックスが見つからなければ、実際にメモリを確保(mallocなどをコール)して、そのアドレスとサイズを配列に加えた後、アドレスを返す。
Memory pool implementation example (1)
An example of structure definition is shown in FIG.
The processing flow when securing the memory is as follows.
1. Search the Buffer_size array for an index where the released flag is 1 and the size when the previously secured flag matches the size that you want to secure this time. Returns the value of buffer_addr (memory buffer address). Here, the released flag is managed by the sign bit of the buffer_size array element, for example. By searching for the array element based on the combination of the size and the address obtained at the previous time, it is possible to reduce the replacement of addresses.
2. If no matching index is found, the memory is actually allocated (call malloc etc.), its address and size are added to the array, and the address is returned.
 メモリプールの実装例(2)
 メモリ解放時の処理として、buffer_addrの配列の中から解放すべきアドレスを検索し、そのアドレスが見つかれば、解放済みフラグを1にセットする。
 メモリプール解放時の処理として、buffer_addrの配列の中からアドレスが設定されている要素に対してメモリの解放(free関数のコールなど)を行う。
Memory pool implementation example (2)
As a process for releasing the memory, the address to be released is searched from the buffer_addr array, and if the address is found, the released flag is set to 1.
As a process when releasing the memory pool, the memory is released (such as calling the free function) for the element whose address is set from the buffer_addr array.
 メモリプール実装の効果
 大半のニューラルネットワークにおいては、メモリサイズの組み合わせは、学習のイテレーション毎に変化することはなく固定的なので、上述のようなメモリプールの実装例を用いることで、mallocなどの呼び出しを最初のイテレーション時だけにとどめることができる。
Effects of memory pool implementation In most neural networks, the combination of memory sizes does not change with each learning iteration and is fixed, so by using the memory pool implementation example described above, calling malloc etc. Can be kept only during the first iteration.
 8-6.Optimizerモジュール
 Optimizerモジュール508は、ニューラルネットワークの重みを有するレイヤーごとに重み更新を行う関数群である。Optimizerモジュール508は、重みの更新アルゴリズム別に以下の関数を定義する。
 (例1)chnr_op_init_state_xxxx(…)
 これにより、重み更新アルゴリズムの内部状態初期化処理を実装(浮動小数点版)することができる。
 (例2)chnr_op_update_one_xxxx(…)
 これにより、重み更新処理を実装(浮動小数点版)することができる。
 (例3)chnr_op_init_state_xxxx_fx (…)
 これにより、重み更新アルゴリズムの内部状態初期化処理を実装(固定小数点版)することができる。
 (例4)chnr_op_update_one_xxxx_fx (…)
 これにより、重み更新処理を実装(固定小数点版)することができる。
 ここで、xxxxは重みの更新アルゴリズム別に付与される名称を表す。
 なお、重みの更新アルゴリズムは、上記第1部の上記「2-13」において説明したものを含むことができる。
8-6. Optimizer Module The Optimizer module 508 is a function group that performs weight update for each layer having a weight of the neural network. The Optimizer module 508 defines the following functions for each weight update algorithm.
(Example 1) chnr_op_init_state_xxxx (...)
As a result, the internal state initialization process of the weight update algorithm can be implemented (floating point version).
(Example 2) chnr_op_update_one_xxxx (...)
This makes it possible to implement weight update processing (floating point version).
(Example 3) chnr_op_init_state_xxxx_fx (...)
As a result, the internal state initialization process of the weight update algorithm can be implemented (fixed point version).
(Example 4) chnr_op_update_one_xxxx_fx (...)
As a result, the weight update process can be implemented (fixed-point version).
Here, xxxx represents a name assigned to each weight update algorithm.
Note that the weight update algorithm can include the algorithm described in “2-13” of the first part.
 8-7.データ変換(Converter)モジュール(1)
 データ変換モジュール506は、データ形式の変換を行う関数群である。
 関数定義例は以下のとおりである。
 (例1)chnr_float_to_fixed(MDArray* dst, MDArray* src, int Q)
  これにより、浮動小数点型から固定小数点型へ変換することができる。
 (例2)chnr_fixed_to_float(MDArray* dst, MDArray* src)
  これにより、固定小数点型から浮動小数点型へ変換することができる。
 (例3)chnr_host_to_device(MDArray* dst, float* src_data, int src_dimensions[], int src_num_axis, int Q, int async, …)
  これにより、デバイス非依存データ表現(後述)からデバイス依存データ表現(後述)へ変換することができる。
 (例4)chnr_device_to_host(float* dst_data, int dst_dimensions[], int*dst_num_axis, MDArray* src, int async, …)
  これにより、デバイス依存データ表現からデバイス非依存データ表現へ変換することができる。
8-7. Data conversion module (1)
The data conversion module 506 is a function group that performs data format conversion.
An example of function definition is as follows.
(Example 1) chnr_float_to_fixed (MDArray * dst, MDArray * src, int Q)
As a result, the floating-point type can be converted to the fixed-point type.
(Example 2) chnr_fixed_to_float (MDArray * dst, MDArray * src)
Thereby, it is possible to convert from the fixed-point type to the floating-point type.
(Example 3) chnr_host_to_device (MDArray * dst, float * src_data, int src_dimensions [], int src_num_axis, int Q, int async,…)
Thereby, it is possible to convert from a device-independent data representation (described later) to a device-dependent data representation (described later).
(Example 4) chnr_device_to_host (float * dst_data, int dst_dimensions [], int * dst_num_axis, MDArray * src, int async,…)
As a result, the device-dependent data representation can be converted into the device-independent data representation.
 浮動小数点型と固定小数点型変換の効果
 組み込み向け半導体チップには、FPU(浮動小数点演算ユニット)を省いたり、少なくとも大規模並列計算に関してFPUを用いない回路設計とすることで、計算量に対するハードウェア資源(トランジスタ数や消費電力)削減を狙ったものが多く見られる。FPUを用いないでニューラルネットワークなど数値計算のアルゴリズムを実行する場合、整数演算器とシフト演算器を利用して小数点以下の情報を含む数値を表現する固定小数点型と呼ばれるデータ型がよく用いられる。浮動小数点型は、実数値をより直感的に扱えるようにするという意味で、アルゴリズム設計に適したデータ型であり、固定小数点型はハードウェア資源の有効活用に適したデータ型といえる。ニューラルネットワークを設計実行するフレームワーク内にこのようなデータ型間の変換関数を用意しておくことで、ニューラルネットのアルゴリズム開発を、統一された環境下で数理面からハードウェアを意識した実装面へ、型変換の影響度をFunction単位で確認しながら段階的に進めることができる。
Effects of floating-point type and fixed-point type conversions For embedded semiconductor chips, the FPU (floating-point arithmetic unit) is omitted, or at least the circuit design that does not use the FPU for large-scale parallel computation is used. Many are aimed at reducing resources (number of transistors and power consumption). When executing a numerical computation algorithm such as a neural network without using an FPU, a data type called a fixed-point type that expresses a numerical value including information after the decimal point by using an integer arithmetic unit and a shift arithmetic unit is often used. The floating-point type is a data type suitable for algorithm design in the sense that a real value can be handled more intuitively, and the fixed-point type is a data type suitable for effective utilization of hardware resources. By preparing a conversion function between such data types in the framework for designing and executing neural networks, neural network algorithm development is a mathematically conscious hardware-oriented implementation in a unified environment. It is possible to proceed step by step while checking the influence of type conversion in units of functions.
 デバイス非依存データ表現とは、特定の計算機に依存する情報をもたないデータ表現をいう。このデータ表現の代表的な実装は、メモリアドレスが連続なC言語形式の多次元配列である。Python層にてnumpyなどのライブラリを利用すればこのようなデータ表現を容易に扱うことができるが、ライブラリやホスト言語を特定するものではない。
 デバイス依存データ表現とは、特定の計算機に特化した最適化実装に適したデータ表現をいう。
 これら2つのデータ表現を相互変換する関数を用意することで、ハードウェアを強く意識した最適化実装と、アルゴリズムを意識した実装(例えばPythonで記述された数式に近い構造の読みやすいコードなど)を連携して全体動作を実行できる。
Device-independent data representation refers to data representation that does not have information that depends on a particular computer. A typical implementation of this data representation is a multi-dimensional array in C language format with continuous memory addresses. If you use a library such as numpy in the Python layer, you can easily handle such data representation, but it does not specify the library or host language.
A device-dependent data representation is a data representation suitable for an optimization implementation specialized for a specific computer.
By preparing a function that interconverts these two data expressions, hardware-aware optimization implementations and algorithm-aware implementations (for example, easy-to-read code with a structure similar to mathematical formulas written in Python) The whole operation can be executed in cooperation.
 デバイス依存データ表現への変換で考慮すべき要件例
(1)メモリ構成
 共有メモリに配置するのか?ハードウェア特有のメモリ領域に配置するのか?
(2)メモリアライメント
 先頭アドレス、各次元ごとの先頭アドレス、パディング
(3)バイトオーダ
 リトルエンディアン、ビッグエンディアン
(4)データ型
 固定小数点(Q値)/浮動小数点,バイト幅(32bit, 16bit, 8bit, …)
(5)マルチコア実行におけるデータ入出力のスケジューリング
(6)データ通信
 最適化実装を持つデバイスが別チップや別基板などにある場合の通信処理
Example of requirements to consider when converting to device-dependent data representation (1) Memory configuration Is it placed in shared memory? Is it placed in a memory area specific to hardware?
(2) Memory alignment Start address, start address for each dimension, padding (3) Byte order Little endian, Big endian (4) Data type Fixed point (Q value) / floating point, byte width (32bit, 16bit, 8bit, …)
(5) Data input / output scheduling in multi-core execution (6) Data communication Communication processing when device with optimized mounting is on another chip or board
 8-8.通信手段
 これまでに説明したNative層の関数群(Native I/F)の実装を以下のような変更を適応することで、別チップや別基板などにあるデバイスと通信しながら全体動作を高速に実行できる
(1)RPC(remote procedure call)
(2)命令キュー
(3)多次元配列のデータ通信量の削減
(4)転送と計算の非同期処理
 これらの変更方針について解説するために以下の用語を定義する
 「ホストデバイス」:全体動作を実行するデバイス(通常の実装ではPython上でChainerのコードが実行されるデバイス)
 「遠隔デバイス」:別チップや別基板などにあるなどの理由で通信処理が必要なデバイス
8-8. Communication means The implementation of the Native layer function group (Native I / F) described so far is adapted to the following changes to speed up the overall operation while communicating with devices on separate chips or substrates. Can be executed (1) RPC (remote procedure call)
(2) Instruction queue (3) Reduction of data traffic of multi-dimensional array (4) Asynchronous processing of transfer and calculation The following terms are defined to explain these change policies. Device to be used (in the normal implementation, the device that runs Chainer code on Python)
"Remote device": A device that requires communication processing because it is on a separate chip or substrate
 RPC(remote procedure call)
 Native I/Fで定義された関数を呼び出したとき、その処理要件(メモリ確保や演算の実行)を直接実行するのではなく、処理要求を表す情報(関数の種類や引数)を生成し、遠隔デバイスへ送信し、遠隔デバイスではその指示に基づいた処理を実行した後、その処理結果をホストデバイスが受信する仕組みを設ける。
 命令キュー
 RPCによる処理要求の通信は、Native I/Fで定義された関数を呼び出す都度実行するのではなく、処理要求を表す情報を一旦キュー(FIFOバッファ)に蓄えることで通信のスケジュールを効率化する。
 多次元配列のデータ通信量の削減
 多次元配列は膨大なデータサイズを持つため、その通信量を削減することは全体動作の速度を向上させる上で重要な課題である。通信量を削減するためには大きく以下の2つの方策がある。
(1)多次元配列の転送回数を削減する
(2)個々の多次元配列のデータ通信量を削減する
RPC (remote procedure call)
When a function defined in Native I / F is called, its processing requirements (memory allocation and calculation execution) are not executed directly, but information (function type and argument) indicating the processing request is generated and A mechanism is provided in which the host device receives the processing result after the processing is transmitted to the device and the remote device executes processing based on the instruction.
Instruction queue Communication of processing requests by RPC is not executed each time a function defined in Native I / F is called, but the information indicating the processing requests is temporarily stored in a queue (FIFO buffer) to improve the communication schedule. To do.
Reduction of data communication amount of multi-dimensional array Since multi-dimensional array has enormous data size, reducing the communication amount is an important issue in improving the speed of the whole operation. There are two major measures to reduce the amount of communication.
(1) Reduce the number of transfers of multidimensional arrays (2) Reduce the amount of data communication of each multidimensional array
 多次元配列の転送回数を削減する方法
 中間層(ニューラルネットの入力層と出力層以外)のレイヤー間で入出力するデータ及び重みの勾配は、遠隔デバイスのみに存在すればよくデバイス間の通信は不要である。また「重み」についても、ネットワーク構造を定義する最初の段階で遠隔デバイスへ転送し、学習終了時にホストデバイスへ転送すればよい。
 データ変換(Converter)モジュール506について説明したデバイス非依存データ表現及びデバイス依存データ表現の変換関数は、このような転送タイミングのマネージを行うのに最適である。具体的には各関数で以下の処理を行う。
 デバイス非依存データ表現からデバイス依存データ表現へ変換する際に、ホストデバイスから遠隔デバイスへのデータ転送を行う。
 デバイス依存データ表現からデバイス非依存データ表現へ変換する際に、遠隔デバイスからホストデバイスへのデータ転送を行う。
Method to reduce the number of transfers of multidimensional arrays Data input / output between layers of the intermediate layer (other than the input layer and output layer of the neural network) and the weight gradient need only exist in the remote device, and communication between devices is not necessary It is unnecessary. Also, the “weight” may be transferred to the remote device at the initial stage of defining the network structure, and transferred to the host device at the end of learning.
The device-independent data representation and the conversion function of the device-dependent data representation described for the data conversion module 506 are optimal for managing such transfer timing. Specifically, each function performs the following processing.
When converting from the device-independent data representation to the device-dependent data representation, data transfer from the host device to the remote device is performed.
When converting from a device-dependent data representation to a device-independent data representation, data is transferred from the remote device to the host device.
 個々の多次元配列のデータ通信量を削減する方法
 様々なデータ圧縮のためのアルゴリズムが知られている。
(1)可逆圧縮(ハフマン符号化、ランレングス圧縮、etc)
(2)不可逆圧縮(DCT, スカラー量子化、ベクトル量子化, etc)
 データ通信を依頼する関数(デバイス非依存データ表現とデバイス依存データ表現の変換関数を想定)の引数にて、これらの圧縮アルゴリズムの種別とパラメータを指定することで、データの性質や精度要件の考慮に基づいた最適なデータ圧縮手段を用いて通信量を削減できる。
Methods for reducing data traffic of individual multidimensional arrays Various algorithms for data compression are known.
(1) Lossless compression (Huffman coding, run-length compression, etc)
(2) Lossy compression (DCT, scalar quantization, vector quantization, etc)
By specifying the type and parameters of these compression algorithms in the argument of the function that requests data communication (assuming device-independent data representation and device-dependent data representation conversion function), the data properties and accuracy requirements are considered. The amount of communication can be reduced by using the optimum data compression means based on the above.
 転送と計算の非同期処理
 多くの組み込み系チップでは、データ通信及び演算処理を別々のハードウェアが非同期に実行する構成を持つ。データ通信を依頼する関数(デバイス非依存データ表現とデバイス依存データ表現の変換関数を想定)に非同期実行(non-blocking)するものを用意すれば、こうしたハードウェア構成を意識したコーディング(一般にパイプライン化と呼ばれる手法)により、アリゴリズム全体の高速化を実現できる。
 疑似コードによるパイプライン化のコーディングを図34に示す
Asynchronous processing of transfer and calculation Many embedded chips have a configuration in which separate hardware asynchronously executes data communication and arithmetic processing. If a function that requests data communication (assuming a conversion function between device-independent data representation and device-dependent data representation) is prepared for asynchronous execution (non-blocking), such hardware configuration coding (generally pipeline) By using a method called “Corporation”, the entire algorithm can be accelerated.
Figure 34 shows the code for pipelining with pseudo code.
 8-9.仮想マシンモジュール
 仮想マシンモジュール509は、バイトコードを解釈しニューラルネットの学習/識別処理(フォワード処理、バックワード処理及び重みの更新)を実行する機能を実現する関数群である。バイトコードは、後述するPython層のバイトコード出力器によって生成されるものを前提とするが、他のソフトウェアで生成されたバイトコードであっても形式が正しければ仮想マシンモジュールで解釈実行できる。
 関数定義例は以下のとおりである。
 (例1)void chnr_init_with_bytecode(VMState* state, char* byte_code)
 これにより、バイトコードを構文解析し仮想マシンの内部状態を初期化することができる。
 (例2)void chnr_forward(VMState* state)
 これにより、フォワード処理を実行することができる。
 (例3)void chnr_backward(VMState* state)
 これにより、バックワード処理を実行することができる。
 (例4)void chnr_update_weight(VMState* state)
 これにより、重みの更新処理を実行することができる。
8-9. Virtual Machine Module The virtual machine module 509 is a function group that realizes a function of interpreting byte codes and executing neural network learning / identification processing (forward processing, backward processing, and weight update). Byte code is assumed to be generated by a Python layer byte code output device, which will be described later, but even byte code generated by other software can be interpreted and executed by the virtual machine module if the format is correct.
An example of function definition is as follows.
(Example 1) void chnr_init_with_bytecode (VMState * state, char * byte_code)
Thereby, it is possible to parse the bytecode and initialize the internal state of the virtual machine.
(Example 2) void chnr_forward (VMState * state)
Thereby, a forward process can be performed.
(Example 3) void chnr_backward (VMState * state)
Thereby, the backward process can be executed.
(Example 4) void chnr_update_weight (VMState * state)
Thereby, the weight update process can be executed.
 バイトコードのフォーマット例
 以下のような情報をバイナリーフォーマットにて格納する。
(1)入出力データ情報
 {配列次元数・サイズ, データ型(float32, FixedPoint)}*Variable数
(2)重み情報
 {配列次元数・サイズ, データ型(float32, FixedPoint), 実現値}*重み個数
(3)バックワード処理時Function呼び出し情報
 {Functionの種別,入出力データのindex, 重み情報のindex, Functionの種別毎に固有のパラメータ}*Function数
(4)重み更新の種別及びパラメータ
 さらにバイトコードには、ニューラルネット全体処理の入出力となる多次元配列のindexを加えてもよい。このindexをバイトコードに格納しておくことにより、仮想マシンを利用するユーザコードではニューラルネット全体処理の入力となる多次元配列と、出力となる多次元配列を関数呼び出しに関して適切に紐付けることができる。例えば、以下のようなフローにより、この紐付けを行うことができる。
 (ステップ1)ユーザコードは仮想マシンの構成に用意された関数を呼び出すことにより、全体処理の入力多次元配列を取得する。
 (ステップ2)ユーザコードは前記ステップ1で取得した多次元配列に入力データをコピーする。
 (ステップ3)ユーザコードは仮想マシンの構成に用意された全体動作を実行する関数を呼び出す。
 (ステップ4)ユーザコードは仮想マシンの構成に用意された関数を呼び出すことにより、全体処理の出力多次元配列を取得する(この多次元配列は前記ステップ3で実行される全体動作の処理結果が格納された状態となっている。前記ステップ3とステップ4の関数は必ずしも分離する必要はなく一体の関数であってもよい。)。
 (ステップ5)ユーザコードは前記ステップ4で取得した多次元配列から出力データの中身を取得する。
Byte code format example Store the following information in binary format.
(1) Input / output data information {number of array dimensions / size, data type (float32, FixedPoint)} * Variable number (2) Weight information {number of array dimensions / size, data type (float32, FixedPoint), realized value} * weight Number (3) Function call information during backward processing {Function type, I / O data index, weight information index, function specific parameters for each function type} * Number of functions (4) Weight update type and parameters, and more bytes An index of a multidimensional array that is an input / output of the entire neural network processing may be added to the code. By storing this index in the bytecode, the user code that uses the virtual machine can appropriately link the multidimensional array that becomes the input of the entire neural network processing and the multidimensional array that becomes the output with respect to the function call. it can. For example, this association can be performed by the following flow.
(Step 1) The user code obtains an input multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine.
(Step 2) The user code copies the input data to the multidimensional array acquired in Step 1 above.
(Step 3) The user code calls a function for executing the entire operation prepared in the configuration of the virtual machine.
(Step 4) The user code obtains the output multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine (this multidimensional array is the result of the overall operation executed in Step 3 above). (The functions in step 3 and step 4 do not necessarily need to be separated and may be integrated functions).
(Step 5) The user code acquires the contents of the output data from the multidimensional array acquired in Step 4.
 仮想マシンの内部状態初期化の処理フロー実装例
(1)バイトコード内の「入出力データ情報」を解釈して、Functionで入出力する多次元配列のリストを生成する。
(2)バイトコード内の「重み情報」を解釈して重みと重みの勾配(何れも多次元配列)のリストを生成する。
(3)バイトコード内の「Backward時Function呼び出し情報」を解釈して以下の情報を持つ構造体(FunctionState)のリストをForward用とBackward用にそれぞれ生成する(フォワード処理/バックワード処理を実行する関数の識別ID, 入出力データのアドレス, 重みのアドレス, 重みの勾配のアドレス, Functionの種別毎に固有のパラメータ)。
(4)バイトコード内の「重み更新の種別とパラメータ」を解釈して、重み更新アルゴリズムの内部状態の多次元配列と、以下の情報を持つ構造体(OptimizerState)を初期化する(重みの更新を実行する関数のアドレス,重みのアドレス,重みの勾配のアドレス,重み更新アルゴリズムの内部状態, 重みの更新種別毎に固有のパラメータ)。
Processing flow implementation example of initialization of internal state of virtual machine (1) “Input / output data information” in the bytecode is interpreted, and a list of multidimensional arrays to be input / output by Function is generated.
(2) Interpret the “weight information” in the bytecode to generate a list of weights and weight gradients (both multidimensional arrays).
(3) Interpret “Backward Function Call Information” in the bytecode and generate a list of structures (FunctionState) with the following information for Forward and Backward (execute forward / backward processing) Function identification ID, input / output data address, weight address, weight gradient address, parameters specific to each function type).
(4) Interpret the "weight update type and parameters" in the bytecode, and initialize a multidimensional array of weight update algorithm internal states and a structure (OptimizerState) with the following information (update of weights) Function address, weight address, weight gradient address, internal state of weight update algorithm, parameters specific to each weight update type).
 仮想マシン内部状態の構成図を図35に示す The configuration diagram of the internal state of the virtual machine is shown in FIG.
 仮想マシンモジュールの実行フロー例(1)(フォワード処理及びバックワード処理)
 仮想マシンモジュールは、図36に示す擬似コードのような処理を実行する。
Virtual machine module execution flow example (1) (forward processing and backward processing)
The virtual machine module executes processing such as pseudo code shown in FIG.
 仮想マシンの実行フロー例(2)(Optimizer)
 仮想マシンモジュールは、図37に示す擬似コードのような処理を実行する。
Virtual machine execution flow example (2) (Optimizer)
The virtual machine module executes processing such as pseudo code shown in FIG.
 Forward処理実行を特化させた最適化装置の構成2;データメモリ領域の再利用手段を有する場合(1)
 全体動作を実行する際、学習(重みの更新)を行わず識別処理のみを実行するのであればフォワード処理のみを実行すれば良い。この場合、以下のデータは不要となる。
(1)レイヤー間で入出力されるデータのうち現在実行中のFunctionがアクセスしないもの
(2)重みの勾配
(3)重み更新アルゴリズムの内部状態
 仮想マシンの内部状態初期化の際、重みの勾配及び重み更新アルゴリズムの内部状態は確保の必要がない。レイヤー間で入出力されるデータについても例えば次段落に記載するような手順でメモリ確保量を抑えることができる。
Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (1)
When performing the entire operation, if only the identification process is performed without learning (updating the weights), only the forward process may be performed. In this case, the following data is unnecessary.
(1) Data input / output between layers that are not accessed by currently executing function (2) Weight gradient (3) Internal state of weight update algorithm Weight gradient when initializing internal state of virtual machine And it is not necessary to secure the internal state of the weight update algorithm. For data input / output between layers, for example, the memory allocation amount can be reduced by the procedure described in the next paragraph.
 Forward処理実行を特化させた最適化装置の構成2;データメモリ領域の再利用手段を有する場合(2)
 仮想マシンモジュールの内部状態初期化時の手順例
(1)Functionごとに入出力するデータサイズ(メモリサイズ)の和を計算し、最大サイズのものを選ぶ。
(2)多次元配列をハンドルする構造体(MDArray)を初期化する際に1で確保したメモリ領域を使い回すようにアドレスを設定する。
 レイヤー毎にメモリ領域の左端と右端を入出力として交互に切り替えて使うようなアドレス設定を行うことで、配列データのコピーが発生しないようにする。ループのある入出力を行うFunctionが含まれている場合には、次イテレーションに持ち越す出力データに関しては、この手順で示した再利用の対象外とし個別にメモリ領域を確保する。
Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (2)
Example procedure when initializing the internal state of a virtual machine module (1) Calculate the sum of the data size (memory size) to be input / output for each function, and select the one with the maximum size.
(2) An address is set so that the memory area secured in 1 is reused when a structure (MDArray) that handles a multidimensional array is initialized.
By setting addresses so that the left and right ends of the memory area are alternately switched as input and output for each layer, copy of array data is prevented from occurring. If a function that performs input / output with a loop is included, the output data carried over to the next iteration is not subject to reuse as shown in this procedure, and a memory area is secured individually.
 Forward処理実行を特化させた最適化装置の構成2;データメモリ領域の再利用手段を有する場合(3)
 Functionが入出力するデータのアドレス設定例を図38に示す。
Configuration 2 of optimization device specializing execution of forward processing; When there is means for reusing data memory area (3)
FIG. 38 shows an example of address setting for data input / output by Function.
 バイトコードのフォーマットに関する補足
 これまでの説明では「バックワード処理時Function呼び出し情報」に格納された情報を単純に昇順あるいは降順に実装する例を示した。しかしながら、複数の実行順序をバイトコードに格納しておいたり、繰り返し、分岐の指示をバイトコードに格納しておくことで、仮想マシン実行時に入力データの性質に応じてニューラルネットの構成を動的に変更するといった、より高度な処理を実行することも可能である。メモリプールモジュールに関して上述したメモリ管理機構はこうした動的なメカニズム実現のために利用することができる。
Supplementary information on bytecode format So far, the information stored in "Function call information during backward processing" is simply implemented in ascending or descending order. However, by storing multiple execution orders in bytecode, or repeatedly storing instructions for branching in bytecode, the configuration of the neural network can be dynamically changed according to the nature of the input data during virtual machine execution. It is also possible to execute more advanced processing such as changing to. The memory management mechanism described above with respect to the memory pool module can be used to implement such a dynamic mechanism.
 仮想マシンから外部のコードへ入出力するデータのひも付けについて
 「仮想マシンの内部状態初期化処理」にてFunction間で入出力するデータのリストが作成されるが、仮想マシンの関数をコールする外部のコードが、このリストの要素に直接アクセスするのが最も簡易な方法である。バイトコード生成時にPythonのVariableインスタンスの変数名を「入出力データ情報」に格納しておけば、この名前を使って入出力をひも付けることもできる。
Linking data input / output from / to the external code from the virtual machine A list of data to be input / output between the functions is created in “Initialization of the internal state of the virtual machine”. The simplest way is to directly access the elements of this list. If the variable name of the Python variable instance is stored in the "input / output data information" at the time of bytecode generation, the input / output can be linked using this name.
 9.Python層
 次に、図29に例示した実施形態に係る実装装置のPython層に関する構成について説明する。
 9-1.NativeArrayクラス
 NativeArrayクラス502は、Python層でNative層の多次元配列をラップするクラスである。NativeArrayクラス502は、Native層の多次元配列と1:1対応するインスタンスとして生成する。また、NativeArrayクラス502は、Pythonオブジェクトとしての基本機能として参照カウントによる生存期間管理機能を持つ。さらにまた、NativeArrayクラス502は、生存期間が終了した時点でNative層の多次元配列の解放を依頼する機能を持つ。
 また、NativeArrayクラス502は、Native層の多次元配列の型情報のコピーを持ち、Python層の他のオブジェクトへ伝える機能を持つ。
 さらにまた、NativeArrayクラス502は、データコピーや配列要素ごとの加算などの機能を持ち、その実行をNative層へ依頼する機能を持つ。
 加えて、NativeArrayクラス502は、その他、Chainerが依存するNumpyやGPUArrayといったPython層の多次元配列ライブラリとコンパチブルに動作する機能を持つ。
9. Python Layer Next, a configuration related to the Python layer of the mounting apparatus according to the embodiment illustrated in FIG. 29 will be described.
9-1. NativeArray Class The NativeArray class 502 is a class that wraps a multi-dimensional array in the Native layer in the Python layer. The NativeArray class 502 is generated as an instance corresponding to the multi-dimensional array of the Native layer 1: 1. The NativeArray class 502 has a lifetime management function based on a reference count as a basic function as a Python object. Furthermore, the NativeArray class 502 has a function of requesting the release of the multi-dimensional array in the Native layer when the lifetime ends.
Further, the NativeArray class 502 has a copy of type information of the multi-dimensional array of the Native layer and has a function of transmitting it to other objects in the Python layer.
Furthermore, the NativeArray class 502 has functions such as data copying and addition for each array element, and has a function of requesting the execution to the Native layer.
In addition, the NativeArray class 502 has a function that is compatible with other multi-dimensional array libraries in the Python layer such as Numpy and GPUArray on which Chainer depends.
 9-2.NativeDeviceクラス
 NativeDeviceクラス501は、Native層の最適化実装及び参照実装を抽象化するクラスである。NativeDeviceクラス501は、以下のような処理をPython層の他のオブジェクトからの依頼に応じてNative層へ依頼する機能を持つ。
(1)デバイスの初期化と解放
(2)多次元配列の生成とコピー(これをラップしたPython層のNativeArrayインスタンスを生成する)
(3)デバイス非依存データ表現とデバイス依存データ表現間の変換(浮動小数点と固定小数点間の変換も指示できる)
(4)FunctionやOptimizerの処理実行(Native層の個々の関数を呼び分ける)
9-2. NativeDevice class The NativeDevice class 501 is an abstraction of optimization implementation and reference implementation of the Native layer. The NativeDevice class 501 has a function of requesting the Native layer for the following processing in response to requests from other objects in the Python layer.
(1) Initialization and release of device (2) Generation and copy of multidimensional array (Generate NativeArray instance of Python layer that wraps this)
(3) Conversion between device-independent data representation and device-dependent data representation (can also instruct conversion between floating point and fixed point)
(4) Process execution of Function and Optimizer (each function of Native layer is called)
 9-3.Functionクラス
 Functionクラス503は、フォワード処理とバックワード処理とをペアにして定義したクラスである。Functionクラス503は、Chainerに存在するクラスだが、フォワード処理とバックワード処理をNative層へ依頼する機能を追加する。
 メソッド実装例は以下のとおりである。
 (例1)forward_native(…)
 これにより、フォワード処理をNative層に依頼することができる。
 (例2)backward_native(…)
 これにより、バックワード処理をNative層に依頼することができる。
9-3. Function class The Function class 503 is a class that defines a forward process and a backward process as a pair. Function class 503 is a class that exists in Chainer, but adds a function for requesting the Native layer for forward processing and backward processing.
An example of method implementation is as follows.
(Example 1) forward_native (...)
Thereby, forward processing can be requested to the Native layer.
(Example 2) backward_native (...)
As a result, backward processing can be requested to the Native layer.
 Forward_nativeやbackward_native呼び出し時に想定される処理フロー
(1)入力データサイズやFunctionインスタンス初期化時のパラメータから、出力データサイズを計算する。
(2)(1)で求めた出力データサイズと、入力データのインスタンス(NativeArray)、Functionの区分(Linear, ReLU, …)をNativeDeviceインスタンスに渡してNative層の関数呼び出しを依頼する。
(3)NativeDeviceインスタンスはこの呼び出しに応じて以下の処理を実行する。
(A)Native層に出力データの多次元配列生成を依頼する。入力データを上書きするFunctionであればこの処理は行わない。
(B)入力データの型(浮動小数点、固定小数点)とFunctionの区分より実際にコールするNative層の関数を決定し呼び出す(Native層の関数はAで確保した多次元配列に処理結果を書き込む)
(C)上記(A)で確保した多次元配列をラップするNativeArrayインスタンスを生成する。
(4)上記(C)で生成したNativeArrayインスタンスをFunctionの戻り値として返す。
Process flow assumed when calling Forward_native or backward_native (1) The output data size is calculated from the input data size and the parameters at the time of Function instance initialization.
(2) The output data size obtained in (1), the input data instance (NativeArray), and the Function classification (Linear, ReLU,...) Are passed to the NativeDevice instance and a function call in the Native layer is requested.
(3) The NativeDevice instance executes the following processing in response to this call.
(A) Request the Native layer to generate a multidimensional array of output data. If the function overwrites the input data, this process is not performed.
(B) Determine and call the function of the Native layer that is actually called from the type of input data (floating point, fixed point) and Function (the Native layer function writes the processing result to the multidimensional array secured in A)
(C) Generate a NativeArray instance that wraps the multidimensional array secured in (A) above.
(4) The NativeArray instance generated in (C) above is returned as the return value of Function.
 9-4.Optimizerクラス
 Optimizerクラス503は、重みの更新を行うクラスである。Optimizerクラス503は、Chainerに存在するクラスであるが、状態初期化及び重み更新処理をNative層へ依頼する機能を追加する。
 メソッド実装例は以下のとおりである。
 (例1)init_state_native(…)
 これにより、重み更新アルゴリズムの内部状態初期化処理をNative層に依頼することができる。
 (例2)update_one_native(…)
 これにより、重み更新処理をNative層に依頼することができる。
 これらのメソッド呼び出し時の処理フローは上記「Functionクラス」で説明済みのものと同等である。
9-4. Optimizer class The Optimizer class 503 is a class for updating weights. The Optimizer class 503 is a class that exists in the chainer, but adds a function for requesting the native layer to perform state initialization and weight update processing.
An example of method implementation is as follows.
(Example 1) init_state_native (...)
Thereby, it is possible to request the Native layer for the internal state initialization process of the weight update algorithm.
(Example 2) update_one_native (...)
Thereby, the weight update process can be requested to the Native layer.
The processing flow at the time of calling these methods is the same as that already described in the above “Function class”.
 9-5. Native層と連携する全体動作の具体例
 具体例を図39に例示する。
9-5. Specific Example of Overall Operation Cooperating with Native Layer A specific example is illustrated in FIG.
 9-6.バイトコード生成(出力)部
 バイトコードし生成器504は、「Define-by-Run」で定義したニューラルネットワークのネットワーク構成をバイトコード(解釈実行可能なデータ形式)に変換し出力する仕組みである。バイトコードの形式は例えば「仮想マシンモジュール」に関して上述したものが考えられる。しかしながら、このような形式以外にも、例えば以下のような形式への出力を考えることができる。
(1)Caffeのニューラルネット定義フォーマット
 Caffeで実行できる(Caffeはニューラルネット設計実行のための代表的なフレームワークの一つ)。
(2)C言語やJava(登録商標)などのプログラミング言語
 全体動作を実行するソフトウェアを生成できる。
(3)HDLやVerilogといったハードウェア記述言語
 全体動作を実行するハードウェアを合成できる。
9-6. Byte Code Generation (Output) Unit The byte code generator 504 is a mechanism for converting the network configuration of the neural network defined by “Define-by-Run” into a byte code (data format that can be interpreted and executed) and outputting it. As the byte code format, for example, the one described above regarding the “virtual machine module” can be considered. However, in addition to this format, for example, output to the following format can be considered.
(1) Neural network definition format of Caffe It can be executed with Caffe (Caffe is one of the representative frameworks for neural network design execution).
(2) Programming language such as C language or Java (registered trademark) Software that executes the entire operation can be generated.
(3) Hardware description languages such as HDL and Verilog Hardware that executes the entire operation can be synthesized.
 バイトコード生成部の関数定義例は以下のとおりである。
 関数名:write_network_difinition(output_node, path, format)
 関数仕様:output_nodeから入力側に向かってつながるネットワーク構成をformatに指定された形式でpathで指定したファイルに出力する。output_nodeはリストで指定可能である(複数のノードを起点にできる)。
A function definition example of the bytecode generation unit is as follows.
Function name: write_network_difinition (output_node, path, format)
Function specifications: The network configuration connected from output_node to the input side is output to the file specified by path in the format specified by format. output_node can be specified in a list (can start from multiple nodes).
 バックワード処理のための参照構造からバイトコードを出力する手続き例
 上記第1部において説明したとおり、Chainerは自然なフォワード処理の計算の記述に従ってバックワード処理のための参照構造を生成する機能を持つ。フォワード処理はバックワード処理のための参照構造を逆順にたどれば計算できるので、この参照構造からバイトコードを生成すれば、フォワード処理及びバックワード処理双方の処理が実行可能となる。
 この手順は大きく以下のステップに分けられる。
(1)バイトコード作成のための要素情報生成
 入出力データ情報生成
 重み情報生成
 バックワード処理時Function呼び出し情報生成
(2)要素情報のバイトコードへの変換
Example procedure for outputting bytecode from a reference structure for backward processing As explained in Part 1 above, Chainer has a function to generate a reference structure for backward processing according to the description of calculation of natural forward processing. . Since the forward process can be calculated by following the reference structure for the backward process in the reverse order, if the byte code is generated from this reference structure, both the forward process and the backward process can be executed.
This procedure is roughly divided into the following steps.
(1) Element information generation for byte code generation Input / output data information generation Weight information generation Function call information generation during backward processing (2) Conversion of element information into byte code
 バイトコード作成のための要素情報生成手順
-「バックワード処理のための参照構造」をwrite_network_difinitionに渡されたoutput_nodeを起点にたどって以下の処理を実行する。
(1)現在のノードがVariableの場合には、その多次元配列の情報(サイズ・次元数,浮動小数点/固定小数点(Q値))を「入出力データ情報」のリストに加える。
(2)現在のノードがFunctionの場合には以下の処理を行う。
(i)重みの多次元配列の情報(サイズ・次元数, 浮動小数点/固定小数点(Q値), 重みの実現値)を「重み情報」のリストに重複を許さないように加える(複数のFunctionインスタンスが同一の重みを共有し得るため)。
(ii)Functionの種別, 入出力データのindex, 重みのindex, Functionの種別毎に固有のパラメータを「Backward時Function呼び出し情報」のリストに加える。
 output_nodeに複数の起点ノードが渡された場合、次段落の手順で同じノードが重複登録されることを避ける。
Element information generation procedure for creating bytecode-“reference structure for backward processing” is traced to output_node passed to write_network_difinition, and the following processing is executed.
(1) When the current node is Variable, the information (size / number of dimensions, floating point / fixed point (Q value)) of the multidimensional array is added to the list of “input / output data information”.
(2) If the current node is Function, the following processing is performed.
(I) Add information of multi-dimensional array of weights (size / number of dimensions, floating point / fixed point (Q value), actual value of weight) to the list of "weight information" so as not to allow duplication (multiple functions Because instances can share the same weight).
(Ii) Add parameters specific to each function type, input / output data index, weight index, and function type to the list of “Function call information at backward”.
If multiple origin nodes are passed to output_node, avoid registering the same node twice in the next paragraph.
 複数の起点ノードが渡された場合の要素情報作成手順
(1)「バックワード処理時Function呼び出し情報」のリスト(空)を作成する。
(2)output_node内の起点ノードごと以下の手順を行う。
(A)起点ノード固有の「バックワード処理時Function呼び出し情報」のリストを作成する。
(B)前段落で説明した登録手順を上記(A)で作成したリストに対して行う。このとき、上記(1)で作成したリストに既に登録されているノードについては登録手順を実行しないようにして重複登録を避ける。
(C)上記(1)で作成したリストの前方に上記(A)で作成したリストを連結する。
Element information creation procedure when multiple origin nodes are passed (1) Create a list (empty) of “Function call information during backward processing”.
(2) The following procedure is performed for each starting node in output_node.
(A) Create a list of “Function call information during backward processing” specific to the origin node.
(B) The registration procedure described in the previous paragraph is performed on the list created in (A) above. At this time, the registration procedure is not executed for nodes already registered in the list created in (1) to avoid duplicate registration.
(C) The list created in (A) is linked to the front of the list created in (1) above.
 Python層:バイトコード出力器(6)
 要素情報のバイトコードへの変換
 「バイトコード作成のための要素情報生成」の手順で作成した以下の情報を、write_network_difinitionのformat引数で指定されたフォーマットへ変換する。
(1)入出力データ情報生成
(2)重み情報生成
(3)バックワード処理時Function呼び出し情報生成
 フォーマットとしては「バイトコード生成部」について先に例示したものが挙げられる。
Python layer: Byte code output device (6)
Conversion of element information to byte code The following information created in the procedure of “Generating element information for creating byte code” is converted to the format specified by the format argument of write_network_difinition.
(1) Generation of input / output data information (2) Generation of weight information (3) Function call information generation at the time of backward processing Examples of the format for the “byte code generation unit” are given above.
 複数のネットワーク構成の出力
 「バイトコード生成部」について上述したwrite_network_difinition関数は、引数pathで渡されたファイルに直接ネットワークの構成を書き出す仕様となっているが、このpath引数に複数のネットワークの構成をバイトコードに書き出すためのオブジェクトを渡すこともできる。ここでネットワーク構成とは「Python層:バイトコード出力器(6)要素情報のバイトコードへの変換」で解説した(1)、(2)、(3)の構成要素を示す。
 この「複数のネットワークの構成をバイトコードに書き出すためのオブジェクト」は、「複数のネットワーク構成」の間で同一の「(2)重み情報」を共有化し、バイトコードに書き出す重み情報を削減する。「(1)入出力データ情報生成」と「(3)バックワード処理時Function呼び出し情報生成」は、部分的な情報に重複が見られる場合も、上述のステップにより独立に生成する。このオブジェクトを利用する場合のコード例を図40に示す。このコード例では6行目では、nodeAからバックワード処理のための参照構造をたどりネットワークの構成を出力し、さらに7行目では、nodeBからバックワード処理のための参照構造をたどりネットワークの構成を出力する。8行目においてこれら2つのネットワーク構成が1つのファイル(./bytecode.bin)に出力される。
Output of multiple network configurations The write_network_difinition function described above for the "byte code generator" is a specification that directly writes the network configuration to the file passed in the argument path. You can also pass an object to write to bytecode. Here, the network configuration indicates the components (1), (2), and (3) described in “Python layer: byte code output unit (6) conversion of element information into byte code”.
This “object for writing a plurality of network configurations to bytecode” shares the same “(2) weight information” among “plurality of network configurations” and reduces the weight information to be written to the bytecode. “(1) Generation of input / output data information” and “(3) Generation of function call information during backward processing” are generated independently by the above-described steps even when partial information is duplicated. A code example in the case of using this object is shown in FIG. In this code example, in line 6, the reference structure for backward processing is output from nodeA, and the network structure is output. In line 7, the reference structure for backward processing is traced from nodeB, and the network structure is output. Output. In line 8, these two network configurations are output to one file (./bytecode.bin).
 同一ネットワークに対するフォワード処理とバックワード処理とで異なるFunction呼び出し順序を指定する方法
 Chainerは、上記第1部において説明したとおり、特定のVariableインスタンスを起点にそこから入力層側へ向かうBackwardのための参照構造を打ち切る機能(unchain_backwardメソッド)を持っている。このunchain_backwardメソッドと前段落で説明した「複数のネットワーク構成の出力」を組み合わせることで同一ネットワークにおけるフォワード処理とバックワード処理の計算に対して異なるFunction呼び出し順序を指定することもできる。
 図41に示すコード例では、#1の呼び出しでAからDまでの全ての処理を実行するネットワーク定義が出力されるのに対して、#2の呼び出しではBからDまでの処理のみが実行されるネットワーク定義が出力される。仮想マシンによるバイトコード実行時に#1で出力されるネットワーク構成に対してForward処理を行い、#2のものに対してBackward処理を実行するなどといった使い分けが可能となる。
A method for specifying different Function calling orders for forward processing and backward processing for the same network As described in Part 1 above, Chainer is a reference for Backward from a specific Variable instance to the input layer side. Has the ability to abort the structure (unchain_backward method). By combining this unchain_backward method and the “output of multiple network configurations” described in the previous paragraph, it is possible to specify different function call orders for forward processing and backward processing calculation in the same network.
In the code example shown in FIG. 41, a network definition that executes all processes from A to D is output by calling # 1, whereas only a process from B to D is executed by calling # 2. The network definition is output. For example, forward processing is performed for the network configuration output in # 1 when bytecode is executed by the virtual machine, and backward processing is performed for the # 2 one.
 10.Native層及びPython層に共通する構成
 10-1.複数のNNアルゴリズムを融合したアルゴリズム実行部
 一般のニューラルネットの構成には頻出するFunctionの組み合わせが見られる
 (例1)Linear→ReLU, Linear→ReLU→Linear→ReLU 
 (例2)Convolution2D→ReLU, Convolution2D→ReLU→Convolution2D→ReLU
 こうした頻出するFunctionの組み合わせを1つのFunctionとして定義し、Python層及びNative層双方でその計算に特化した実装を行うことで、アルゴリズム設計においてもハードウェア実行効率においても以下のようなメリットを享受できる
(1)Functionを呼び出すためのオーバーヘッド(関数呼び出しや通信)を削減できる。
(2)複数のFunctionにまたがるデータの依存関係や並列性を考慮した実装により、高い実行効率を得られる(キャッシュメモリを有効活用しメインメモリに直接アクセスするデータ量を削減するなど)。
(3)アルゴリズム設計時により抽象化したFunctionを用いることで複雑なネットワーク構成を理解・定義しやすくなる。
10. Configuration common to Native layer and Python layer 10-1. Algorithm execution unit that fuses multiple NN algorithms In the configuration of a general neural network, frequent combinations of functions can be seen. (Example 1) Linear → ReLU, Linear → ReLU → Linear → ReLU
(Example 2) Convolution2D → ReLU, Convolution2D → ReLU → Convolution2D → ReLU
By defining such a combination of frequently used functions as a single function and implementing implementation specialized in the calculation in both the Python layer and the Native layer, the following advantages can be enjoyed in both algorithm design and hardware execution efficiency. (1) Overhead (function call and communication) for calling Function can be reduced.
(2) High execution efficiency can be obtained by implementing data dependency and parallelism that span multiple functions (such as reducing the amount of data that directly accesses the main memory by effectively using the cache memory).
(3) It becomes easier to understand and define a complex network configuration by using a more abstract function at the time of algorithm design.
 一般に近年の計算機は演算コアの速度向上が著しいのに対して、メモリアクセスの性能はそれほど高速化されていない。このため計算機全体の性能を見たときメモリアクセスに律速され十分な計算性能を達成できない課題がある。このような課題を解決するため演算コアに物理的に近い場所にキャッシュメモリやレジスタファイルと呼ばれる小容量であるが特別に高速なメモリを配置し、キャッシュメモリ上で多くの計算を行うことで、両者の速度ギャップを埋める仕組みが用いられている。 Generally, in recent computers, the speed of the arithmetic core is remarkably improved, but the performance of memory access is not so high. For this reason, when looking at the performance of the entire computer, there is a problem that it is rate-limited by memory access and sufficient calculation performance cannot be achieved. In order to solve such problems, a small memory called a cache memory or a register file is physically located close to the arithmetic core, but a specially high-speed memory is placed, and many calculations are performed on the cache memory. A mechanism to fill the speed gap between the two is used.
 ところで、ニューラルネットの構成には頻出する以下のようなFunctionの組み合わせが見られる。
 ・Convolution2D→ReLU
 ・Convolution2D→ReLU→Convolution2D→ReLU
 Convolution2Dは、入出力するデータサイズに対して計算量が大きいので、キャッシュメモリなどの仕組みを有効活用し計算コアの性能を発揮できるチャンスが大きい。これに対してReLUは、入出力するデータサイズに対して計算量が小さいので、このチャンスは小さい。
 Convolution2DとReLUを個別のFunctionとして実行する場合、Convolution2Dの処理結果のデータを全てメインメモリに書き出した後、その内容を再び演算コア周辺に転送しReLUの計算を行う必要がある。その理由はConvolution2Dの処理が終わった後にその結果を直ぐにReLUが用いるか否かが分からないためである。
By the way, the following combinations of functions that appear frequently are seen in the configuration of the neural network.
・ Convolution2D → ReLU
・ Convolution2D → ReLU → Convolution2D → ReLU
Convolution2D has a large amount of calculation with respect to the input / output data size, so there is a great chance that the performance of the calculation core can be demonstrated by effectively using a mechanism such as a cache memory. In contrast, ReLU has a small amount of calculation with respect to the input / output data size, so this chance is small.
When Convolution2D and ReLU are executed as separate functions, it is necessary to write all the data of the Convolution2D processing result to the main memory, and then transfer the contents again to the periphery of the arithmetic core to calculate the ReLU. The reason is that it is not known whether or not ReLU will use the result immediately after Convolution2D processing is finished.
 そこで、Convolution2DとReLUを一体のFunctionとして実行すれば、Convolution2Dの処理結果をメインメモリに書き出す前に、キャッシュメモリやレジスタファイル上で直接ReLUの処理の入力として用いることができるので、メインメモリへのデータ転送の頻度を少なくし、より効率的(高速)に処理を実行できるチャンスが増える。
 Convolution2D→ReLU→Convolution2D→ReLUのようにさらに多くのFunctionを一体のFunctionとして実行できれば、処理効率向上のチャンスはさらに増える。キャッシュメモリのサイズや、Functionの組み合わせの中でのデータの依存関係を勘案し、さらに積極的にメインメモリへのアクセス量を削減できるからである。
Therefore, if Convolution2D and ReLU are executed as an integrated function, the Convolution2D processing results can be used directly as input for ReLU processing on the cache memory or register file before writing to the main memory. The frequency of data transfer is reduced, and the chances of executing processing more efficiently (high speed) are increased.
If more functions can be executed as an integrated function, such as Convolution2D → ReLU → Convolution2D → ReLU, the chance of improving the processing efficiency further increases. This is because the amount of access to the main memory can be further reduced in consideration of the size of the cache memory and the data dependency in the combination of functions.
 10-2.Forward処理実行を特化させた最適化装置の構成1;重みの最適化処理手段を有する場合
 ニューラルネットワークのレイヤーアルゴリズムの中には、バックワード処理を行わずフォワード処理のみを実行する場合に特化した重みの最適化を行うことで計算量やメモリ使用量を削減できるものが存在する。このような最適化が可能な理由は以下による。
 確率的勾配降下法を用いてニューラルネットを学習するには重みのベクトルに高い数値精度と高い値域の自由度が必要である。なぜなら、学習時に小さな値の更新を積み重ねていく必要がある上、事前に重みのベクトルが変化する値域を十分に想定できないためである。これに対して、Forward処理だけを実行するのであればこのような精度と自由度の両立は必要なく重みの情報量を削減した上でForward処理を実行するようにすればメモリや計算量を削減することができる。計算量を削減できるのは重みの要素数を削減したり0の重みを計算しないなどの対応がとれるからである。例えば、Linear(レイヤー間の内積)の処理では、重みの情報(入力ノード数*出力ノード数の行列)を特異値分解し、対角成分が小さくなる要素を削除することで、重みのデータサイズを圧縮し、計算サイズを削減するテクニックが知られている。
 (J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic
models with singular value decomposition. In Interspeech, 2013)
 ChainerのFunctionクラスにこうしたForwardに特化した重みの最適化処理を実行するメソッドを加えることで、既に学習済みの重みを用いてForward処理のみを実行する場合の計算資源を削減することができる。このメソッドはFunctionクラスに既存のForwardメソッドやBackwardメソッドと同様に、Functionが保持する重みの多次元配列の型に依存してハードウェア実装(汎用計算機、GPU、Native)の違いを隠蔽する(再初期化の実装を呼び分ける)機能を持つ。
10-2. Configuration of optimization device that specializes forward processing execution 1; with weight optimization processing means Some neural network layer algorithms specialize when only forward processing is performed without backward processing There are some which can reduce the amount of calculation and the memory usage by optimizing the weights. The reason why such optimization is possible is as follows.
Learning a neural network using the stochastic gradient descent method requires high numerical accuracy and a high degree of freedom in the weight vector. This is because it is necessary to repeatedly update small values at the time of learning, and a range of values in which the weight vector changes in advance cannot be sufficiently assumed. On the other hand, if only the Forward process is executed, it is not necessary to have both accuracy and flexibility. If the Forward process is executed after reducing the amount of weight information, the memory and the calculation amount are reduced. can do. The amount of calculation can be reduced because measures such as reducing the number of weight elements or not calculating zero weights can be taken. For example, in the Linear (inner product between layers) process, the weight data size is obtained by performing singular value decomposition on the weight information (matrix of the number of input nodes * number of output nodes) and removing elements that have smaller diagonal components. Techniques for compressing and reducing the computation size are known.
(J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic
models with singular value decomposition. In Interspeech, 2013)
By adding a method that performs weight optimization processing specialized for Forward in the Function class of Chainer, it is possible to reduce computational resources when only Forward processing is executed using already learned weights. This method hides the difference of hardware implementation (general-purpose computer, GPU, Native) depending on the type of multi-dimensional array of weights held by Function, as well as the existing Forward and Backward methods in Function class It has a function to call the implementation of initialization).
 ソフトウェアハードウェア構成に関する補足
 これまでPython層、Native層の関数やクラスに特定の機能を従属されることを例示して実施形態の詳細を説明してきた。しかしながら、これらのソフトウェア階層、クラス、関数の役割分担は、本発明の実施形態に係る機能の構成を具体的に説明するための一例を示したものに過ぎず、以下の例に示すように本発明の実施形態に係る個々の機能がこれまでの説明とは別のクラスや、階層、あるいはハードウェアに実装されることも考えられる。
(1)「Forward処理実行を特化させた最適化装置の構成2;データメモリ領域の再利用手段を有する場合」で説明した処理内容は仮想マシンモジュールではなくバイトコード出力器であらかじめ実行できる。
(2)「複数のFunctionを組み合わせたFunction」で説明したFunctionはソフトウェアレベルの最適化ではなく特化したハードウェア(FPGA, ASIC)で実装できる。
 よって、本発明の実施形態に係る構成はPython層、Native層の関数やクラス、あるいはソフトウェアを前提とした実装に直接依存するものではない。
Supplementary Notes on Software Hardware Configuration The details of the embodiments have been described so far by exemplifying that specific functions are subordinated to functions and classes in the Python layer and the Native layer. However, the division of roles of these software hierarchies, classes, and functions is merely an example for specifically explaining the functional configuration according to the embodiment of the present invention, and as shown in the following example, It is also conceivable that each function according to the embodiment of the invention is implemented in a class, hierarchy, or hardware different from those described above.
(1) The processing contents described in “Configuration 2 of optimization device specialized for forward processing execution; having data memory area reusing means” can be executed in advance by a byte code output unit instead of a virtual machine module.
(2) The function described in “Function combining multiple functions” can be implemented by specialized hardware (FPGA, ASIC), not software level optimization.
Therefore, the configuration according to the embodiment of the present invention does not depend directly on the functions and classes of the Python layer and the Native layer, or on the assumption of software.
 10 学習装置
 100 評価ボード
 110 取得部
 120 記憶部
 130 実行部
 200 組み込み系チップ(組み込み系半導体集積回路)
 401 駆動部
 402 Functionクラス/Optimizerクラス
 405 Native層実行部
 408 Native用多次元配列
 409 Variableクラス
 504 バイトコード生成部
 505 デバイス管理モジュール
 506 データ変換モジュール
 507 多次元配列モジュール
 508 Functionモジュール/Optimizerモジュール
 509 仮想マシンモジュール
 510 メモリプールモジュール
DESCRIPTION OF SYMBOLS 10 Learning apparatus 100 Evaluation board 110 Acquisition part 120 Storage part 130 Execution part 200 Embedded system chip (Embedded system semiconductor integrated circuit)
401 Drive unit 402 Function class / Optimizer class 405 Native layer execution unit 408 Native multi-dimensional array 409 Variable class 504 Byte code generation unit 505 Device management module 506 Data conversion module 507 Multi-dimensional array module 508 Function module / Optimizer module 509 Virtual machine Module 510 Memory pool module

Claims (11)

  1.  機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装装置であって、
     第1のプログラミング言語により記述された第1のソースコードを前記半導体集積回路に搭載された第1の演算装置に実行させることが可能な第1の実行手段と、
     前記第1のプログラミング言語とは異なる第2のプログラミング言語により記述された第2のソースコードを前記半導体集積回路に搭載された第2の演算装置に実行させることが可能な第2の実行手段と、
     前記第1の実行手段が前記第1のソースコードに含まれた第1の特定コードを実行した結果と、前記第2の実行手段が前記第2のソースコードに含まれた第2の特定コードであって前記第1の特定コードが前記第2のプログラミング言語により書き換えられた第2の特定コードを実行した結果と、を比較して比較結果を出力する比較手段とを具備し、
     前記第2の実行手段は、前記第1のプログラミング言語により記述された前記第1のソースコードから、任意のデータ形式により記述され、かつ、入出力データ情報、重み情報、バックワード処理時Function呼び出し情報のうち少なくとも1つを含むバイトコードを生成するバイトコード生成手段を含む
     ことを特徴とする実装装置。
    A mounting device for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
    First execution means capable of causing a first arithmetic unit mounted on the semiconductor integrated circuit to execute a first source code described in a first programming language;
    Second execution means capable of causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language; ,
    The result of executing the first specific code included in the first source code by the first execution means, and the second specific code included in the second source code by the second execution means Comparing means for comparing a result obtained by executing the second specific code in which the first specific code is rewritten by the second programming language and outputting a comparison result,
    The second execution means is described in an arbitrary data format from the first source code described in the first programming language, and is called input / output data information, weight information, Function call at the time of backward processing A mounting apparatus comprising byte code generation means for generating byte code including at least one of information.
  2.  機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装装置であって、
     第1のプログラミング言語とは異なる第2のプログラミング言語により記述された第2のソースコードを前記半導体集積回路に搭載された第2の演算装置に実行させることが可能な第2の実行手段を備え、
     入出力データ情報、重み情報、バックワード処理時Function呼び出し情報のうち少なくとも1つを含むバイトコードを用いて、前記第2の演算装置によって前記第2の実行手段を実行する
    ことを特徴とする実装装置。
    A mounting device for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
    Second execution means capable of causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language is provided. ,
    Implementation wherein the second execution unit is executed by the second arithmetic unit using a bytecode including at least one of input / output data information, weight information, and function call information during backward processing. apparatus.
  3.  前記第2の実行手段は、データを固定小数点型から浮動小数点型及び/又は浮動小数点型から固定小数点型に変換する変換手段を含み、該変換手段により変換されたデータを用いて関数を実行する、請求項1又は請求項2に記載の実装装置。 The second execution means includes conversion means for converting data from a fixed-point type to a floating-point type and / or a floating-point type to a fixed-point type, and executes a function using the data converted by the conversion means. The mounting apparatus according to claim 1 or 2.
  4.  前記第2の実行手段は、各々がレイヤーアルゴリズムを定義する複数の関数を1つの関数として定義する関数定義手段を含み、該関数定義手段により定義された関数を呼び出して実行する、請求項1~請求項3のいずれかに記載の実装装置。 The second execution means includes function definition means for defining a plurality of functions each defining a layer algorithm as one function, and calls and executes a function defined by the function definition means. The mounting apparatus according to claim 3.
  5.  前記第2の実行手段は、バックワード処理を実行しない場合には、フォワード処理を定義する関数を実行する際に重みの最適化処理を実行する、請求項1~請求項4のいずれかに記載の実装装置。 The weight execution process according to any one of claims 1 to 4, wherein the second execution unit executes a weight optimization process when executing a function defining a forward process when the backward process is not executed. Mounting equipment.
  6.  前記第2の実行手段は、バックワード処理を実行しない場合には、重みの勾配、重み更新アルゴリズムの内部状態、及び、レイヤー間で入出力されるデータ、のうちの少なくとも1つを不要とする、請求項1~請求項5のいずれかに記載の実装装置。 If the second execution means does not execute backward processing, at least one of the weight gradient, the internal state of the weight update algorithm, and data input / output between layers is unnecessary. The mounting apparatus according to any one of claims 1 to 5.
  7.  前記バイトコード生成手段は、複数のネットワーク構成を定義したバイトコードを生成する、請求項1に記載の実装装置。 2. The mounting apparatus according to claim 1, wherein the byte code generation means generates a byte code defining a plurality of network configurations.
  8.  前記第1の演算装置はCPU及びGPUのうちの少なくとも一方を含み、前記第2の演算装置は補助演算装置を含む、請求項1に記載の実装装置。 The mounting device according to claim 1, wherein the first arithmetic device includes at least one of a CPU and a GPU, and the second arithmetic device includes an auxiliary arithmetic device.
  9.  前記第1のプログラミング言語により記述された前記第1のソースコードは、
     コンピュータを、
     前記第1のソースコードに含まれた各コードを順次実行する実行手段であって、各コードを実行した時点において、該コードにより定義されたフォワード処理の出力値を入力値に基づいて計算し、かつ、該コードに対応する層におけるオブジェクト間の参照構造を生成する、ように構成された実行手段、
    として機能させるように記述されたものである、請求項1に記載の実装装置。
    The first source code written in the first programming language is:
    Computer
    Execution means for sequentially executing each code included in the first source code, and at the time of executing each code, an output value of a forward process defined by the code is calculated based on an input value; And execution means configured to generate a reference structure between objects in a layer corresponding to the code,
    The mounting apparatus according to claim 1, wherein the mounting apparatus is described so as to function as:
  10.  機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装方法であって、
     第1のプログラミング言語により記述された第1のソースコードを前記半導体集積回路に搭載された第1の演算装置に実行させる第1の実行処理と、
     前記第1のプログラミング言語とは異なる第2のプログラミング言語により記述された第2のソースコードを前記半導体集積回路に搭載された第2の演算装置に実行させる第2の実行処理と、
     前記第1の実行処理において前記第1のソースコードに含まれた第1の特定コードを実行した結果と、前記第2の実行処理において前記第2のソースコードに含まれた第2の特定コードであって前記第1の特定コードが前記第2のプログラミング言語により書き換えられた第2の特定コードを実行した結果と、を比較して比較結果を出力する比較処理とを含み、
     前記第2の実行処理は、前記第1のプログラミング言語により記述された前記第1のソースコードから、任意のデータ形式により記述され、かつ、入出力データ情報、重み情報、バックワード処理時Function呼び出し情報のうち少なくとも1つを含むバイトコードを生成するバイトコード生成処理を含む
     ことを特徴とする実装方法。
    A mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
    A first execution process for causing a first arithmetic unit mounted on the semiconductor integrated circuit to execute a first source code described in a first programming language;
    A second execution process for causing a second arithmetic unit mounted in the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language;
    A result of executing the first specific code included in the first source code in the first execution process, and a second specific code included in the second source code in the second execution process The first specific code is compared with a result of executing the second specific code rewritten by the second programming language, and a comparison process for outputting a comparison result,
    The second execution process is described in an arbitrary data format from the first source code described in the first programming language, and input / output data information, weight information, Function call at the time of backward processing A mounting method comprising: a byte code generation process for generating a byte code including at least one of information.
  11.  機械学習を実行するコンピュータプログラムを半導体集積回路に実装する実装方法であって、
     第1のプログラミング言語とは異なる第2のプログラミング言語により記述された第2のソースコードを前記半導体集積回路に搭載された第2の演算装置に実行させる第2の実行処理を備え、
     入出力データ情報、重み情報、バックワード処理時Function呼び出し情報のうち少なくとも1つを含むバイトコードを用いて、前記第2の演算装置によって前記第2の実行処理を実行する
     ことを特徴とする実装方法。

     
    A mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
    A second execution process for causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language;
    The second execution process is executed by the second arithmetic unit using a bytecode including at least one of input / output data information, weight information, and function call information during backward processing. Method.

PCT/JP2016/004028 2015-09-03 2016-09-02 Installation device and installation method WO2017038104A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2015-174205 2015-09-03
JP2015174205 2015-09-03
JP2015-213294 2015-10-29
JP2015213294A JP2018173672A (en) 2015-09-03 2015-10-29 Mounting apparatus

Publications (1)

Publication Number Publication Date
WO2017038104A1 true WO2017038104A1 (en) 2017-03-09

Family

ID=58186916

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/004028 WO2017038104A1 (en) 2015-09-03 2016-09-02 Installation device and installation method

Country Status (1)

Country Link
WO (1) WO2017038104A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10268951B2 (en) 2017-06-14 2019-04-23 International Business Machines Corporation Real-time resource usage reduction in artificial neural networks
EP3518151A1 (en) 2018-01-29 2019-07-31 Panasonic Intellectual Property Corporation of America Data processing method and data processing system
CN111160543A (en) * 2017-12-14 2020-05-15 中科寒武纪科技股份有限公司 Integrated circuit chip device and related product
EP3686733A1 (en) 2019-01-23 2020-07-29 Fujitsu Limited Calculation processing apparatus, program, and method of controlling the calculation processing apparatus
US11521070B2 (en) 2015-10-29 2022-12-06 Preferred Networks, Inc. Information processing device and information processing method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689171A (en) * 1992-03-26 1994-03-29 Nec Ic Microcomput Syst Ltd Program check method
JPH0744515A (en) * 1993-07-29 1995-02-14 Matsushita Electric Ind Co Ltd Neural network circuit
WO2010047388A1 (en) * 2008-10-24 2010-04-29 独立行政法人情報通信研究機構 Calculation processing system, program creation method, and program creation program
JP2012208843A (en) * 2011-03-30 2012-10-25 Keihin Corp Development support device
JP2013106343A (en) * 2011-11-10 2013-05-30 Toyota Infotechnology Center Co Ltd Optimization of dynamic spectrum access
US20130218821A1 (en) * 2011-09-21 2013-08-22 Botond Szatmary Round-trip engineering apparatus and methods for neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689171A (en) * 1992-03-26 1994-03-29 Nec Ic Microcomput Syst Ltd Program check method
JPH0744515A (en) * 1993-07-29 1995-02-14 Matsushita Electric Ind Co Ltd Neural network circuit
WO2010047388A1 (en) * 2008-10-24 2010-04-29 独立行政法人情報通信研究機構 Calculation processing system, program creation method, and program creation program
JP2012208843A (en) * 2011-03-30 2012-10-25 Keihin Corp Development support device
US20130218821A1 (en) * 2011-09-21 2013-08-22 Botond Szatmary Round-trip engineering apparatus and methods for neural networks
JP2013106343A (en) * 2011-11-10 2013-05-30 Toyota Infotechnology Center Co Ltd Optimization of dynamic spectrum access

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11915146B2 (en) 2015-10-29 2024-02-27 Preferred Networks, Inc. Information processing device and information processing method
US11521070B2 (en) 2015-10-29 2022-12-06 Preferred Networks, Inc. Information processing device and information processing method
US11461637B2 (en) 2017-06-14 2022-10-04 International Business Machines Corporation Real-time resource usage reduction in artificial neural networks
US10558914B2 (en) 2017-06-14 2020-02-11 International Business Machines Corporation Real-time resource usage reduction in artificial neural networks
US10268951B2 (en) 2017-06-14 2019-04-23 International Business Machines Corporation Real-time resource usage reduction in artificial neural networks
CN111160543A (en) * 2017-12-14 2020-05-15 中科寒武纪科技股份有限公司 Integrated circuit chip device and related product
CN111160543B (en) * 2017-12-14 2023-08-29 中科寒武纪科技股份有限公司 Integrated circuit chip device and related products
US10885447B2 (en) 2018-01-29 2021-01-05 Panasonic Intellectual Property Corporation Of America Data processing method and data processing system
US11694099B2 (en) 2018-01-29 2023-07-04 Panasonic Intellectual Property Corporation Of America Data processing method and data processing system
EP3518151A1 (en) 2018-01-29 2019-07-31 Panasonic Intellectual Property Corporation of America Data processing method and data processing system
JP2020119213A (en) * 2019-01-23 2020-08-06 富士通株式会社 Arithmetic processing device, program, and method for controlling arithmetic processing device
JP7225831B2 (en) 2019-01-23 2023-02-21 富士通株式会社 Processing unit, program and control method for processing unit
EP3686733A1 (en) 2019-01-23 2020-07-29 Fujitsu Limited Calculation processing apparatus, program, and method of controlling the calculation processing apparatus

Similar Documents

Publication Publication Date Title
JP7418511B2 (en) Information processing device and information processing method
WO2017038104A1 (en) Installation device and installation method
Dorier et al. Damaris/viz: a nonintrusive, adaptable and user-friendly in situ visualization framework
JP2018173672A (en) Mounting apparatus
CN114841326A (en) Operator processing method, device and equipment of deep learning framework and storage medium
WO2015170333A2 (en) Manifold system and synthesis of a manifold system from input models
Gratadour et al. Green FLASH: energy efficient real-time control for AO
US11861331B1 (en) Scaling high-level statistical languages to large, distributed datasets
Roquier et al. Hardware and software synthesis of heterogeneous systems from dataflow programs
Bezati et al. An heterogeneous compiler of dataflow programs for zynq platforms
Zaki et al. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in gnu radio
Artail et al. Speedy cloud: Cloud computing with support for hardware acceleration services
CN113420466B (en) Cross-platform automatic performance optimization oriented unit computing component and method
Schornbaum Block-structured adaptive mesh refinement for simulations on extreme-scale supercomputers
CN109597611B (en) Front-end data flow control component development system, method, device and storage medium
CN114127681A (en) Method and apparatus for enabling autonomous acceleration of data flow AI applications
Rashid et al. A multi-objective framework for characterization of software specifications
Chang et al. Deep neural networks compiler for a trace-based accelerator (short WIP paper)
Mori et al. A design methodology for the next generation real-time vision processors
Christensen Framework and Model for Interactive Spatiotemporal Data Analysis and Visualization Systems
Cao Task-based Runtime Optimizations Towards High Performance Computing Applications
Sabaghian-Bidgoli et al. A novel modeling approach for system-level application mapping targeted for configurable architecture
Valcke et al. OASIS and PALM, the CERFACS couplers
Chau et al. Advances in Dataflow Systems
Campeanu et al. Component-based development of embedded systems with GPUs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16841137

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16841137

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP