WO2017038104A1

WO2017038104A1 - Installation device and installation method

Info

Publication number: WO2017038104A1
Application number: PCT/JP2016/004028
Authority: WO
Inventors: 辰哉加藤; 遼介奥田; 誠也得居; 裕也海野; 将平比戸
Original assignee: 株式会社ＰｒｅｆｅｒｒｅｄＮｅｔｗｏｒｋｓ
Priority date: 2015-09-03
Filing date: 2016-09-02
Publication date: 2017-03-09

Abstract

[Problem] To allow appropriately installing an algorithm which executes machine learning in an embedded chip which is formed from more meager computing resources than a computer which is used in the design of the algorithm. [Solution] Provided is an installation device, comprising: a first executing means which causes a first computational device which is mounted upon a semiconductor integrated circuit to execute first source code which is written in a first programming language; a second executing means which causes a second computational device which is mounted in the semiconductor integrated circuit to execute second source code which is written in a second programming language which is a lower-level language than the first programming language; and a comparison means which compares the result of the execution by the first executing means of first specific code which is included in the first source code with the result of the execution by the second executing means of second specific code (which is the first specific code rewritten in the second programming language) which is included in the second source code, and outputs the result of the comparison. The second executing means further comprises a byte code generating means which generates byte code from the first source code, said byte code storing I/O data information, weighting information, backward process function call information, etc.

Description

Mounting apparatus and mounting method

The technology described in this specification relates to a mounting apparatus and a mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit.

In recent years, machine learning using a neural network has been used in various fields.
When executing such machine learning, a developer or the like uses a predetermined programming language to create a source code that defines a network structure of a neural network, and the source code thus created is stored in a personal computer or the like. By executing it, it is possible to cause such a personal computer to execute machine learning (Non-patent Document 1).

In recent years, an algorithm that performs machine learning designed using a computer with abundant computing resources (such as a personal computer) is appropriately applied to an embedded chip (embedded semiconductor integrated circuit) having a scarce computing resource compared to such a computer. There is a need for a framework to implement.

Therefore, according to various embodiments of the present invention, an algorithm for executing machine learning designed using a computer having abundant calculation resources is used as an embedded system chip (embedded system semiconductor integrated circuit) having less calculation resources than such a computer. ) And a mounting method for mounting appropriately.

An apparatus according to an aspect is a mounting apparatus for mounting a computer program for executing machine learning on a semiconductor integrated circuit, wherein the first source code described in a first programming language is mounted on the semiconductor integrated circuit. First execution means that can be executed by a first arithmetic unit, and second source code described in a second programming language different from the first programming language are mounted on the semiconductor integrated circuit. Second execution means that can be executed by the second arithmetic unit, a result of execution of the first specific code included in the first source code by the first execution means, and the first The second execution means is a second specific code included in the second source code, and the first specific code is rewritten by the second programming language. Comparing means for comparing the result of executing the second specific code with each other and outputting a comparison result, wherein the second executing means is the first source described in the first programming language. It is characterized by comprising bytecode generation means for generating a bytecode that is described in an arbitrary data format and includes at least one of input / output data information, weight information, and function call information during backward processing. To do.

According to various embodiments of the present invention, an algorithm for executing machine learning designed using a computer having abundant computing resources is applied to an embedded system chip (embedded semiconductor integrated circuit) having a scarce computing resource compared to such a computer. A mounting device that is appropriately mounted can be provided.

FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art. FIG. 2 is a schematic diagram conceptually showing a technique called “Define-by-Run” according to the embodiment of the present invention. FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network. FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network. FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network. FIG. 6 is a diagram illustrating a pseudo code for realizing the calculation executed in the forward process by the Linear. FIG. 7 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by Linear. FIG. 8 is a diagram illustrating a pseudo code for realizing a calculation executed at the time of forward processing by the ReLU. FIG. 9 is a diagram illustrating a pseudo code for realizing a calculation executed during backward processing by the ReLU. FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D. FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention. FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention. FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention. FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art. FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention. FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art. FIG. 19 is a diagram illustrating still another example of the source code input to the learning device according to the embodiment of the present invention. FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention. FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention. FIG. 22 is a schematic diagram illustrating a case where an execution unit using Python and an execution unit using a chip communicate with each other. FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention. FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention. FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention. FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention. FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting technique (second technique) according to an embodiment of the present invention. FIG. 28 is a schematic diagram conceptually showing functions included in the mounting apparatus according to the embodiment of the present invention. FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to the embodiment of the present invention. FIG. 30 is a diagram illustrating a structure definition example of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention. FIG. 31 is a diagram showing the mutual compatibility and reference relationship of multidimensional array data in the multidimensional array module of the mounting apparatus according to the embodiment of the present invention. FIG. 32 is a diagram for explaining a memory pool module of the mounting apparatus according to the embodiment of the present invention. FIG. 33 is a diagram for explaining a structure definition example of the memory pool module of the mounting apparatus according to the embodiment of the present invention. FIG. 34 is a diagram showing a coding example of pipelining in the mounting apparatus according to an embodiment of the present invention. FIG. 35 is a diagram showing an internal state of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. FIG. 36 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. FIG. 37 is a diagram showing an example of an execution flow of the virtual machine module in the mounting apparatus according to the embodiment of the present invention. FIG. 38 is a diagram showing an example of address setting in the virtual machine module in the mounting apparatus according to the embodiment of the present invention. FIG. 39 is a diagram illustrating a specific example of the entire operation in which the Python layer and the Native layer cooperate in the mounting apparatus according to the embodiment of the present invention. FIG. 40 is a diagram illustrating an output of a plurality of network configurations in the bytecode generation unit in the mounting apparatus according to an embodiment of the present invention. FIG. 41 is a diagram illustrating a code example in the byte code generation unit in the mounting apparatus according to the embodiment of the present invention. FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention. FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention. FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention. FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention. FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention. FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention. FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention. FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged. FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention. FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention. FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention. FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention. FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention. FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment of the present invention.

Hereinafter, various embodiments of the present invention will be described with reference to the accompanying drawings. In addition, the same referential mark is attached | subjected to the component which is common in each drawing.
First, in the first part, the information processing apparatus according to the embodiment (hereinafter described as a learning apparatus that is an example of the information processing apparatus) will be described, and in the second part, the information processing apparatus according to the embodiment will be implemented. A method for mounting the above algorithm on an embedded chip (embedded semiconductor integrated circuit) will be described.

Part 1 (Learning device according to the embodiment)
1. Background and Outline Machine learning algorithms including deep learning can often be formulated as a minimization problem of the sum of loss functions defined for each model. The loss function is an index represented by an error between the model output and the correct answer in a given learning data sample. Here, a series of processes from inputting data into the model until obtaining an output and comparing with the correct answer is called a calculation graph, and the result is defined as a loss function. The loss function minimization problem can be solved by a general method called a gradient method as long as the gradient obtained by differentiating the loss function can be calculated.

When trying to implement it as a computer program, there is a method of coding all of the loss function and the gradient, but it is generally difficult to calculate the gradient of a complex model, and it is difficult to obtain the calculation formula explicitly There are many, and cannot be described directly as a program. So, use calculation libraries such as Caffe (http://caffe.berkeleyvision.org/), Torch (http://torch.ch/), Theano (http://deeplearning.net/software/theano/) This is the second method. The contents disclosed in these URLs are incorporated herein in their entirety by reference.

These libraries can automatically obtain the gradient function just by describing the loss function as a combination of the prepared basic calculation elements (Primitive) in a dedicated mini-programming language. This is because the gradient of each basic calculation element is defined, and the gradient of the entire combination can be obtained as an automatic differentiation. In other words, a neural network that can be expressed as a large-scale calculation graph, such as that used in deep learning, can be expressed using the gradient function if the loss function calculation can be expressed explicitly using this mini-programming language. To learn.

Such a calculation library has so far been based on a calculation procedure called “Define-and-Run” by the applicant. This is an approach in which a calculation graph is first defined (Define), a gradient is derived by automatic differentiation, and then learning (Run) is performed using learning data. In this approach, if the calculation graph does not have a complicated control syntax (if, for, etc.) and does not change over time, a series of gradient calculations can be compiled and prepared as a batch at the time of Define. It brought merit that memory management was unnecessary.

However, in the case of a calculation graph with a complicated control syntax, which has increased with the development of deep learning research, or in the case of a model in which the calculation graph changes dynamically even under meta conditions that do not depend on data However, there were problems such as low expressiveness of mini-programming language, difficulty in debugging, and deterioration of memory efficiency due to inability to dynamically change the structure. For this reason, implementation and execution may be difficult depending on the complexity of the model and the scale of the data.

Therefore, in the embodiment, a new calculation procedure called “Define-by-Run” by the applicant is proposed. Specifically, in the embodiment, instead of having a fixed graph structure like “Define-and-Run” in advance, the graph structure is dynamically extracted and stored in each learning (Run), and the The approach of recalculating the gradient each time is adopted.

This eliminates the need for a mini-programming language that defines graphs in advance, which has the effect of eliminating the design, implementation, and maintenance costs for developers, and the learning costs and debugging difficulties for users. In addition, because control syntax can be freely used in common programming languages (C, Java (registered trademark) and Python), neural networks with more complicated graph structures can be easily used. Can be implemented. Furthermore, by enabling meta-changing operations with certain types of conditioning on the graph, memory efficiency can be improved and flexible learning and application of models can be realized.

The conceptual difference between the above-described technique called “Define-and-Run” and the technique called “Define-by-Run” according to the embodiment is the difference between FIG. 1 and FIG. It is also clear by comparing. FIG. 1 is a schematic diagram conceptually showing a technique called “Define-and-Run” according to the prior art, and FIG. 2 is called “Define-by-Run” according to an embodiment of the present invention. It is a schematic diagram which shows the method to be performed notionally. In the Define-and-run configuration shown in Fig. 1, the mini-programming language first inputs only the model definition, and outputs the calculation procedure of forward (identification) processing and backward (learning) processing, which are the entities of the calculation graph. (Define step). In the next step, the forward / backward processing system inputs data and updates parameters (weights) according to the calculation procedure of forward (identification) processing and backward (learning) processing (Run step).
On the other hand, in the Define-by-run configuration shown in Fig. 2, the general-purpose programming language processing system executes the forward (identification) process while inputting the model definition, input data, and parameters, and at the same time the backward (learning) process. Generate the calculation procedure. Here, the model definition is defined in conformity with the grammar of a general-purpose programming language such as function call, four arithmetic operations, loop and branch. The calculation procedure of the backward (learning) process can be dynamically changed independently of the execution of the forward (identification) process. Backward processing system can be called at any timing. The Backward processing system updates the parameters from the input data and the results of the Forward processing according to the Backward calculation procedure.

2. 2. Background Art Related to Neural Networks 2-1. Basic Processing Flow of Neural Network The processing performed in the neural network mainly includes forward processing, backward processing, and updating of weights.
The forward process is a process for processing and propagating information from the input layer to the output layer of the neural network.

Backward processing refers to performing two processes, error back propagation and weight gradient calculation, from the output layer to the input layer of the neural network. The error back propagation is a process of propagating the error (δ) obtained from the output side layer to the input side layer. The weight gradient calculation is processing for obtaining a weight gradient (勾配 W) from an error (δ) obtained from an output layer and an output value of an input layer for a layer having a weight.

The update of the weight is a process of updating the weight with an algorithm derived from the stochastic gradient descent method (SGD) using the weight gradient (∂W) obtained by the weight gradient calculation for the layer having the weight. Say. This weight update is executed once for each unit of batch processing.

2-2. Calculation Modules Frequently Appearing in Examples of Neural Networks Each layer constituting a neural network is realized by, for example, a layer algorithm listed below.
-Linear
-ReLu
-Dropout
-Softmax Cross Entropy
-Convolution 2D
-Pooling (Average Pooling, Max Pooling, etc.) etc.

Typical examples of the weight update algorithm include the following.
-Momentum-SGD
-Adam, etc.

2-3. Network configuration example of neural network (1)
FIG. 3 is a schematic diagram illustrating an example of a network configuration of a neural network. As an example, FIG. 3 illustrates a neural network in which six intermediate layers (Linear, ReLU, Linear, ReLU, Dropout, and Linear) are arranged between an input layer and an output layer (Softmax). On the page, a rightward arrow indicates forward processing, and a leftward arrow indicates backward processing.
Since the input layer has no weight to be updated, the backward processing is arranged adjacent to the input layer (in the example shown in FIG. 3, adjacent to the input layer). To the Linear layer).

2-4. Network configuration example of neural network (2)
FIG. 4 is a schematic diagram showing another example of the network configuration of the neural network. In FIG. 4, as an example, a plurality of intermediate layers (Convolution 2D, ReLU, Convolution 2D) arranged in series between an input layer and an intermediate layer (Linear) arranged adjacent to the output layer (Softmax) are shown. , ReLU) is illustrated as a neural network in which a plurality (three) are arranged in parallel. On the page, an upward arrow indicates forward processing, and a downward arrow indicates backward processing.

2-5. Network configuration example of neural network (3)
FIG. 5 is a schematic diagram showing still another example of the network configuration of the neural network. FIG. 5 illustrates, as an example, a neural network having a loop (this may be referred to as “Recurrent Neural Network”). In the figure, the flow of data in the forward processing is indicated by arrows. The intermediate layer (here, “Linear”) executes a calculation using the previous output value of the intermediate layer and the output value of the current input layer as the input of the intermediate layer. As a method for realizing backward processing in such a neural network, a method (BPTT) in which the network is expanded in the time axis direction in advance and converted into a loop-free network is known.

2-6. Layer algorithm calculation (Linear)
Linear, which is one of the layer algorithms, executes a calculation that repeats the operation of taking the weighted average of all the nodes in the input side layer by the number of nodes in the intermediate layer.
FIG. 6 is a diagram showing pseudo code for realizing a calculation executed at the time of forward processing by Linear, and FIG. 7 is a diagram showing a pseudo code for realizing a calculation executed at the time of backward processing by Linear. It is.

2-7. Layer algorithm calculation (ReLU)
ReLU, which is one of layer algorithms, calculates Max (0, val) for each node in the input side layer. This algorithm is the most used technique in recent years in the processing (activation function) for adding nonlinearity to the calculation of the neural network.
FIG. 8 is a diagram showing pseudo code for realizing the calculation executed at the time of forward processing by ReLU, and FIG. 9 is a diagram showing the pseudo code for realizing the calculation executed at the time of backward processing by ReLU. It is.

2-8. Layer algorithm calculation details (Dropout)
Dropout, which is one of the layer algorithms, randomly selects a fixed ratio of nodes and executes a calculation that inactivates output and back propagation. This algorithm is unnecessary when only identification is performed (that is, when learning is not performed).
2-9. Layer algorithm calculation (Softmax Cross Entropy)
Softmax Cross Entropy, one of the layer algorithms, corrects the value of the input side layer using the following formula.

This layer algorithm is generally used in the output layer.
Also, this layer algorithm calculates an error from the difference between the correct answer label (1 or 0) and the output value during backward processing.

2-10. Calculation contents of layer algorithm (Convolution 2D)
Convolution 2D, which is one of the layer algorithms, convolves an image having a data structure of Channel * Width * Height. Both the input layer and the output of the layer have a data structure of Channel * Width * Height. With this algorithm, the image size can be reduced by stride processing. In this algorithm, padding is inserted into an image on the input side layer. This algorithm has the same calculation structure as the Linear (repeating the inner product calculation of the input channel for the number of output channels) with respect to the channel direction.
FIG. 10 is a diagram illustrating a pseudo code for realizing a calculation executed during forward processing by Convolution 2D.
Convolution 2D performs weight gradient calculation and error backpropagation in the same way as Linear during backward processing. The scale of each processing loop is the same as that during forward processing.

2-11. Layer algorithm calculation (Max Pooling)
Max Pooling, which is one of the layer algorithms, reduces the image vertically and horizontally by taking the maximum value of the image on the input side layer. Note that the filter size taking the maximum value and the stride width for image reduction may be different. There is no change in the number of channels.

2-12. Layer algorithm calculation (Average Pooling)
Max Pooling, which is one of the layer algorithms, reduces the image in the vertical and horizontal directions by taking the average value of the images on the input side layer. Note that the filter size taking the average value may differ from the stride width for image reduction. There is no change in the number of channels.

2-13. Weight Update Algorithm There are various algorithms derived from the stochastic gradient descent method (SGD) as the weight update algorithm. In these algorithms, the calculation is independent for each weight element.
The formula of momentum-SGD mentioned above is as follows.

Also, Adam's formula mentioned above is as follows.

3. Next, a hardware configuration of the learning device according to the embodiment of the present invention will be described. FIG. 11 is a schematic diagram illustrating a hardware configuration example of a learning device according to an embodiment of the present invention.

As shown in FIG. 11, the learning device 10 includes a CPU 11, a main memory 12, an input I / F 13, an output I / F 14, a communication I / F 15, an external memory 16, and a user I / F 17. These components are electrically connected to each other via an internal bus 18. Note that the learning apparatus 10 can selectively include a GPU (not shown).

The CPU 11 loads various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python) from the external memory 16 into the main memory 12, and is included in the loaded program. Execute the instruction. The main memory 12 is used for storing a program executed by the CPU 11, and is constituted by, for example, a DRAM.

The input I / F 13 has a function of capturing output data of a measuring device (not shown), and is connected to each component by an internal bus 18. Here, the various measurement data that are the output of the measurement device include information acquired by a sensor or the like, for example, temperature, humidity, position information, image data, etc., and moving image data or temperature data acquired at certain intervals of temperature. Time series data such as columns may be used. The output I / F 14 receives data from each component through the internal bus 18 and outputs the data to an output device (not shown) outside the learning device. Here, the data output to the output device may be, for example, control information for driving a motor, control information for an information output device such as a buzzer, a control switch, an automobile accelerator or brake, or a liquid crystal display.

The communication I / F 15 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and communicates various information with a server device (not shown) via the communication network 20. It is configured to be possible.
The external memory 16 is composed of, for example, a magnetic disk drive, a flash memory, or the like, and stores various programs such as a program (a program used for creating source code) that supports an operating system and a programming language (for example, Python).

The learning device 10 according to the embodiment having the above-described configuration is configured such that the CPU 11 (and optionally the GPU) executes machine learning by executing a predetermined program loaded from the external memory 16 to the main memory 12. Can function as a learning device. For example, the learning device 10 that performs machine learning can be realized as a learning device that is modeled by a neural network by the CPU 11 (optionally in addition to the GPU) executing various programs.

The learning device 10 having the above configuration can be mounted on a corresponding individual (device). Further, the learning device 10 can be connected to a corresponding measurement device and a corresponding output device. These measuring devices and output devices may be mounted on a corresponding individual (device) or may be connected as separate devices using communication means.

In one embodiment, the learning device 10 is an arbitrary information processing device capable of executing machine learning, such as a personal computer, a tablet, a mobile phone, a smartphone, a mobile information terminal, a touch pad, and an information processing server. Including but not limited to.

4). Functional Blocks of Learning Device According to Embodiment Next, functions of the learning device 10 having the above configuration will be briefly described. FIG. 12 is a block diagram schematically illustrating an example of functions of the learning device according to an embodiment of the present invention.

The learning apparatus 10 according to the embodiment is based on a technique called “Define-by-Run” as described above. Specifically, the learning device 10 according to the embodiment performs backward processing and weight update processing at a timing at which the neural network forward processing is executed by a general procedural language including branching, looping, and function calling. By dynamically generating necessary network configuration information, a mechanism capable of actually executing backward processing and weight update processing is provided.

In order to realize such “Define-by-Run”, as illustrated in FIG. 12, the learning device 10 according to an embodiment mainly includes an acquisition unit 110, a storage unit 120, an execution unit 130, including. The acquisition unit 110 acquires a source code including a code defining a forward process of each layer constituting the neural network. Specifically, such source code is created by a developer or user using a text editor using a predetermined programming language (for example, Python). To get.
For example, the acquisition unit 110 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, the user I / F 17, and the like illustrated in FIG.

The storage unit 120 stores a correspondence relationship between each of a plurality of forward processes that can be defined in the source code and a backward process corresponding to the forward process. In the correspondence relationship stored in the storage unit 120, a corresponding backward process is associated with a certain forward process included in a plurality of forward processes in a one-to-one relationship. In other words, in the correspondence relationship stored in the storage unit 120, for example, for a layer called “Linear” (intermediate layer), a forward process corresponding to Linear and a backward process corresponding to this forward process are associated with each other. . (As described above, the one-to-one correspondence between the forward process and the backward process is a process corresponding to the forward process when the backward process is executed using the reference structure for the backward process. For example, when the forward process is executed in the order of A → B → C, the backward process is executed in the order of C → B → A. On the other hand, since both forward processing and backward processing are implemented in pairs, such backward processing can be realized.)
The storage unit 120 can store various information including the source code acquired by the acquisition unit 110 and various libraries used in a programming language corresponding to the source code.
For example, the storage unit 120 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG. 11.

The execution unit 130 sequentially executes each code included in the source code acquired by the acquisition unit 110 (stored in the storage unit 120). The execution unit 130 can calculate the output value of the forward process defined by the code based on the input value when each code is executed. In addition, the execution unit 130 can generate a reference structure between objects in a layer corresponding to the code when the code is executed.
The execution unit 130 can be realized by the cooperation of the CPU 11, the main memory 12, the external memory 16, and the like illustrated in FIG.

In order to realize the above-described method “Define-by-Run”, the learning device 10 according to an embodiment uses the acquisition unit 110, the storage unit 120, and the execution unit 130 described above, thereby providing three classes. That is, three classes of Function, Variable, and Optimizer are used. Note that the names of these classes are given for convenience and are not limiting.
First, a class called Function is a class defined by pairing forward processing and backward processing. The class called Function defines the specific layer algorithm exemplified in the above “2-6” to “2-12” as a subclass.
Next, a class called Variable is a class that manages data input and output between functions. This class of Variable has a role of concealing the difference between the GPU and the CPU, and has a method (unchain_backward, which will be described later) for terminating backward processing of a network including a loop within a finite range.
Furthermore, a class called Optimizer is a class for updating weights.

5). Operation example 1
Next, a specific example of an operation performed by the learning apparatus 10 according to the embodiment having the above configuration will be described. FIG. 13 is a diagram illustrating an example of source code input to the learning device according to the embodiment of the present invention. Note that the source code illustrated in FIG. 13 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Further, the number of lines described at the left end in FIG. 13 is given for explaining this specific example, and is not included in the actual source code.
Hereinafter, in the present embodiment, a case where the source code is described in Python as an example will be described, but the source code may be described in a programming language other than Python. Details of Phython are disclosed at https://www.python.org/. The content disclosed at this URL is incorporated herein by reference in its entirety.

First, a developer or the like creates the source code illustrated in FIG. 13 using a text editor or the like. The acquisition unit 110 (see FIG. 12) of the learning device 10 acquires the source code created in this way and stores it in the storage unit 120. Next, the execution unit 130 executes each code included in the source code stored in the storage unit 120 line by line. When the source code does not include a control syntax such as an if statement or a for statement as illustrated in FIG. 13, the execution unit 130 executes the first line to the last line one line at a time from the top to the bottom. Run sequentially. Conversely, when the control syntax is included in the source code, the execution unit 130 executes each code in the order according to the control syntax.

The contents of the source code illustrated in FIG. 13 will be described.
Lines 1 to 3 describe registration of a function including parameters by FunctionSet. Specifically, here, a function including a weight (an instance of linear class l1, l2, l3, which is a function subclass defining a layer algorithm for performing an inner product) is registered in an object of a class called FunctionSet. Functions with weights can be updated by the Optimizer. FunctionSet is a mechanism for improving the readability of code by grouping the functions updated by Optimizer.

Lines

4 and 5 describe the initialization of the Optimizer. The fourth line creates an instance of an Optimizer (class for updating weights) subclass that implements the algorithm called Adam. Adam's processing is to update for each element of weight by the mathematical expression described in “2-13” above. In the fifth line, a list of functions including the weights already defined in the first to third lines is passed to the setup method of the instance of the optimizer subclass generated in the fourth line. By executing this setup method, the internal state of the Optimizer subclass for updating the weight included in the Function list passed to this method is initialized.

The sixth line describes loading of input data. That is, the sixth line illustrates the process of reading the input data x and t from a file or the like. In this example, x holds data with a large amount of information such as images and sounds, and t holds a label ID corresponding to x (data with a small amount of information for answer matching).

7th line describes holding input data by Variable object. That is, in the seventh line, a Variable class object that holds input data is generated. The “Define-by-Run” function is realized by the mutual dependence of the Variable object and the Function object, and any input data has a mechanism to realize the “Define-by-Run” function. Therefore, a procedure that is explicitly held by an instance of the Variable class is required.

Lines 8 to 11 describe execution of forward processing. Specifically, in the 8th to 11th lines, the Forward process is executed by the description of a general programming language. The “Define-by-Run” function generates a reference structure for backward processing simultaneously with the execution of this definition. By referring to the instances of the Function class and the Variable class, it is possible to express the correspondence between arbitrary processing and data. This is obvious because the Variable class represents data and the Function class represents processing. A data structure expressing the backward calculation procedure shown in FIG. 2 using this reference structure is defined as a reference structure for backward processing. The reference structure for backward processing grows every time a basic calculation (arithmetic operation or power) for a Variable object and a Function call with a Variable object as an argument or return value are called. Therefore, a reference structure for backward processing can be generated even for forward processing descriptions including function calls other than those for branches, loops, and functions and variables. Each basic calculation for Variable objects also has a Function subclass associated with it.

Line 12 describes the execution of backward processing. Specifically, the 12th line executes the backward process by calling the backward method of the loss variable (an instance of the Variable class) obtained as a result of executing the forward process executed in the 8th to 11th lines. . The backward process is automatically executed in the reverse order of the forward process by following the reference structure for the backward process generated when the forward process is executed.

Line 13 describes the weight update. Specifically, in the 13th row, a weight gradient is calculated as a result of executing backward processing in the 12th row. When the update method of the instance of the Optimizer subclass is called as in the 13th line, the weight is updated using the weight gradient. Since the update method call for the Optimizer subclass and the backward method call for the Variable class are separate functions, it is also possible to update the weight after partially executing the backward processing. This is effective when it is not desired to update the weight for a function that has already been learned.

Here, as the contents processed during the forward processing, attention is particularly paid to the contents processed by the code described in the eighth line.
The eighth line describes h1 = F.relu (model.l1 (x)).

When “model.l1 (x)” is executed, the following reference structure for backward processing is generated.

In the above reference structure, x 'represents a Variable object that is a copy of x, l1' represents a copy of l1 (shallow copy), y represents a value (Variable object) returned by the forward method of l1 ', and splitter is a network Represents an instance of a class that manages the branch of.
Shallow copy is a method of copying an object that does not copy data that the object internally references when copying the object. By making a shallow copy, for example, duplication of weight data of a Funtion instance can be avoided.
The meaning of the arrow indicates the direction of reference of the object. For example, the description A ← B means that a member of B object includes a reference to the object of A.

The above reference structure becomes a reference structure for backward processing as follows after execution of “F.relu (”.

Here, the case where a reference structure for backward processing is generated when the code described in the eighth line is executed has been described. Needless to say, a reference structure is generated.
As described above, when the forward process is executed, a reference structure for the backward process is generated by a natural-type function call.
At this point, the backward processing can be executed starting from h1. The flow of processing executed by the backward processing system when actually executing backward processing from h1 is shown below.
Follow the relu 'referenced by the h1 instance and call the relu' Backward process. The input at this time is an error value held by h1, and the output result is stored in an instance of y ′. Correspondence between data input and output by such Function instances is defined in the Forward process / Backward process defined for each Function subclass. Then, from relu 'to y' via y ', spliter copies the error value held by y' to y. (The reason why the splitter is inserted is described in the next section). Next, follow l1 'from y and execute the backward processing of l1'. The input at this time is an error value held by y, and the output result is stored in an instance of x ′. A weight error is also calculated. The weight error is calculated from the forward output value stored in x ′ and the error value held by y. In the same manner, backward processing ends when x is reached, which is the end point of the reference structure for backward processing.

The reason why Splitter is inserted into the reference structure will be explained just in case.
When “model.l1 (x)” is called again immediately after the above reference structure is created, the following reference structure is generated.

In the above reference structure, l1 '' represents a copy (shallow copy) of l1 (an instance different from l1 '), x''represents a Variable object copied from x (an instance different from x'), and z is Indicates the value (Variable object) returned by the l1 '' forward method.

When propagating an error value from the output layer during backward processing, the result of adding and combining the error values transmitted by the instance of the splitter to x ′ and x ″ is set as the error of x. By inserting a splitter in this way, an error can be propagated during backward processing from all functions using x as an input during forward processing.

Next, it supplements about the network structure of the neural network produced | generated when the source code illustrated in FIG. 13 is executed. FIG. 14 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. In FIG. 14, a block drawn by a dotted line shows an instance of a variable, and a block drawn by a solid line shows a function.

First, when the seventh line is executed, an instance 30 of variable x and an instance of variable t are generated. For convenience of explanation, only the instance 30 of the variable x is shown in FIG. 14, but in reality, an instance of the variable t is generated in the same manner. At the time when the seventh line is executed, the variable x instance actually holds data such as images and sounds.

Next, when the eighth line is executed by the execution unit 130, the neural network in a state in which the function “l1” 31, the function “relu” 32, and the instance 33 of the variable h1 are sequentially grown after the instance 30 of the variable x. Is generated. Note that when the eighth line is executed, the execution result of the forward process described in the eighth line is already held by the instance 33 of the variable h1. Further, when the eighth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.

Next, when the ninth line is executed by the execution unit 130, the neural network in a state where the function “l2” 34, the function “relu” 35, and the instance 36 of the variable h2 are sequentially grown after the instance 33 of the variable h1. Is generated. Note that when the ninth line is executed, the execution result of the forward process described in the ninth line is already held in the instance 36 of the variable h2. Further, when the ninth line is executed, as described above, the reference structure for backward processing generated at the present time is generated.

Similarly, when the 10th line is executed by the execution unit 130, a neural network in which the function “l3” 37 and the instance 38 of the variable y are sequentially grown after the instance 36 of the variable h2 is generated. Note that when the tenth line is executed, the execution result of the forward process described in the tenth line is already held in the instance 38 of the variable y. At the time when the 10th line is executed, as described above, the reference structure for backward processing generated at the present time is generated.

Finally, when the 11th line is executed by the execution unit 130, a neural network in which the function “Softmax” 39 and the instance 40 of the variable loss are sequentially grown after the instance 38 of the variable y is generated. Note that when the eleventh line is executed, the execution result of the forward process described in the eleventh line is already held in the new stance 40 of the variable loss. At the time when the eleventh line is executed, as described above, the reference structure for backward processing generated at the present time is generated. When the eleventh line is executed, the forward process described in the source code is completed. That is, at the time when the eleventh line is executed, the difference between the result of the identification performed by the finally obtained neural network and the true identification result given by the variable t is held in the instance 40 of the variable loss. Yes. The backward processing of the next step is executed with this difference as an input.

After the forward process described in the 8th to 11th lines is completed, the backward process is executed by the execution unit 130 executing the 12th line. Since the generated reference structure for backward processing has already been generated, the execution unit 130 executes backward processing based on the reference structure, thereby performing each intermediate layer (however, , Only the weighted intermediate layer) can be calculated.

Next, the execution unit 130 executes the 13th line. Thereby, the weight of each intermediate layer (however, only the intermediate layer having the weight) is updated using the weight gradient calculated by executing the 12th row. That is, learning is executed.

As described above, in the learning device according to the present embodiment, for the forward process, the developer or the like obtains the execution result obtained by giving any variable instance to any function, and any variable instance. It is possible to construct a neural network by a method that describes whether to hold each line by line. Thereby, the developer or the like can easily perform the forward process intuitively in the source code. Further, the developer or the like describes the forward process in the source code (without needing to be aware of the backward process), and automatically executes the backward process by causing the learning apparatus according to the present embodiment to execute the source code. The learning device can execute the program.

6). Comparative Example 1
Next, in order to show the superiority of the learning apparatus according to the present embodiment, a case where a process equivalent to that executed by the source code illustrated in FIG. 13 is described by Caffe according to the prior art will be described. FIG. 15 is a diagram illustrating an example of a source code described by Caffe according to the related art.

As shown in FIG. 15, the definition of the layer (corresponding to the function in the present embodiment) is described in the block surrounded by {} described immediately after the term “layer”. In the technique according to this prior art, it is necessary to clearly indicate the dependency between layers in the code. For example, the descriptions “top” and “bottom” represent the dependency between layers. “Bottom” represents from which layer the input to the layer is obtained, and “top” represents to which layer the processing result in the layer is output.

In this conventional technique, it is necessary to statically define the network configuration prior to the learning and identification processing performed in the neural network. That is, it is necessary to first define the network configuration of the neural network, and then perform learning and identification of the neural network. Therefore, it is difficult to dynamically change the network configuration in accordance with the nature of the data.
On the other hand, in the learning device according to the present embodiment, as described above with reference to FIG. 14, when each code defining the configuration of the neural network is executed, forward processing corresponding to the code is executed. The That is, the definition of the configuration of the neural network and the execution of the forward process based on the configuration are executed at the same timing. As a result, it is possible to easily change the network configuration dynamically according to the nature of the data. For example, a branch may be added to the code of FIG. 13 and the layer for executing the forward processing may be switched according to the value of the variable t or the data size of the variable x. Further, for example, in the code of FIG. 19, a variable value can be given as input data instead of the constant “10” on the ninth line.

In addition, in the method according to the prior art, developers and the like need to describe the definition of the network configuration of the neural network so that both forward processing and backward processing can be appropriately executed when creating the source code. is there. On the other hand, in the learning device according to the present embodiment, as described above with reference to FIG. 14, the forward processing (network configuration) is simply described without having to be aware of whether the backward processing can be appropriately executed. After that, the learning device automatically executes the backward processing by causing the learning device to execute the source code. Therefore, a developer or the like can easily and efficiently construct a neural network and execute identification and learning.

Furthermore, in the method according to the prior art, when creating a source code, developers and the like define a neural network so that both forward processing and backward processing can be appropriately executed, and then define the neural network. The procedure of substituting data (input data, teacher data, etc.) into the neural network is performed. Therefore, it is difficult to intuitively describe source code.
On the other hand, in the learning device according to the present embodiment, the developer or the like was obtained by giving an instance of any variable to any function at the time of describing the forward processing (network configuration) line by line. The source code is described by a method that describes in which variable instance the execution result is to be held line by line. As a result, the developer or the like can intuitively describe the source code.

7). Operation example 2
Next, another specific example of the operation performed by the learning apparatus 10 according to the embodiment having the above configuration will be described. FIG. 16 is a diagram illustrating another example of the source code input to the learning device according to the embodiment of the present invention. It should be noted that the source code illustrated in FIG. 16 is intentionally simplified for the purpose of explaining the characteristics of the learning device according to the present embodiment. Also, the number of lines shown at the left end in FIG. 16 is given for explaining this specific example, and is not included in the actual source code.

With reference to the source code shown in FIG. 16, the learning device according to the present embodiment explains that a neural network can be easily constructed using a control syntax (here, a for sentence). To do.

Since the first to third lines are the same as the first to fifth lines in the source code shown in FIG. 13, detailed description thereof is omitted.
The fourth line describes that the process described in the fifth to tenth lines is looped until the value of i becomes 0 to 1000.
The fifth and sixth lines are the same as the sixth and seventh lines in the source code shown in FIG.
The seventh line describes that y, which is the processing result of the function l1 and the function relu, is added again to the argument of l1.
The eighth to tenth lines are the same as the eleventh to thirteenth lines in the source code shown in FIG.

FIG. 17 is a schematic diagram conceptually showing the network configuration of the neural network generated by the source code shown in FIG. In FIG. 17, a block drawn by a dotted line shows an instance of a variable, and a block drawn by a solid line shows a function. FIG. 17 shows only the configuration of a neural network generated only when the variable i is 0 to 2 for convenience of explanation.

As is apparent from FIGS. 16 and 17, in the learning device according to the present embodiment, the same configuration including the variable instance and the function (here, the instance 51 of the variable x and the instance 50 of the variable y are added). In the neural network, the function “l1” 53 and the function “relu” 54 are sequentially followed by the function 52 to be performed, and the output value of the function “relu” 54 is held in the instance of the variable y). Even if it exists, it turns out that it is easy to construct using a simple control syntax (here for statement). That is, it can be seen that the source code used in the learning apparatus according to the present embodiment has a high affinity with the control syntax of the programming language.

8). Comparative Example 2
Next, in order to show the superiority of the learning device according to the present embodiment, a case where a process equivalent to that executed by the source code illustrated in FIG. 16 is described by Caffe according to the prior art will be described. FIG. 18 is a schematic diagram conceptually showing a network configuration of a neural network generated by a source code described by Caffe according to the prior art.

When a neural network similar to that illustrated in FIGS. 16 and 17 is to be constructed using Caffe according to the prior art, the configuration of the neural network cannot be defined using the control syntax. First, a basic configuration as shown in FIG. 18 is defined. Next, the developer or the like gives the initial value of the instance 75 of the variable y to the function 72 and also gives the instance 75 of the variable y at the previous time and the instance 71 of the variable x at the current time to the function 72. (A portion indicated by a thick line in FIG. 18) The processing must be described specifically. When constructing a neural network in which the above basic configuration is repeated many times, or when constructing a neural network having a multilayer structure, the developers etc. Each layer must have such a special description.

On the other hand, in the learning device according to the present embodiment, as illustrated in FIGS. 16 and 17, the source code to be described can be easily obtained without using a special description using the control syntax of the programming language. It can be described. Therefore, according to the learning device according to the present embodiment, even a complicated or large-scale neural network can be easily and efficiently constructed.

9. About Additional Function (1) The learning device according to an embodiment may be able to execute a function that cuts off a reference structure for backward processing.
Specifically, when the unchain_backward method of an instance of the Variable class is called, the reference structure for backward processing toward the input side from that instance is cut off. For example, it is assumed that the following reference structure for backward processing is generated by executing forward processing (detailed configuration such as splitter is omitted).

A (input layer) ← Convolution2D ← B ← Linear ← C ← Softmax ← D (output layer)
Here, A, B, C, and D represent Variable class instances, and Convolution2D, Linear, and Softmax represent Function class instances.

At this time, when B. unchain_backward () is called, the reference structure for backward processing is changed as follows as a result of being cut off starting from B.
B ← Linear ← C ← Softmax ← D (Output layer)

Consider the situation where this unchain_backward method is applied to the source code shown in FIG. In this source code, the seventh line adds y, which is the processing result of the function l1 and the function relu, to the argument of the function l1 again. In the “Define-by-Run” mechanism, when the description of “x + y” is executed, a copy of y is generated, and the backward processing already generated by the forward processing executed so far is performed. Are linked together. Therefore, in this example, the reference structure for backward processing continues to grow every time the loop is repeated.
The backward process performed in line 9 is performed on the reference structure for the grown backward process. Since the process of the ninth line is included in the loop, the calculation time of the entire loop process is proportional to the square of the loop size.

FIG. 19 is a diagram showing still another example of the source code input to the learning device according to the embodiment of the present invention. Note that the number of lines described at the left end in FIG. 19 is given for explaining this specific example, and is not included in the actual source code.

By changing the source code shown in FIG. 16 and periodically calling unchain_backward in the 11th line as shown in FIG. 19, an increase in calculation time can be suppressed.
The ninth line describes that the processing of the tenth to twelfth lines is executed every time the loop after the fourth line is executed 10 times.
The eleventh line calls unchain_backward and discards the reference structure for backward processing starting from loss. Thereby, the calculation time of the entire loop process can be kept short.

By using unchain_backward in this way, even when learning is performed for forward processing having a loop in the reference structure, excessive growth of the reference structure for backward processing is suppressed, and the amount of calculation is realistic. A learning process can be executed.
Furthermore, in another embodiment, unchain_backward can be used for the purpose of not updating the weight for a specific function.

10. About Additional Function (2) The learning device according to an embodiment can specify the volatile attribute when an instance of the Variable class is initialized. When the volatile attribute is valid, a reference structure for backward processing is not generated for the forward processing for inputting the variable.

When only the forward processing is executed using the learned weight (that is, when it is not necessary to execute the backward processing), the processing for generating the reference structure for the backward processing is executed when the forward processing is executed. If this is done, waste will occur in both execution speed and memory usage. In such a case, by specifying the volatile attribute when initializing the Variable class instance that holds the input data for forward processing, the generation of the reference structure for backward processing is stopped and efficient forward processing is performed. Can only run.

11. Addendum As the most preferred embodiment, the embodiment in which source code written in Python is input to the learning device has been described. However, the technique disclosed in this specification uses the source code written in Python. It is not limited to only. In other words, the technology disclosed in this specification calculates the output value of the forward processing described in the code when the learning device executes each code based on the input value, and the learning device describes in each code. Generating a reference structure for backward processing each time the forward processing is performed (and enabling backward processing based on this reference structure), and constructing a neural network using a control syntax The same is true when using source code written in a programming language equivalent to Python (eg, R, Julia, Sparkz, MLib, etc.) that can implement at least one of Is applicable.

The technology disclosed in this specification can be realized by executing source code written in Python and its equivalent programming language, and instead of Python and its equivalent programming language. It may be realized by executing the described module or library.

In this specification, the names used to identify variables, functions, methods, classes, subclasses, etc. are not limited to the technology disclosed in this specification, and may be arbitrary.

The processes and procedures described in this specification can be realized not only by those explicitly described in the embodiment but also by software, hardware, or a combination thereof. Specifically, the processes and procedures described in this specification are realized by mounting logic corresponding to the processes on a medium such as an integrated circuit, a volatile memory, a nonvolatile memory, a magnetic disk, or an optical storage. Is done. Further, the processes and procedures described in the present specification can be implemented as a computer program and executed by various computers.

Even though the processes and procedures described herein are described as being performed by a single device, software, component, or module, such processes or procedures may include multiple devices, multiple software, It may be executed by multiple components and / or multiple modules. Further, even if it is described that the data, table, or database described in this specification is stored in a single memory, such data, table, or database may be stored in a single device. Or a plurality of memories arranged in a distributed manner in a plurality of devices. Further, the software and hardware elements described herein may be realized by integrating them into fewer components or by disassembling them into more components.

Part 2 (Algorithm mounting method for embedded chip)
1. BACKGROUND Deep learning (deep learning) is an algorithm that requires a large amount of computation and memory usage, and a learning sample amount, while achieving high performance. The widespread use of GPUs and clouds that provide abundant computing resources at low cost, and the widespread use of Web infrastructure that enables sharing of learning samples, was the background that supported the rise of deep learning in recent years.
There are various environments (libraries, frameworks) that support deep learning algorithm development. Many development environments have a function to improve learning speed using GPU.

In fields such as fully automated driving of automobiles and highly versatile robot control, information obtained from various sensors such as cameras and LIDAR (laser distance measurement) is analyzed in real time, and countless motors are controlled to solve the problem. Advanced information processing capability is required, so the application of deep learning with a performance that is different from the conventional one is strongly expected.
However, these fields depend on embedded environments that have less computational resources than GPUs and clouds because of demands for safety, chip price, power consumption, etc., so the application of deep learning that requires high computational resources is delayed. ing.
In addition to the fact that the demand for computational resources for these algorithms exceeds the realistic and economical performance of embedded environments, the reason why the application of deep learning to embedded environments has been delayed is that it is not limited to software environments. There is also an aspect where there is no implementation that supports learning.
Even in the embedded environment, hardware performance has improved year by year, and the deep learning algorithm has been continuously improved to alleviate the demand for computational resources, so the former factor will gradually be solved. it is conceivable that.

The problems to be solved by the embodiments of the present invention remain mainly in the software environment due to the development of a framework for designing deep learning algorithms that operate on an embedded chip while satisfying product level requirements. It is intended to break through barriers to deep learning adaptation to embedded environments and accelerate development.
In order to solve this problem, the learning device according to the embodiment described in the first part, which is a framework that provides high productivity in developing a deep learning algorithm while being GPU-based, is functionally expanded for an embedded environment. Therefore, in the next paragraph and after, problems for adaptation to the embedded environment focused on the learning device according to the embodiment will be described.

2. Problems of Implementation Method According to Embodiment Since the learning device according to the embodiment described in the first part depends on advanced language functions and libraries, an algorithm that operates in this learning device is directly used on an embedded semiconductor chip. Attempting to operate with this may cause the following adverse effects.
First, with regard to security, as the scale of libraries and languages grows, the degree to which applications depend on a virtually unknown implementation increases. Along with this, there is an increased risk that a defect included in such mounting becomes a defect of a chip product as it is.
Next, regarding the footprint, the implementation of the library and language itself puts pressure on the memory resources of the chip product.
Furthermore, with regard to overhead, the computational resources of chip products cannot be fully utilized via a library with a highly abstract API. At least for large-scale calculations required by neural networks, low-level performance tuning specialized for chips is essential.
For the reasons described above, it is highly possible that an algorithm operating in the learning apparatus according to the embodiment is not operated on the embedded semiconductor chip as it is, so that the product level requirement cannot be satisfied.

3. Concept of Mounting Method According to Embodiment In the mounting method according to the embodiment, a new neural network (NN) algorithm designed in a personal computer or the like having abundant calculation resources can be applied to any embedded chip (embedded semiconductor integrated circuit). Achieving a state in which the product level requirements can be met and operating in the shortest possible period. For that purpose, it is desirable that the developer who designs the algorithm and the developer who is deeply aware of the hardware can work as independently as possible. In this embodiment, the technical idea regarding the apparatus (framework) which assists it is proposed.

4). Development steps assumed in embedded chip development The following three steps are assumed as steps to follow when developing an embedded chip.
Step I: A state in which a code used in the learning device according to the embodiment (a code written in Python as an example) is running on a PC (+ GPU). This state is an algorithm using a neural network having a complicated configuration. This is a state in which the design and verification of is realized with less code description. This is the concept of the method “Define-by-Run” described above.
Step II: Chip-optimized implementation and Python code mixed together This state is almost the same as the Python code for the operation check and performance verification of the algorithm designed by the learning device according to the embodiment on the chip. It is a state that has been realized.
Step III: A state where the algorithm designed by the learning device according to the embodiment is operated only by implementation optimized for the chip. In this state, the algorithm operates after satisfying the specification requirements of the product level as a chip (on-chip In other words, a real-time cooperative operation with other modules and control mechanisms can be performed).
In the mounting method according to the present embodiment, when a new algorithm is developed by the learning device according to the embodiment, by avoiding re-correction, redesign, and re-learning as much as possible between the steps I to III described above, Propose a framework that can be developed in a short period of time.

4-1. About Step I FIG. 20 is a schematic diagram for explaining Step I of the mounting method according to the embodiment of the present invention. The configuration shown in FIG. 20 is a configuration premised on the learning device according to the embodiment described in the first part. That is, in this configuration, the source code written in Python as one aspect of the programming language uses PyCUDA as one aspect of the library and numpy (BLAS) as one aspect of the library. And a general-purpose computer. Note that “Chainer” shown in FIG. 20 is the name given by the applicant to the framework for describing the source code used in the learning apparatus according to the embodiment described in Part 1 above. It is.

4-2. About Step II FIG. 21 is a schematic diagram for explaining Step II of the mounting method according to the embodiment of the present invention. In the configuration shown in FIG. 21, the front end of Chainer is executed on Python. As shown in FIG. 21, in this embodiment, a Native I / F (for example, an interface for calling an implementation equivalent to the main function of Chainer described in a low-level language such as C language) is provided on the PC. And the execution optimized for the embedded chip can be executed with the same code.

FIG. 22 is a schematic diagram for explaining a case where an execution unit based on Python and an execution unit based on a chip communicate with each other. As shown in Fig. 22, it is possible to remove the dependency on Python from the configuration on the embedded chip by providing a communication function in the implementation of the Native I / F (the optimization implementation on the embedded chip is driven from the Chainer on the PC) ).

Implementation of Native I / F Implement reference code (for example, low-level languages such as C) for Chainer Function and Optimizer. This reference code is implemented in a manner independent of external libraries such as numpy. Also, a memory pool mechanism suitable for dynamic network definition is implemented. Also, create a data conversion function with numpy separately from Function / Optimizer. In addition, create a floating-point reference code for the above Function / Optimizer.
Furthermore, create a fixed-point reference code for the above Function / Optimizer. Create a data conversion function between floating point and fixed point separately from Function / Optimizer. This is because the fixed-point reference code is created for many chips that do not have an FPU.
Implement code optimized for various chips based on the above reference code.

4-3. About Step III FIG. 23 is a schematic diagram for explaining Step III of the mounting method according to the embodiment of the present invention. As shown in FIG. 23, a method for outputting the network definition and weight as a byte code from Chainer has been added. In addition, a virtual machine that interprets the bytecode and executes neural network processing (forward processing, backward processing, weight update) is provided. Native I / F chip optimization implementation can be diverted.

Configuration 1
(Configuration of Native IF)
FIG. 42 is a diagram illustrating a configuration example of the Native I / F according to an embodiment of the present invention.
A configuration that provides an interface independent of the type of computer for each NN algorithm.
A processing system using the NN algorithm instructs a specific computer to execute the algorithm via this interface.
The interface here is a means for defining the correspondence between the format of the input data and the format of the output data, and the processing method of the format of the input data and the format of the output data. If the interface is the same, the same output result can be obtained for the same input. For example, a function written in C language and its function declaration can be mentioned.
The processing system on the side using the NN algorithm is not particularly limited. For example, the existing framework for NN design (Chainer et al.). Moreover, the processing system developed in conjunction with the development of the algorithm is also mentioned.
The computer here means a device that executes a calculation. A computer is a device that includes a computing core, a memory hierarchy, and hardware resources necessary to perform a calculation.
A general-purpose computer means a commonly used computer. It is a computer in which conventional algorithms including Linux (registered trademark) OS and Python easily operate.
The accelerator here means a device that executes specific calculations including the calculation of the NN algorithm at high speed.
The GPU here is a computer specialized in image processing, but also capable of performing general-purpose calculations. The GPU also includes a form of the accelerator. Since there are software assets such as CUDA, the ease of implementing the NN algorithm is about halfway between a general-purpose computer and a general accelerator.

Configuration 1-1
(Configuration for identifying and learning by NN)
FIG. 43 is a diagram illustrating a configuration example for executing identification / learning by NN according to an embodiment of the present invention.
The Native I / F has at least a Forward processing unit. With this configuration, the Native I / F can perform identification processing using the NN algorithm.
Further, the Native I / F includes at least a forward processing unit, a backward processing unit, an internal state initialization processing unit of a weight update algorithm, and a weight update processing unit. With this configuration, the Native I / F can execute identification processing and learning processing using the NN algorithm.
A Forward processing unit and a Backward processing unit are included for each layer algorithm. The weight update algorithm internal state initialization processing unit and the weight update processing unit are included for each weight update algorithm.
Further, the Native I / F includes a forward processing call interface and a backward processing call interface for each layer algorithm, and an internal state initialization processing interface for the weight update algorithm and a weight for each weight update algorithm. An update processing call interface.
Furthermore, the implementation that is called through the Native I / F has a Native I / F call management unit. With this configuration, the implementation that can be called through the Native I / F can change the implementation that can optimally execute the operation of the Native I / F according to the difference in the parameters of the Native I / F. Incidentally, when there is no implementation that can execute the operation of the Native I / F, the call manager of the Native I / F returns an error to the caller. Therefore, Native
The implementation called by the I / F can select and execute an implementation that can optimally execute the operation.

Configuration 1-1-1
(Configuration 1 for performing identification / learning by NN; configuration for managing a multidimensional array (multidimensional array management unit))
FIG. 44 is a diagram showing a configuration example for managing a multidimensional array according to an embodiment of the present invention.
The Native I / F further has a multidimensional array management unit. The multidimensional array management unit generates, destroys multidimensional arrays, acquires attributes (number of axes, number of elements per axis), acquires aggregate results (total, average, variance, etc. for each axis), and multidimensional arrays At least one selected from the group including the four arithmetic operations for each element.

Configuration 1-2
(Configuration for sharing data)

Configuration 1-2-1
(Configuration 1 for sharing data; in the case of a data representation conversion unit)
FIG. 45 is a diagram illustrating a configuration example of a data expression conversion unit according to an embodiment of the present invention.
Further, the Native I / F has a data expression conversion unit. The data representation conversion unit can mutually convert a data representation (device-dependent data representation) that depends on a specific computer and a data representation (device-independent data representation) that does not depend on a specific computer. it can.

Configuration 1-2-2
(Configuration 2 for sharing data; + when having an external storage medium)
Furthermore, the processing system that calls the Native I / F has an external storage medium. The external storage medium can store weight data converted into device-independent data.

Configuration 1-2-3
(Configuration 3 for sharing data; + with communication unit)
FIG. 46 is a diagram illustrating a configuration example of a communication unit according to an embodiment of the present invention.
Further, the implementation that is called through the Natieve I / F has a communication unit. The communication unit can communicate the call information of the Native I / F to the called implementation. Also, when an arbitrary processing system that uses the NN algorithm tries to call a Native I / F that is not related to the presence or absence of call information communication, the implementation on the side called through the Native I / F must Optimal communication processing can be executed. By this step, the physical distance of the computer, the presence / absence of memory sharing, or the difference in communication protocol can be hidden from any processing system using the NN algorithm.
For example, the Native I / F that is not related to the presence or absence of communication of call information includes an interface for executing a layer algorithm, an interface for executing a weight update algorithm, or an interface for executing data representation conversion. is there.

Configuration 2
(Configuration of extended version of Native I / F)

Configuration 2-1
(Configuration 1 of the extended version of Native I / F; with type conversion unit and NN algorithm execution unit for floating point and / or NN algorithm execution unit for fixed point)
FIG. 47 is a diagram illustrating a configuration example of a floating-point and fixed-point execution unit and a type conversion unit according to an embodiment of the present invention.
The Native I / F includes a type conversion unit, an NN algorithm execution unit for floating point, and / or an NN algorithm execution unit for fixed point.
For example, there is a computer B having only a type conversion unit, a computer A having only a floating-point NN algorithm execution unit, or a computer C having only a fixed-point NN algorithm execution unit. When the computer A, the computer B, and the computer C are combined with the basic configuration of Native I / F, the floating point type data generated by the computer A is transferred to the computer B. Subsequently, the data transferred from the computer A to the computer B is converted into fixed-point data by the computer B. Then, the fixed-point type data converted by the computer B is transferred to the computer C. Then, the fixed point type data transferred from the computer B becomes the input data of the computer C, and the entire operation of the NN algorithm is executed. Such steps can also be performed in reverse order.

Configuration 2-2
(Configuration 2 of extended version of Native I / F; with memory pool module)
FIG. 48 is a diagram showing a configuration example of a memory pool according to an embodiment of the present invention.
Furthermore, the implementation that is called through the Native I / F has a memory pool module. The memory pool module can realize dynamic memory management.

Configuration 2-3
(Configuration 3 of the extended version of Native I / F; with an algorithm execution unit that fuses multiple NN algorithms)
FIG. 49 is a diagram illustrating a configuration example of an algorithm execution unit in which a plurality of NN algorithms according to an embodiment of the present invention are merged.
Furthermore, the Native I / F has an algorithm execution unit that combines a plurality of NN algorithms. The algorithm execution unit that fuses the plurality of NN algorithms simultaneously executes the plurality of algorithms for combinations of frequent NN algorithms.

Configuration 2-4
(Configuration 4 of extended version of Native I / F; with multi-dimensional array data compression / decompression unit)
FIG. 50 is a diagram illustrating a configuration example of a multi-dimensional array data communication amount reduction unit according to an embodiment of the present invention.
Further, the implementation called through the Native I / F has a multidimensional array data compression / decompression unit. The multi-dimensional array data compression / decompression unit is provided in the communication unit.

Configuration 3
(Configuration of Native I / F + Chainer execution unit)
FIG. 51 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.

Configuration 3-1
(Native I / F + Chainer execution unit configuration 1; with a bytecode generation unit and a virtual machine)
FIG. 53 is a diagram illustrating a configuration example of a bytecode generation unit and a virtual machine according to an embodiment of the present invention.
Further, the Chainer execution unit has a byte code generation unit. The byte code generation unit inputs a backward calculation procedure and a weight, and outputs it as a byte code. For example, it has a bytecode generator in Chainer's Python layer.
The Native I / F has a virtual machine. The virtual machine interprets the bytecode and executes NN algorithm processing. The NN algorithm processing here is any one of forward processing, backward processing, and weight update, or a combination thereof.

Configuration 3-2
(Native I / F + Chainer execution unit configuration 2; with comparison unit)
FIG. 54 is a diagram illustrating a configuration example of a comparison unit according to an embodiment of the present invention.
Further, the Chainer execution unit has a comparison unit. The comparison unit compares the input / output results of the existing execution unit and the native layer execution unit corresponding to the same NN algorithm, or calls the native layer execution unit of different implementation in the same Native I / F Compare the input / output results of each other.

Configuration 3-3
(Native I / F + Chainer execution part 3; with function composition part)
FIG. 55 is a diagram illustrating a configuration example of a function synthesis unit according to an embodiment of the present invention.
Further, the Chainer execution unit has a function synthesis unit. The function synthesizer inputs the Backward calculation procedure, and “Native that executes multiple algorithms simultaneously”
The combination of instances of the Function class that can handle “I / F” is replaced with an instance of the Function class corresponding to “Native I / F that executes multiple algorithms simultaneously”. However, in the implementation of the Native layer for a computer that executes the Backward calculation procedure, the above replacement is not performed if there is no “Native I / F that executes multiple algorithms simultaneously”.
The replacement here can be executed by partial match search when the Backward calculation procedure is regarded as a character string.
For example, the function synthesis unit is provided in the Python layer of Chainer.

Configuration 4
(Configuration of optimization device that specializes forward processing execution)

Configuration 4-1
(Configuration 1 of optimization device specialized for forward processing execution; with weight optimization processing means)
Further, the Chainer execution unit includes a weight optimization processing unit. The weight optimization processing means executes a weight optimization process suitable for the Function class.

Configuration 4-2
(Optimization device configuration 2 specializing forward processing execution; with data memory area reuse means)
In addition, the Chainer execution unit and Native
The I / F has means for reusing the data memory area. The data memory area reuse means reuses a memory area for data input / output between layers. The reuse means is provided in the Forward processing execution unit or the virtual machine.
For example, a flag for identifying the execution of only the forward process is provided in the argument of the interface (defined by Native I / F) that executes the forward process of the virtual machine.
The condition for executing this process is when the volatile attribute is specified in the Variable variable that is input by the Chainer Function class instance, or
This is when the flag for identifying that only the Forward process is executed when the Forward process of the virtual machine is executed is valid.

Action 1
(Operation by the configuration of NativeIF)
The division of labor between developers who design and use NN algorithms and developers who are deeply aware of the hardware configuration of computers will be easier.
For example, a developer who designs and uses an algorithm can guarantee the identity of the interface for each NN algorithm to be executed by the Native I / F, so various computers can be used without changing the software that calls the Native I / F. You can execute the process.
Specifically, it is possible to reduce the risk that software that is being developed on a specific computer depends on it. As a result, it becomes possible to select computers based on more essential criteria such as the price of computers and weaknesses and weaknesses in specific applications.
For developers who are deeply aware of the hardware configuration of a computer, providing an optimized implementation for a computer that supports Native I / F allows a wide range of NN algorithm users to use their own computers. .

Action 1-1
(Effects of configuration for executing identification and learning by NN)
Developers who design and use the NN algorithm can implement the entire operation of the NN algorithm by calling the interface provided in the Native I / F using any processing system that uses the NN algorithm. .
In addition, a developer who designs and uses an NN algorithm can realize the entire operation of the NN algorithm using an implementation that is optimal for the computer that is being used without being aware of the specific configuration of the computer.

Action 1-1-1
(Operation by Configuration 1 for executing identification / learning by NN; in the case of a configuration managing a multidimensional array (multidimensional array management unit))
A developer who designs and uses an NN algorithm can execute any combination of NN algorithms without performing unnecessary data conversion processing when executing the entire operation of the NN algorithm.
At this time, it is possible to confirm whether or not the NN algorithm can perform the calculation as intended by confirming the total result of the contents of the multidimensional array that is the processing result of the arbitrary NN algorithm.

Action 1-2
(Operation by configuration for sharing data)
Action 1-2-1
(Operation by configuration 1 for sharing data; in the case of the data representation conversion unit)
Through the device-independent data representation, it is possible to exchange data necessary for realizing the entire operation of the NN algorithm between computers with different hardware configurations.

Action 1-2-2
Information unique to each computer can be hidden.
(Configuration 2 for sharing data; + when having an external storage medium)
After the weight data is converted into a device-independent data representation and stored in an external storage medium, identification processing can be executed on any computer using weights learned using a specific computer .

Action 1-2-3
(Operation by configuration 3 for sharing data; + when having a communication unit)
Regardless of the hardware configuration of the computer, the physical distance, and the presence or absence of memory sharing, data necessary to realize the entire operation of the NN algorithm can be exchanged.
It is also possible to call an NN algorithm implementation implemented on a computer in which the processing system on the side using the NN algorithm cannot operate from a computer capable of operating the processing system on the side using the NN algorithm.
Therefore, the entire operation of the NN algorithm can be realized using a plurality of computers connected to a computer network.

Action 2
(Operation by configuration of extended version of Native I / F)

Action 2-1
(Operation by configuration 1 of the extended version of Native I / F; type conversion unit and NN algorithm execution unit for floating point and / or NN algorithm execution unit for fixed point)
In a hardware configuration in which a computer that does not have a floating-point arithmetic unit (FPU) and a computer that has an FPU are mixed, the overall operation of the NN algorithm can be realized using a data type suitable for each computer.
The entire algorithm operation of the NN can be realized by using floating point arithmetic or fixed point arithmetic.
Specifically, the computer A transfers the floating point type data generated by the floating point NN algorithm execution unit of the computer A to the computer B. Next, the computer B converts the floating-point type data transferred from the computer A into fixed-point type data by the type conversion unit, and then transfers the fixed-point type data to the computer C.
The computer C transfers the fixed-point type data generated by the fixed-point NN algorithm execution unit of the computer C to the computer B. Next, the computer B converts the fixed-point type data transferred from the computer C into floating-point type data by the type conversion unit, and then transfers the floating-point type data to the computer A.

Action 2-2
(Operation by configuration 2 of extended version of Native I / F; with memory pool module)
When a processing system that relies on a dynamic memory management mechanism calls the Native I / F including data generation and destruction and executes the entire NN algorithm operation, the operation can be realized lightly.

Action 2-3
(Operation by configuration 3 of the extended version of Native I / F; when having an algorithm execution unit that fuses multiple NN algorithms)
Unnecessary access to global memory can be avoided. In addition, the overhead of function calls can be reduced. Therefore, combinations of frequent NN algorithms can be executed at high speed.

Action 2-4
(Operation by configuration 4 of the extended version of Native I / F; multi-dimensional array data compression / decompression unit)
When a plurality of computers connected to a computer network are used to execute the overall operation of the NN algorithm, the amount of data communication of a multidimensional array can be reduced. Therefore, the operation speed can be improved.

Action 3
(Operation by Native I / F + Chainer execution unit)
By combining the NN algorithm with NativeI / F support and the NN algorithm without NativeI / F support, the overall operation of the NN can be defined and executed.
As soon as support for Native I / F is available, the entire NN operation can be performed by substituting Native I / F as appropriate. Therefore, it is not necessary to modify existing software.
When combining Native I / F, you can enjoy the existing benefits of Define-by-run.

Action 3-1
(Operation by configuration 1 of Native I / F + Chainer execution unit; with bytecode generation unit and virtual machine)
Since the Chainer execution unit has a bytecode generation unit and the Native I / F has a virtual machine, the dependence on advanced libraries and programming languages can be reduced. Therefore, even in various computers including poor execution environments such as accelerators, the entire operation of the NN designed by Chainer can be executed while satisfying the product level requirements.

Action 3-2
(Operation by configuration 2 of Native I / F + Chainer execution unit; with comparison unit)
The comparison unit compares the input / output results of the existing execution unit and the Native layer execution unit corresponding to the same NN algorithm, and inputs the native layer execution units that call the native layers of different implementations in the same Native I / F. Compare the output results.
By having such a comparison unit, it is possible to compare the accuracy of the processing result of the floating-point NN algorithm execution unit and the accuracy of the processing result of the fixed-point NN algorithm execution unit. Therefore, it is possible to compare the processing result of the execution unit that has been sufficiently tested that the NN algorithm can be calculated correctly with the processing result of the newly created Native layer. Therefore, it can be assured that the newly created Native layer implementation can correctly calculate the NN algorithm.

Action 3-3
(Operation by configuration 3 of Native I / F + Chainer execution unit; with function composition unit)
The function synthesizer inputs the Backward calculation procedure, and selects a combination of instances of the Function class that can handle “Native I / F that executes multiple algorithms simultaneously”. Replace with an instance of Function class corresponding to 1: 1 in F. If there is no “Native I / F that simultaneously executes a plurality of algorithms”, the function synthesis unit does not perform the above replacement.
By having such a function synthesis unit in the structure of Chainer's Python layer, the Backward calculation procedure is automatically processed regardless of the presence or absence of “Native I / F that executes multiple algorithms simultaneously”. By processing this Backwarod calculation procedure, “Native that executes multiple algorithms simultaneously”
If "I / F" exists, the Native I / F is called and replaced with an instance of the corresponding Function class. As a result, the entire operation of the high-speed NN algorithm can always be realized.
Even if the combination of functions does not have a “Native I / F that executes multiple algorithms simultaneously”, the function synthesis unit may be advantageous. Specifically, a combination of Convolution2D + BatchNormalization or a combination of Linear + BatchNormalization when limited to the Forward process.
BatchNormalization is a process for aligning the variance between elements and removing the average based on long-term statistical information through NN learning for each element of the input multidimensional array. If only forward processing is performed instead of learning, there is no need to update the variance or average. For example, if a and b are constants, processing such as y = ax + b is performed for each array element. Only. Linear processing is processing to perform matrix product. Convolution2D is a process for calculating a combination of convolution and matrix product. Since these processes include the conversion such as y = ax + b exemplified earlier, by adjusting the weight and bias of Linear and Convolution2D, the output results of these functions are input to BatchNormalization and processed. The same result can be obtained.
The function synthesis unit can convert Convolution2D + BatchNormalization into a single Convolution2D by adjusting the weights and biases. The same applies to conversion from Linear + BatchNormalization to a single Linear.

Action 4
(Operation by the configuration of the optimization device specialized for forward processing execution)
Memory can be reduced by reducing the amount of weight information in forward processing or the amount of data memory of input data and executing forward processing. Further, the amount of calculation can be reduced by reducing the number of weight elements or executing the forward processing without calculating zero weight.

Action 4-1
(Operation by the optimization device 1 specializing forward processing execution; when there is a weight optimization processing means)
By having a weight optimization processing means specialized for forward processing in Chainer's Function class, weight optimization processing is executed for any instance of Function class included in the learned network configuration. can do. As described above, since the weight optimization process can be executed, it is possible to reduce the memory and the calculation amount in the forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.

Action 4-2
(Operation by the optimizing device 2 specializing forward processing execution; in the case of having a means for reusing the data memory area)
By giving a flag that executes only the Forward process as an argument to the execution unit (Chainer or virtual machine) of the Forward process, it is possible to reduce the memory during the Forward process. As a result, the entire operation of the NN algorithm can be executed at high speed.

5). Specific Procedure of Mounting Method According to Embodiment The mounting method according to the embodiment includes a first method and a second method.
5-1. First Method FIG. 24 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (first method) according to an embodiment of the present invention. As shown in FIG. 24, a mounting apparatus according to an embodiment mainly includes an evaluation board (motherboard) 100, an embedded chip (embedded semiconductor integrated circuit) 200 detachably mounted on the evaluation board 100, including.

The evaluation board 100 mainly includes a CPU 101, a main memory 102, a communication I / F 103, and an external memory 104. Each of these components is electrically connected via an internal bus 109.

The CPU 101 loads various programs such as an operating system from the external memory 103 into the main memory 102, and executes instructions included in the loaded program. The main memory 102 is used for storing a program executed by the CPU 101, and is constituted by, for example, a DRAM.

The communication I / F 103 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and a communication network (not shown) including the Ethernet (registered trademark) and the Internet. Through this, it is possible to communicate with a computer and an input / output device (not shown) operated by a developer or the like. The communication I / F 103 can also communicate with a communication I / F 204 described later of the embedded chip 200. The external memory 104 is configured by a flash memory, for example, and stores various programs such as an operating system.

Next, the embedded chip 200 includes a CPU 201, an accelerator (auxiliary arithmetic unit) 202, a main memory 203, a communication I / F 204, and an external memory 205. These components are electrically connected via an internal bus 209. The embedded chip can optionally include a GPU (not shown).

The CPU 201 loads source code (for example, source code written in Python) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and is included in the loaded source code. Execute each code that will be executed.
The accelerator 202 loads the source code (source code written in C language, assembler, etc.) received from the evaluation board 100 (communication I / F 103) via the communication I / F 204 into the main memory 203, and loads it. Execute each code included in the source code. The main memory 203 is used for storing source code executed by the CPU 201 and the accelerator 202, and is configured by a DRAM, for example.

The communication I / F 204 communicates with the communication I / F 103 of the evaluation board 100 to transmit / receive various information. The external memory 205 is configured by, for example, a flash memory and stores various data.

FIG. 25 is a flowchart showing an example of a procedure used in the mounting method according to the embodiment of the present invention.
First, in step 301, source code described in a first programming language (for example, Python) is executed on a personal computer or the like. The developer confirms whether the source code operates on a personal computer or the like based on the execution result. Here, the personal computer or the like refers to a computer having abundant calculation resources, and includes, for example, the learning apparatus according to the embodiment described in the first part. The state in which the source code operates in the personal computer or the like in step 301 is the same state as step I described in “4-1” above.

In step 302, the CPU 201 of the embedded chip 200 is caused to execute the source code written in Python or the like that has been confirmed to operate on a personal computer or the like in step 301 using the evaluation board 100. The developer confirms whether or not the source code is operable by the CPU 201 based on the execution result. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Here, the source code described in Python or the like can be passed to the CPU 201 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the CPU 201, the developer corrects the source code and repeats step 302. When it is confirmed that the source code is operable by the CPU 201, the developer or the like proceeds to the next step 303.

In step 303, a developer or the like uses a second programming language (for example, C language) to operate the accelerator 202 (at least a part of the source code) that is confirmed to be operable by the CPU 201 in step 302. Or assembler).

In step 304, the source code rewritten in C language or the like in step 303 is executed by the accelerator 202 of the embedded chip 200 using the evaluation board 100. The developer confirms whether or not the rewritten source code is operable by the accelerator 202 based on the execution result. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Here, the source code described in C language or the like can be passed to the accelerator 202 via the communication I / F 103 of the evaluation board 100 and the communication I / F 204 of the embedded chip 200. If it is determined that the source code is not operable by the accelerator 202, the developer corrects the source code and repeats step 304. When it is confirmed that the source code is operable by the accelerator 202, the developer or the like proceeds to the next step 305.

In step 305, in the evaluation board 100, the result of the CPU 201 executing the first specific code (code to be verified) in the source code described in Python or the like, and the accelerator 202 in the source code described in C language or the like The result of executing the second specific code, which is the second specific code and the first specific code is rewritten from Python or the like to C language or the like (referred to as a unit test executed by the embedded chip 200, for example) Using the module to output the comparison result. Based on the comparison result, the developer verifies whether the same output is obtained for the same input in both execution results. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Until this verification is completed, the developer or the like repeats steps 303 to 305 described above. When this verification is completed, the developer or the like moves to the next step 306.

In step 306, the developer tunes the source code in step 305 so that the source code written in the C language or the like operates at a higher speed by the accelerator 202.

In step 307, in the evaluation board 100, the result of the CPU 201 executing the source code described in Python or the like and the result of the accelerator 202 executing the source code described in the C language or the like tuned in step 306 are shown. (For example, using a module called a unit test executed by the embedded chip 200) and compare results are output. Based on the comparison result, the developer verifies whether the same output is obtained for the same input in both execution results. Such an operation can be realized by the CPU 101 of the evaluation board 100 loading and executing a predetermined program stored in the external memory 104. Until this verification is completed, the developer or the like repeats Step 306 and Step 307 described above. When this verification is completed, the developer or the like moves to the next step 308.

When step 307 is completed, the embedded chip 200 is in a state of being operated by two source codes such as Python and C language. This state will be described with reference to FIG. FIG. 26 is a schematic diagram showing an operation state of the embedded chip in the mounting method according to the embodiment of the present invention.

As shown in FIG. 26, in step 301 (which corresponds to step I), the function calling side (that is, the main body that calls the function) is described in Python or the like, and the called side (that is, the calling side). Functions) are also written in Python.
Next, in the above steps 302 to 307, the function calling side is still described in Python or the like, and the called side is a mixture of those written in Python or the like and those written in C language or the like. It is in a state. That is, in the state where step 307 is completed, the embedded chip 200 is in a state of being operated by two source codes such as Python and C language.

The final purpose of the mounting method according to the present embodiment is that, as shown at the right end of FIG. 26, both the calling side and the called side are described in C language, that is, embedded. This is a state in which the system chip 200 is operated only by the source code described in the C language or the like.

Therefore, returning to FIG. 25, in step 308, the developer or the like has source code written in Python and still in C language so that the embedded chip 200 operates only with source code written in C language or the like. Rewrite all parts that have not been rewritten to C language or the like. In this step 308, the embedded chip 200 is disconnected from Python. The source code described in C language or the like generated in this way is stored in the external memory 205 or the like of the embedded chip 200. As a result, the embedded chip 200 can read the source code stored in the external memory 205 or the like and cause the accelerator 202 to execute the machine learning. This state is a state targeted by the mounting method according to the embodiment, and the state described in the above “1” and “2” has been solved.

5-2. Second Method FIG. 27 is a schematic diagram illustrating a configuration example of a mounting apparatus used in a mounting method (second method) according to an embodiment of the present invention. The mounting device (FIG. 27) used in the second method is different from the mounting device (FIG. 24) used in the first method in that the embedded chip 200 does not include a CPU. In the second method, the operation performed by the CPU 201 of the embedded chip 200 in the first method is performed by a CPU provided in a computer (such as a personal computer) (not shown) provided outside. For example, the computer (such as a personal computer) referred to here may be the learning device (such as the personal computer illustrated in FIG. 11) described in the first part.

In the mounting method performed by the second technique, the operation executed by the CPU 201 of the embedded chip 200 in

steps

302, 305, and 307 in the mounting method described with reference to FIG. It changes so that it may be performed in CPU provided in (not shown). In order to realize this, the evaluation board 100 shown in FIG. 27 is communicably connected to a computer (not shown) provided outside, for example, via the communication I / F 103, so that the computer The CPU provided in may be able to execute the source code written in Python and receive the execution result.

6). Configuration of Mounting Device Next, a configuration necessary for the mounting device 100 according to the above-described embodiment to realize the technique described in the above “5” will be described.

6-1. Definition of terms used to describe the configuration of the present invention (difference between class and module)
A module is a collection of procedures and data defined and implemented to achieve a specific purpose (a concept that is independent of the support of a specific programming language)
A class is a module defined and implemented with the support of an object-oriented language such as Python

(Python layer and Native layer)
The Native layer refers to the layer of the Native I / F and the implementation (software and hardware) called from it. The Python layer is a software layer that is supposed to be executed on the Python language. Currently Chainer is written in the Python language, but Chainer may be ported to another programming language in the future. The function described here as a Python layer does not necessarily mean specializing in the Python language. As a division of roles between the Python layer and the Native layer, the Python layer assumes a development environment with a high level of abstraction that is more suitable for algorithm design, and the Native layer has a development environment with a low level of abstraction that is more conscious of the hardware configuration. Is assumed.

(Computer and execution unit correspondence)
FIG. 52 is a diagram illustrating an example of cooperation with an existing execution unit according to an embodiment of the present invention.
The execution part is a method of the Function / Optimizer class for actually calculating the neural network algorithm.
The existing execution unit is a general-purpose computer execution unit and / or a GPU execution unit. The general-purpose computer execution unit calculates the NN algorithm using the general-purpose computer. The GPU execution unit calculates the NN algorithm using the GPU.
The Native execution unit calculates the NN algorithm using the Native layer implementation. Since the Native layer is implemented for each type of computer, all computer types (general-purpose computers, GPUs, accelerators) can operate through Native I / F.

6-2. Configuration of Mounting Unit FIG. 28 is a schematic diagram conceptually showing functions of a mounting apparatus according to an embodiment of the present invention. As shown in FIG. 28, the mounting unit 400 mainly includes a drive unit 401, a Function class / Optimizer class 402, a general-purpose computer execution unit 403, a GPU execution unit 404, a Native layer execution unit 405, and a general-purpose computer. The multi-dimensional array 406 for GPU, the multi-dimensional array 407 for GPU, the multi-dimensional array 408 for Native, and the Variable class 409 are included.

The drive unit 401 mainly executes an execution unit that instructs the Function class / Optimizer class 402 to execute an algorithm (function), and an execution result (or GPU execution unit 404) of the algorithm (function) by the general-purpose computer execution unit 403. And a comparison unit that compares the execution result of the Native layer execution unit 405 using, for example, a module referred to as a unit test, and outputs the comparison result.

The Function class / Optimizer class 402 causes at least one of the general-purpose computer execution unit 403, the GPU execution unit 404, and the Native layer execution unit 405 to execute the algorithm (function) instructed from the drive unit 401.

The general-purpose computer execution unit 403 acquires a multi-dimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the multi-dimensional array 406 for general-purpose computers, and executes the algorithm (function) using the CPU. To do. The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.

The GPU execution unit 404 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the GPU multidimensional array 407, and executes the algorithm (function) using the GPU. The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.

The Native layer execution unit 405 acquires a multidimensional array corresponding to the algorithm (function) instructed from the Function class / Optimizer class 402 from the native multidimensional array 408, and executes the algorithm (function) using an accelerator. . The execution result is returned to the drive unit 401 via the Function class / Optimizer class 402.

The Variable class 409 holds all the multidimensional arrays used by the general-purpose computer multidimensional array 406, the GPU multidimensional array 407, and the native multidimensional array 408. The general-purpose computer multidimensional array 406, GPU Corresponding multidimensional arrays 407 and Native multidimensional arrays 408 are supplied.

When the first method described in “5-1” is adopted as a mounting method, all the components shown in FIG. 28 are arranged in the embedded chip 200 (see FIG. 24). The In this case, the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU 201 mounted on the embedded system chip 200, and the GPU execution unit 404 includes a GPU (not shown) mounted on the embedded system chip 200. The native layer execution unit 405 executes the algorithm (function) mainly using the accelerator 202 mounted on the embedded chip 200.

On the other hand, when the second method described in “5-2” is adopted as the mounting method, among the components shown in FIG. 28, the Function class / Optimizer class 402, the general-purpose computer execution unit 403 The GPU execution unit 404, the general-purpose computer multi-dimensional array 406, the GPU multi-dimensional array 407, and the Variable class 409 are arranged in a computer (such as a personal computer) provided outside. In this case, the implementation of the Native layer is still arranged in the embedded chip 200. Furthermore, in this case, the general-purpose computer execution unit 403 executes an algorithm (function) using the CPU of the computer provided outside, and the GPU execution unit 404 uses the GPU of the computer provided outside. Execute algorithm (function).

6-3. Configuration of Native Layer Execution Unit Next, the configuration of the Native layer execution unit 405 described above will be described. FIG. 29 is a schematic diagram illustrating a configuration example of a Native layer execution unit included in the mounting apparatus according to an embodiment of the present invention.
As shown in FIG. 29, the Native layer execution unit 405 mainly includes a NativeDevice class 501, a NativeArray class 502, a Function class / Optimizer class 503, and a byte code generation unit 504 in the Python layer. The Function class / Optimizer class 503 described in FIG. 29 and the Function class / Optimizer class 402 described in FIG. 28 are the same components, and the NativeArray class 502 described in FIG. 29 and FIG. The native multidimensional array 408 is the same component.
Further, the Native layer execution unit 405 mainly includes a device management module 505, a data conversion module 506, a multidimensional array module 507, a Function module / Optimizer module 508, a virtual machine module 509, and a memory in the Native layer. A pool module 510.

The NativeDevice class 502 wraps the device management module in the Native layer in the Python layer and hides function calls and data input / output to the Native layer. The NativeArray class 502 wraps a multi-dimensional array in the Native layer with a Python layer. Of the Function class / Optimizer class 503, the Function class wraps the Function module in the Native layer in the Python layer, and the Optimizer class wraps the Optimizer module in the Native layer in the Python layer. Here, the Function class and the Optimizer class have been implemented in Chainer, and have a function of hiding the difference in execution between the general-purpose computer and the GPU. Execution in the Native layer can be hidden by extending this function.
The byte code generation unit generates a byte code.
The details of the constituent elements illustrated in FIG. 29 will be described later.

7). Effects of the mounting apparatus according to the embodiment Deep learning is a developing technology with active research and development. Therefore, a new layer algorithm having better performance than the conventional one was invented during the development period for embedded chips. It is anticipated that there will be a demand to incorporate new algorithms into software being developed or hardware implementations.
In order to realize a state in which the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment, it is necessary to take the following development steps.
1. Implement and verify the algorithm in an environment where abundant computational resources such as GPU can be obtained.
The algorithm verified in 2.1 is combined with the neural network module that has already been optimized and implemented on the embedded chip, and the operation is verified. Based on the result, the optimization specialized for the corresponding chip is applied to the algorithm verified by implementation in 1.
After completing the work in 3.2, use only the neural network implementation optimized for the corresponding chip, and combine it with other modules (sensors, motor control systems, etc.) to meet product level specifications. It is verified by various test items.

When the neural network algorithm is operated on Python, the mounting device according to the embodiment performs an execution unit using the Python language that operates on a general-purpose computer, an execution unit using a GPU, and an optimized mounting for a specific chip. And an execution unit using the network are configured to call each layer, and the entire neural network algorithm is operated by using only the optimized implementation for the specific chip by way of the byte code.
The algorithm implementation code created in step 1 between

steps

1 and 2 described in the previous paragraph can be used for 2, and the difference in operation results between

steps

1 and 2 can be easily compared. Furthermore, between

steps

2 and 3, the optimization implementation created for step 2 can be diverted for step 3, and conversely, the defect correction related to the optimization implementation found in step 3 can also be diverted for step 2. Can do. As a result, it is possible to realize a state in which the configuration of the neural network including the new layer algorithm satisfies the product level specifications in the embedded environment and can be operated with the minimum development cost.

Definition of Terms In describing embodiments of the present invention in detail, the following terms are defined.
“Overall operation” represents a processing unit in which forward processing alone or forward processing, backward processing, and weight update processing are repeatedly executed. This overall operation is envisaged as a typical neural network learning and identification embodiment.

8). Native Layer Next, a configuration related to the Native layer of the mounting apparatus according to the embodiment illustrated in FIG. 29 will be described.
8-1. Device Management Module The device management module 505 performs initialization and release processing of devices (software and hardware status depending on optimization implementation). Specifically, the processing performed in the device management module 505 differs depending on the form of the device, but as a typical processing content, for example, a later-described memory pool is secured or released. The device need not be on the same chip or on the same board as the general-purpose computer that runs Chainer and Python. It is possible to implement an optimized implementation that communicates with a device on another board and initializes / releases it.

An example definition of a function that initializes or releases a device is shown below.
(Example 1) Device * chnr_init_device (void)
As a result, the device can be initialized.
(Example 2) void chnr_release_device (Device * device)
As a result, the device can be released.

8-2. Function Module The Function module 508 is a function group that performs calculation for each layer of the neural network, and defines the following functions for each type of layer.
chnr_forward_xxxx (…)
-Implemented forward processing (floating point version) chnr_backward_xxxx (…)
-Implemented word processing (floating point version) chnr_forward_xxxx_fx (...)
-Implemented forward processing (fixed point version) chnr_backward_xxxx_fx (...)
-Implementation of backward processing (fixed-point version) where xxxx represents the name assigned to each layer type.
Specific processing contents in each function include those exemplified in “2-6” to “2-12” in the first part.

8-3. Multidimensional Array (MDArray) Module The multidimensional array module 507 manages a multidimensional array that is input and output between the functions of the Native layer. The multidimensional array module 507 can manage an arbitrary size and number of dimensions. Further, as will be described later, the multidimensional array module 507 has a mechanism for mutual conversion with Numpy (multidimensional array class of python layer on which Chainer depends) and a multidimensional array library for GPU.
Furthermore, the multidimensional array module 507 can hold not only floating point types but also fixed point types. As a result, the calculation of the neural network can be easily realized even with hardware having no FPU (floating point arithmetic unit). The multidimensional array module 507 has a function of mutual conversion with a floating-point multidimensional array.

A mounting example of the multidimensional array module 507 will be described.
FIG. 30 shows an example of the structure definition of the multidimensional array module of the mounting apparatus according to the embodiment of the present invention.
Next, function definition examples are as follows.
(Example 1) MDArray chnr_create_md_array (dimensions [], numaxis, type)
Thereby, a multidimensional array can be generated and initialized.
(Example 2) void chnr_delete_md_array (MDArray * mdrray)
Thereby, a multidimensional array can be deleted.
(Example 3) void chnr_md_array_add (MDArray * dst, MDArray * a, MDArray * b)
Thereby, the elements of the multidimensional array can be added.

Next, the memory management of the multidimensional array of the multidimensional array module 507 will be described.
Management (generation / destroy) of the memory area that stores the entities of multidimensional arrays is realized in the Native layer. In the case of an embedded environment, an environment having a memory configuration that cannot be managed by a memory management mechanism (malloc / free) provided as standard by Linux (registered trademark) OS or the like can be considered. In the Python layer, considering the role division of the software hierarchy, which is algorithm development in the Native layer and development that is strongly aware of hardware, the management mechanism responsible for the characteristics of these hardware environments can be implemented in the Native layer. Is appropriate. This memory management mechanism can be reused when using the virtual machine described later (when the dependency on the Python layer is removed).
A class that wraps the multi-dimensional array in the Native layer is prepared in the Python layer, and the generation / release timing of the memory area is matched with the instance lifetime of this Python class. Such a mechanism is necessary in order to handle multidimensional arrays naturally in Python code. The “Define-by-Run” function also depends on the Python memory management mechanism.

Fig. 31 shows the mutual compatibility and reference relationship of multidimensional array data.

8-4. Memory Pool Module The memory pool module 510 is a mechanism for reducing the number of calls of a memory management mechanism having a high cost (such as the number of processing cycles) by using the memory area once reserved.
An example of function definition is as follows.
(Example 1) void chnr_momory_pool_init (MemoryPool * momory_pool)
Thereby, the memory pool can be initialized.
(Example 2) void chnr_momory_pool_release (MemoryPool * momory_pool)
Thereby, the memory pool can be discarded.
(Example 3) void * chnr_momory_pool_alloc_buffer (MemoryPool * momory_pool, int byte_size, void * old_addr)
Thereby, a memory can be secured.
(Example 4) void chnr_momory_pool_free_buffer (MemoryPool * momory_pool, void * addr)
Thereby, the memory can be released.

Background that requires memory pool module in Native layer (1)
Chainer's "Define-by-Run" relies on Python's dynamic memory management mechanism. An example thereof (Linear Function forward processing) is shown in FIG. In this example, an instance of Wx (the content is a multidimensional array) is newly generated by the description of Wx = x.dot (self.WT) on the third line. Wx is automatically destroyed by the Python memory management mechanism when it is no longer referenced by any variable.
The size of the data output by the function (Wx in the above example) can be changed dynamically depending on the input data size and parameters, and the actual (memory area) can be reserved for forward processing and backward processing codes. It is executed in the flow. In order to realize “Define-by-Run” (defined when the network configuration is executed), it is necessary to have a mechanism for securing a memory of a necessary size at such a necessary timing.

Background that requires a memory pool module in the Native layer (2)
By preparing a class that wraps the multi-dimensional array of the Native layer in the Python layer and making the lifetime of the multi-dimensional array of the Native layer coincide with the Python layer, you can enjoy the flexibility of "Define-by-Run" While implementing the Native layer.
However, since Function is normally called at a high frequency, calling a memory management mechanism with high cost such as malloc or free in the Native layer every time may cause a reduction in processing speed. For this reason, it is necessary to prepare a function (memory pool) for reusing the memory area once reserved.

Memory pool implementation example (1)
An example of structure definition is shown in FIG.
The processing flow when securing the memory is as follows.
1. Search the Buffer_size array for an index where the released flag is 1 and the size when the previously secured flag matches the size that you want to secure this time. Returns the value of buffer_addr (memory buffer address). Here, the released flag is managed by the sign bit of the buffer_size array element, for example. By searching for the array element based on the combination of the size and the address obtained at the previous time, it is possible to reduce the replacement of addresses.
2. If no matching index is found, the memory is actually allocated (call malloc etc.), its address and size are added to the array, and the address is returned.

Memory pool implementation example (2)
As a process for releasing the memory, the address to be released is searched from the buffer_addr array, and if the address is found, the released flag is set to 1.
As a process when releasing the memory pool, the memory is released (such as calling the free function) for the element whose address is set from the buffer_addr array.

Effects of memory pool implementation In most neural networks, the combination of memory sizes does not change with each learning iteration and is fixed, so by using the memory pool implementation example described above, calling malloc etc. Can be kept only during the first iteration.

8-6. Optimizer Module The Optimizer module 508 is a function group that performs weight update for each layer having a weight of the neural network. The Optimizer module 508 defines the following functions for each weight update algorithm.
(Example 1) chnr_op_init_state_xxxx (...)
As a result, the internal state initialization process of the weight update algorithm can be implemented (floating point version).
(Example 2) chnr_op_update_one_xxxx (...)
This makes it possible to implement weight update processing (floating point version).
(Example 3) chnr_op_init_state_xxxx_fx (...)
As a result, the internal state initialization process of the weight update algorithm can be implemented (fixed point version).
(Example 4) chnr_op_update_one_xxxx_fx (...)
As a result, the weight update process can be implemented (fixed-point version).
Here, xxxx represents a name assigned to each weight update algorithm.
Note that the weight update algorithm can include the algorithm described in “2-13” of the first part.

8-7. Data conversion module (1)
The data conversion module 506 is a function group that performs data format conversion.
An example of function definition is as follows.
(Example 1) chnr_float_to_fixed (MDArray * dst, MDArray * src, int Q)
As a result, the floating-point type can be converted to the fixed-point type.
(Example 2) chnr_fixed_to_float (MDArray * dst, MDArray * src)
Thereby, it is possible to convert from the fixed-point type to the floating-point type.
(Example 3) chnr_host_to_device (MDArray * dst, float * src_data, int src_dimensions [], int src_num_axis, int Q, int async,…)
Thereby, it is possible to convert from a device-independent data representation (described later) to a device-dependent data representation (described later).
(Example 4) chnr_device_to_host (float * dst_data, int dst_dimensions [], int * dst_num_axis, MDArray * src, int async,…)
As a result, the device-dependent data representation can be converted into the device-independent data representation.

Effects of floating-point type and fixed-point type conversions For embedded semiconductor chips, the FPU (floating-point arithmetic unit) is omitted, or at least the circuit design that does not use the FPU for large-scale parallel computation is used. Many are aimed at reducing resources (number of transistors and power consumption). When executing a numerical computation algorithm such as a neural network without using an FPU, a data type called a fixed-point type that expresses a numerical value including information after the decimal point by using an integer arithmetic unit and a shift arithmetic unit is often used. The floating-point type is a data type suitable for algorithm design in the sense that a real value can be handled more intuitively, and the fixed-point type is a data type suitable for effective utilization of hardware resources. By preparing a conversion function between such data types in the framework for designing and executing neural networks, neural network algorithm development is a mathematically conscious hardware-oriented implementation in a unified environment. It is possible to proceed step by step while checking the influence of type conversion in units of functions.

Device-independent data representation refers to data representation that does not have information that depends on a particular computer. A typical implementation of this data representation is a multi-dimensional array in C language format with continuous memory addresses. If you use a library such as numpy in the Python layer, you can easily handle such data representation, but it does not specify the library or host language.
A device-dependent data representation is a data representation suitable for an optimization implementation specialized for a specific computer.
By preparing a function that interconverts these two data expressions, hardware-aware optimization implementations and algorithm-aware implementations (for example, easy-to-read code with a structure similar to mathematical formulas written in Python) The whole operation can be executed in cooperation.

Example of requirements to consider when converting to device-dependent data representation (1) Memory configuration Is it placed in shared memory? Is it placed in a memory area specific to hardware?
(2) Memory alignment Start address, start address for each dimension, padding (3) Byte order Little endian, Big endian (4) Data type Fixed point (Q value) / floating point, byte width (32bit, 16bit, 8bit, …)
(5) Data input / output scheduling in multi-core execution (6) Data communication Communication processing when device with optimized mounting is on another chip or board

8-8. Communication means The implementation of the Native layer function group (Native I / F) described so far is adapted to the following changes to speed up the overall operation while communicating with devices on separate chips or substrates. Can be executed (1) RPC (remote procedure call)
(2) Instruction queue (3) Reduction of data traffic of multi-dimensional array (4) Asynchronous processing of transfer and calculation The following terms are defined to explain these change policies. Device to be used (in the normal implementation, the device that runs Chainer code on Python)
"Remote device": A device that requires communication processing because it is on a separate chip or substrate

RPC (remote procedure call)
When a function defined in Native I / F is called, its processing requirements (memory allocation and calculation execution) are not executed directly, but information (function type and argument) indicating the processing request is generated and A mechanism is provided in which the host device receives the processing result after the processing is transmitted to the device and the remote device executes processing based on the instruction.
Instruction queue Communication of processing requests by RPC is not executed each time a function defined in Native I / F is called, but the information indicating the processing requests is temporarily stored in a queue (FIFO buffer) to improve the communication schedule. To do.
Reduction of data communication amount of multi-dimensional array Since multi-dimensional array has enormous data size, reducing the communication amount is an important issue in improving the speed of the whole operation. There are two major measures to reduce the amount of communication.
(1) Reduce the number of transfers of multidimensional arrays (2) Reduce the amount of data communication of each multidimensional array

Method to reduce the number of transfers of multidimensional arrays Data input / output between layers of the intermediate layer (other than the input layer and output layer of the neural network) and the weight gradient need only exist in the remote device, and communication between devices is not necessary It is unnecessary. Also, the “weight” may be transferred to the remote device at the initial stage of defining the network structure, and transferred to the host device at the end of learning.
The device-independent data representation and the conversion function of the device-dependent data representation described for the data conversion module 506 are optimal for managing such transfer timing. Specifically, each function performs the following processing.
When converting from the device-independent data representation to the device-dependent data representation, data transfer from the host device to the remote device is performed.
When converting from a device-dependent data representation to a device-independent data representation, data is transferred from the remote device to the host device.

Methods for reducing data traffic of individual multidimensional arrays Various algorithms for data compression are known.
(1) Lossless compression (Huffman coding, run-length compression, etc)
(2) Lossy compression (DCT, scalar quantization, vector quantization, etc)
By specifying the type and parameters of these compression algorithms in the argument of the function that requests data communication (assuming device-independent data representation and device-dependent data representation conversion function), the data properties and accuracy requirements are considered. The amount of communication can be reduced by using the optimum data compression means based on the above.

Asynchronous processing of transfer and calculation Many embedded chips have a configuration in which separate hardware asynchronously executes data communication and arithmetic processing. If a function that requests data communication (assuming a conversion function between device-independent data representation and device-dependent data representation) is prepared for asynchronous execution (non-blocking), such hardware configuration coding (generally pipeline) By using a method called “Corporation”, the entire algorithm can be accelerated.
Figure 34 shows the code for pipelining with pseudo code.

8-9. Virtual Machine Module The virtual machine module 509 is a function group that realizes a function of interpreting byte codes and executing neural network learning / identification processing (forward processing, backward processing, and weight update). Byte code is assumed to be generated by a Python layer byte code output device, which will be described later, but even byte code generated by other software can be interpreted and executed by the virtual machine module if the format is correct.
An example of function definition is as follows.
(Example 1) void chnr_init_with_bytecode (VMState * state, char * byte_code)
Thereby, it is possible to parse the bytecode and initialize the internal state of the virtual machine.
(Example 2) void chnr_forward (VMState * state)
Thereby, a forward process can be performed.
(Example 3) void chnr_backward (VMState * state)
Thereby, the backward process can be executed.
(Example 4) void chnr_update_weight (VMState * state)
Thereby, the weight update process can be executed.

Byte code format example Store the following information in binary format.
(1) Input / output data information {number of array dimensions / size, data type (float32, FixedPoint)} * Variable number (2) Weight information {number of array dimensions / size, data type (float32, FixedPoint), realized value} * weight Number (3) Function call information during backward processing {Function type, I / O data index, weight information index, function specific parameters for each function type} * Number of functions (4) Weight update type and parameters, and more bytes An index of a multidimensional array that is an input / output of the entire neural network processing may be added to the code. By storing this index in the bytecode, the user code that uses the virtual machine can appropriately link the multidimensional array that becomes the input of the entire neural network processing and the multidimensional array that becomes the output with respect to the function call. it can. For example, this association can be performed by the following flow.
(Step 1) The user code obtains an input multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine.
(Step 2) The user code copies the input data to the multidimensional array acquired in Step 1 above.
(Step 3) The user code calls a function for executing the entire operation prepared in the configuration of the virtual machine.
(Step 4) The user code obtains the output multidimensional array of the entire process by calling a function prepared in the configuration of the virtual machine (this multidimensional array is the result of the overall operation executed in Step 3 above). (The functions in step 3 and step 4 do not necessarily need to be separated and may be integrated functions).
(Step 5) The user code acquires the contents of the output data from the multidimensional array acquired in Step 4.

Processing flow implementation example of initialization of internal state of virtual machine (1) “Input / output data information” in the bytecode is interpreted, and a list of multidimensional arrays to be input / output by Function is generated.
(2) Interpret the “weight information” in the bytecode to generate a list of weights and weight gradients (both multidimensional arrays).
(3) Interpret “Backward Function Call Information” in the bytecode and generate a list of structures (FunctionState) with the following information for Forward and Backward (execute forward / backward processing) Function identification ID, input / output data address, weight address, weight gradient address, parameters specific to each function type).
(4) Interpret the "weight update type and parameters" in the bytecode, and initialize a multidimensional array of weight update algorithm internal states and a structure (OptimizerState) with the following information (update of weights) Function address, weight address, weight gradient address, internal state of weight update algorithm, parameters specific to each weight update type).

The configuration diagram of the internal state of the virtual machine is shown in FIG.

Virtual machine module execution flow example (1) (forward processing and backward processing)
The virtual machine module executes processing such as pseudo code shown in FIG.

Virtual machine execution flow example (2) (Optimizer)
The virtual machine module executes processing such as pseudo code shown in FIG.

Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (1)
When performing the entire operation, if only the identification process is performed without learning (updating the weights), only the forward process may be performed. In this case, the following data is unnecessary.
(1) Data input / output between layers that are not accessed by currently executing function (2) Weight gradient (3) Internal state of weight update algorithm Weight gradient when initializing internal state of virtual machine And it is not necessary to secure the internal state of the weight update algorithm. For data input / output between layers, for example, the memory allocation amount can be reduced by the procedure described in the next paragraph.

Configuration 2 of optimization device specializing execution of forward processing; with data memory area reuse means (2)
Example procedure when initializing the internal state of a virtual machine module (1) Calculate the sum of the data size (memory size) to be input / output for each function, and select the one with the maximum size.
(2) An address is set so that the memory area secured in 1 is reused when a structure (MDArray) that handles a multidimensional array is initialized.
By setting addresses so that the left and right ends of the memory area are alternately switched as input and output for each layer, copy of array data is prevented from occurring. If a function that performs input / output with a loop is included, the output data carried over to the next iteration is not subject to reuse as shown in this procedure, and a memory area is secured individually.

Configuration 2 of optimization device specializing execution of forward processing; When there is means for reusing data memory area (3)
FIG. 38 shows an example of address setting for data input / output by Function.

Supplementary information on bytecode format So far, the information stored in "Function call information during backward processing" is simply implemented in ascending or descending order. However, by storing multiple execution orders in bytecode, or repeatedly storing instructions for branching in bytecode, the configuration of the neural network can be dynamically changed according to the nature of the input data during virtual machine execution. It is also possible to execute more advanced processing such as changing to. The memory management mechanism described above with respect to the memory pool module can be used to implement such a dynamic mechanism.

Linking data input / output from / to the external code from the virtual machine A list of data to be input / output between the functions is created in “Initialization of the internal state of the virtual machine”. The simplest way is to directly access the elements of this list. If the variable name of the Python variable instance is stored in the "input / output data information" at the time of bytecode generation, the input / output can be linked using this name.

9. Python Layer Next, a configuration related to the Python layer of the mounting apparatus according to the embodiment illustrated in FIG. 29 will be described.
9-1. NativeArray Class The NativeArray class 502 is a class that wraps a multi-dimensional array in the Native layer in the Python layer. The NativeArray class 502 is generated as an instance corresponding to the multi-dimensional array of the Native layer 1: 1. The NativeArray class 502 has a lifetime management function based on a reference count as a basic function as a Python object. Furthermore, the NativeArray class 502 has a function of requesting the release of the multi-dimensional array in the Native layer when the lifetime ends.
Further, the NativeArray class 502 has a copy of type information of the multi-dimensional array of the Native layer and has a function of transmitting it to other objects in the Python layer.
Furthermore, the NativeArray class 502 has functions such as data copying and addition for each array element, and has a function of requesting the execution to the Native layer.
In addition, the NativeArray class 502 has a function that is compatible with other multi-dimensional array libraries in the Python layer such as Numpy and GPUArray on which Chainer depends.

9-2. NativeDevice class The NativeDevice class 501 is an abstraction of optimization implementation and reference implementation of the Native layer. The NativeDevice class 501 has a function of requesting the Native layer for the following processing in response to requests from other objects in the Python layer.
(1) Initialization and release of device (2) Generation and copy of multidimensional array (Generate NativeArray instance of Python layer that wraps this)
(3) Conversion between device-independent data representation and device-dependent data representation (can also instruct conversion between floating point and fixed point)
(4) Process execution of Function and Optimizer (each function of Native layer is called)

9-3. Function class The Function class 503 is a class that defines a forward process and a backward process as a pair. Function class 503 is a class that exists in Chainer, but adds a function for requesting the Native layer for forward processing and backward processing.
An example of method implementation is as follows.
(Example 1) forward_native (...)
Thereby, forward processing can be requested to the Native layer.
(Example 2) backward_native (...)
As a result, backward processing can be requested to the Native layer.

Process flow assumed when calling Forward_native or backward_native (1) The output data size is calculated from the input data size and the parameters at the time of Function instance initialization.
(2) The output data size obtained in (1), the input data instance (NativeArray), and the Function classification (Linear, ReLU,...) Are passed to the NativeDevice instance and a function call in the Native layer is requested.
(3) The NativeDevice instance executes the following processing in response to this call.
(A) Request the Native layer to generate a multidimensional array of output data. If the function overwrites the input data, this process is not performed.
(B) Determine and call the function of the Native layer that is actually called from the type of input data (floating point, fixed point) and Function (the Native layer function writes the processing result to the multidimensional array secured in A)
(C) Generate a NativeArray instance that wraps the multidimensional array secured in (A) above.
(4) The NativeArray instance generated in (C) above is returned as the return value of Function.

9-4. Optimizer class The Optimizer class 503 is a class for updating weights. The Optimizer class 503 is a class that exists in the chainer, but adds a function for requesting the native layer to perform state initialization and weight update processing.
An example of method implementation is as follows.
(Example 1) init_state_native (...)
Thereby, it is possible to request the Native layer for the internal state initialization process of the weight update algorithm.
(Example 2) update_one_native (...)
Thereby, the weight update process can be requested to the Native layer.
The processing flow at the time of calling these methods is the same as that already described in the above “Function class”.

9-5. Specific Example of Overall Operation Cooperating with Native Layer A specific example is illustrated in FIG.

9-6. Byte Code Generation (Output) Unit The byte code generator 504 is a mechanism for converting the network configuration of the neural network defined by “Define-by-Run” into a byte code (data format that can be interpreted and executed) and outputting it. As the byte code format, for example, the one described above regarding the “virtual machine module” can be considered. However, in addition to this format, for example, output to the following format can be considered.
(1) Neural network definition format of Caffe It can be executed with Caffe (Caffe is one of the representative frameworks for neural network design execution).
(2) Programming language such as C language or Java (registered trademark) Software that executes the entire operation can be generated.
(3) Hardware description languages such as HDL and Verilog Hardware that executes the entire operation can be synthesized.

A function definition example of the bytecode generation unit is as follows.
Function name: write_network_difinition (output_node, path, format)
Function specifications: The network configuration connected from output_node to the input side is output to the file specified by path in the format specified by format. output_node can be specified in a list (can start from multiple nodes).

Example procedure for outputting bytecode from a reference structure for backward processing As explained in Part 1 above, Chainer has a function to generate a reference structure for backward processing according to the description of calculation of natural forward processing. . Since the forward process can be calculated by following the reference structure for the backward process in the reverse order, if the byte code is generated from this reference structure, both the forward process and the backward process can be executed.
This procedure is roughly divided into the following steps.
(1) Element information generation for byte code generation Input / output data information generation Weight information generation Function call information generation during backward processing (2) Conversion of element information into byte code

Element information generation procedure for creating bytecode-“reference structure for backward processing” is traced to output_node passed to write_network_difinition, and the following processing is executed.
(1) When the current node is Variable, the information (size / number of dimensions, floating point / fixed point (Q value)) of the multidimensional array is added to the list of “input / output data information”.
(2) If the current node is Function, the following processing is performed.
(I) Add information of multi-dimensional array of weights (size / number of dimensions, floating point / fixed point (Q value), actual value of weight) to the list of "weight information" so as not to allow duplication (multiple functions Because instances can share the same weight).
(Ii) Add parameters specific to each function type, input / output data index, weight index, and function type to the list of “Function call information at backward”.
If multiple origin nodes are passed to output_node, avoid registering the same node twice in the next paragraph.

Element information creation procedure when multiple origin nodes are passed (1) Create a list (empty) of “Function call information during backward processing”.
(2) The following procedure is performed for each starting node in output_node.
(A) Create a list of “Function call information during backward processing” specific to the origin node.
(B) The registration procedure described in the previous paragraph is performed on the list created in (A) above. At this time, the registration procedure is not executed for nodes already registered in the list created in (1) to avoid duplicate registration.
(C) The list created in (A) is linked to the front of the list created in (1) above.

Python layer: Byte code output device (6)
Conversion of element information to byte code The following information created in the procedure of “Generating element information for creating byte code” is converted to the format specified by the format argument of write_network_difinition.
(1) Generation of input / output data information (2) Generation of weight information (3) Function call information generation at the time of backward processing Examples of the format for the “byte code generation unit” are given above.

Output of multiple network configurations The write_network_difinition function described above for the "byte code generator" is a specification that directly writes the network configuration to the file passed in the argument path. You can also pass an object to write to bytecode. Here, the network configuration indicates the components (1), (2), and (3) described in “Python layer: byte code output unit (6) conversion of element information into byte code”.
This “object for writing a plurality of network configurations to bytecode” shares the same “(2) weight information” among “plurality of network configurations” and reduces the weight information to be written to the bytecode. “(1) Generation of input / output data information” and “(3) Generation of function call information during backward processing” are generated independently by the above-described steps even when partial information is duplicated. A code example in the case of using this object is shown in FIG. In this code example, in line 6, the reference structure for backward processing is output from nodeA, and the network structure is output. In line 7, the reference structure for backward processing is traced from nodeB, and the network structure is output. Output. In line 8, these two network configurations are output to one file (./bytecode.bin).

A method for specifying different Function calling orders for forward processing and backward processing for the same network As described in Part 1 above, Chainer is a reference for Backward from a specific Variable instance to the input layer side. Has the ability to abort the structure (unchain_backward method). By combining this unchain_backward method and the “output of multiple network configurations” described in the previous paragraph, it is possible to specify different function call orders for forward processing and backward processing calculation in the same network.
In the code example shown in FIG. 41, a network definition that executes all processes from A to D is output by calling # 1, whereas only a process from B to D is executed by calling # 2. The network definition is output. For example, forward processing is performed for the network configuration output in # 1 when bytecode is executed by the virtual machine, and backward processing is performed for the # 2 one.

10. Configuration common to Native layer and Python layer 10-1. Algorithm execution unit that fuses multiple NN algorithms In the configuration of a general neural network, frequent combinations of functions can be seen. (Example 1) Linear → ReLU, Linear → ReLU → Linear → ReLU
(Example 2) Convolution2D → ReLU, Convolution2D → ReLU → Convolution2D → ReLU
By defining such a combination of frequently used functions as a single function and implementing implementation specialized in the calculation in both the Python layer and the Native layer, the following advantages can be enjoyed in both algorithm design and hardware execution efficiency. (1) Overhead (function call and communication) for calling Function can be reduced.
(2) High execution efficiency can be obtained by implementing data dependency and parallelism that span multiple functions (such as reducing the amount of data that directly accesses the main memory by effectively using the cache memory).
(3) It becomes easier to understand and define a complex network configuration by using a more abstract function at the time of algorithm design.

Generally, in recent computers, the speed of the arithmetic core is remarkably improved, but the performance of memory access is not so high. For this reason, when looking at the performance of the entire computer, there is a problem that it is rate-limited by memory access and sufficient calculation performance cannot be achieved. In order to solve such problems, a small memory called a cache memory or a register file is physically located close to the arithmetic core, but a specially high-speed memory is placed, and many calculations are performed on the cache memory. A mechanism to fill the speed gap between the two is used.

By the way, the following combinations of functions that appear frequently are seen in the configuration of the neural network.
・ Convolution2D → ReLU
・ Convolution2D → ReLU → Convolution2D → ReLU
Convolution2D has a large amount of calculation with respect to the input / output data size, so there is a great chance that the performance of the calculation core can be demonstrated by effectively using a mechanism such as a cache memory. In contrast, ReLU has a small amount of calculation with respect to the input / output data size, so this chance is small.
When Convolution2D and ReLU are executed as separate functions, it is necessary to write all the data of the Convolution2D processing result to the main memory, and then transfer the contents again to the periphery of the arithmetic core to calculate the ReLU. The reason is that it is not known whether or not ReLU will use the result immediately after Convolution2D processing is finished.

Therefore, if Convolution2D and ReLU are executed as an integrated function, the Convolution2D processing results can be used directly as input for ReLU processing on the cache memory or register file before writing to the main memory. The frequency of data transfer is reduced, and the chances of executing processing more efficiently (high speed) are increased.
If more functions can be executed as an integrated function, such as Convolution2D → ReLU → Convolution2D → ReLU, the chance of improving the processing efficiency further increases. This is because the amount of access to the main memory can be further reduced in consideration of the size of the cache memory and the data dependency in the combination of functions.

10-2. Configuration of optimization device that specializes forward processing execution 1; with weight optimization processing means Some neural network layer algorithms specialize when only forward processing is performed without backward processing There are some which can reduce the amount of calculation and the memory usage by optimizing the weights. The reason why such optimization is possible is as follows.
Learning a neural network using the stochastic gradient descent method requires high numerical accuracy and a high degree of freedom in the weight vector. This is because it is necessary to repeatedly update small values at the time of learning, and a range of values in which the weight vector changes in advance cannot be sufficiently assumed. On the other hand, if only the Forward process is executed, it is not necessary to have both accuracy and flexibility. If the Forward process is executed after reducing the amount of weight information, the memory and the calculation amount are reduced. can do. The amount of calculation can be reduced because measures such as reducing the number of weight elements or not calculating zero weights can be taken. For example, in the Linear (inner product between layers) process, the weight data size is obtained by performing singular value decomposition on the weight information (matrix of the number of input nodes * number of output nodes) and removing elements that have smaller diagonal components. Techniques for compressing and reducing the computation size are known.
(J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic
models with singular value decomposition. In Interspeech, 2013)
By adding a method that performs weight optimization processing specialized for Forward in the Function class of Chainer, it is possible to reduce computational resources when only Forward processing is executed using already learned weights. This method hides the difference of hardware implementation (general-purpose computer, GPU, Native) depending on the type of multi-dimensional array of weights held by Function, as well as the existing Forward and Backward methods in Function class It has a function to call the implementation of initialization).

Supplementary Notes on Software Hardware Configuration The details of the embodiments have been described so far by exemplifying that specific functions are subordinated to functions and classes in the Python layer and the Native layer. However, the division of roles of these software hierarchies, classes, and functions is merely an example for specifically explaining the functional configuration according to the embodiment of the present invention, and as shown in the following example, It is also conceivable that each function according to the embodiment of the invention is implemented in a class, hierarchy, or hardware different from those described above.
(1) The processing contents described in “Configuration 2 of optimization device specialized for forward processing execution; having data memory area reusing means” can be executed in advance by a byte code output unit instead of a virtual machine module.
(2) The function described in “Function combining multiple functions” can be implemented by specialized hardware (FPGA, ASIC), not software level optimization.
Therefore, the configuration according to the embodiment of the present invention does not depend directly on the functions and classes of the Python layer and the Native layer, or on the assumption of software.

DESCRIPTION OF SYMBOLS 10 Learning apparatus 100 Evaluation board 110 Acquisition part 120 Storage part 130 Execution part 200 Embedded system chip (Embedded system semiconductor integrated circuit)
401 Drive unit 402 Function class / Optimizer class 405 Native layer execution unit 408 Native multi-dimensional array 409 Variable class 504 Byte code generation unit 505 Device management module 506 Data conversion module 507 Multi-dimensional array module 508 Function module / Optimizer module 509 Virtual machine Module 510 Memory pool module

Claims

A mounting device for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
First execution means capable of causing a first arithmetic unit mounted on the semiconductor integrated circuit to execute a first source code described in a first programming language;
Second execution means capable of causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language; ,
The result of executing the first specific code included in the first source code by the first execution means, and the second specific code included in the second source code by the second execution means Comparing means for comparing a result obtained by executing the second specific code in which the first specific code is rewritten by the second programming language and outputting a comparison result,
The second execution means is described in an arbitrary data format from the first source code described in the first programming language, and is called input / output data information, weight information, Function call at the time of backward processing A mounting apparatus comprising byte code generation means for generating byte code including at least one of information.
A mounting device for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
Second execution means capable of causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language is provided. ,
Implementation wherein the second execution unit is executed by the second arithmetic unit using a bytecode including at least one of input / output data information, weight information, and function call information during backward processing. apparatus.
The second execution means includes conversion means for converting data from a fixed-point type to a floating-point type and / or a floating-point type to a fixed-point type, and executes a function using the data converted by the conversion means. The mounting apparatus according to claim 1 or 2.
The second execution means includes function definition means for defining a plurality of functions each defining a layer algorithm as one function, and calls and executes a function defined by the function definition means. The mounting apparatus according to claim 3.
The weight execution process according to any one of claims 1 to 4, wherein the second execution unit executes a weight optimization process when executing a function defining a forward process when the backward process is not executed. Mounting equipment.
If the second execution means does not execute backward processing, at least one of the weight gradient, the internal state of the weight update algorithm, and data input / output between layers is unnecessary. The mounting apparatus according to any one of claims 1 to 5.
2. The mounting apparatus according to claim 1, wherein the byte code generation means generates a byte code defining a plurality of network configurations.
The mounting device according to claim 1, wherein the first arithmetic device includes at least one of a CPU and a GPU, and the second arithmetic device includes an auxiliary arithmetic device.
The first source code written in the first programming language is:
Computer
Execution means for sequentially executing each code included in the first source code, and at the time of executing each code, an output value of a forward process defined by the code is calculated based on an input value; And execution means configured to generate a reference structure between objects in a layer corresponding to the code,
The mounting apparatus according to claim 1, wherein the mounting apparatus is described so as to function as:
A mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
A first execution process for causing a first arithmetic unit mounted on the semiconductor integrated circuit to execute a first source code described in a first programming language;
A second execution process for causing a second arithmetic unit mounted in the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language;
A result of executing the first specific code included in the first source code in the first execution process, and a second specific code included in the second source code in the second execution process The first specific code is compared with a result of executing the second specific code rewritten by the second programming language, and a comparison process for outputting a comparison result,
The second execution process is described in an arbitrary data format from the first source code described in the first programming language, and input / output data information, weight information, Function call at the time of backward processing A mounting method comprising: a byte code generation process for generating a byte code including at least one of information.
A mounting method for mounting a computer program for executing machine learning on a semiconductor integrated circuit,
A second execution process for causing a second arithmetic unit mounted on the semiconductor integrated circuit to execute a second source code described in a second programming language different from the first programming language;
The second execution process is executed by the second arithmetic unit using a bytecode including at least one of input / output data information, weight information, and function call information during backward processing. Method.