WO2020225916A1 - Dispositif de traitement d'informations - Google Patents

Dispositif de traitement d'informations Download PDF

Info

Publication number
WO2020225916A1
WO2020225916A1 PCT/JP2019/018604 JP2019018604W WO2020225916A1 WO 2020225916 A1 WO2020225916 A1 WO 2020225916A1 JP 2019018604 W JP2019018604 W JP 2019018604W WO 2020225916 A1 WO2020225916 A1 WO 2020225916A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
output
input
layer
circuit
Prior art date
Application number
PCT/JP2019/018604
Other languages
English (en)
Japanese (ja)
Inventor
松本 渉
樹生 蓮井
Original Assignee
株式会社アラヤ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社アラヤ filed Critical 株式会社アラヤ
Priority to PCT/JP2019/018604 priority Critical patent/WO2020225916A1/fr
Publication of WO2020225916A1 publication Critical patent/WO2020225916A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to an information processing device that performs an operation of a neural network used for artificial intelligence, and particularly to a technique applied when reducing the amount of operation when performing an operation of a neural network.
  • Deep neural networks with deep layer structures hereinafter referred to as “DNN”
  • CNN convolutional neural networks
  • recurrent which have particularly high recognition performance and prediction performance in neural networks
  • N neural networks
  • Neural networks hereinafter referred to as “RNN”
  • LSTM Long Short Term Memory
  • Non-Patent Document 1 the convolution operation, which normally performs an operation collectively for three dimensions of height, width, and channel, is divided into an operation in the height / width direction and an operation in the channel direction. Techniques for reducing the number of times are described
  • Non-Patent Document 1 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenich enko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam., Mob ileNets: Efficient Convolutional Neural Networks .org / abs / 1704.04861
  • NNs such as DNN, CNN, RNN, and LSTM, which are often used to realize conventional artificial intelligence functions, have a large amount of calculation, and use a large-scale server for computer resources or a graphic processing unit (hereinafter referred to as "graphic processing unit”). It is necessary to mount an additional unit such as (referred to as "GPU”). For this reason, there is a problem that it becomes expensive when introducing intelligent equipment or mounting on equipment, and a large amount of power consumption is required. Further, if the hardware configuration is different for each function, there is a problem that the mass production effect cannot be obtained and the equipment cost becomes high.
  • the present invention has been made in view of the above circumstances, and by reducing the amount of calculation of NNs such as DNN, CNN, RNN, and LSTM, computer resources can be significantly reduced, and miniaturization and power consumption can be reduced.
  • the purpose is to provide artificial intelligence functions and services that can be installed in general-purpose devices.
  • the purpose is to obtain the effect of reducing the equipment cost by sharing many parts of the hardware configuration.
  • the information processing device of one aspect of the present invention is an information processing device provided with a calculation processing unit that realizes an artificial intelligence function by performing a neural network calculation on input data.
  • An arithmetic circuit that performs product-sum calculation of the input vector and the weight matrix in each layer of the network, an input image temporary storage unit that inputs input data to the arithmetic circuit, and an output of the arithmetic circuit are stored, and the next layer
  • An output data temporary storage unit that copies output data to the input data temporary storage unit for calculation is provided, and the calculation circuit has a maximum input channel of input image data, an output channel of output data, and a kernel size.
  • the feature is that the arithmetic circuit can be used, and the component to be calculated of the input data can be freely set by changing the address.
  • the computer resources for realizing the artificial intelligence function can be significantly reduced, the space, price, and power consumption occupied by the computer can be reduced. Therefore, for example, when an artificial intelligence function is installed in a device, it becomes possible to perform neural network calculations using a low-priced CPU, a general-purpose FPGA (field-programable gate array), or an LSI, and it is compact and low. Achieves price reduction, low power consumption, and high speed.
  • FIG. 1 It is a figure which shows the configuration example of FPGA in the case where a plurality of layers having different compression ratios are mounted in a common circuit. It is a figure which shows the application example (example which applied to a vehicle) of the circuit (device) of each embodiment of this invention. It is a figure which shows the application example (example which applied to the smartphone) of the circuit (device) of each embodiment of this invention. It is a figure which shows the application example (example which applied to a drone) of the circuit (device) of each embodiment of this invention. It is a figure which shows the example of the hardware composition of the circuit (device) of each embodiment of this invention.
  • CNN Convolutional Neural Network
  • CNNs are usually composed of multiple convolutional layers. Each convolution layer performs a plurality of convolution operations on images of a plurality of input channels and outputs the result. The output result is used as the input for the next layer. It is also possible to apply a nonlinear function after each convolution operation.
  • FIG. 1 is a diagram showing a conventional processing configuration of an arithmetic circuit that performs a convolution operation when a network is not compressed in a certain convolution layer in a CNN.
  • the network described in the present specification is a network in an NN (neural network), and when it is referred to as network compression, it means a reduction in the number of operations.
  • the convolution operation in CNN is usually a tensor (hereinafter referred to as "kernel") composed of three dimensions of height K_h, width K_w, and number of channels C_in, which is cut out from the input image and has the same size as the kernel. It refers to the operation of multiplying the pixel value tensor of an image by values having the same index and then taking the sum.
  • the number of channels C_in here is, for example, 3 channels when each pixel is composed of 3 pieces of RGB data.
  • the convolution process of one layer is completed.
  • the convolution operation by one kernel obtains the output image of one channel. Normally, by preparing multiple kernels, it is possible to obtain output images of multiple channels.
  • the input image 101 is an image having a height H, a width W, and a number of channels C_in. Each pixel of the image holds the value (pixel value) of the pixel.
  • the output image 102 is an image having a height H, a width W, and the number of channels C_out. C_in and C_out may be the same or different.
  • the arithmetic circuit 103 takes the pixel values of all channels at a certain pixel location in the input image 101 as input, and adds the pixel values of all channels at a certain pixel location in the output image 102. And output.
  • the arithmetic circuit 103 is composed of a plurality of multipliers (arithmetic units) 104-1 to 104-4, 105-1 to 105-4, ..., 109-1 to 109-4, and these plurality of multipliers are used. It has a structure in which vessels are arranged in parallel.
  • the term arithmetic unit is used as a concept including an adder, a subtractor, and the like that add and output the multiplication results of a plurality of multipliers.
  • each of the multipliers 104-1 to 104-4, 105-1 to 105-4, ..., 109-1 to 109-4 a coefficient to be multiplied is set for a value input in advance.
  • Each multiplier 104-1 to 104-4, 105-1 to 105-4, ..., 109-1 to 109-4 outputs a value obtained by multiplying the input pixel value by this coefficient.
  • the result of multiplication is grouped by the multiplier of the corresponding output channel, the total value in the group is calculated, and the total value is output to the corresponding output channel.
  • a multiplier group 103a corresponding to the output channel 1 a multiplier group 103b corresponding to the output channel 2, ..., A multiplier group 103n corresponding to the output channel Cout are prepared, and the total value added for each group is calculated. It is output.
  • the image 110 shown on the lower side of FIG. 1 shows a processing state in which all the pixel values of the output image are output by repeating the above-mentioned convolution operation while scanning the location of the pixel of interest for the entire image. It is an image.
  • the pixel value at the upper left corner of the image is output for all output channels
  • the pixel value at a position shifted one right from the upper left corner of the image is output for all output channels.
  • the processing configuration shown in FIG. 1 is replaced with a processing configuration in which the network shown in FIG. 2 is compressed, that is, a processing configuration in which the number of operations is reduced. That is, the processing configuration shown in FIG. 2 shows a configuration when the network is compressed with respect to the processing configuration shown in FIG.
  • FIG. 3 shows a rule for deciding the location of an unnecessary multiplier when compressing a network with the configuration shown in FIG.
  • the required and unnecessary multiplications are defined so that the required multiplication location is the sum of products of different combinations for the input data among a plurality of groups.
  • the arithmetic circuit 301 shown on the upper side of FIG. 3 shows a case where a product sum of overlapping combinations exists for the input data.
  • the necessary parts of each arithmetic unit in each group are shown.
  • 1 is set as a multiplication coefficient at a place where a multiplier is required, and 0 is set at a place where a multiplier is not needed. Setting the multiplication factor 0 without using it means that the corresponding multiplier is unnecessary.
  • the four input data corresponding to each of the input channels 1 to 4 are designated as x_1, x_2, x_3, and x_4.
  • the value of the output channel 1 is x_1 + x_2
  • the value of the output channel 2 is x_3 + x_4
  • the value of the output channel 3 is x_3 + x_4.
  • the arithmetic circuit 301 since the values of the output channel 2 and the output channel 3 overlap, the values of the output channel 2 and the output channel 3 are the same, and there is an arithmetic unit that does not need to be calculated. In such a case, the equation of output channel 2 and the equation of output channel 3 are said to be not linearly dependent or linearly independent. Therefore, in the arithmetic circuit 301, the accuracy of the output result of the CNN may deteriorate because the information to be transmitted to the output channel is lost.
  • the arithmetic circuit 302 shown on the lower side of FIG. 3 shows a linearly independent case.
  • the value of the output channel 1 is x_1 + x_2 corresponding to the input channel 1 and the input channel 2.
  • the value of the output channel 2 is x_2 + x_3 corresponding to the input channel 2 and the input channel 3.
  • the value of the output channel 3 is x_3 + x_4 corresponding to the input channel 3 and the input channel 4.
  • the equations of each output channel are linearly independent, and there is no loss of information to be transmitted to the output channel, so that it is possible to prevent deterioration of the accuracy of the output result of CNN.
  • the multipliers 204-1 and 204-2, the multipliers 205-1 and 205-2, and the multipliers 206-1 and 206-2 corresponding to each output channel of the arithmetic circuit 302 shown in FIG. 3 are respectively. Since it has the same configuration, if a plurality of multipliers (here, two multipliers) corresponding to one output channel are prepared, the same multiplier can be reused for each operation of each output channel. It is possible to reduce the computer resources for realizing the artificial intelligence function.
  • FIG. 4 is a diagram showing an example different from that of FIG. 1 regarding a conventional processing configuration of an arithmetic circuit that performs a convolution operation when the network is not compressed in a certain convolution layer.
  • the input image 401 is an image having a width W, a height H, and a number of channels C_in. Each pixel of the image holds the value (pixel value) of the pixel.
  • the output image 402 is an image having a width W, a height H, and a number of channels C_out. The number of channels C_in of the input image 401 and the number of channels C_out of the output image 402 may be the same or different.
  • the arithmetic circuit 403 takes the pixel values of all channels at the pixel positions of 3 ⁇ 3 including the periphery of one pixel of interest in the input image 401 as an input, and receives one 1 in the corresponding output image 402. Outputs the pixel values of all channels at one pixel location. That is, the arithmetic circuit 403 has an arithmetic unit group 404a corresponding to the output channel 1, an arithmetic unit group 404b corresponding to the output channel 2, ..., an arithmetic unit group 404n corresponding to the output channel Cout. Each of the arithmetic unit groups 404a to 404n includes a plurality of multipliers. For example, the arithmetic unit group 404a includes multipliers 405-1 to 405-n.
  • the processing configuration shown in FIG. 4 is replaced with a processing configuration in which the network shown in FIG. 5 is compressed, that is, a processing configuration in which the number of operations is reduced. That is, the processing configuration shown in FIG. 5 shows a configuration when the network is compressed with respect to the processing configuration shown in FIG.
  • the input image 501 is supplied to the arithmetic circuit 503 of the compressed processing configuration to obtain the output of each channel, and the output image 502 is obtained.
  • the arithmetic circuit 503 includes a multiplier group 504a corresponding to the output channel 1, a multiplier group 504b corresponding to the output channel 2, ..., And a multiplier group 504n corresponding to the output channel Cout.
  • the multiplier group 504a has multipliers 505-1, 505-2, ..., 505-m (where m is a number smaller than the number n of the multiplier 405).
  • FIG. 6 shows an example of the detailed configuration of the arithmetic circuit 503 of FIG.
  • the example shown in FIG. 6 shows a method of reusing a compressed arithmetic circuit when the input channel C_in differs depending on each convolutional layer of the CNN.
  • a part of the arithmetic circuit used in the layer with the larger number of input channels is reused to perform the arithmetic in the layer with the smaller number of input channels.
  • an example that can be done.
  • FIG. 6 shows a case where there are two types of layers, one is a layer having four input channels and the other is a layer having three input channels.
  • the multiplier 505-1 corresponding to the input channel 1 and the multiplier 505 corresponding to the input channel 2 The calculation is performed with 2 and the calculated outputs of both multipliers 505-1 and 505-2 are totaled to obtain the value of the output channel 1.
  • the multiplier 506-1 corresponding to the input channel 2 and the multiplier 506-2 corresponding to the input channel 3 perform the calculation, and both multipliers 506-1 and 506-
  • the value of the output channel 2 is obtained by summing the arithmetic outputs of 2.
  • the multiplier 507-1 corresponding to the input channel 3 and the multiplier 507-2 corresponding to the input channel 4 perform the calculation, and both multipliers 507-1 and 507- The calculated outputs of 2 are summed to obtain the value of the output channel 3.
  • the number of input channels shown on the lower side of FIG. 6 is 3 layers
  • the number of input channels shown on the upper side of FIG. 6 is used for the calculation corresponding to the output channel 1 and the calculation corresponding to the output channel 2. This is the same as in the case of the 4th layer.
  • the calculation corresponding to the output channel 3 is performed only by the multiplier 507-1 corresponding to the input channel 3, and the value of the output channel 3 is obtained from the calculation output of the multiplier 507-1. ..
  • the multiplier 507-2 which was required for the layer with 4 input channels, is not used.
  • each group can be used both when all the multipliers included in the multiplier group are used and when only the three multipliers from the top are used. Is determined to be linearly independent.
  • the multipliers 505-1 and 505-2 of the group corresponding to the output channel 1 and the multipliers 506-1 and 506-2 of the group corresponding to the output channel 2 are output.
  • the computational resources can be reduced accordingly.
  • FIG. 7 is a configuration example in which the compressed arithmetic circuit is reused when the output channel C_out differs depending on each convolutional layer of the CNN.
  • the upper side of FIG. 7 is a convolutional layer having 4 input channels and 3 output channels.
  • the convolutional layer shown on the upper side of FIG. 7 is the same as the convolutional layer shown on the upper side of FIG.
  • the lower side of FIG. 7 is a case of a convolutional layer having 4 input channels and 2 output channels.
  • the multiplier 505-1 corresponding to the input channel 1 and the multiplier 505-2 corresponding to the input channel 2 perform the calculation, and the multipliers 505-1 and 505 both perform the calculation.
  • the operation output of -2 is summed to obtain the value of output channel 1.
  • the multiplier 506-1 corresponding to the input channel 2 and the multiplier 506-2 corresponding to the input channel 3 perform the calculation, and both multipliers 506-1 and 506-
  • the value of the output channel 2 is obtained by summing the arithmetic outputs of 2.
  • the calculation by the multipliers 507-1 and 507-2 shown on the upper side of FIG. 7 becomes unnecessary. That is, when the number of output channels is two layers, only the top two multiplier groups among the three multiplier groups prepared when the number of output channels is three layers are used. As described above, the calculation when the number of output channels is 2 can also be performed by using the same calculation circuit as the layer having a large number of output channels. Regarding the location of the required multipliers and the unnecessary multipliers, if all the multiplier groups are used and each group is decided to be linearly independent, when some of the multiplier groups are used. But it will always be linearly independent.
  • FIG. 8 shows a configuration in which the compressed arithmetic circuit is reused when the kernel size differs depending on the layer.
  • Multipliers 601-1 to 601-5 corresponding to the eyes and multipliers 601-6 corresponding to the second pixel of the input channel 2 are provided, and the outputs of the multipliers 601-1 to 601-6 are totaled and output. Obtain the output corresponding to channel 1.
  • the multipliers 602-1 to 602-5 corresponding to the second to sixth pixels of the input channel 1 and the first and second pixels of the input channel 2 are supported.
  • the multipliers 602-6 and 602-7 are provided, and the outputs of the multipliers 602-1 to 602-7 are totaled to obtain an output corresponding to the output channel 2. Further, as a multiplier group corresponding to the output channel 3, multipliers 603-1 to 603-5 corresponding to the third to seventh pixels of the input channel 1 are provided, and the multipliers 603-1 to 603-5 are provided. The outputs are summed to obtain the output corresponding to the output channel 3.
  • the multiplier group corresponding to the output channel 1 corresponds to the first pixel of the input channel 1. Only the multiplier 601-1 is used, and the output of the multiplier 601-1 is set as the output corresponding to the output channel 1. Further, as the multiplier group corresponding to the output channel 2, only the multiplier 602-6 corresponding to the first pixel of the input channel 2 is used, and the output of the multiplier 602-6 is set to the output corresponding to the output channel 2. .. Other multipliers (shown by the dashed line) are considered inoperable when reused.
  • each group is determined to be linearly independent.
  • FIG. 9 shows an example of the processing procedure of the convolutional neural network described in the examples of the embodiments so far.
  • a convolutional network usually consists of multiple layers.
  • a network having a four-layer structure of convolutional layers 1, 2, 3, and 4, in which the number of channels of an input image is 3 and the number of channels of an output image is 128 will be described as an example.
  • the convolution layers 1, 2, 3, and 4 have kernel sizes of 3 ⁇ 3, 3 ⁇ 3, 3 ⁇ 3, and 1 ⁇ 1, respectively.
  • step S1 when an input image is given as an input to the convolutional neural network (step S1), the convolutional layer 1 first performs an operation using this as an input and outputs a result (step S2).
  • the convolutional layer 1 takes a 3-channel image as an input and outputs a 32-channel image.
  • the convolutional layer 2 performs an operation using the 32-channel image obtained by the convolutional layer 1 as an input, and outputs the result (step S3). In this way, the number of output channels of the convolutional layer 1 and the number of input channels of the convolutional layer 2 match. In the convolutional layer 2, the number of output channels is 64.
  • the convolutional layer 3 performs an operation by taking the image of 64 channels obtained by the convolutional layer 2 as an input, and outputs the result (step S4).
  • the number of output channels is 128.
  • the convolutional layer 4 performs an calculation using the 128-channel image obtained by the convolutional layer 3 as an input, and outputs the result (step S5).
  • the number of output channels is 128.
  • the result output by the convolutional layer 4 becomes the output of this convolutional neural network (step S6).
  • the maximum number of input channels is 128, the maximum number of output channels is 128, and the maximum kernel size is 3 ⁇ 3. Therefore, in the example of the present embodiment, one calculation circuit corresponding to the maximum value of each parameter is created so that the calculation can be performed in all the convolution layers, and then the description will be given with reference to FIGS. 6 and 7. As described above, when calculating each convolution layer, a part of the multiplier is not used and the calculation is performed.
  • FIG. 10 shows the configuration of an information processing device that executes the convolutional neural network described in the examples of the embodiments so far.
  • the information processing device includes a storage unit 701, an input image temporary storage unit 702, a convolution calculation circuit 703, and an output image temporary storage unit 704.
  • the storage unit 701 stores an image input to the convolutional neural network and an image output by the convolutional neural network.
  • the input image temporary storage unit 702 receives the image from the storage unit 701 and transmits it to the convolution calculation circuit 703. Further, in the middle of the network calculation, the contents of the output image temporary storage unit 704 are copied and stored, and transmitted to the convolution calculation circuit 703.
  • the convolution calculation circuit 703 takes the data received from the input image temporary storage unit 702 as an input, calculates one convolution layer, and stores the result in the output image temporary storage unit 704.
  • the output image temporary storage unit 704 stores the calculation result of the convolution calculation circuit 703.
  • the stored image is copied to the input image temporary storage unit 702.
  • the image stored in the input image temporary storage unit 702 is transmitted to the storage unit 701.
  • the information processing device shown in FIG. 10 is configured as, for example, a computer device composed of a CPU (Central Processing Unit) and its peripheral circuits (ROM, RAM, various interfaces, etc.), as well as a general-purpose FPGA or a general-purpose FPGA. It can be configured with an LSI.
  • the convolution calculation circuit 703 constitutes a circuit corresponding to the maximum value of the parameter of the convolution layer included in the network shown in FIG. Then, the input image of the network is stored in the storage unit 701.
  • step S1 the input image stored in the storage unit 701 is copied to the input image temporary storage unit 702.
  • step S2 using this input image as an input, the convolution calculation circuit 703 uses a part of the calculation circuit 703 to perform a convolution calculation process corresponding to the convolution layer 1, and outputs the result to the output image temporary storage unit 704. Store in.
  • the output image temporary storage unit 704 copies the image stored in step S2 to the input image temporary storage unit 702. Using this copied image as an input, the convolution calculation circuit 704 uses a part of the calculation circuit to perform a convolution calculation process corresponding to the convolution layer 2, and stores the result in the output image temporary storage unit 704.
  • step S5 the same processing is performed on the convolutional layer 3 and the convolutional layer 4.
  • step S5 the same processing is performed on the convolutional layer 3 and the convolutional layer 4.
  • step S5 the result is stored in the output image temporary storage unit 704, and in step S6, the stored image is copied to the storage unit 701 to complete the neural network calculation.
  • the computer resources for realizing the artificial intelligence function can be significantly reduced, so that the space, price, and power consumption occupied by the computer can be reduced. ..
  • the configuration of the information processing apparatus shown in FIG. 10 is not limited to this, and a plurality of convolution calculation circuits 703 may be connected in series.
  • the first convolution arithmetic circuit is subjected to the convolution arithmetic processing corresponding to the convolution layer 1
  • the second convolution arithmetic circuit is subjected to the convolution arithmetic processing corresponding to the convolution layer 2. It can be performed.
  • the first convolution calculation circuit inputs an input image from the input image temporary storage unit 702, performs a predetermined convolution calculation process, and outputs the result to the second convolution calculation circuit.
  • the second convolution calculation circuit performs a predetermined convolution calculation process using the output of the first convolution calculation circuit as an input, and stores the result in the output image temporary storage unit 704.
  • the convolution calculation process corresponding to the convolution layer 3 is performed by the first convolution calculation circuit
  • the convolution calculation process corresponding to the convolution layer 4 is performed by the second convolution calculation circuit.
  • the operation of the neural network can be completed.
  • DNN deep neural network
  • the structure of DNN is defined based on FIG.
  • the input signal is an N-dimensional vector.
  • (*) T indicates the transpose of the matrix.
  • DNN performs pre-training by unsupervised learning using a stacked self-encoder before supervised learning for identification.
  • this self-encoder aims to acquire the main information of the high-dimensional input signal and convert it into low-dimensional feature data.
  • learning is performed so as to minimize the difference between the restored data and the input data using a self-encoder. This learning is carried out from the lower layer to the upper layer by using the gradient descent method, the error back propagation method, or the like.
  • x (l + 1) W (l) Weight matrix for the network layer represented by x (l) With Restore vector from x (l + 1) by calculating To generate.
  • Weight matrix W (l) Weight matrix for the network layer represented by x (l) With Restore vector from x (l + 1) by calculating To generate.
  • the self-encoder Weight matrix When learning the self-encoder Weight matrix by solving the optimization problem to find When Is derived.
  • the length of the vector of x (l) is J (l) .
  • J (l + 1) ⁇ J (l)
  • the self-encoder reduces the dimension of the data. That is, it can be regarded as a problem of restoring the original signal x (l) by using W (l) from the dimensionally compressed signal x (l + 1) . Conversely, it is sufficient that the weight matrix W (l) has a property of restoring the original signal x (l) from the dimensionally compressed signal x (l + 1) .
  • a method that satisfies the randomness of the weight matrix can be considered other than the method of randomly selecting the components of the matrix.
  • a construction method focusing on this point is shown.
  • the construction method of the weight matrix showing this characteristic is shown below.
  • the signal x (2) of the intermediate node that is dimensionally compressed is obtained.
  • FIG. 14 shows how the vector x (2) of the intermediate node can be obtained by the matrix calculation of the weight matrix W (1) and the input signal vector x (1) at this time.
  • 14 and 15 show the network compression method.
  • a product of the input vector length N and the output vector length M with respect to the M ⁇ N component is required for each layer, and the number of times of this product is a factor that increases the amount of calculation. It was.
  • is This is an operation of multiplying the component of the i-th column of the matrix A and the i-th element of the vector B when A is a matrix and B is a vector.
  • Matrix replaced or randomly replaced with a specific rule The sum of the matrix with and is executed as follows.
  • permutation means performing an operation of exchanging the locations of arbitrary two elements of a matrix with each other an arbitrary number of times.
  • a matrix of M' ⁇ N' 10 ⁇ 50 as shown at the right end of FIG. Is output.
  • x (2) having a vector length of 500 is generated from the matrix X (2) of 10 ⁇ 50.
  • the same calculation as the calculation using the weight matrix W (1) of 500 ⁇ 784 can be executed to output the signal of the intermediate node of 500 dimensions from the input signal of 784 dimensions.
  • the replaced matrix By using the sum of the matrices based on the combination of, the characteristics close to the random matrix can be realized.
  • the recognition performance and the prediction performance can be suppressed to a slight performance difference between the conventional method and the method of the present invention.
  • the vector length of the input signal vector x (1) is 9 and the output vector is originally a 6 ⁇ 9 weight matrix W (1).
  • the vector length of x (2) of 6 is targeted.
  • the weight is set in the range of wi , j ⁇ [-1,1].
  • the variance value of the weight distribution is large, the weight often takes a value of -1 or 1, which causes a problem of vanishing gradient problem in which learning does not converge even in the process of learning.
  • the method adopted in the present invention does not take the product sum of the components of each row of the weight matrix W (l) and all the elements of the vector x (l) , but takes the product sum of some elements and the equations match.
  • the same equation was avoided by taking measures to create rules for combinations that do not.
  • a weight matrix with the number of rows compressed according to the compression ratio Is created, and W (l) is divided for each reciprocal 1 / ⁇ of the compression rate, as shown in equation (1).
  • Matrix replaced or randomly replaced with a specific rule
  • the sum of the matrices with and is executed as shown in equation (2).
  • the representation of the superscript (1) of the matrix component and the vector element is omitted.
  • the second row of is shifted to the left by one column, and the replacement is as follows. And. Also, Replace the second row to the left by two columns, as shown below. And.
  • the present invention has a network structure such as DNN (Deep Neural Network) and RNN (Recurrent Neural Network). It can be applied to various information processing devices that perform arithmetic processing with.
  • the network compression method described with reference to FIGS. 11 to 16 is an example, and other network compression methods may be applied to the configuration of the information processing apparatus described in the first embodiment. Good.
  • the network compression method described in the first embodiment is applied to at least one layer, and the network compression is applied to some of the other layers. May not be done.
  • FIG. 17 shows the basic network configuration of CNN. Since CNN is generally used for recognizing an object reflected in an image or the like, the following description will be given with an awareness of object identification in an image.
  • the input data size of the first layer is M (1) ⁇ M (1), and the total number of input channels is CN (1) .
  • the size of the output data of the lth layer is the same as the input data size of the l + 1 layer M (1 + 1) ⁇ M (1 + 1), and the total number of output channels of the lth layer is the total number of input channels of the l + 1 layer CN (l + 1). Is the same as.
  • the area to be convolved is called a kernel or a filter, and the size of this filter area is H (1) ⁇ H (1), which corresponds to channels C (l) and C (l + 1) of the l layer and l + 1 layer.
  • Let F (l), C (l), and C (l + 1) be the matrices of the filters to be used.
  • Input data corresponding to each C (l) Filters corresponding to input channel C (l) and output channel C (l + 1) And.
  • the matrices X (l) and C (l) are set to the length M (l) ⁇ M (l). Let x (l) and C (l) be converted into the vector of.
  • X (l), C (l) converted into a vector of length M (l + 1) ⁇ M (l + 1) according to the size of the output data is x r (l), C ( l) .
  • r is the meaning of the vector that is the target of the r-th convolution calculation.
  • the convolution calculation areas of the matrices X (l) and C (l) are converted into vectors in order so as to correspond to the calculation order of the convolution calculation, and each channel C (l) is connected in the row direction.
  • a matrix xb (l) and C (l) of size (H (l) ⁇ H (l) ) ⁇ (M (l + 1) ⁇ M (l + 1) ) are generated for each as follows.
  • This matrix is concatenated in the row direction for CN (l) of channels to generate xb (l) as shown below. However, When i> M (l) or j> M (l) in Is.
  • xb (l + 1) is calculated as follows.
  • each line of xb (l + 1) can be regarded as x (l + 1) and C (l + 1) as shown below.
  • the calculation is performed as described above. In the example of FIG. 18, from here To calculate.
  • FIG. 19 shows FB (l) at that time.
  • (l) indicating the layer is omitted for simplification.
  • the network compression method of the present embodiment is applied to CNN to compress the FB (l) . This compressed filter matrix And.
  • the calculation as shown in FIG. 21 (a) is required.
  • a product of CN (l + 1) , CN (l) , H (l) , H (l) , M (l + 1), and M (l + 1) is required, and the number of times of this product increases the amount of calculation. It was.
  • the original matrix of CN (l + 1) ⁇ (CN (l) ⁇ H (l) ⁇ H (l) ) is replaced with (CN (l + 1) ⁇ ⁇ . ) ⁇ (CN (l) , H (l) , H (l) ) is compressed to the compression rate indicated by ⁇ .
  • the input signal vector is originally a 6 ⁇ 9 matrix FB (1).
  • the vector length of 9 is compressed.
  • 2 ⁇ 9 matrix with ⁇ 1/4 And vector length 9 2x9 matrix
  • the matrix The components of are wi , j , vector The element of is written as x j . here, And.
  • this part may be replaced.
  • FIG. 22 is a comparison of a flowchart for executing the arithmetic processing of the embodiment of the present embodiment with a conventional CNN. Steps S41 to S51 on the left side of FIG. 22 are flowcharts of conventional CNN execution, and steps S61 to S73 on the right side of FIG. 22 are flowcharts of CNN execution of the present embodiment.
  • the processing of the present embodiment can be applied to the conventional matrix calculation portion.
  • Max Pooling in the flowchart of FIG. 22 is a function of extracting only the value that takes the maximum value in the combination of areas with filter output.
  • the conventional method may be used for a part of the matrix calculation used immediately before the recognition calculation (Softmax) for recognition.
  • an input signal such as image data generally treats a combination of signals for each pixel as a vector.
  • the input vector x (1) is converted into a matrix xb (1) according to the convolution rule, and preprocessing for normalization and quantization is performed (step S41).
  • the activation function f is executed (step S43)
  • processing such as MAX pooling is performed (step S44)
  • step S61 the input signal is preprocessed in the same manner as in the conventional example (step S61). And the matrix of the compressed filter as already explained in the matrix calculation part. Using, (Step S62), and then, as well as Is executed (step S63). Further, the activation function f is executed (step S64). Next, processing such as MAX pooling (step S65) is performed.
  • the calculation of the present embodiment may be applied only to a part of the whole layers. For example, it may be applied to only at least one layer in a network composed of a plurality of layers. Even if the weight matrix is compressed as described above, the characteristics are almost the same, so that the amount of calculation can be reduced.
  • the filter matrix is also a representation of a part of the network structure, and the compression of the filter matrix can be regarded as network compression.
  • a larger matrix FB (l) will be used for explanation.
  • FIG. Compressed 12x576 matrix A 4x (16.12) submatrix is placed in the upper left, a 4x (16.13) submatrix is placed in the middle, and a 4x (16.13) submatrix is placed in the lower right.
  • each submatrix has no overlapping rows and there are overlapping columns.
  • FIG. 24 shows an example of a general CNN arithmetic circuit.
  • 3x3 size input data (for example, an image) is shown as a circuit that realizes operation of an uncompressed network having 2 channels, 3x3 output data having 1 channel, and a kernel size of 2x2.
  • the CNN arithmetic circuit is composed of FPGA, and the first line of the block RAM (hereinafter referred to as BRAM) in the FPGA is the first line of input data such as an image.
  • BRAM block RAM
  • the write destination of the input data to the BRAM is specified by the write address setting circuit.
  • First row of each BRAM Sorted the 2x2 kernel for input channel 1 to 1x4 And each component is producted, and the sum of all is taken as follows. Similarly, after the input channel 2 is also product-summed, the sum of the product-sum results of the input channel 1 and the input channel 2 with the kernel is taken and output as one component of the output channel.
  • t 1 and the like indicate the processing time, and the value of t increases by 1 for each clock.
  • the output data is written to a 3x3 output data buffer (BRAM may also be used for this buffer) with a write address specified by the write address setting circuit.
  • FIG. 24 shows an example of a general CNN arithmetic circuit.
  • FIG. 25 shows a processing circuit when the kernel size is compressed from 2x2 to 1x1 to 1/4.
  • the circuit configuration is the same as the example in FIG. 24, and the values of the kernel components are compressed from 8 to 2 1/4 of the configuration in FIG. 24, and the kernel components are compressed. Only the data to be summed with and is written from the input data by designating the address to the BRAM.
  • FIG. 26 shows an example in which the kernel size is changed from 2x2 to 1x2 for channel 1 and 2x1 for channel 2 and the compression ratio is halved.
  • the circuit configuration is the same as in the example of FIG. 24, the values of the kernel components are compressed from 8 to 4 1/2 of the configuration of FIG. 24, and only the data to be summed with the kernel components is input.
  • the content is such that a write operation process is performed by designating an address in the BRAM from the data.
  • FIG. 27 shows a circuit configuration when a plurality of general layers have different compression ratios.
  • the system (information processing device) 1000 includes a main CPU 1011 and a DDR (external memory) 1012 and an external memory 1013, and performs various arithmetic processes such as image recognition and voice recognition under the control of the main CPU 1011. Run.
  • the system 1000 incorporates the FPGA block 1100 as an information processing device that is an arithmetic circuit that executes deep learning.
  • the FPGA block 1100 includes an internal control CPU 1110, a PCIe (endpoint) 1120, a memory control unit 1130, and a CNN arithmetic circuit 1300.
  • the memory control unit 1130 accesses the external memory (external memory) 1013 of the FPGA block 1100.
  • the internal control CPU 1110 controls the execution of deep learning in the deep learning processing circuit 1300.
  • the deep learning processing circuit 1300 includes, for example, a data transfer control unit 1311, a convolution layer 1312, a pooling layer 1313, and an activation function holding unit 1314 as an arithmetic circuit that executes deep learning without compression. Further, as an arithmetic circuit that executes deep learning at a certain compression rate (for example, a compression rate of 1/4), a data transfer control unit 1321, a convolutional layer 1322, a pooling layer 1323, and an activation function holding unit 1324 are provided. Further, when having a plurality of layers having different compression rates, another circuit block having a different compression rate is required. As described above, when the compression rates are different in the plurality of layers, it is common to prepare a plurality of arithmetic circuits having different compression rates.
  • FIG. 28 shows the configuration of the system (information processing apparatus) 1000 when the circuit of the present embodiment is applied.
  • the system 1000 shown in FIG. 28 only one set of a data transfer control unit 1311, a convolution layer 1312, a pooling layer 1313, and an activation function holding unit 1314 is provided as the deep learning processing circuit 1300'.
  • This set of arithmetic circuits executes all operations in the uncompressed layer and in a plurality of layers having different compression ratios. Therefore, it is possible to standardize the circuit that performs the CNN calculation in the FPGA block 1100, and it is possible to construct the circuit by the common FPGA. Differences in weight and network structure of each layer can be realized by changing the weight for each layer and changing the write address setting, for example, by controlling the program by the main CPU 1011 and sharing the FPGA for various deep learning.
  • the AI control device (AI control component) equipped with the main CPU that controls this FPGA can be configured with a common circuit board, and can be handled only by changing the program that runs on the main CPU. is there. Therefore, the common use of a plurality of AI control devices has the effect of increasing the mass production effect and reducing the product cost.
  • each AI control device 11, 12, and 13 are mounted on a vehicle (automobile) 10, and each of the AI control devices 11, 12, and 13 can detect obstacles, detect people, or detect people, respectively.
  • the FPGA block 1100 mounted on each AI control device 11, 12, 13 has the same configuration shown in FIG. 28. It can be converted and the product cost can be reduced.
  • each AI control device 11, 12, and 13 is individually provided with the FPGA block 1100, but when a plurality of devices perform deep learning recognition that does not overlap in time, It is also possible to perform different recognition in time division using one AI control device.
  • the present embodiment is used as the FPGA block 1100 built in the AI functional device 21.
  • various deep learning according to the application executed by the smartphone 20 can be executed by the FPGA block 1100 built in one AI functional device 21.
  • the smartphone 20 is used as an example in FIG. 30, the same configuration can be applied to various information processing devices such as tablet terminals and smart watches that require deep learning to perform deep learning.
  • FIG. 31 shows an example applied to the drone (flying object) 30.
  • the circuit of the present embodiment can also be applied to the AI functional device 31 built in the drone 30 for detecting a security person, detecting a specific object, approaching, etc., as the FPGA block 1100. Also in this case, the AI functional device 31 can execute a plurality of functions.
  • the amount of calculation can be significantly reduced according to the compression rate ⁇ and the submatrixation in both DNN and CNN, and even if compressed, they are almost the same. Since the correct answer rate can be achieved, there is an effect that it can be implemented by using a CPU, a general-purpose FPGA, or the like at a lower cost and lower power consumption.
  • x (l + 1) in the weighting matrix W (l) a vector x (l) when the determination of the row of the weight matrix W (l) component and the vector x of all the elements (l)
  • the same equation was avoided by taking the sum of products of some elements and creating a rule of combination where the equations do not match. Any calculation that can generate combinations in which this equation does not match can be applied to various methods, not limited to the above-mentioned method.
  • FIG. 32 shows an example of the hardware configuration of a computer device which is an information processing device that executes arithmetic processing of each embodiment of the present invention.
  • the computer device C shown in FIG. 32 includes a CPU (Central Processing Unit) C1, a ROM (Read Only Memory) C2, and a RAM (Random Access Memory) C3 connected to the bus C8, respectively.
  • the computer device C includes a non-volatile storage C4, a network interface C5, an input device C6, and a display device C7.
  • an FPGA (field-programmable gate array) C9 may be provided if necessary.
  • the CPU C1 reads the program code of the software that realizes each function of the information processing system device of this example from the ROM C2 and executes it. Variables, parameters, etc. generated during the arithmetic processing are temporarily written in the RAM C3. For example, when the CPU C1 reads out the program stored in the ROM C2, the DNN and CNN arithmetic processing already described is executed. It is also possible to implement arithmetic processing by implementing part or all of DNN and CNN in FPGA C9. When FPGA is used, there is an effect that power consumption can be reduced and high-speed calculation can be realized.
  • non-volatile storage C4 for example, an HDD (Hard disk drive), an SSD (Solid State Drive), or the like is used.
  • an OS Operating System
  • various parameters, a program for executing DNN or CNN, and the like are recorded.
  • Various data can be input / output to the network interface C5 via a LAN (Local Area Network) to which terminals are connected, a dedicated line, or the like.
  • the network interface C5 receives an input signal for performing a DNN or CNN calculation.
  • the calculation result of DNN or CNN is transmitted from the network interface C5 to an external terminal device.
  • the input device C6 is composed of a keyboard or the like.
  • a calculation result or the like is displayed on the display device C7.
  • the present invention includes artificial intelligence having a network structure in a part thereof such as a general neural network and a recurrent neural network (RNN). It can be applied to all systems that perform dimensional compression and compressed sensing of input data by machine learning or matrix calculation.
  • RNN recurrent neural network
  • FPGA block 1110 ... CPU for internal control, 1120 ... PCIE (endpoint), 1130 ... memory control unit, 1300, 1300'... deep learning processing circuit, 1311 ... data transfer control unit, 1312 ... convolution layer, 1313 ... pooling layer, 1314 ... activation function holding unit, 1321 ... data transfer control unit, 1322 ... convolution layer, 1323 ... pooling layer, 1324 ... activation function holding unit, C ... computer device, C1 ... CPU, C2 ... ROM, C3 ... RAM, C4 ... Non-volatile storage, C5 ... Network interface, C6 ... Input device, C7 ... Display device, C8 ... Bus, C9 ... FPGA

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

La présente invention s'applique au calcul d'un réseau qui interconnecte des nœuds dans un réseau neuronal. Selon la présente invention, une matrice de poids de réseau neuronal est conçue pour présenter un nombre réduit de lignes ou de colonnes par rapport au nombre de lignes ou de colonnes déterminé par les données d'entrée ou les données de sortie. En outre, le vecteur de données d'entrée est multiplié par les composantes de poids au nombre réduit de lignes ou de colonnes; la matrice résultant de la multiplication est divisée en matrices partielles comprenant chacune un certain nombre de lignes ou de colonnes; et une addition de matrices est effectuée individuellement sur les matrices partielles divisées.
PCT/JP2019/018604 2019-05-09 2019-05-09 Dispositif de traitement d'informations WO2020225916A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/018604 WO2020225916A1 (fr) 2019-05-09 2019-05-09 Dispositif de traitement d'informations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/018604 WO2020225916A1 (fr) 2019-05-09 2019-05-09 Dispositif de traitement d'informations

Publications (1)

Publication Number Publication Date
WO2020225916A1 true WO2020225916A1 (fr) 2020-11-12

Family

ID=73050755

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/018604 WO2020225916A1 (fr) 2019-05-09 2019-05-09 Dispositif de traitement d'informations

Country Status (1)

Country Link
WO (1) WO2020225916A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05233581A (ja) * 1992-02-18 1993-09-10 Hitachi Ltd ニューラルネットワーク応用システム
JPH09319724A (ja) * 1996-05-31 1997-12-12 Noritz Corp 推論装置
US20180101763A1 (en) * 2016-10-06 2018-04-12 Imagination Technologies Limited Buffer Addressing for a Convolutional Neural Network
JP2018156451A (ja) * 2017-03-17 2018-10-04 株式会社東芝 ネットワーク学習装置、ネットワーク学習システム、ネットワーク学習方法およびプログラム
WO2019082859A1 (fr) * 2017-10-23 2019-05-02 日本電気株式会社 Dispositif d'inférence, procédé d'exécution de calcul de convolution et programme

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05233581A (ja) * 1992-02-18 1993-09-10 Hitachi Ltd ニューラルネットワーク応用システム
JPH09319724A (ja) * 1996-05-31 1997-12-12 Noritz Corp 推論装置
US20180101763A1 (en) * 2016-10-06 2018-04-12 Imagination Technologies Limited Buffer Addressing for a Convolutional Neural Network
JP2018156451A (ja) * 2017-03-17 2018-10-04 株式会社東芝 ネットワーク学習装置、ネットワーク学習システム、ネットワーク学習方法およびプログラム
WO2019082859A1 (fr) * 2017-10-23 2019-05-02 日本電気株式会社 Dispositif d'inférence, procédé d'exécution de calcul de convolution et programme

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
OSAWA, K. ET AL.: "Accelerating Matrix Multiplication in Deep Learning by Using Low-Rank ApProximation", PROCEEDINGS OF 2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS, 21 July 2017 (2017-07-21), pages 186 - 192, XP033153217, ISBN: 978-1-5386-3250-5, DOI: 10.1109/HPCS.2017.37 *
OSAWA, KAZUKI ET AL.: "Accelerating Convolutional Neural Networks using Low-Rank Approximation", PROCEEDINGS OF THE CONFERENCE ON COMPUTATIONAL ENGINEERING AND SCIENCE, vol. 22, 31 May 2017 (2017-05-31), pages 1 - 5, ISSN: 1342-145X *
TANOMOTO, MASAKAZU ET AL.: "A CGRA-based Approach for Accelerating Convolutional Neural Networks", PROCEEDINGS OF 2015 IEEE 9TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS- ON-CHIP, 25 September 2015 (2015-09-25), pages 73 - 80, XP032809497, ISBN: 978-1-4799-8670-5, DOI: 10.1109/MCSoC.2015.41 *
TANOMOTO, MASAKAZU ET AL.: "Convolutional Neural Network Processing on An Accelerator based on Manymemory Network", IEICE TECHNICAL REPORT., vol. 114, 19 November 2014 (2014-11-19), pages 57 - 62, ISSN: 0913-5685 *

Similar Documents

Publication Publication Date Title
CN107622302B (zh) 用于卷积神经网络的超像素方法
WO2020044527A1 (fr) Dispositif de traitement d'informations
KR102545128B1 (ko) 뉴럴 네트워크를 수반한 클라이언트 장치 및 그것을 포함하는 시스템
JP7359395B2 (ja) 画像処理装置及びその画像処理方法、並びに画像処理システム及びそのトレーニング方法
US11328184B2 (en) Image classification and conversion method and device, image processor and training method therefor, and medium
JP2020068027A (ja) アンサンブル学習ベースの画像分類システム
CN113298716A (zh) 基于卷积神经网络的图像超分辨率重建方法
JP6528349B1 (ja) 情報処理装置及び情報処理方法
WO2020225916A1 (fr) Dispositif de traitement d'informations
CN111028301A (zh) 一种基于加权l1范数的卷积稀疏编码方法
CN111340089A (zh) 图像特征学习方法、模型、装置和计算机存储介质
CN112634136B (zh) 一种基于图像特征快速拼接的图像超分辨率方法及其系统
CN112488187B (zh) 一种基于核二维岭回归子空间聚类的图像处理方法
CN114580618A (zh) 一种反卷积处理方法、装置、电子设备及介质
Chang et al. Deep learning acceleration design based on low rank approximation
US12094084B2 (en) Multi-channel feature map fusion
JP6741159B1 (ja) 推論装置及び推論方法
CN110324620B (zh) 帧内预测方法、装置、电子设备及机器可读存储介质
Atsalakis et al. A window-based color quantization technique and its embedded implementation
Keleş et al. PAON: A New Neuron Model using Pad\'e Approximants
CN115618945A (zh) 面向边缘设备知识蒸馏方法及装置
KR20230157220A (ko) 영상 처리 장치 및 그 동작 방법
CN118379374A (zh) 基于光谱记忆增强网络的高光谱图像压缩成像方法及装置
CN118397476A (zh) 一种遥感图像目标检测模型的改进方法
CN116868205A (zh) 经层式分析的神经网络剪枝方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928246

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19928246

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP