WO2021261252A1 - Computation circuit, computation method, program, and computation circuit design method - Google Patents

Computation circuit, computation method, program, and computation circuit design method Download PDF

Info

Publication number
WO2021261252A1
WO2021261252A1 PCT/JP2021/021922 JP2021021922W WO2021261252A1 WO 2021261252 A1 WO2021261252 A1 WO 2021261252A1 JP 2021021922 W JP2021021922 W JP 2021021922W WO 2021261252 A1 WO2021261252 A1 WO 2021261252A1
Authority
WO
WIPO (PCT)
Prior art keywords
arithmetic
processing
processing steps
assigned
units
Prior art date
Application number
PCT/JP2021/021922
Other languages
French (fr)
Japanese (ja)
Inventor
和茂 橋本
正志 森
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Publication of WO2021261252A1 publication Critical patent/WO2021261252A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present disclosure relates to an arithmetic circuit, an arithmetic method, a program for executing this arithmetic method, and a design method of the arithmetic circuit.
  • Parallel processing For processing in which the processing time increases as the amount of input data increases, it is common practice to shorten the processing time by parallel processing using multiple arithmetic units.
  • the original processing is divided into partial processing of the same algorithm, and each partial processing is calculated independently by the corresponding arithmetic unit. Then, by integrating the results of each partial process with a relatively small amount of calculation, it is possible to obtain the same or approximately the same result as the result of the original process.
  • Parallel processing is suitably used for signal processing, certain processing of artificial intelligence (for example, calculation of the average value of a plurality of random variables according to the same distribution), and the like.
  • One of the problems with parallel processing is that the amount of data input to each arithmetic unit varies. If the amount of input data varies from one arithmetic unit to another, there is a disadvantage that the total processing time is determined by the processing time of the arithmetic unit having the largest amount of input data.
  • Patent Document 1 discloses a method for solving the above problems. Specifically, according to this document, in the multiplication process of the coefficient matrix and the input vector, the multiplication of each non-zero element included in the coefficient matrix and the corresponding element of the input vector is set as the processing unit, and the processing in each arithmetic unit is performed. The multiplication is assigned to each arithmetic unit so that the number of units is leveled.
  • Patent Document 1 Although the calculation method described in International Publication No. 2019/053835 (Patent Document 1) is effective in shortening the processing time of multiplication processing, it takes time and effort to determine the processing to be assigned to each calculation unit. It takes. This is because the number of unit processes to be assigned to each arithmetic unit is not determined until the search for non-zero elements is completed for all input data. As a result, the circuit scale for the preprocessing of allocating the arithmetic processing to each arithmetic unit is increased, and the total processing time is also increased.
  • the present disclosure has been made in consideration of the above-mentioned problems, and one of the purposes thereof is to reduce the processing time of the entire arithmetic processing by suppressing the variation in the processing time for each arithmetic unit by a relatively simple method. It is to provide an arithmetic circuit that can be shortened, an arithmetic method, and a program for executing this arithmetic method.
  • the arithmetic circuit of one embodiment performs arithmetic processing on the input data.
  • the arithmetic processing on the input data includes m processing steps that can be processed in parallel with each other.
  • the arithmetic processing amount of each of the m processing steps is different from the arithmetic processing amount of the other processing steps.
  • the arithmetic circuit includes n arithmetic units that execute m processing steps and a control processor, where n is an integer of 2 or more and smaller than m.
  • the control processor randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
  • FIG. It is a block diagram which shows an example of the image recognition system which includes the arithmetic circuit by Embodiment 1.
  • FIG. It is a block diagram which shows an example of the structure of the signal input part of FIG. It is a flowchart which shows the outline of the arithmetic processing using the arithmetic circuit of FIG. It is a flowchart which shows the detail of the arithmetic processing using the arithmetic circuit of FIG. It is a flowchart which shows the example of the process in step S110 of FIG. 4 in detail.
  • FIG. 6 shows the processing time of each arithmetic unit in a tabular form when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit. It is a figure which shows the example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format. In the example of allocation of arithmetic units shown in FIG. 9, it is a figure which shows the processing time of each arithmetic unit in a table format. It is a figure for demonstrating the effect of Embodiment 1.
  • FIG. It is a flowchart which shows the other realization method of the process in step S110 of FIG.
  • FIG. 1 is a block diagram showing an example of an image recognition system including an arithmetic circuit according to the first embodiment.
  • the image recognition system 30 has a system configuration assuming execution of an image recognition application in a surveillance camera for the purpose of person detection, an in-vehicle camera for the purpose of object detection, or the like.
  • the image recognition system 30 has a function of detecting a specific object from image data according to a CNN model. Specifically, the image recognition system 30 performs a two-dimensional convolution operation, that is, a product-sum operation such as a product (Ax + b) of a matrix A and a vector x.
  • the image recognition system 30 includes a signal input unit 31, a CPU (Central Processing Unit) 32, a memory 33, a DMAC (Direct Memory Access Controller) 34, a parallel processing calculation unit 35, a reader / writer 36, and a reader / writer 36.
  • a network interface 37 is provided. Each of the above configurations is interconnected via the bus interconnect 38.
  • the signal input unit 31 generates image data by converting light incident through an optical system (not shown) into an electric signal.
  • the image data is arithmetically processed by the arithmetic circuit 40.
  • a configuration example of the signal input unit 31 will be described later with reference to FIG.
  • the CPU 32 functions as a control processor that controls the entire image recognition system 30.
  • the CPU 32 also accesses the dedicated memory 41 inside the parallel processing calculation unit 35.
  • the memory 33 stores instructions and control data executed by the CPU 32.
  • the memory 33 includes a volatile memory such as a DRAM (Dynamic Random Access Memory) and a SRAM, and an electrically rewritable non-volatile memory such as a flash memory.
  • a volatile memory such as a DRAM (Dynamic Random Access Memory) and a SRAM
  • an electrically rewritable non-volatile memory such as a flash memory.
  • the DMAC 34 controls direct data transfer between the signal input unit 31, the memory 33, and the dedicated memory 41 of the parallel processing calculation unit 35, without going through the CPU 32.
  • Parallel processing calculation unit 35 performs two-dimensional convolution calculation processing.
  • the parallel processing calculation unit 35 includes a dedicated memory 41, an input data control unit 43, and n arithmetic units 44_1 to 44_n as its internal configuration.
  • the arithmetic units 44_1 to 44_n are referred to as the arithmetic unit 44 when they are generically referred to or when they indicate unspecified ones.
  • the processing capacity of each of the n arithmetic units 44 is the same.
  • the processing capacity of each of the n arithmetic units 44 is substantially the same to the extent that the difference in processing time of each arithmetic unit does not matter.
  • the n arithmetic units 44 can be programmed in parallel with each other.
  • the number n of the arithmetic units 44 is determined depending on the number m of processing steps that can be processed in parallel included in the processing program for the input data 42. 2 ⁇ n ⁇ m holds.
  • the arithmetic unit 44 can be configured by a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), a multi-core processor, or the like.
  • the dedicated memory 41 stores the input data 42 processed by the n arithmetic units 44 and the arithmetic results of each arithmetic unit 44.
  • the input data control unit 43 assigns each of the above processing steps to each arithmetic unit 44.
  • the arithmetic unit 44 executes arithmetic processing for the assigned processing step.
  • the arithmetic circuit 40 is configured by the CPU 32, the memory 33, and the parallel processing calculation unit 35.
  • the CPU 32 and the memory 33 may be provided inside the parallel processing calculation unit 35.
  • the reader / writer 36 writes data or a program to the storage medium and reads out the data or program stored in the storage medium.
  • the storage medium stores data or programs non-temporarily by magnetic or optical methods or by using semiconductor memory.
  • a storage medium a CD (Compact Disc), a DVD (Digital Versatile Disc), a Blu-ray disc, a hard disk, a flash memory, or the like can be used.
  • the network interface 37 is connected to an external device via the network.
  • the program executed by the CPU 32 and the program executed by the arithmetic unit 44 may be provided via the network via the network interface 37, or may be provided by the storage medium via the reader / writer 36.
  • FIG. 2 is a block diagram showing an example of the configuration of the signal input unit of FIG.
  • the block diagram of FIG. 2 shows a configuration example when a CMOS (Complementary Metal Oxide Semiconductor) device is used as the photoelectric element, but a CCD (Charge Coupled Devices) may be used as the optical sensor, or the like.
  • CMOS Complementary Metal Oxide Semiconductor
  • CCD Charge Coupled Devices
  • the type of optical sensor may be used.
  • the signal input unit 31 includes an optical system (not shown) for condensing light, a sensor array 51, a column ADC (Analog-to-Digital Converter) 52, a vertical scanning circuit 53, and the like. It includes a horizontal scanning circuit 54, an output amplifier 55, a frame buffer 56, and a signal processing circuit 57.
  • an optical system not shown
  • a sensor array 51 for condensing light
  • a column ADC Analog-to-Digital Converter
  • vertical scanning circuit 53 and the like. It includes a horizontal scanning circuit 54, an output amplifier 55, a frame buffer 56, and a signal processing circuit 57.
  • the sensor array 51 includes a plurality of photoelectric elements arranged in a matrix.
  • the light input to the signal input unit 31 is focused on the sensor array 51 by the optical system.
  • the vertical scanning circuit 53 is connected to a plurality of control signal lines extending in the row direction of the sensor array 51, and drives a readout circuit of each photoelectric element via each control signal line.
  • the horizontal scanning circuit 54 is connected to a plurality of output signal lines extending in the column direction of the sensor array 51, and reads an optical signal from each photoelectric element via each output signal line.
  • the column ADC 52 converts an optical signal read from each photoelectric element into a digital signal.
  • the output amplifier 55 amplifies the converted digital signal.
  • the frame buffer 56 temporarily stores the amplified digital signal frame by frame.
  • the signal processing circuit 57 removes noise and the like contained in the digital signal and executes various image corrections.
  • FIG. 3 is a flowchart showing an outline of arithmetic processing using the arithmetic circuit of FIG.
  • the process executed by the arithmetic circuit 40 is divided into a preprocessing S100 that is executed only once at the beginning and an arithmetic process S200 that is repeatedly executed a plurality of times according to the input data.
  • the preprocessing S100 is executed by the CPU 32 according to the program.
  • the preprocessing S100 may be executed by another general-purpose CPU.
  • the arithmetic processing S200 is mainly executed by the parallel processing calculation unit 35 according to the program, and the overall control of the arithmetic processing S200 is executed by, for example, the CPU 32.
  • FIG. 4 is a flowchart showing details of arithmetic processing using the arithmetic circuit of FIG.
  • the arithmetic processing for the input data 42 executed by the arithmetic circuit 40 includes m processing steps capable of parallel processing.
  • the number of processing steps m is larger than the number n of the arithmetic units 44.
  • step S110 of the preprocessing S100 the CPU 32 allocates each of the m processing steps to any one of the n arithmetic units 44 based on the random number. Therefore, each arithmetic unit 44 is assigned at least one of m processing steps.
  • step S210 of the arithmetic processing S200 each arithmetic unit 44 of the parallel processing calculation unit 35 executes the assigned processing step. If the execution of all the processing steps assigned in each arithmetic unit 44 is not completed (NO in step S220), the above step S210 is repeated. When each arithmetic unit 44 executes all the assigned processing steps, the arithmetic processing ends.
  • FIG. 5 is a flowchart showing in detail an example of the process in step S110 of FIG.
  • the identification number of the processing step is i (however, 1 ⁇ i ⁇ m, i is an integer), and the first processing step to the mth processing step are executed in order.
  • step S300 of FIG. 5 the CPU 32 initializes the identification number i of the processing step to 1.
  • the CPU 32 In the next step S310, the CPU 32 generates a uniform random number of integers in the range of 1 to n as identification numbers of n arithmetic units.
  • r (i) be a uniform random number of the generated integers. 1 ⁇ r (i) ⁇ n holds.
  • a known pseudo-random number generation algorithm may be used to generate a uniform random number. For example, linear congruential or multiply-with-carry can be used.
  • the processing capacity of each of the n arithmetic units is the same or substantially the same, it is desirable to generate a uniform random number having the same appearance probability of each random number.
  • the processing capacity of each arithmetic unit is different, it is necessary to generate a random number different from the uniform random number.
  • the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer uniform random number r (i).
  • the CPU 32 increments the identification number i of the processing step by 1.
  • each of the m processing steps is assigned to any one of the n arithmetic units 44.
  • FIG. 6 is a diagram conceptually showing the convolution operation in the convolution layer of the convolutional neural network.
  • the output data 62 is generated by the convolution operation between the input data 60 and the kernel 61. Further, after the bias is added to each element of the output data 62, an activation function is applied to each element of the output data 62.
  • the convolution operation while sliding the kernel 61 on the input data 60 at regular intervals, the elements of the kernel 61 and the corresponding elements of the input data 60 are multiplied, and the sum of them is obtained.
  • the interval for sliding the kernel (that is, stride) is 1.
  • the operations of the following equations (1) to (9) are executed.
  • the bias addition and the activation function operation are omitted.
  • y 1 x 1 * w 1 + x 2 * w 2 + x 3 * w 3 + x 6 * w 4 + x 7 * w 5 + x 8 * w 6 + X 11 * w 7 + x 12 * w 8 + x 13 * w 9 ...
  • y 2 x 2 * w 1 + x 3 * w 2 + x 4 * w 3 + x 7 * w 4 + x 8 * w 5 + x 9 * w 6 + X 12 * w 7 + x 13 * w 8 + x 14 * w 9 ...
  • y 3 x 3 * w 1 + x 4 * w 2 + x 5 * w 3 + x 8 * w 4 + x 9 * w 5 + x 10 * w 6 + X 13 * w 7 + x 14 * w 8 + x 15 * w 9 ...
  • y 4 x 6 * w 1 + x 7 * w 2 + x 8 * w 3 + x 11 * w 4 + x 12 * w 5 + x 13 * w 6 + X 16 * w 7 + x 17 * w 8 + x 18 * w 9 ...
  • y 5 x 7 * w 1 + x 8 * w 2 + x 9 * w 3 + x 12 * w 4 + x 13 * w 5 + x 14 * w 6 + X 17 * w 7 + x 18 * w 8 + x 19 * w 9 ...
  • y 6 x 8 * w 1 + x 9 * w 2 + x 10 * w 3 + x 13 * w 4 + x 14 * w 5 + x 15 * w 6 + X 18 * w 7 + x 19 * w 8 + x 20 * w 9 ...
  • the input data 60 includes 0 elements, it is not necessary to execute multiplication with the corresponding elements of the kernel 61 for the 0 elements. For example, if the values of x 4 , x 5 , x 9 , x 13 to x 18 , x 21 , x 22 , and x 25 of the input data 60 are set to 0, the calculation processing amount of each processing step varies.
  • FIG. 7 is a diagram showing an example of the processing time of each processing step of the convolution operation shown in FIG. 6 in a table format. Since the processing performance of each arithmetic unit is assumed to be the same in the first embodiment, the arithmetic processing amount of each processing step is proportional to the processing time.
  • the processing time of each processing step is represented as shown in FIG.
  • FIG. 8 is a diagram showing the processing time of each arithmetic unit in a table format when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit.
  • processing steps 1 to 4 are assigned to arithmetic units 1 to 4 in order
  • processing steps 5 to 8 are assigned to arithmetic units 1 to arithmetic units 4 in order
  • the remaining processing steps. 9 is assigned to the arithmetic unit 1.
  • the processing steps 1 to 9 are regularly assigned to the arithmetic units 44_1 to 44_4
  • the processing time in each arithmetic unit 44 varies as shown in FIG. Therefore, the total processing time is determined by the processing time of the arithmetic unit 44_1 having the longest processing time.
  • FIG. 9 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format.
  • the CPU 32 randomly assigns each of the processing steps 1 to 9 to each arithmetic unit 44 based on a uniform random number.
  • FIG. 10 is a diagram showing the processing time of each arithmetic unit in a table format in the arithmetic unit allocation example shown in FIG. Compared with the case where each processing step is regularly assigned to the arithmetic unit 44 as shown in FIG. 8, in the case of the first embodiment shown in FIG. 10, the degree of variation of each arithmetic unit can be relaxed.
  • the degree of variation may increase depending on the generation of random numbers.
  • both the number of processing steps and the number of arithmetic units 44 are much larger than in the case of the above example. Therefore, by randomly assigning each processing step to any one arithmetic unit 44 based on a uniform random number, it is possible to suppress variations in the processing time for each arithmetic unit 44.
  • FIG. 11 is a diagram for explaining the effect of the first embodiment.
  • the processing times from the arithmetic unit 1 to the arithmetic unit n are N 1 , ..., N n, respectively.
  • FIG. 11A shows a case where the processing time of each arithmetic unit varies as a result of regularly allocating each processing step to the arithmetic units 1 to n. As shown in FIG. 11 (A), the processing time N 1 processing time longest calculator 1, the arithmetic circuit 40 overall processing time is determined.
  • FIG. 11B shows a case where each processing step is randomly assigned to the arithmetic units 1 to n according to a uniform random number.
  • the processing time of the entire arithmetic circuit 40 can be shortened as compared with the case of FIG. 11A.
  • the arithmetic circuit 40 of the present embodiment is compared with the arithmetic circuit of the above-mentioned International Publication No. 2019/053835 (Patent Document 1).
  • each processing step is uniformly and randomly assigned to any one arithmetic unit. Therefore, it is not necessary to estimate the arithmetic processing amount of each processing step in advance based on the search result for the non-zero element. As a result, the time required for preprocessing can be shortened as compared with the case of the arithmetic circuit of Patent Document 1, and thereby the entire processing time including preprocessing can be shortened.
  • Embodiment 2 [Outline of Embodiment 2]
  • the arithmetic circuit 40 of the first embodiment randomly assigns each processing step of input data to any one arithmetic unit based on a uniform random number.
  • the number of processing steps assigned to each arithmetic unit varies depending on the generation of random numbers.
  • the processing time of each arithmetic unit may vary.
  • the condition that the number of processing steps assigned to each arithmetic unit is substantially equal, that is, the difference in the number of processing steps for each arithmetic unit is within one is satisfied.
  • the CPU 32 randomly assigns each processing step to any one of the arithmetic units.
  • FIG. 12 is a flowchart showing another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. The m processing steps are assigned identification numbers from the first to the mth.
  • the number of random number generation processes is i (however, 1 ⁇ i ⁇ m, i is an integer), and the identification number of the arithmetic unit is j (however, 1 ⁇ j ⁇ n, j is an integer).
  • the first random number generation process to the mth random number generation process are executed in order.
  • step S400 of FIG. 12 the CPU 32 initializes each of the number i of the random number generation processing and the identification number j of the arithmetic unit to 1.
  • the CPU 32 In the next step S410, the CPU 32 generates a uniform random number of an integer not equal to the random number already generated within the range of 1 to m.
  • a uniform random number of integers is simply referred to as an integer random number.
  • the CPU 32 allocates the r (i) th processing step to the jth arithmetic unit 44_j using the random number r (i) generated in the i-th time.
  • step S430 the CPU 32 increments the number i of the random number generation processing by 1 and increments the identification number j of the arithmetic unit by 1.
  • step S440 the CPU 32 initializes the identification number j of the arithmetic unit to 1 (step S450).
  • step S410 the above steps S410, S420, and S430 are repeated until the number i of the random number generation processing exceeds m (until YES is obtained in step S460).
  • the first to nth processing steps are almost evenly distributed between the first arithmetic unit 44_1 to the mth arithmetic unit 44_m, that is, the difference in the number of processing steps for each arithmetic unit is within one. Is assigned to be.
  • FIG. 13 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 12 in a table format. As shown in FIG. 13, each processing step is assigned to the arithmetic unit based on a non-overlapping integer random number from 1 to 9.
  • the fifth processing step is assigned to the first arithmetic unit 44_1 by using the integer random number 5 generated in the first time. Subsequently, using the integer random numbers 1, 6 and 4 generated from the second to the fourth times, the second arithmetic unit 44_2 to the fourth arithmetic unit 44_4 are used to obtain the first and sixth arithmetic units. And the fourth processing step is assigned respectively.
  • the third processing step is assigned to the first arithmetic unit 44_1 using the integer random number 3 generated in the ninth time.
  • FIG. 14 is a diagram showing the processing time of each arithmetic unit in a table format in the processing step allocation example shown in FIG.
  • the number of processing steps assigned to each of the first to fourth arithmetic units 44_1 to 44_1 is almost evenly, that is, the number of processing steps is two or three.
  • Each processing step can be assigned as follows. As a result, it is possible to suppress variations in the processing time for each arithmetic unit.
  • FIG. 15 is a diagram for conceptually showing the difference between the processing step allocation method shown in FIG. 9 and the processing step allocation method shown in FIG. 15 (A) conceptually shows the method of allocating the processing steps in the case of the first embodiment shown in FIG. 9, and FIG. 15 (B) shows the method of allocating the processing steps in the case of the second embodiment shown in FIG. Is conceptually shown.
  • the processing steps are selected in order from the first processing step to the ninth processing step, and the arithmetic unit is assigned to the selected processing step. Assigned.
  • the arithmetic unit to be assigned is randomly selected using a uniform random number.
  • the arithmetic units are sequentially selected from the first arithmetic unit to the fourth arithmetic unit in order, and the selected arithmetic units are selected. Processing steps are assigned. The processing steps assigned to each arithmetic unit are randomly selected using a uniform random number of integers generated so as not to overlap in the range of 1 to m.
  • the arithmetic units assigned to each processing step are periodically selected in a fixed order.
  • the processing steps assigned to each arithmetic unit are randomly selected using uniform random numbers generated so as not to overlap.
  • n processing steps are allocated to m arithmetic units almost evenly, that is, the difference in the number of processing steps for each arithmetic unit is within 1 within a range that does not lose randomness. be able to.
  • Embodiment 3 Outline of Embodiment 3
  • the processing performance of each arithmetic unit 44 constituting the parallel processing calculation unit 35 is substantially the same.
  • the third embodiment a case where there is a difference in processing performance for each arithmetic unit 44 will be described.
  • the identification number of each arithmetic unit is randomly generated at a frequency proportional to the processing performance of each of the n arithmetic units, and the arithmetic unit corresponding to the generated identification number is assigned to each processing step. ..
  • the processing time for each arithmetic unit can be made almost equal.
  • the inverse function method or the von Neumann rejection method can be used.
  • any known method may be used.
  • a distribution function is assumed in which the identification numbers 1 to n of the arithmetic units 44_1 to 44_n are defined as the domain and the processing performance of each arithmetic unit is used as the range, and the cumulative distribution function of this distribution function is defined as F. do. Then, a new random number generation function is obtained by applying the inverse function F- 1 of this cumulative distribution function to the uniform random number generation function.
  • FIG. 16 is a flowchart showing still another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. In the flowchart of FIG. 16, the identification number of the processing step is i (however, 1 ⁇ i ⁇ m, i is an integer), and the first processing step to the mth processing step are executed in order.
  • step S500 of FIG. 16 the CPU 32 initializes the identification number i of the processing step to 1.
  • the CPU 32 In the next step S510, the CPU 32 generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 44_1 to 44_n.
  • integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 44_1 to 44_n.
  • the above-mentioned inverse function method is used to generate such an integer random number. Let the generated integer random number be r (i). 1 ⁇ r (i) ⁇ n holds.
  • the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer random number r (i).
  • the CPU 32 increments the identification number i of the processing step by 1.
  • step S510 to S530 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S540).
  • each of the m processing steps is assigned to any one of the n arithmetic units 44.
  • FIG. 17 is a diagram conceptually showing the method of allocating the processing steps shown in FIG.
  • processing steps 2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two processing steps 1 are assigned to the arithmetic unit 3 having the next highest processing performance. , 6 are assigned.
  • Processing steps 4 and 8 are assigned to the arithmetic units 1 and 2 having low processing performance, respectively.
  • the CPU 32 randomly generates an identification number of each arithmetic unit at a frequency proportional to the processing performance of each of the n arithmetic units, and the generated identification number is used as the identification number. Assign the corresponding arithmetic unit to each processing step. As a result, a larger number of processing steps are assigned to the arithmetic unit having higher processing performance, and as a result, it is possible to suppress variations in the processing time for each arithmetic unit.
  • Embodiment 4 the design method of the arithmetic circuit of the third embodiment will be described. Specifically, we present a design method that can optimize both the processing speed and the circuit area of the entire arithmetic circuit, taking into consideration the difference in processing speed and the difference in circuit area for each arithmetic unit.
  • an arithmetic circuit is configured by using all n arithmetic units having different processing circuits and circuit areas, it is assumed that the area of the entire arithmetic circuit exceeds the allowable range. In this case, it is necessary to select an arithmetic unit to be incorporated in the arithmetic circuit so that the area of the entire arithmetic circuit is within the allowable range. If the arithmetic unit having the largest circuit area is simply not included in the arithmetic circuit in order, the processing speed may not meet the specifications. Therefore, it is necessary to optimize both the processing speed of the entire arithmetic circuit and the circuit area.
  • FIG. 1 a specific description will be given with reference to FIG.
  • FIG. 18 is a flowchart showing the design procedure of the arithmetic circuit.
  • the design procedure of FIG. 18 is executed, for example, by the CPU of the design support device.
  • step S600 of FIG. 18 the CPU selects at least one arithmetic unit not included in the arithmetic circuit based on the circuit area of each of the arithmetic units 1 to n. In this case, at least one arithmetic unit is selected so that the area of the entire arithmetic circuit is within the permissible range.
  • the CPU sets at least one selected arithmetic unit to non-allocation.
  • the arithmetic processing executed by the arithmetic circuit includes m (m> n) processing steps capable of parallel processing.
  • the identification number of the processing step is i (however, 1 ⁇ i ⁇ m, i is an integer), and the first processing step to the mth processing step are sequentially selected.
  • the CPU initializes the identification number i of the processing step to 1.
  • the CPU In the next step S620, the CPU generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 1 to n.
  • integer random numbers for example, the above-mentioned inverse function method or the von Neumann rejection method is used. Let the generated integer random number be r (i). 1 ⁇ r (i) ⁇ n holds.
  • step S630 the CPU determines whether or not the allocation to the r (i) th arithmetic unit is prohibited. If the allocation is prohibited (YES in step S630), the CPU returns the process to step S620. On the other hand, if the allocation is possible (NO in step S630), the CPU advances the process to step S640.
  • step S640 the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit using the generated integer random number r (i).
  • step S6500 the CPU increments the identification number i of the processing step by 1.
  • step S620 to S650 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S660).
  • each of the m processing steps is assigned to any of the arithmetic units except the arithmetic unit whose allocation is prohibited.
  • the CPU calculates the time required for the arithmetic processing by simulation or the like based on the allocation result of the above processing step.
  • step S680 when the CPU sets another arithmetic unit to prohibit allocation, the process is returned to step S600 and each of the above steps is repeated.
  • the CPU may allocate processing steps and calculate processing time for all combinations of arithmetic units such that the area of the entire arithmetic circuit fits within the allowable area.
  • the CPU selects the combination of arithmetic units when the processing time is the shortest as the arithmetic unit to be incorporated in the arithmetic circuit. This makes it possible to optimize both the processing speed of the entire arithmetic circuit and the circuit area.
  • the above arithmetic circuit design method can be summarized in the following procedures (i) to (iv).
  • the procedure (i) to (iv) is realized, for example, by causing a computer as a design support device to execute a program.
  • the computer determines a combination of a plurality of arithmetic units having different processing performances and circuit areas from each other so that the total circuit area is equal to or less than a predetermined upper limit value (step S600).
  • the computer For each of the m processing steps, the computer has a frequency proportional to the processing performance of each arithmetic unit constituting the above combination, and the identification number of each of the plurality of arithmetic units constituting the above combination. Any one of them is randomly generated. Then, the computer assigns each processing step to the arithmetic unit corresponding to the generated identification number (steps S610 to S660).
  • the computer estimates the processing time of m processing steps based on the allocation result of m processing steps to the plurality of arithmetic units constituting the above combination (step S670).
  • the computer determines a plurality of combinations of a plurality of arithmetic units by executing the above procedure (i) a plurality of times, and the above procedure (ii) and the procedure (ii) for each of the plurality of combinations. By executing iii), the processing time of m processing steps is estimated for each of the plurality of combinations (when YES in step S680).
  • the computer determines the combination of the plurality of arithmetic units having the shortest processing time as the arithmetic unit used in the arithmetic circuit (step S690).
  • the fourth arithmetic unit 4 has the highest processing performance, the processing performance of the arithmetic unit 3 is the second highest, and the processing performance of the arithmetic unit 1 and the arithmetic unit 2 is low.
  • the second arithmetic unit 2 has the largest circuit area, the circuit area of the arithmetic unit 4 is the second largest, and the circuit areas of the arithmetic unit 1 and the arithmetic unit 3 are small.
  • the arithmetic unit 2 having the largest circuit area is not assigned a processing step.
  • Processing steps are assigned to the other arithmetic units 1 to 3 according to the processing speed. Specifically, five processing steps 2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two processing steps 1, 6 are assigned to the arithmetic unit 3 having the next highest processing performance. Assigned. Processing steps 4 and 8 are assigned to the arithmetic unit 1 having low processing performance.
  • the processing time of the entire arithmetic processing is calculated based on the arithmetic processing amount of each processing step and the processing speed of the arithmetic units 1 to 3.
  • the combination of arithmetic units incorporated in the arithmetic circuit is determined so that the area of the entire processing circuit is within the allowable range and the processing time is the shortest.
  • Embodiment 5 the random number generation method described in the second embodiment is applied to the optimization of the circuit layout of the logic cell.
  • LSI Large Scale Integration
  • a logic cell composed of an arithmetic unit is randomly assigned to a plurality of circuit areas in a semiconductor chip.
  • the number of logical cells assigned to each circuit area varies depending on the generation of random numbers.
  • the circuit area may vary from circuit area to circuit area.
  • the condition that the number of logical cells assigned to each circuit area is almost equal, that is, the difference in the number of logical cells for each circuit area is within one is satisfied.
  • the circuit areas are selected cyclically in a fixed order.
  • the logical cells assigned to each circuit area are randomly selected using uniform random numbers generated so as not to overlap.
  • n logical cells are allocated almost evenly to m circuit areas, that is, the difference in the number of logical cells in each circuit area is within 1 within a range that does not lose randomness. be able to.

Abstract

A computation circuit (40) performs computation processing on input data (42). The computation processing on the input data includes m processing steps which can be processed in parallel to each other. The computation processing amount of each of the m processing steps differs from the computation processing amount of at least one of the remaining processing steps. The computation circuit (40) comprises: n computing units (44) that execute the m processing steps, where n is an integer greater than 1 and less than m; and a control processor (32). The control processor (32) randomly allocates each of the m processing steps to one of the n computing units (44) on the basis of a random number.

Description

演算回路、演算方法、プログラム、および演算回路の設計方法Arithmetic circuit, arithmetic method, program, and arithmetic circuit design method
 本開示は、演算回路、演算方法、この演算方法を実行するためのプログラム、および演算回路の設計方法に関する。 The present disclosure relates to an arithmetic circuit, an arithmetic method, a program for executing this arithmetic method, and a design method of the arithmetic circuit.
 入力データ量の増加に伴って処理時間が増加するような処理に対して、複数の演算器を用いた並列処理によって処理時間の短縮を図ることが一般に行われる。並列処理では、元の処理を同一アルゴリズムの部分処理に分割し、各部分処理が対応する演算器によって独立に計算される。そして、各部分処理の結果が比較的少ない演算量によって統合されることによって、元の処理の結果と同一または近似的に同一の結果を得ることができる。並列処理は、信号処理、人工知能のある種の処理(たとえば、同一の分布に従った複数の確率変数の平均値計算)などに対して好適に用いられる。 For processing in which the processing time increases as the amount of input data increases, it is common practice to shorten the processing time by parallel processing using multiple arithmetic units. In parallel processing, the original processing is divided into partial processing of the same algorithm, and each partial processing is calculated independently by the corresponding arithmetic unit. Then, by integrating the results of each partial process with a relatively small amount of calculation, it is possible to obtain the same or approximately the same result as the result of the original process. Parallel processing is suitably used for signal processing, certain processing of artificial intelligence (for example, calculation of the average value of a plurality of random variables according to the same distribution), and the like.
 並列処理の問題点の一つは、各演算器に入力されるデータ量にばらつきが生じることである。演算器ごとの入力データ量にばらつきがあると、最も入力データ量の多い演算器の処理時間によって全体の処理時間が決まるという不都合が生じる。 One of the problems with parallel processing is that the amount of data input to each arithmetic unit varies. If the amount of input data varies from one arithmetic unit to another, there is a disadvantage that the total processing time is determined by the processing time of the arithmetic unit having the largest amount of input data.
 国際公報第2019/053835号(特許文献1)は、上記の問題点を解決するための一手法を開示する。具体的にこの文献によれば、係数行列と入力ベクトルとの乗算処理において、係数行列に含まれる各非零要素と入力ベクトルの対応する要素との乗算を処理単位とし、各演算器での処理単位数が平準化するように当該乗算が各演算器に割り当てられる。 International Publication No. 2019/053835 (Patent Document 1) discloses a method for solving the above problems. Specifically, according to this document, in the multiplication process of the coefficient matrix and the input vector, the multiplication of each non-zero element included in the coefficient matrix and the corresponding element of the input vector is set as the processing unit, and the processing in each arithmetic unit is performed. The multiplication is assigned to each arithmetic unit so that the number of units is leveled.
国際公報第2019/053835号International Publication No. 2019/053835
 上記の国際公報第2019/053835号(特許文献1)に記載の演算方法は、乗算処理の処理時間の短縮には効果があるものの、各演算器に割り当てるべき処理を決定するのに時間および手間がかかる。なぜなら、全入力データに対して非零要素の探索が完了するまで、各演算器に対して割り当てるべき単位処理数が確定しないからである。この結果、各演算器への演算処理の割り当てという前処理のための回路規模が増大し、全体の処理時間も増大する。 Although the calculation method described in International Publication No. 2019/053835 (Patent Document 1) is effective in shortening the processing time of multiplication processing, it takes time and effort to determine the processing to be assigned to each calculation unit. It takes. This is because the number of unit processes to be assigned to each arithmetic unit is not determined until the search for non-zero elements is completed for all input data. As a result, the circuit scale for the preprocessing of allocating the arithmetic processing to each arithmetic unit is increased, and the total processing time is also increased.
 本開示は上記の問題点を考慮してなされたものであり、その目的の一つは、比較的簡単な方法で演算器ごとの処理時間のばらつきを抑制することによって演算処理全体の処理時間を短縮することが可能な演算回路、演算方法、およびこの演算方法を実行するためのプログラムを提供することである。 The present disclosure has been made in consideration of the above-mentioned problems, and one of the purposes thereof is to reduce the processing time of the entire arithmetic processing by suppressing the variation in the processing time for each arithmetic unit by a relatively simple method. It is to provide an arithmetic circuit that can be shortened, an arithmetic method, and a program for executing this arithmetic method.
 一実施形態の演算回路は、入力データに対して演算処理を行う。入力データに対する演算処理は、互いに並列処理が可能なm個の処理ステップを含む。m個の処理ステップの各々の演算処理量には他の処理ステップの演算処理量と異なるものがある。演算回路は、nを2以上かつmより小さい整数として、m個の処理ステップを実行するn個の演算器と、制御プロセッサとを備える。制御プロセッサは、m個の処理ステップの各々をn個の演算器のいずれか1つに、乱数に基づいてランダムに割り当てる。 The arithmetic circuit of one embodiment performs arithmetic processing on the input data. The arithmetic processing on the input data includes m processing steps that can be processed in parallel with each other. The arithmetic processing amount of each of the m processing steps is different from the arithmetic processing amount of the other processing steps. The arithmetic circuit includes n arithmetic units that execute m processing steps and a control processor, where n is an integer of 2 or more and smaller than m. The control processor randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
 上記の実施形態によれば、m個の処理ステップの各々をn個の演算器のいずれか1つに、乱数に基づいてランダムに割り当てることによって、演算器ごとの処理時間のばらつきを抑制し、演算処理全体の処理時間を短縮できる。 According to the above embodiment, by randomly assigning each of the m processing steps to any one of the n arithmetic units based on a random number, the variation in the processing time for each arithmetic unit is suppressed. The processing time of the entire arithmetic processing can be shortened.
実施の形態1による演算回路を含む画像認識システムの一例を示すブロック図である。It is a block diagram which shows an example of the image recognition system which includes the arithmetic circuit by Embodiment 1. FIG. 図1の信号入力部の構成の一例を示すブロック図である。It is a block diagram which shows an example of the structure of the signal input part of FIG. 図1の演算回路を用いた演算処理の概要を示すフローチャートである。It is a flowchart which shows the outline of the arithmetic processing using the arithmetic circuit of FIG. 図1の演算回路を用いた演算処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the arithmetic processing using the arithmetic circuit of FIG. 図4のステップS110における処理の一例を詳細に示すフローチャートである。It is a flowchart which shows the example of the process in step S110 of FIG. 4 in detail. 畳み込みニューラルネットワークの畳み込み層における畳み込み演算を概念的に示す図である。It is a figure which conceptually shows the convolution operation in the convolution layer of a convolutional neural network. 図6に示す畳み込み演算の各処理ステップの処理時間の例を表形式で示す図である。It is a figure which shows the example of the processing time of each processing step of the convolution operation shown in FIG. 6 in a tabular form. 図6に示す畳み込み演算の各処理ステップを規則的に各演算器に割り当てた場合において、各演算器の処理時間を表形式で示す図である。It is a figure which shows the processing time of each arithmetic unit in a tabular form when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit. 図6に示す畳み込み演算の各処理ステップを、図5に示す手順で各演算器に割り当てた例を表形式で示す図である。It is a figure which shows the example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format. 図9に示す演算器の割り当て例において、各演算器の処理時間を表形式で示す図である。In the example of allocation of arithmetic units shown in FIG. 9, it is a figure which shows the processing time of each arithmetic unit in a table format. 実施の形態1の効果を説明するための図である。It is a figure for demonstrating the effect of Embodiment 1. FIG. 図4のステップS110における処理の他の実現方法を示すフローチャートである。It is a flowchart which shows the other realization method of the process in step S110 of FIG. 図6に示す畳み込み演算の各処理ステップを、図12に示す手順で各演算器に割り当てた例を表形式で示す図である。It is a figure which shows the example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 12 in a table format. 図13に示す処理ステップの割り当て例において、各演算器の処理時間を表形式で示す図である。In the processing step allocation example shown in FIG. 13, it is a figure which shows the processing time of each arithmetic unit in a table format. 図9に示す処理ステップの割り当て方法と図13に示す処理ステップの割り当て方法との違いを、概念的に示すための図である。It is a figure for conceptually showing the difference between the process step allocation method shown in FIG. 9 and the process step allocation method shown in FIG. 図4のステップS110における処理のさらに他の実現方法を示すフローチャートである。It is a flowchart which shows the other realization method of the process in step S110 of FIG. 図16に示す処理ステップの割り当て方法を概念的に示す図である。It is a figure which conceptually shows the allocation method of the processing step shown in FIG. 演算回路の設計手順を示すフローチャートである。It is a flowchart which shows the design procedure of an arithmetic circuit. 図18に示す演算回路の設計方法の具体例を概念的に示す図である。It is a figure which conceptually shows the concrete example of the design method of the arithmetic circuit shown in FIG.
 以下、各実施の形態について図面を参照して詳しく説明する。以下では、CNN(Convolutional Neural Network:たたみ込みニューラルネットワーク)モデルにおける畳み込み演算を例に挙げて説明するが、本開示の演算回路および演算方法の適用対象は、畳み込み演算に限定されるものでない。以下の説明において、同一または相当する部分には同一の参照符号を付して、その説明を繰り返さない場合がある。 Hereinafter, each embodiment will be described in detail with reference to the drawings. Hereinafter, the convolutional operation in the CNN (Convolutional Neural Network) model will be described as an example, but the application target of the operation circuit and the operation method of the present disclosure is not limited to the convolutional operation. In the following description, the same or corresponding parts may be designated by the same reference numerals and the description may not be repeated.
 実施の形態1.
 [画像認識システムの全体構成]
 図1は、実施の形態1による演算回路を含む画像認識システムの一例を示すブロック図である。
Embodiment 1.
[Overall configuration of image recognition system]
FIG. 1 is a block diagram showing an example of an image recognition system including an arithmetic circuit according to the first embodiment.
 画像認識システム30は、人物検知を目的とした監視カメラ、または、物体検知を目的とした車載用カメラなどにおける画像認識アプリケーションの実行を想定したシステム構成である。画像認識システム30は、CNNモデルに従って、画像データから特定の物体を検出する機能を有する。具体的に、画像認識システム30は、2次元のコンボリューション演算、すなわち、行列Aとベクトルxとの積(Ax+b)などの積和演算を行う。 The image recognition system 30 has a system configuration assuming execution of an image recognition application in a surveillance camera for the purpose of person detection, an in-vehicle camera for the purpose of object detection, or the like. The image recognition system 30 has a function of detecting a specific object from image data according to a CNN model. Specifically, the image recognition system 30 performs a two-dimensional convolution operation, that is, a product-sum operation such as a product (Ax + b) of a matrix A and a vector x.
 図1に示すように、画像認識システム30は、信号入力部31、CPU(Central Processing Unit)32、メモリ33、DMAC(Direct Memory Access Controller)34、並列処理計算部35、リーダ・ライタ36、およびネットワークインタフェース37を備える。上記の各構成は、バスインターコネクト38を介して相互に接続される。 As shown in FIG. 1, the image recognition system 30 includes a signal input unit 31, a CPU (Central Processing Unit) 32, a memory 33, a DMAC (Direct Memory Access Controller) 34, a parallel processing calculation unit 35, a reader / writer 36, and a reader / writer 36. A network interface 37 is provided. Each of the above configurations is interconnected via the bus interconnect 38.
 信号入力部31は、光学系(不図示)を通して入射した光を電気信号に変換することにより、画像データを生成する。画像データは、演算回路40によって演算処理される。信号入力部31の構成例は、図2を参照して後述する。 The signal input unit 31 generates image data by converting light incident through an optical system (not shown) into an electric signal. The image data is arithmetically processed by the arithmetic circuit 40. A configuration example of the signal input unit 31 will be described later with reference to FIG.
 CPU32は、画像認識システム30全体を制御する制御プロセッサとして機能する。また、CPU32は、並列処理計算部35の内部の専用メモリ41に対してもアクセスする。 The CPU 32 functions as a control processor that controls the entire image recognition system 30. The CPU 32 also accesses the dedicated memory 41 inside the parallel processing calculation unit 35.
 メモリ33は、CPU32において実行される命令および制御データを格納する。メモリ33は、DRAM(Dynamic Random Access Memory)およびSRAMなどの揮発性メモリと、フラッシュメモリなどの電気的に書き換え可能な不揮発性メモリとを含む。 The memory 33 stores instructions and control data executed by the CPU 32. The memory 33 includes a volatile memory such as a DRAM (Dynamic Random Access Memory) and a SRAM, and an electrically rewritable non-volatile memory such as a flash memory.
 DMAC34は、信号入力部31とメモリ33と並列処理計算部35の専用メモリ41との間での、CPU32を介さない直接的なデータ転送を制御する。 The DMAC 34 controls direct data transfer between the signal input unit 31, the memory 33, and the dedicated memory 41 of the parallel processing calculation unit 35, without going through the CPU 32.
 並列処理計算部35は、2次元のコンボリューション演算処理を行う。並列処理計算部35は、その内部構成として、専用メモリ41、入力データ制御部43、およびn個の演算器44_1~演算器44_nを含む。演算器44_1~44_nについて、総称する場合または不特定のものを示す場合に演算器44と記載する。実施の形態1では、n個の演算器44の各々の処理能力は同じとする。もしくは、n個の演算器44の各々の処理能力は、各演算器の処理時間の違いが問題にならない程度において実質的に同じとする。 Parallel processing calculation unit 35 performs two-dimensional convolution calculation processing. The parallel processing calculation unit 35 includes a dedicated memory 41, an input data control unit 43, and n arithmetic units 44_1 to 44_n as its internal configuration. The arithmetic units 44_1 to 44_n are referred to as the arithmetic unit 44 when they are generically referred to or when they indicate unspecified ones. In the first embodiment, the processing capacity of each of the n arithmetic units 44 is the same. Alternatively, the processing capacity of each of the n arithmetic units 44 is substantially the same to the extent that the difference in processing time of each arithmetic unit does not matter.
 n個の演算器44は、互いに並列的にプログラム可能である。演算器44の個数nは、入力データ42に対する処理プログラムに含まれる並列処理可能な処理ステップの個数mに依存して決定される。2≦n<mが成り立つ。演算器44は、GPU(Graphics Processing Unit)、FPGA(Field Programmable Gate Array)、マルチコアプロセッサなどによって構成することができる。 The n arithmetic units 44 can be programmed in parallel with each other. The number n of the arithmetic units 44 is determined depending on the number m of processing steps that can be processed in parallel included in the processing program for the input data 42. 2 ≦ n <m holds. The arithmetic unit 44 can be configured by a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), a multi-core processor, or the like.
 専用メモリ41は、n個の演算器44で処理される入力データ42および各演算器44の演算結果を格納する。入力データ制御部43は、各演算器44に対して上記の各処理ステップを割り当てる。演算器44は、割り当てられた処理ステップに対して演算処理を実行する。 The dedicated memory 41 stores the input data 42 processed by the n arithmetic units 44 and the arithmetic results of each arithmetic unit 44. The input data control unit 43 assigns each of the above processing steps to each arithmetic unit 44. The arithmetic unit 44 executes arithmetic processing for the assigned processing step.
 上記のCPU32、メモリ33、および並列処理計算部35によって、演算回路40が構成される。CPU32およびメモリ33は、並列処理計算部35の内部に設けられていてもよい。 The arithmetic circuit 40 is configured by the CPU 32, the memory 33, and the parallel processing calculation unit 35. The CPU 32 and the memory 33 may be provided inside the parallel processing calculation unit 35.
 リーダ・ライタ36は、記憶媒体にデータまたはプログラムを書き込んだり、記憶媒体に格納されたデータまたはプログラムを読み出したりする。記憶媒体は、磁気的または光学的な方法で、または半導体メモリを使用することにより、データまたはプログラムを非一時的に記憶する。記憶媒体として、CD(Compact Disc)、DVD(Digital Versatile Disc)、ブルーレイディスク、ハードディスク、フラッシュメモリなどを用いることができる。 The reader / writer 36 writes data or a program to the storage medium and reads out the data or program stored in the storage medium. The storage medium stores data or programs non-temporarily by magnetic or optical methods or by using semiconductor memory. As a storage medium, a CD (Compact Disc), a DVD (Digital Versatile Disc), a Blu-ray disc, a hard disk, a flash memory, or the like can be used.
 ネットワークインタフェース37は、ネットワークを介して外部の機器と接続される。CPU32で実行されるプログラムならびに演算器44で実行されるプログラムは、ネットワークインタフェース37を介してネットワーク経由で提供されてもよいし、リーダ・ライタ36を介して記憶媒体によって提供されてもよい。 The network interface 37 is connected to an external device via the network. The program executed by the CPU 32 and the program executed by the arithmetic unit 44 may be provided via the network via the network interface 37, or may be provided by the storage medium via the reader / writer 36.
 図2は、図1の信号入力部の構成の一例を示すブロック図である。図2のブロック図は、光電素子としてCMOS(Complementary Metal Oxide Semiconductor)デバイスが用いられた場合の構成例を示しているが、光センサとしてCCD(Charge Coupled Devices)が用いられてもよいし、他の種類の光センサが用いられてもよい。 FIG. 2 is a block diagram showing an example of the configuration of the signal input unit of FIG. The block diagram of FIG. 2 shows a configuration example when a CMOS (Complementary Metal Oxide Semiconductor) device is used as the photoelectric element, but a CCD (Charge Coupled Devices) may be used as the optical sensor, or the like. The type of optical sensor may be used.
 図2を参照して、信号入力部31は、光を集光する光学系(不図示)と、センサアレイ51と、カラムADC(Analog-to-Digital Converter)52と、垂直走査回路53と、水平走査回路54と、出力アンプ55と、フレームバッファ56と、信号処理回路57とを含む。 With reference to FIG. 2, the signal input unit 31 includes an optical system (not shown) for condensing light, a sensor array 51, a column ADC (Analog-to-Digital Converter) 52, a vertical scanning circuit 53, and the like. It includes a horizontal scanning circuit 54, an output amplifier 55, a frame buffer 56, and a signal processing circuit 57.
 センサアレイ51は、行列状に配列された複数の光電素子を含む。信号入力部31に入力された光は光学系によってセンサアレイ51上に合焦される。 The sensor array 51 includes a plurality of photoelectric elements arranged in a matrix. The light input to the signal input unit 31 is focused on the sensor array 51 by the optical system.
 垂直走査回路53は、センサアレイ51を行方向に延在する複数の制御信号線と接続され、各制御信号線を介して各光電素子の読み出し回路を駆動する。水平走査回路54は、センサアレイ51を列方向に延在する複数の出力信号線と接続され、各出力信号線を介して各光電素子から光信号を読み出す。 The vertical scanning circuit 53 is connected to a plurality of control signal lines extending in the row direction of the sensor array 51, and drives a readout circuit of each photoelectric element via each control signal line. The horizontal scanning circuit 54 is connected to a plurality of output signal lines extending in the column direction of the sensor array 51, and reads an optical signal from each photoelectric element via each output signal line.
 カラムADC52は、各光電素子から読み出された光信号をデジタル信号に変換する。出力アンプ55は、変換後のデジタル信号を増幅する。フレームバッファ56は、増幅されたデジタル信号をフレームごとに一時的に記憶する。信号処理回路57は、デジタル信号に含まれるノイズ等を除去したり、各種画像補正を実行したりする。 The column ADC 52 converts an optical signal read from each photoelectric element into a digital signal. The output amplifier 55 amplifies the converted digital signal. The frame buffer 56 temporarily stores the amplified digital signal frame by frame. The signal processing circuit 57 removes noise and the like contained in the digital signal and executes various image corrections.
 [演算処理]
 次に、図1の演算回路40の動作について説明する。図3は、図1の演算回路を用いた演算処理の概要を示すフローチャートである。
[Operation processing]
Next, the operation of the arithmetic circuit 40 of FIG. 1 will be described. FIG. 3 is a flowchart showing an outline of arithmetic processing using the arithmetic circuit of FIG.
 図3を参照して、演算回路40によって実行される処理は、最初に一度のみ行われる前処理S100と、入力データに応じて複数回繰り返して実行される演算処理S200とに分割される。 With reference to FIG. 3, the process executed by the arithmetic circuit 40 is divided into a preprocessing S100 that is executed only once at the beginning and an arithmetic process S200 that is repeatedly executed a plurality of times according to the input data.
 一例として、前処理S100は、CPU32によってプログラムに従って実行される。前処理S100を他の汎用のCPUによって実行してもよい。演算処理S200は、プログラムに従って主として並列処理計算部35によって実行され、演算処理S200の全体的な制御は、たとえば、CPU32によって実行される。 As an example, the preprocessing S100 is executed by the CPU 32 according to the program. The preprocessing S100 may be executed by another general-purpose CPU. The arithmetic processing S200 is mainly executed by the parallel processing calculation unit 35 according to the program, and the overall control of the arithmetic processing S200 is executed by, for example, the CPU 32.
 図4は、図1の演算回路を用いた演算処理の詳細を示すフローチャートである。以下の説明において、演算回路40で実行される入力データ42に対する演算処理は、並列処理が可能なm個の処理ステップを含むものとする。処理ステップ数mは、演算器44の個数nよりも大きい。 FIG. 4 is a flowchart showing details of arithmetic processing using the arithmetic circuit of FIG. In the following description, the arithmetic processing for the input data 42 executed by the arithmetic circuit 40 includes m processing steps capable of parallel processing. The number of processing steps m is larger than the number n of the arithmetic units 44.
 図4を参照して、前処理S100のステップS110において、CPU32は、乱数に基づいてm個の処理ステップの各々をn個の演算器44のいずれか1つに割り当てる。したがって、各演算器44には、m個の処理ステップの少なくとも1つが割り当てられることになる。 With reference to FIG. 4, in step S110 of the preprocessing S100, the CPU 32 allocates each of the m processing steps to any one of the n arithmetic units 44 based on the random number. Therefore, each arithmetic unit 44 is assigned at least one of m processing steps.
 演算処理S200のステップS210において、並列処理計算部35の各演算器44は、割り当てられた処理ステップを実行する。各演算器44において割り当てられた全ての処理ステップの実行が完了していない場合には(ステップS220でNO)、上記のステップS210が繰り返される。各演算器44が割り当てられた処理ステップを全て実行した場合には、演算処理が終了する。 In step S210 of the arithmetic processing S200, each arithmetic unit 44 of the parallel processing calculation unit 35 executes the assigned processing step. If the execution of all the processing steps assigned in each arithmetic unit 44 is not completed (NO in step S220), the above step S210 is repeated. When each arithmetic unit 44 executes all the assigned processing steps, the arithmetic processing ends.
 図5は、図4のステップS110における処理の一例を詳細に示すフローチャートである。図5のフローチャートでは、処理ステップの識別番号をi(ただし、1≦i≦m、iは整数)とし、第1番目の処理ステップから第m番目の処理ステップまでが順に実行される。 FIG. 5 is a flowchart showing in detail an example of the process in step S110 of FIG. In the flowchart of FIG. 5, the identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are executed in order.
 図5のステップS300において、CPU32は、処理ステップの識別番号iを1に初期化する。 In step S300 of FIG. 5, the CPU 32 initializes the identification number i of the processing step to 1.
 次のステップS310において、CPU32は、n個の演算器の識別番号として1からnの範囲の整数の一様乱数を生成する。生成された整数の一様乱数をr(i)とする。1≦r(i)≦nが成り立つ。一様乱数を生成するために、公知の擬似乱数の生成アルゴリズムを用いてもよい。たとえば、線形合同法またはキャリー付き乗算などを用いることができる。なお、実施の形態1では、n個の演算器の各々の処理能力が同じまたは実質的に同じとしているので、各乱数の出現確率が等しい一様乱数を生成することが望ましい。実施の形態3で説明するように、各演算器の処理能力が異なる場合には一様乱数と異なる乱数を生成する必要がある。 In the next step S310, the CPU 32 generates a uniform random number of integers in the range of 1 to n as identification numbers of n arithmetic units. Let r (i) be a uniform random number of the generated integers. 1 ≦ r (i) ≦ n holds. A known pseudo-random number generation algorithm may be used to generate a uniform random number. For example, linear congruential or multiply-with-carry can be used. In the first embodiment, since the processing capacity of each of the n arithmetic units is the same or substantially the same, it is desirable to generate a uniform random number having the same appearance probability of each random number. As described in the third embodiment, when the processing capacity of each arithmetic unit is different, it is necessary to generate a random number different from the uniform random number.
 その次のステップS320において、CPU32は、生成した整数の一様乱数r(i)を用いて、第i番目の処理ステップを第r(i)番目の演算器44_r(i)に割り当てる。 In the next step S320, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer uniform random number r (i).
 その次のステップS330において、CPU32は、処理ステップの識別番号iを1だけインクリメントする。 In the next step S330, the CPU 32 increments the identification number i of the processing step by 1.
 上記のステップS310~S330は、処理ステップの識別番号iがmを超えるまで(ステップS340でYESとなるまで)繰り返される。以上により、m個の処理ステップの各々が、n個の演算器44のいずれか1つに割り当てられる。 The above steps S310 to S330 are repeated until the identification number i of the processing step exceeds m (until YES is set in step S340). As described above, each of the m processing steps is assigned to any one of the n arithmetic units 44.
 [演算処理の具体例]
 以下、畳み込みニューラルネットワークの畳み込み層における畳み込み演算を例に挙げて、図1の演算回路40の動作をさらに説明する。並列処理計算部35には4台の演算器44_1~44_4が設けられている(n=4)とする。
[Specific example of arithmetic processing]
Hereinafter, the operation of the arithmetic circuit 40 in FIG. 1 will be further described by taking the convolutional operation in the convolutional layer of the convolutional neural network as an example. It is assumed that the parallel processing calculation unit 35 is provided with four arithmetic units 44_1 to 44_4 (n = 4).
 図6は、畳み込みニューラルネットワークの畳み込み層における畳み込み演算を概念的に示す図である。図6に示すように、入力データ60とカーネル61との畳み込み演算によって出力データ62が生成される。さらに、出力データ62の各要素にバイアスが加算された後に、出力データ62の各要素に活性化関数が施される。 FIG. 6 is a diagram conceptually showing the convolution operation in the convolution layer of the convolutional neural network. As shown in FIG. 6, the output data 62 is generated by the convolution operation between the input data 60 and the kernel 61. Further, after the bias is added to each element of the output data 62, an activation function is applied to each element of the output data 62.
 畳み込み演算では、入力データ60上でカーネル61を一定間隔でスライドさせながら、カーネル61の要素と対応する入力データ60の要素とが乗算され、それらの和が求められる。カーネルをスライドさせる間隔(すなわち、ストライド)を1とする。この場合、畳み込み演算では、以下の式(1)~(9)の演算が実行される。なお、簡単のために、バイアス加算と活性化関数演算とは省略している。 In the convolution operation, while sliding the kernel 61 on the input data 60 at regular intervals, the elements of the kernel 61 and the corresponding elements of the input data 60 are multiplied, and the sum of them is obtained. The interval for sliding the kernel (that is, stride) is 1. In this case, in the convolution operation, the operations of the following equations (1) to (9) are executed. For the sake of simplicity, the bias addition and the activation function operation are omitted.
 y=x*w+x*w+x*w+x*w+x*w+x*w
  +x11*w+x12*w+x13*w  …(1)
 y=x*w+x*w+x*w+x*w+x*w+x*w
  +x12*w+x13*w+x14*w  …(2)
 y=x*w+x*w+x*w+x*w+x*w+x10*w
  +x13*w+x14*w+x15*w  …(3)
 y=x*w+x*w+x*w+x11*w+x12*w+x13*w
  +x16*w+x17*w+x18*w  …(4)
 y=x*w+x*w+x*w+x12*w+x13*w+x14*w
  +x17*w+x18*w+x19*w  …(5)
 y=x*w+x*w+x10*w+x13*w+x14*w+x15*w
  +x18*w+x19*w+x20*w  …(6)
 y=x11*w+x12*w+x13*w+x16*w+x17*w+x18*w
  +x21*w+x22*w+x23*w  …(7)
 y=x12*w+x13*w+x14*w+x17*w+x18*w+x19*w
  +x22*w+x23*w+x24*w  …(8)
 y=x13*w+x14*w+x15*w+x18*w+x19*w+x20*w
  +x22*w+x23*w+x24*w  …(9)
 上記の式(1)~(9)の演算の各々が互いに並列処理が可能な処理ステップに相当する。以下、式(1)~(9)で表される処理ステップの識別番号をそれぞれ1~9とする。
y 1 = x 1 * w 1 + x 2 * w 2 + x 3 * w 3 + x 6 * w 4 + x 7 * w 5 + x 8 * w 6
+ X 11 * w 7 + x 12 * w 8 + x 13 * w 9 ... (1)
y 2 = x 2 * w 1 + x 3 * w 2 + x 4 * w 3 + x 7 * w 4 + x 8 * w 5 + x 9 * w 6
+ X 12 * w 7 + x 13 * w 8 + x 14 * w 9 ... (2)
y 3 = x 3 * w 1 + x 4 * w 2 + x 5 * w 3 + x 8 * w 4 + x 9 * w 5 + x 10 * w 6
+ X 13 * w 7 + x 14 * w 8 + x 15 * w 9 ... (3)
y 4 = x 6 * w 1 + x 7 * w 2 + x 8 * w 3 + x 11 * w 4 + x 12 * w 5 + x 13 * w 6
+ X 16 * w 7 + x 17 * w 8 + x 18 * w 9 ... (4)
y 5 = x 7 * w 1 + x 8 * w 2 + x 9 * w 3 + x 12 * w 4 + x 13 * w 5 + x 14 * w 6
+ X 17 * w 7 + x 18 * w 8 + x 19 * w 9 … (5)
y 6 = x 8 * w 1 + x 9 * w 2 + x 10 * w 3 + x 13 * w 4 + x 14 * w 5 + x 15 * w 6
+ X 18 * w 7 + x 19 * w 8 + x 20 * w 9 ... (6)
y 7 = x 11 * w 1 + x 12 * w 2 + x 13 * w 3 + x 16 * w 4 + x 17 * w 5 + x 18 * w 6
+ X 21 * w 7 + x 22 * w 8 + x 23 * w 9 … (7)
y 8 = x 12 * w 1 + x 13 * w 2 + x 14 * w 3 + x 17 * w 4 + x 18 * w 5 + x 19 * w 6
+ X 22 * w 7 + x 23 * w 8 + x 24 * w 9 … (8)
y 9 = x 13 * w 1 + x 14 * w 2 + x 15 * w 3 + x 18 * w 4 + x 19 * w 5 + x 20 * w 6
+ X 22 * w 7 + x 23 * w 8 + x 24 * w 9 … (9)
Each of the operations of the above equations (1) to (9) corresponds to a processing step capable of parallel processing with each other. Hereinafter, the identification numbers of the processing steps represented by the formulas (1) to (9) are set to 1 to 9, respectively.
 ところで、入力データ60に0要素が含まれる場合には、当該0要素についてはカーネル61の対応する要素との乗算を実行しなくてよい。たとえば、入力データ60のうち、x、x、x、x13~x18、x21、x22、x25の値を0とすると、各処理ステップの演算処理量にばらつきが生じる。 By the way, when the input data 60 includes 0 elements, it is not necessary to execute multiplication with the corresponding elements of the kernel 61 for the 0 elements. For example, if the values of x 4 , x 5 , x 9 , x 13 to x 18 , x 21 , x 22 , and x 25 of the input data 60 are set to 0, the calculation processing amount of each processing step varies.
 図7は、図6に示す畳み込み演算の各処理ステップの処理時間の例を表形式で示す図である。実施の形態1では各演算器の処理性能は同じと仮定しているので、各処理ステップの演算処理量は処理時間に比例する。入力データ60が上記のような零要素を含む場合、各処理ステップの処理時間は図7のように表される。 FIG. 7 is a diagram showing an example of the processing time of each processing step of the convolution operation shown in FIG. 6 in a table format. Since the processing performance of each arithmetic unit is assumed to be the same in the first embodiment, the arithmetic processing amount of each processing step is proportional to the processing time. When the input data 60 includes the zero element as described above, the processing time of each processing step is represented as shown in FIG.
 図8は、図6に示す畳み込み演算の各処理ステップを規則的に各演算器に割り当てた場合において、各演算器の処理時間を表形式で示す図である。図8では、処理ステップ1から処理ステップ4を演算器1から演算器4にそれぞれ順番に割り当て、処理ステップ5から処理ステップ8を演算器1から演算器4にそれぞれ順番に割り当て、残りの処理ステップ9を演算器1に割り当てている。このように各処理ステップ1~9は、演算器44_1~44_4に規則的に割り当て場合には、図8に示すように各演算器44における処理時間にばらつきが生じる。したがって、最も処理時間の長い演算器44_1の処理時間によって全体の処理時間が決まる。 FIG. 8 is a diagram showing the processing time of each arithmetic unit in a table format when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit. In FIG. 8, processing steps 1 to 4 are assigned to arithmetic units 1 to 4 in order, processing steps 5 to 8 are assigned to arithmetic units 1 to arithmetic units 4 in order, and the remaining processing steps. 9 is assigned to the arithmetic unit 1. As described above, when the processing steps 1 to 9 are regularly assigned to the arithmetic units 44_1 to 44_4, the processing time in each arithmetic unit 44 varies as shown in FIG. Therefore, the total processing time is determined by the processing time of the arithmetic unit 44_1 having the longest processing time.
 図9は、図6に示す畳み込み演算の各処理ステップを、図5に示す手順で各演算器に割り当てた例を表形式で示す図である。図9に示すように、実施の形態1の演算回路40では、CPU32は、処理ステップ1~9の各々を、一様乱数に基づいてランダムに各演算器44に割り当てる。 FIG. 9 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format. As shown in FIG. 9, in the arithmetic circuit 40 of the first embodiment, the CPU 32 randomly assigns each of the processing steps 1 to 9 to each arithmetic unit 44 based on a uniform random number.
 図10は、図9に示す演算器の割り当て例において、各演算器の処理時間を表形式で示す図である。図8に示すように各処理ステップを演算器44に規則的に割り当てた場合に比べて、図10に示す実施の形態1の場合には、各演算器のばらつき度合を緩和させることができる。 FIG. 10 is a diagram showing the processing time of each arithmetic unit in a table format in the arithmetic unit allocation example shown in FIG. Compared with the case where each processing step is regularly assigned to the arithmetic unit 44 as shown in FIG. 8, in the case of the first embodiment shown in FIG. 10, the degree of variation of each arithmetic unit can be relaxed.
 なお、上記の例では、乱数の発生次第でばらつき度合が増すこともあり得る。しかしながら、実際の演算回路40における演算処理では、処理ステップ数および演算器44の個数のいずれも上記の例の場合に比べてはるかに大きい。したがって、一様乱数に基づいて各処理ステップをいずれか1つの演算器44にランダムに割り当てることによって、演算器44ごとの処理時間のばらつきを抑制できる。 In the above example, the degree of variation may increase depending on the generation of random numbers. However, in the arithmetic processing in the actual arithmetic circuit 40, both the number of processing steps and the number of arithmetic units 44 are much larger than in the case of the above example. Therefore, by randomly assigning each processing step to any one arithmetic unit 44 based on a uniform random number, it is possible to suppress variations in the processing time for each arithmetic unit 44.
 [実施の形態1の効果]
 図11は、実施の形態1の効果を説明するための図である。図11(A)および図11(B)において、各処理ステップをいずれか1つの演算器に割り当てた結果として、演算器1から演算器nの処理時間がそれぞれN,…,Nであったとする。
[Effect of Embodiment 1]
FIG. 11 is a diagram for explaining the effect of the first embodiment. In FIGS. 11A and 11B, as a result of allocating each processing step to any one arithmetic unit, the processing times from the arithmetic unit 1 to the arithmetic unit n are N 1 , ..., N n, respectively. Suppose.
 図11(A)は、各処理ステップを演算器1~演算器nに規則的に割り当てた結果、各演算器の処理時間にばらつきが生じた場合を示す。図11(A)に示すように、処理時間が最も長い演算器1の処理時間Nによって、演算回路40全体の処理時間が決まる。 FIG. 11A shows a case where the processing time of each arithmetic unit varies as a result of regularly allocating each processing step to the arithmetic units 1 to n. As shown in FIG. 11 (A), the processing time N 1 processing time longest calculator 1, the arithmetic circuit 40 overall processing time is determined.
 図11(B)は、一様乱数に従って各処理ステップを演算器1~演算器nにランダムに割り当てた場合を示す。この場合には、演算器ごとの処理時間のばらつきを抑制することができるので、図11(A)の場合に比べて演算回路40全体の処理時間を短くできる。 FIG. 11B shows a case where each processing step is randomly assigned to the arithmetic units 1 to n according to a uniform random number. In this case, since the variation in the processing time for each arithmetic unit can be suppressed, the processing time of the entire arithmetic circuit 40 can be shortened as compared with the case of FIG. 11A.
 次に、本実施の形態の演算回路40を、前述の国際公報第2019/053835号(特許文献1)の演算回路と比較する。 Next, the arithmetic circuit 40 of the present embodiment is compared with the arithmetic circuit of the above-mentioned International Publication No. 2019/053835 (Patent Document 1).
 特許文献1の演算回路の場合には、各演算器での処理量が平準化するように各演算器に演算処理が割り当てられる。したがって、図6および図7を参照して説明した積和演算の場合には、入力データ60に含まれる非零要素を予め探索し、各処理ステップの演算処理量を見積もる必要がある。このため、各処理ステップをいずれか1つの演算器に割り当てるという前処理に時間を要する。 In the case of the arithmetic circuit of Patent Document 1, arithmetic processing is assigned to each arithmetic unit so that the processing amount in each arithmetic unit is leveled. Therefore, in the case of the product-sum operation described with reference to FIGS. 6 and 7, it is necessary to search in advance for non-zero elements included in the input data 60 and estimate the operation processing amount of each processing step. Therefore, it takes time for preprocessing to assign each processing step to any one arithmetic unit.
 これに対して本実施の形態の演算回路40の場合には、各処理ステップを一様にランダムにいずれか1つの演算器に割り当てるだけである。したがって、非零要素に探索結果に基づいて各処理ステップの演算処理量を事前に見積もる必要がない。この結果、前処理に要する時間を特許文献1の演算回路の場合に比べて短縮でき、これにより前処理を含めた全体の処理時間を短縮できる。 On the other hand, in the case of the arithmetic circuit 40 of the present embodiment, each processing step is uniformly and randomly assigned to any one arithmetic unit. Therefore, it is not necessary to estimate the arithmetic processing amount of each processing step in advance based on the search result for the non-zero element. As a result, the time required for preprocessing can be shortened as compared with the case of the arithmetic circuit of Patent Document 1, and thereby the entire processing time including preprocessing can be shortened.
 実施の形態2.
 [実施の形態2の概要]
 実施の形態1の演算回路40は、一様乱数に基づいて入力データの各処理ステップをいずれか1つの演算器にランダムに割り当てる。この場合、乱数の発生次第では、各演算器に割り当てられる処理ステップの数にばらつきが生じる。この結果、演算器ごとの処理時間にばらつきが生じる場合があり得る。
Embodiment 2.
[Outline of Embodiment 2]
The arithmetic circuit 40 of the first embodiment randomly assigns each processing step of input data to any one arithmetic unit based on a uniform random number. In this case, the number of processing steps assigned to each arithmetic unit varies depending on the generation of random numbers. As a result, the processing time of each arithmetic unit may vary.
 そこで、実施の形態2の演算回路では、各演算器に割り当てる処理ステップの数をほぼ均等に、すなわち、演算器ごとの処理ステップ数の差が1個以内にするという条件を満たすようにする。CPU32は、この条件を満たした上で、各処理ステップをいずれか1つの演算器にランダムに割り当てる。以下、図面を参照して具体的に説明する。 Therefore, in the arithmetic circuit of the second embodiment, the condition that the number of processing steps assigned to each arithmetic unit is substantially equal, that is, the difference in the number of processing steps for each arithmetic unit is within one is satisfied. After satisfying this condition, the CPU 32 randomly assigns each processing step to any one of the arithmetic units. Hereinafter, a specific description will be given with reference to the drawings.
 [演算器の割り当て手順]
 図12は、図4のステップS110における処理の他の実現方法を示すフローチャートである。実施の形態1の図4および図5の場合と同様に、並列処理計算部35はn個の演算器44_1~44_nを含む。さらに、演算回路40で実行される演算処理は、並列処理が可能なm個(m>n)の処理ステップを含む。m個の処理ステップには、第1番目から第m番目までの識別番号が付されている。
[Calculator allocation procedure]
FIG. 12 is a flowchart showing another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. The m processing steps are assigned identification numbers from the first to the mth.
 乱数生成処理の処理数をi(ただし、1≦i≦m、iは整数)とし、演算器の識別番号をj(ただし、1≦j≦n、jは整数)とする。図12のフローチャートでは、第1回目の乱数生成処理から第m回目の乱数生成処理までが順に実行される。 The number of random number generation processes is i (however, 1 ≦ i ≦ m, i is an integer), and the identification number of the arithmetic unit is j (however, 1 ≦ j ≦ n, j is an integer). In the flowchart of FIG. 12, the first random number generation process to the mth random number generation process are executed in order.
 図12のステップS400において、CPU32は、乱数生成処理の処理数iおよび演算器の識別番号jの各々を1に初期化する。 In step S400 of FIG. 12, the CPU 32 initializes each of the number i of the random number generation processing and the identification number j of the arithmetic unit to 1.
 次のステップS410において、CPU32は、1からmの範囲内で既に生成した乱数と等しくない整数の一様乱数を生成する。以下、実施の形態2では誤解の無い限り、整数の一様乱数を単に整数乱数と記載する。第i回目に生成された整数乱数をr(i)とする。1≦r(i)≦mであり、かつ、r(i)は、r(1)~r(i-1)のいずれとも等しくない。 In the next step S410, the CPU 32 generates a uniform random number of an integer not equal to the random number already generated within the range of 1 to m. Hereinafter, in the second embodiment, unless there is a misunderstanding, a uniform random number of integers is simply referred to as an integer random number. Let r (i) be the integer random number generated in the i-th time. 1 ≦ r (i) ≦ m, and r (i) is not equal to any of r (1) to r (i-1).
 その次のステップS420において、CPU32は、第i回目に生成した乱数r(i)を用いて、第r(i)番目の処理ステップを第j番目の演算器44_jに割り当てる。 In the next step S420, the CPU 32 allocates the r (i) th processing step to the jth arithmetic unit 44_j using the random number r (i) generated in the i-th time.
 その次のステップS430において、CPU32は、乱数生成処理の処理数iを1だけインクリメントし、演算器の識別番号jを1だけインクリメントする。CPU32は、演算器の識別番号jがnを超えた場合には(ステップS440でYES)、演算器の識別番号jを1に初期化する(ステップS450)。 In the next step S430, the CPU 32 increments the number i of the random number generation processing by 1 and increments the identification number j of the arithmetic unit by 1. When the identification number j of the arithmetic unit exceeds n (YES in step S440), the CPU 32 initializes the identification number j of the arithmetic unit to 1 (step S450).
 上記のステップS410,S420,S430は、乱数生成処理の処理数iがmを超えるまで(ステップS460でYESとなるまで)繰り返される。以上により、第1番目の演算器44_1から第m番目の演算器44_mに、第1番目から第n番目の処理ステップがほぼ均等に、すなわち、演算器ごとの処理ステップ数の差が1個以内になるように割り当てられる。 The above steps S410, S420, and S430 are repeated until the number i of the random number generation processing exceeds m (until YES is obtained in step S460). As a result, the first to nth processing steps are almost evenly distributed between the first arithmetic unit 44_1 to the mth arithmetic unit 44_m, that is, the difference in the number of processing steps for each arithmetic unit is within one. Is assigned to be.
 [演算処理の具体例]
 以下、実施の形態1の図6および図7に示す畳み込み演算と同じ例を用いて、実施の形態2の演算回路40の動作をさらに説明する。並列処理計算部35には4台の演算器44_1~44_4が設けられており(m=4)、演算処理は9個の処理ステップ(n=9)に分割される。
[Specific example of arithmetic processing]
Hereinafter, the operation of the arithmetic circuit 40 of the second embodiment will be further described by using the same example as the convolution arithmetic shown in FIGS. 6 and 7 of the first embodiment. The parallel processing calculation unit 35 is provided with four arithmetic units 44_1 to 44_4 (m = 4), and the arithmetic processing is divided into nine processing steps (n = 9).
 図13は、図6に示す畳み込み演算の各処理ステップを、図12に示す手順で各演算器に割り当てた例を表形式で示す図である。図13に示すように、1から9までの重複しない整数乱数に基づいて、各処理ステップが演算器に割り当てられる。 FIG. 13 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 12 in a table format. As shown in FIG. 13, each processing step is assigned to the arithmetic unit based on a non-overlapping integer random number from 1 to 9.
 具体的に、まず、第1回目に生成された整数乱数5を用いて、第1番目の演算器44_1に、第5番目の処理ステップが割り当てられる。続いて、第2回目から第4回目までに生成された整数乱数1,6,4を用いて、第2番目の演算器44_2から第4番目の演算器44_4に、第1番目、第6番目および第4番目の処理ステップがそれぞれ割り当てられる。 Specifically, first, the fifth processing step is assigned to the first arithmetic unit 44_1 by using the integer random number 5 generated in the first time. Subsequently, using the integer random numbers 1, 6 and 4 generated from the second to the fourth times, the second arithmetic unit 44_2 to the fourth arithmetic unit 44_4 are used to obtain the first and sixth arithmetic units. And the fourth processing step is assigned respectively.
 次に、第5回目から第8回目に生成された整数乱数9,7,2,8を用いて、第1番目から第4番目の演算器44_1~44_4に、第9番目、第7番目、第2番目、および第8番目の処理ステップが割り当てられる。 Next, using the integer random numbers 9, 7, 2, and 8 generated from the 5th to the 8th times, the 9th, 7th, and so on to the 1st to 4th arithmetic units 44_1 to 44_4. The second and eighth processing steps are assigned.
 最後に、第9回目に生成された整数乱数3を用いて、第1番目の演算器44_1に、第3番目の処理ステップが割り当てられる。 Finally, the third processing step is assigned to the first arithmetic unit 44_1 using the integer random number 3 generated in the ninth time.
 図14は、図13に示す処理ステップの割り当て例において、各演算器の処理時間を表形式で示す図である。図14に示すように、第1番目から第4番目までの演算器44_1~演算器44_4に対してほぼ均等に、すなわち、演算器ごとに割り当てられた処理ステップ数が2個または3個になるように各処理ステップを割り当てることができる。この結果、演算器ごとの処理時間のばらつきを抑制できる。 FIG. 14 is a diagram showing the processing time of each arithmetic unit in a table format in the processing step allocation example shown in FIG. As shown in FIG. 14, the number of processing steps assigned to each of the first to fourth arithmetic units 44_1 to 44_1 is almost evenly, that is, the number of processing steps is two or three. Each processing step can be assigned as follows. As a result, it is possible to suppress variations in the processing time for each arithmetic unit.
 図15は、図9に示す処理ステップの割り当て方法と図13に示す処理ステップの割り当て方法との違いを、概念的に示すための図である。図15(A)は図9に示す実施の形態1の場合の処理ステップの割り当て方法を概念的に示し、図15(B)は図13に示す実施の形態2の場合の処理ステップの割り当て方法を概念的に示す。 FIG. 15 is a diagram for conceptually showing the difference between the processing step allocation method shown in FIG. 9 and the processing step allocation method shown in FIG. 15 (A) conceptually shows the method of allocating the processing steps in the case of the first embodiment shown in FIG. 9, and FIG. 15 (B) shows the method of allocating the processing steps in the case of the second embodiment shown in FIG. Is conceptually shown.
 図15(A)を参照して、実施の形態1の場合には、第1番目の処理ステップから第9番目の処理ステップまで順番に処理ステップが選択され、選択された処理ステップに演算器が割り当てられる。割り当てられる演算器は、一様乱数を用いてランダムに選択される。 With reference to FIG. 15A, in the case of the first embodiment, the processing steps are selected in order from the first processing step to the ninth processing step, and the arithmetic unit is assigned to the selected processing step. Assigned. The arithmetic unit to be assigned is randomly selected using a uniform random number.
 図15(B)を参照して、実施の形態2の場合には、第1番目の演算器から第4番目の演算器まで順番に循環的に演算器が選択され、選択された演算器に処理ステップが割り当てられる。各演算器に割り当てられる処理ステップは、1からmの範囲で重複しないように生成された整数の一様乱数を用いてランダムに選択される。 With reference to FIG. 15B, in the case of the second embodiment, the arithmetic units are sequentially selected from the first arithmetic unit to the fourth arithmetic unit in order, and the selected arithmetic units are selected. Processing steps are assigned. The processing steps assigned to each arithmetic unit are randomly selected using a uniform random number of integers generated so as not to overlap in the range of 1 to m.
 [実施の形態2の効果]
 上記のとおり、実施の形態2の演算回路によれば、各処理ステップに割り当てられる演算器は一定の順番で循環的に選択される。一方、各演算器に割り当てられる処理ステップは、重複しないように生成された一様乱数を用いてランダムに選択される。これにより、ランダム性を失わない範囲内で、m個の演算器に対してn個の処理ステップをほぼ均等に、すなわち、演算器ごとの処理ステップ数の差が1個以内になるように割り当てることができる。この結果、演算器ごとの処理時間のばらつきを抑制できる。
[Effect of Embodiment 2]
As described above, according to the arithmetic circuit of the second embodiment, the arithmetic units assigned to each processing step are periodically selected in a fixed order. On the other hand, the processing steps assigned to each arithmetic unit are randomly selected using uniform random numbers generated so as not to overlap. As a result, n processing steps are allocated to m arithmetic units almost evenly, that is, the difference in the number of processing steps for each arithmetic unit is within 1 within a range that does not lose randomness. be able to. As a result, it is possible to suppress variations in the processing time for each arithmetic unit.
 [変形例]
 図5のフローチャートのステップS310を変形してもよい。具体的に、ステップS310では、CPU32は、1からnの範囲で重複しない整数乱数を生成する。n個の整数乱数の生成が完了したら、CPU32は、再び1からnの範囲で重複しない整数乱数を生成する。以上の手順がi=1からi=mまで繰り返される。この方法によっても、図12の場合と同等の結果を得ることができる。
[Modification example]
Step S310 in the flowchart of FIG. 5 may be modified. Specifically, in step S310, the CPU 32 generates an integer random number that does not overlap in the range of 1 to n. When the generation of n integer random numbers is completed, the CPU 32 again generates non-overlapping integer random numbers in the range of 1 to n. The above procedure is repeated from i = 1 to i = m. Also by this method, the same result as in the case of FIG. 12 can be obtained.
 実施の形態3.
 [実施の形態3の概要]
 実施の形態1,2では、並列処理計算部35を構成する演算器44ごとの処理性能はほぼ同じであると仮定していた。実施の形態3では、演算器44ごとの処理性能に違いがある場合について説明する。この場合には、n個の演算器の各々の処理性能に比例した頻度で各演算器の識別番号をランダムに発生させ、発生した識別番号に対応する演算器を各処理ステップに演算器を割り当てる。これによって、演算器ごとの処理時間をほぼ均等にできる。
Embodiment 3.
[Outline of Embodiment 3]
In the first and second embodiments, it is assumed that the processing performance of each arithmetic unit 44 constituting the parallel processing calculation unit 35 is substantially the same. In the third embodiment, a case where there is a difference in processing performance for each arithmetic unit 44 will be described. In this case, the identification number of each arithmetic unit is randomly generated at a frequency proportional to the processing performance of each of the n arithmetic units, and the arithmetic unit corresponding to the generated identification number is assigned to each processing step. .. As a result, the processing time for each arithmetic unit can be made almost equal.
 与えられた頻度分布で乱数を発生させる方法として、たとえば、逆関数法またはフォンノイマンの棄却法などを用いることができる。その他、任意の公知の方法を用いてもよい。 As a method of generating random numbers with a given frequency distribution, for example, the inverse function method or the von Neumann rejection method can be used. In addition, any known method may be used.
 具体的に逆関数法では、演算器44_1~44_nの識別番号1~nを定義域とし、各演算器の処理性能を値域とする分布関数を仮定し、この分布関数の累積分布関数をFとする。そして、この累積分布関数の逆関数F-1を一様乱数発生関数に作用させたものを新たな乱数発生関数とする。以下、図面を参照して具体的に説明する。 Specifically, in the inverse function method, a distribution function is assumed in which the identification numbers 1 to n of the arithmetic units 44_1 to 44_n are defined as the domain and the processing performance of each arithmetic unit is used as the range, and the cumulative distribution function of this distribution function is defined as F. do. Then, a new random number generation function is obtained by applying the inverse function F- 1 of this cumulative distribution function to the uniform random number generation function. Hereinafter, a specific description will be given with reference to the drawings.
 [演算器の割り当て手順]
 図16は、図4のステップS110における処理のさらに他の実現方法を示すフローチャートである。実施の形態1の図4および図5の場合と同様に、並列処理計算部35はn個の演算器44_1~44_nを含む。さらに、演算回路40で実行される演算処理は、並列処理が可能なm個(m>n)の処理ステップを含む。図16のフローチャートでは、処理ステップの識別番号をi(ただし、1≦i≦m、iは整数)とし、第1番目の処理ステップから第m番目の処理ステップまでが順に実行される。
[Calculator allocation procedure]
FIG. 16 is a flowchart showing still another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. In the flowchart of FIG. 16, the identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are executed in order.
 図16のステップS500において、CPU32は、処理ステップの識別番号iを1に初期化する。 In step S500 of FIG. 16, the CPU 32 initializes the identification number i of the processing step to 1.
 次のステップS510において、CPU32は、n個の演算器の識別番号として1からnの範囲の整数乱数を、演算器44_1~演算器44_nの処理性能に比例した頻度で生成する。このような整数乱数の生成には、たとえば、前述の逆関数法が用いられる。生成された整数乱数をr(i)とする。1≦r(i)≦nが成り立つ。 In the next step S510, the CPU 32 generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 44_1 to 44_n. For example, the above-mentioned inverse function method is used to generate such an integer random number. Let the generated integer random number be r (i). 1 ≦ r (i) ≦ n holds.
 その次のステップS520において、CPU32は、生成した整数乱数r(i)を用いて、第i番目の処理ステップを第r(i)番目の演算器44_r(i)に割り当てる。 In the next step S520, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer random number r (i).
 その次のステップS530において、CPU32は、処理ステップの識別番号iを1だけインクリメントする。 In the next step S530, the CPU 32 increments the identification number i of the processing step by 1.
 上記のステップS510~S530は、処理ステップの識別番号を表すパラメータiがmを超えるまで(ステップS540でYESとなるまで)繰り返される。以上により、m個の処理ステップの各々が、n個の演算器44のいずれか1つに割り当てられる。 The above steps S510 to S530 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S540). As described above, each of the m processing steps is assigned to any one of the n arithmetic units 44.
 図17は、図16に示す処理ステップの割り当て方法を概念的に示す図である。図17では、並列処理計算部35には4台の演算器1~演算器4が設けられており(m=4)、演算処理は9個の処理ステップ(n=9)を含むものとする。また、第4番目の演算器4が最も処理性能が高く、その次に演算器3の処理性能が高く、演算器1および演算器2の処理性能が低いものとする。 FIG. 17 is a diagram conceptually showing the method of allocating the processing steps shown in FIG. In FIG. 17, the parallel processing calculation unit 35 is provided with four arithmetic units 1 to 4 (m = 4), and the arithmetic processing includes nine processing steps (n = 9). Further, it is assumed that the fourth arithmetic unit 4 has the highest processing performance, the processing performance of the arithmetic unit 3 is the second highest, and the processing performance of the arithmetic unit 1 and the arithmetic unit 2 is low.
 図17に示すように、最も処理性能が高い演算器4に5個の処理ステップ2,3,5,7,9が割り当てられ、次に処理性能が高い演算器3に2個の処理ステップ1,6が割り当てられる。処理性能が低い演算器1,2には、処理ステップ4,8がそれぞれ割り当てられる。このように演算器の処理性能に応じて、演算器に割り当てられる処理ステップの個数を異ならせることによって、演算器ごとの処理時間のばらつきを抑制できる。 As shown in FIG. 17, five processing steps 2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two processing steps 1 are assigned to the arithmetic unit 3 having the next highest processing performance. , 6 are assigned. Processing steps 4 and 8 are assigned to the arithmetic units 1 and 2 having low processing performance, respectively. By making the number of processing steps assigned to the arithmetic unit different according to the processing performance of the arithmetic unit in this way, it is possible to suppress variations in the processing time for each arithmetic unit.
 [実施の形態3の効果]
 上記のとおり、実施の形態3の演算回路によれば、CPU32は、n個の演算器の各々の処理性能に比例した頻度で各演算器の識別番号をランダムに発生させ、発生した識別番号に対応する演算器を各処理ステップに割り当てる。これにより、処理性能の高い演算器ほど多くの個数の処理ステップが割り当てられることになるので、結果として、演算器ごとの処理時間のばらつきを抑制できる。
[Effect of Embodiment 3]
As described above, according to the arithmetic circuit of the third embodiment, the CPU 32 randomly generates an identification number of each arithmetic unit at a frequency proportional to the processing performance of each of the n arithmetic units, and the generated identification number is used as the identification number. Assign the corresponding arithmetic unit to each processing step. As a result, a larger number of processing steps are assigned to the arithmetic unit having higher processing performance, and as a result, it is possible to suppress variations in the processing time for each arithmetic unit.
 実施の形態4.
 実施の形態4では、実施の形態3の演算回路の設計方法について説明する。具体的には、演算器ごとの処理速度の違いと回路面積の違いとを考慮した上で、演算回路全体の処理速度および回路面積の両方が最適化できるような設計方法を提示する。
Embodiment 4.
In the fourth embodiment, the design method of the arithmetic circuit of the third embodiment will be described. Specifically, we present a design method that can optimize both the processing speed and the circuit area of the entire arithmetic circuit, taking into consideration the difference in processing speed and the difference in circuit area for each arithmetic unit.
 たとえば、処理回路および回路面積に違いのあるn個の演算器を全て用いて演算回路を構成すると、演算回路全体の面積が許容範囲を超えてしまう場合を想定する。この場合、演算回路全体の面積が許容範囲内に収まるように演算回路に組み込む演算器を選択する必要がある。単純に、回路面積の最も大きい演算器から順番に演算回路に含めないようにすると、処理速度がスペックを満たさない可能性がある。したがって、演算回路全体の処理速度と回路面積との両方を最適化する必要がある。以下、図18を参照して、具体的に説明する。 For example, if an arithmetic circuit is configured by using all n arithmetic units having different processing circuits and circuit areas, it is assumed that the area of the entire arithmetic circuit exceeds the allowable range. In this case, it is necessary to select an arithmetic unit to be incorporated in the arithmetic circuit so that the area of the entire arithmetic circuit is within the allowable range. If the arithmetic unit having the largest circuit area is simply not included in the arithmetic circuit in order, the processing speed may not meet the specifications. Therefore, it is necessary to optimize both the processing speed of the entire arithmetic circuit and the circuit area. Hereinafter, a specific description will be given with reference to FIG.
 図18は、演算回路の設計手順を示すフローチャートである。図18の設計手順は、たとえば、設計支援装置のCPUによって実行される。 FIG. 18 is a flowchart showing the design procedure of the arithmetic circuit. The design procedure of FIG. 18 is executed, for example, by the CPU of the design support device.
 図18のステップS600において、演算器1~nの各々の回路面積に基づいて、CPUは、演算回路に含めない少なくとも1つの演算器を選択する。この場合、演算回路全体の面積が許容範囲にぎりぎり収まるように、少なくとも1つの演算器が選択される。CPUは、選択した少なくとも1つの演算器を割り当て禁止に設定する。 In step S600 of FIG. 18, the CPU selects at least one arithmetic unit not included in the arithmetic circuit based on the circuit area of each of the arithmetic units 1 to n. In this case, at least one arithmetic unit is selected so that the area of the entire arithmetic circuit is within the permissible range. The CPU sets at least one selected arithmetic unit to non-allocation.
 なお、実施の形態3の場合と同様に、演算回路で実行される演算処理は、並列処理が可能なm個(m>n)の処理ステップを含むものとする。処理ステップの識別番号をi(ただし、1≦i≦m、iは整数)とし、第1番目の処理ステップから第m番目の処理ステップまでが順に選択される。 As in the case of the third embodiment, the arithmetic processing executed by the arithmetic circuit includes m (m> n) processing steps capable of parallel processing. The identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are sequentially selected.
 次のステップS610において、CPUは、処理ステップの識別番号iを1に初期化する。 In the next step S610, the CPU initializes the identification number i of the processing step to 1.
 その次のステップS620において、CPUは、n個の演算器の識別番号として1からnの範囲の整数乱数を、演算器1~nの処理性能に比例した頻度で生成する。このような整数乱数の生成には、たとえば、前述の逆関数法またはフォンノイマンの棄却法などが用いられる。生成された整数乱数をr(i)とする。1≦r(i)≦nが成り立つ。 In the next step S620, the CPU generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 1 to n. For the generation of such an integer random number, for example, the above-mentioned inverse function method or the von Neumann rejection method is used. Let the generated integer random number be r (i). 1 ≦ r (i) ≦ n holds.
 その次のステップS630において、CPUは、第r(i)番目の演算器への割り当てが禁止されているか否かを判定する。割り当てが禁止されている場合(ステップS630でYES)、CPUは処理をステップS620に戻す。一方、割り当てが可能な場合(ステップS630でNO)、CPUは処理をステップS640に進める。 In the next step S630, the CPU determines whether or not the allocation to the r (i) th arithmetic unit is prohibited. If the allocation is prohibited (YES in step S630), the CPU returns the process to step S620. On the other hand, if the allocation is possible (NO in step S630), the CPU advances the process to step S640.
 ステップS640において、CPU32は、生成した整数乱数r(i)を用いて、第i番目の処理ステップを第r(i)番目の演算器に割り当てる。その次のステップS6500において、CPUは、処理ステップの識別番号iを1だけインクリメントする。 In step S640, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit using the generated integer random number r (i). In the next step S6500, the CPU increments the identification number i of the processing step by 1.
 上記のステップS620~S650は、処理ステップの識別番号を表すパラメータiがmを超えるまで(ステップS660でYESとなるまで)繰り返される。以上により、m個の処理ステップの各々が、割り当て禁止の演算器を除くいずれかの演算器に割り当てられた。 The above steps S620 to S650 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S660). As described above, each of the m processing steps is assigned to any of the arithmetic units except the arithmetic unit whose allocation is prohibited.
 次のステップS670において、CPUは、上記の処理ステップの割り当て結果に基づいて、演算処理に要する時間をシミュレーション等によって計算する。 In the next step S670, the CPU calculates the time required for the arithmetic processing by simulation or the like based on the allocation result of the above processing step.
 その次のステップS680において、CPUは、他の演算器を割り当て禁止に設定する場合には、処理をステップS600に戻して上記の各ステップを繰り返して実行する。たとえば、CPUは、演算回路全体の面積が許容面積ぎりぎり収まるような全ての演算器の組み合わせについて、処理ステップの割り当てと処理時間の計算を実行してもよい。 In the next step S680, when the CPU sets another arithmetic unit to prohibit allocation, the process is returned to step S600 and each of the above steps is repeated. For example, the CPU may allocate processing steps and calculate processing time for all combinations of arithmetic units such that the area of the entire arithmetic circuit fits within the allowable area.
 その次のステップS690において、CPUは、処理時間が最も短い場合の演算器の組み合わせを、演算回路に組み混むべき演算器として選択する。これによって、演算回路全体の処理速度と回路面積との両方の最適化を図ることができる。 In the next step S690, the CPU selects the combination of arithmetic units when the processing time is the shortest as the arithmetic unit to be incorporated in the arithmetic circuit. This makes it possible to optimize both the processing speed of the entire arithmetic circuit and the circuit area.
 上記の演算回路の設計方法は、次の(i)~(iv)の手順にまとめることができる。(i)~(iv)の手順は、たとえば、設計支援装置としてのコンピュータにプログラムを実行させることによって実現される。 The above arithmetic circuit design method can be summarized in the following procedures (i) to (iv). The procedure (i) to (iv) is realized, for example, by causing a computer as a design support device to execute a program.
 (i)コンピュータは、互いに処理性能および回路面積が異なる複数の演算器から、回路面積の合計が予め定められた上限値以下となるように複数の演算器の組み合わせを決定する(ステップS600)。 (I) The computer determines a combination of a plurality of arithmetic units having different processing performances and circuit areas from each other so that the total circuit area is equal to or less than a predetermined upper limit value (step S600).
 (ii)コンピュータは、m個の処理ステップの各々に対して、上記の組み合わせを構成する各演算器の処理性能に比例した頻度で、上記の組み合わせを構成する複数の演算器のそれぞれの識別番号のいずれか1つをランダムに生成する。そして、コンピュータは、生成した識別番号に対応する演算器に各処理ステップを割り当てる(ステップS610~S660)。 (Ii) For each of the m processing steps, the computer has a frequency proportional to the processing performance of each arithmetic unit constituting the above combination, and the identification number of each of the plurality of arithmetic units constituting the above combination. Any one of them is randomly generated. Then, the computer assigns each processing step to the arithmetic unit corresponding to the generated identification number (steps S610 to S660).
 (iii)コンピュータは、上記の組み合わせを構成する複数の演算器に対するm個の処理ステップの割り当て結果に基づいて、m個の処理ステップの処理時間を推定する(ステップS670)。 (Iii) The computer estimates the processing time of m processing steps based on the allocation result of m processing steps to the plurality of arithmetic units constituting the above combination (step S670).
 (iv)コンピュータは、上記の手順(i)を複数回実行することにより、複数の演算器の組み合わせを複数通り決定し、複数通りの組み合わせの各々に対して上記の手順(ii)および手順(iii)を実行することにより、複数通りの組み合わせの各々に対してm個の処理ステップの処理時間を推定する(ステップS680でYESの場合)。コンピュータは、最も処理時間の短い複数の演算器の組み合わせを、演算回路に使用する演算器に決定する(ステップS690)。 (Iv) The computer determines a plurality of combinations of a plurality of arithmetic units by executing the above procedure (i) a plurality of times, and the above procedure (ii) and the procedure (ii) for each of the plurality of combinations. By executing iii), the processing time of m processing steps is estimated for each of the plurality of combinations (when YES in step S680). The computer determines the combination of the plurality of arithmetic units having the shortest processing time as the arithmetic unit used in the arithmetic circuit (step S690).
 図19は、図18に示す演算回路の設計方法の具体例を概念的に示す図である。図19では、4台の演算器1~4(n=4)のうち、実際に演算回路に組み込まれる3台の演算器が選択される。演算処理は8個の処理ステップ(m=8)を含むものとする。また、第4番目の演算器4が最も処理性能が高く、その次に演算器3の処理性能が高く、演算器1および演算器2の処理性能が低いものとする。また、第2番目の演算器2が最も回路面積が大きく、その次に演算器4の回路面積が大きく、演算器1および演算器3の回路面積が小さいものとする。 FIG. 19 is a diagram conceptually showing a specific example of the design method of the arithmetic circuit shown in FIG. In FIG. 19, among the four arithmetic units 1 to 4 (n = 4), three arithmetic units actually incorporated in the arithmetic circuit are selected. The arithmetic processing shall include eight processing steps (m = 8). Further, it is assumed that the fourth arithmetic unit 4 has the highest processing performance, the processing performance of the arithmetic unit 3 is the second highest, and the processing performance of the arithmetic unit 1 and the arithmetic unit 2 is low. Further, it is assumed that the second arithmetic unit 2 has the largest circuit area, the circuit area of the arithmetic unit 4 is the second largest, and the circuit areas of the arithmetic unit 1 and the arithmetic unit 3 are small.
 図19に示すように、演算回路全体の面積を許容範囲に収めるために、最も回路面積が大きい演算器2には、処理ステップを割り当てないものとする。他の演算器1~3には、処理速度に応じて処理ステップが割り当てられる。具体的に、最も処理性能が高い演算器4に5個の処理ステップ2,3,5,7,9が割り当てられ、次に処理性能が高い演算器3に2個の処理ステップ1,6が割り当てられる。処理性能が低い演算器1には、処理ステップ4,8が割り当てられる。 As shown in FIG. 19, in order to keep the area of the entire arithmetic circuit within the allowable range, the arithmetic unit 2 having the largest circuit area is not assigned a processing step. Processing steps are assigned to the other arithmetic units 1 to 3 according to the processing speed. Specifically, five processing steps 2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two processing steps 1, 6 are assigned to the arithmetic unit 3 having the next highest processing performance. Assigned. Processing steps 4 and 8 are assigned to the arithmetic unit 1 having low processing performance.
 この後、各処理ステップの演算処理量と演算器1~3の処理速度とに基づいて、演算処理全体の処理時間が計算される。最終的に、処理回路全体の面積が許容範囲内に収まった上で最も処理時間の短くなるように、演算回路に組み込まれる演算器の組み合わせが決定される。 After that, the processing time of the entire arithmetic processing is calculated based on the arithmetic processing amount of each processing step and the processing speed of the arithmetic units 1 to 3. Finally, the combination of arithmetic units incorporated in the arithmetic circuit is determined so that the area of the entire processing circuit is within the allowable range and the processing time is the shortest.
 実施の形態5.
 実施の形態5では、実施の形態2で説明した乱数生成方法を、論理セルの回路レイアウトの最適化に応用する。
Embodiment 5.
In the fifth embodiment, the random number generation method described in the second embodiment is applied to the optimization of the circuit layout of the logic cell.
 たとえば、LSI(Large Scale Integration)の設計において、演算器によって構成される論理セルを、半導体チップ内の複数の回路エリアにランダムに割り当てる場合を想定する。この場合、乱数の発生次第では、各回路エリアに割り当てられる論理セルの数にばらつきが生じる。この結果、回路エリアごとの回路面積にばらつきが生じる場合があり得る。 For example, in the design of LSI (Large Scale Integration), it is assumed that a logic cell composed of an arithmetic unit is randomly assigned to a plurality of circuit areas in a semiconductor chip. In this case, the number of logical cells assigned to each circuit area varies depending on the generation of random numbers. As a result, the circuit area may vary from circuit area to circuit area.
 そこで、実施の形態5では、各回路エリアに割り当てる論理セルの数をほぼ均等に、すなわち、回路エリアごとの論理セル数の差が1個以内になるという条件を満たすようにする。 Therefore, in the fifth embodiment, the condition that the number of logical cells assigned to each circuit area is almost equal, that is, the difference in the number of logical cells for each circuit area is within one is satisfied.
 具体的には、実施の形態2で説明したように、回路エリアは一定の順番で循環的に選択される。一方、各回路エリアに割り当てられる論理セルは、重複しないように生成された一様乱数を用いてランダムに選択される。これにより、ランダム性を失わない範囲内で、m個の回路エリアに対してn個の論理セルをほぼ均等に、すなわち、回路エリアごとの論理セル数の差が1個以内になるように割り当てることができる。結果として、回路エリアごとの回路面積のばらつきを抑制できる。 Specifically, as described in the second embodiment, the circuit areas are selected cyclically in a fixed order. On the other hand, the logical cells assigned to each circuit area are randomly selected using uniform random numbers generated so as not to overlap. As a result, n logical cells are allocated almost evenly to m circuit areas, that is, the difference in the number of logical cells in each circuit area is within 1 within a range that does not lose randomness. be able to. As a result, it is possible to suppress variations in the circuit area for each circuit area.
 今回開示された実施の形態はすべての点で例示であって制限的なものでないと考えられるべきである。この出願の範囲は上記した説明ではなくて請求の範囲によって示され、請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time should be considered to be exemplary in all respects and not restrictive. The scope of this application is indicated by the scope of claims rather than the above description, and is intended to include all modifications within the meaning and scope of the claims.
 30 画像認識システム、31 信号入力部、32 CPU、33 メモリ、35 並列処理計算部、36 リーダ・ライタ、37 ネットワークインタフェース、38 バスインターコネクト、40 演算回路、41 専用メモリ、42,60 入力データ、43 入力データ制御部、44 演算器、61 カーネル、62 出力データ。 30 image recognition system, 31 signal input unit, 32 CPU, 33 memory, 35 parallel processing calculation unit, 36 reader / writer, 37 network interface, 38 bus interconnect, 40 arithmetic circuit, 41 dedicated memory, 42, 60 input data, 43 Input data control unit, 44 arithmetic unit, 61 kernel, 62 output data.

Claims (9)

  1.  入力データに対して演算処理を行う演算回路であって、
     前記入力データに対する演算処理は、互いに並列処理が可能なm個の処理ステップを含み、前記m個の処理ステップの各々の演算処理量は残余の処理ステップのうち少なくとも1つの処理ステップの演算処理量と異なり、
     前記演算回路は、
     nを2以上かつmより小さい整数として、前記m個の処理ステップを実行するn個の演算器と、
     制御プロセッサとを備え、
     前記制御プロセッサは、前記m個の処理ステップの各々を前記n個の演算器のいずれか1つに、乱数に基づいてランダムに割り当てる、演算回路。
    An arithmetic circuit that performs arithmetic processing on input data.
    The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
    The arithmetic circuit is
    With n being an integer greater than or equal to 2 and smaller than m, n arithmetic units that execute the m processing steps and
    Equipped with a control processor
    The control processor is an arithmetic circuit that randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
  2.  前記n個の演算器は、n個の識別番号にそれぞれ対応付けられ、
     前記制御プロセッサは、前記m個の処理ステップの各々に対して前記n個の識別番号のうちのいずれか1つを一様にランダムに生成し、生成した識別番号に対応する演算器に各処理ステップを割り当てる、請求項1に記載の演算回路。
    The n arithmetic units are associated with n identification numbers, respectively.
    The control processor uniformly and randomly generates any one of the n identification numbers for each of the m processing steps, and each process is performed by the arithmetic unit corresponding to the generated identification numbers. The arithmetic circuit according to claim 1, wherein a step is assigned.
  3.  前記制御プロセッサは、各演算器に割り当てられる処理ステップの個数の差が1以下になるように、前記m個の処理ステップの各々を前記n個の演算器のいずれか1つに乱数に基づいてランダムに割り当てる、請求項1に記載の演算回路。 The control processor attaches each of the m processing steps to any one of the n arithmetic units based on a random number so that the difference in the number of processing steps assigned to each arithmetic unit is 1 or less. The arithmetic circuit according to claim 1, which is randomly assigned.
  4.  前記制御プロセッサは、前記m個の処理ステップの各々に割り当てる演算器を、一定の順番で循環的に選択し、
     前記制御プロセッサは、前記n個の演算器の各々に割り当てる処理ステップを、重複しないように生成された一様乱数を用いてランダムに選択する、請求項3に記載の演算回路。
    The control processor cyclically selects the arithmetic units to be assigned to each of the m processing steps in a fixed order.
    The arithmetic circuit according to claim 3, wherein the control processor randomly selects a processing step to be assigned to each of the n arithmetic units using a uniform random number generated so as not to overlap.
  5.  前記n個の演算器は、互いに処理性能が異なり、
     前記n個の演算器は、n個の識別番号にそれぞれ対応付けられ、
     前記制御プロセッサは、前記m個の処理ステップの各々に対して、前記n個の演算器の各々の処理性能に比例した頻度で前記n個の識別番号のいずれか1つをランダムに生成し、生成した識別番号に対応する演算器に各処理ステップを割り当てる、請求項1に記載の演算回路。
    The n arithmetic units have different processing performances from each other.
    The n arithmetic units are associated with n identification numbers, respectively.
    The control processor randomly generates any one of the n identification numbers for each of the m processing steps at a frequency proportional to the processing performance of each of the n arithmetic units. The arithmetic circuit according to claim 1, wherein each processing step is assigned to the arithmetic unit corresponding to the generated identification number.
  6.  前記入力データに対する演算処理は、畳み込みニューラルネットワークモデルにおける畳み込み演算を含む、請求項1~5のいずれか1項に記載の演算回路。 The arithmetic circuit according to any one of claims 1 to 5, wherein the arithmetic processing for the input data includes a convolutional arithmetic in a convolutional neural network model.
  7.  入力データに対する演算方法であって、
     前記入力データに対する演算処理は、互いに並列処理が可能なm個の処理ステップを含み、前記m個の処理ステップの各々の演算処理量は残余の処理ステップのうち少なくとも1つの処理ステップの演算処理量と異なり、
     前記演算方法は、
     制御プロセッサが、前記m個の処理ステップの各々をn個の演算器のいずれか1つに、乱数に基づいてランダムに割り当てるステップと、
     前記n個の演算器の各々が、割り当てられた少なくとも1つの処理ステップを実行するステップとを備える、演算方法。
    It is a calculation method for input data.
    The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
    The calculation method is
    A step in which the control processor randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
    An arithmetic method, wherein each of the n arithmetic units includes a step of executing at least one assigned processing step.
  8.  入力データに対する演算処理を演算回路に実行させるためのプログラムであって、
     前記入力データに対する演算処理は、互いに並列処理が可能なm個の処理ステップを含み、前記m個の処理ステップの各々の演算処理量は残余の処理ステップのうち少なくとも1つの処理ステップの演算処理量と異なり、
     前記演算回路は、
     nを2以上かつmより小さい整数として、n個の演算器と、
     制御プロセッサとを含み、
     前記プログラムは、
     前記制御プロセッサに、前記m個の処理ステップの各々を前記n個の演算器のいずれか1つに乱数に基づいてランダムに割り当てさせ、
     前記n個の演算器の各々に、割り当てられた少なくとも1つの処理ステップを実行させる、プログラム。
    A program for causing an arithmetic circuit to execute arithmetic processing on input data.
    The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
    The arithmetic circuit is
    With n being an integer greater than or equal to 2 and smaller than m, n arithmetic units and
    Including control processor
    The program
    The control processor is made to randomly assign each of the m processing steps to any one of the n arithmetic units based on a random number.
    A program that causes each of the n arithmetic units to perform at least one assigned processing step.
  9.  請求項5に記載の演算回路の設計方法であって、
     互いに処理性能および回路面積が異なる複数の演算器から、回路面積の合計が予め定められた上限値以下となるように複数の演算器の組み合わせを決定するステップと、
     前記m個の処理ステップの各々に対して、前記組み合わせを構成する各演算器の処理性能に比例した頻度で、前記組み合わせを構成する複数の演算器のそれぞれの識別番号のいずれか1つをランダムに生成し、生成した識別番号に対応する演算器に各処理ステップを割り当てるステップと、
     前記組み合わせを構成する複数の演算器に対する前記m個の処理ステップの割り当て結果に基づいて、前記m個の処理ステップの処理時間を推定するステップと、
     前記複数の演算器の組み合わせを決定するステップを複数回実行することにより、前記複数の演算器の組み合わせを複数通り決定し、前記複数通りの組み合わせの各々に対して前記割り当てるステップおよび前記推定するステップを実行することにより、前記複数通りの組み合わせの各々に対して前記m個の処理ステップの処理時間を推定し、最も処理時間の短い前記複数の演算器の組み合わせを、前記演算回路に使用する前記n個の演算器に決定するステップとを備える、演算回路の設計方法。
    The method for designing an arithmetic circuit according to claim 5.
    A step of determining a combination of a plurality of arithmetic units having different processing performances and circuit areas from each other so that the total circuit area is equal to or less than a predetermined upper limit value.
    For each of the m processing steps, one of the identification numbers of the plurality of arithmetic units constituting the combination is randomly assigned at a frequency proportional to the processing performance of each arithmetic unit constituting the combination. And the step of assigning each processing step to the arithmetic unit corresponding to the generated identification number,
    A step of estimating the processing time of the m processing steps based on the allocation result of the m processing steps to the plurality of arithmetic units constituting the combination, and a step of estimating the processing time of the m processing steps.
    By executing the step of determining the combination of the plurality of arithmetic units a plurality of times, a plurality of combinations of the plurality of arithmetic units are determined, and the step of assigning the combination to each of the plurality of arithmetic units and the step of estimating the estimation are performed. By executing the above, the processing time of the m processing steps is estimated for each of the plurality of combinations, and the combination of the plurality of arithmetic units having the shortest processing time is used for the arithmetic circuit. A method for designing an arithmetic circuit, which comprises a step of determining n arithmetic units.
PCT/JP2021/021922 2020-06-24 2021-06-09 Computation circuit, computation method, program, and computation circuit design method WO2021261252A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-108599 2020-06-24
JP2020108599 2020-06-24

Publications (1)

Publication Number Publication Date
WO2021261252A1 true WO2021261252A1 (en) 2021-12-30

Family

ID=79281110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/021922 WO2021261252A1 (en) 2020-06-24 2021-06-09 Computation circuit, computation method, program, and computation circuit design method

Country Status (1)

Country Link
WO (1) WO2021261252A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004509386A (en) * 2000-06-30 2004-03-25 タレス ネデルラント ベー.フェー. How to automatically assign software functions to multiple processors
JP2005011331A (en) * 2003-05-26 2005-01-13 Toshiba Corp Load distribution system and computer management program
JP2014096113A (en) * 2012-11-12 2014-05-22 Nippon Telegr & Teleph Corp <Ntt> Load balancer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004509386A (en) * 2000-06-30 2004-03-25 タレス ネデルラント ベー.フェー. How to automatically assign software functions to multiple processors
JP2005011331A (en) * 2003-05-26 2005-01-13 Toshiba Corp Load distribution system and computer management program
JP2014096113A (en) * 2012-11-12 2014-05-22 Nippon Telegr & Teleph Corp <Ntt> Load balancer

Similar Documents

Publication Publication Date Title
CN101681449B (en) Calculation processing apparatus and method
JP5171118B2 (en) Arithmetic processing apparatus and control method thereof
JP2019109895A (en) Method and electronic device for performing convolution calculations in neutral network
US20180253641A1 (en) Arithmetic processing apparatus and control method therefor
JP2019109896A (en) Method and electronic device for performing convolution calculations in neutral network
US20100214936A1 (en) Calculation processing apparatus and method
JP2020524318A (en) Alternate loop limit
TW202022711A (en) Convolution accelerator using in-memory computation
US20200118249A1 (en) Device configured to perform neural network operation and method of operating same
CN110874628A (en) Artificial neural network and method for controlling fixed point therein
CN113841159A (en) Method for performing convolution operation at specific layer in neural network by electronic device and electronic device thereof
CN111026603B (en) On-chip network temperature prediction method and device, equipment and storage medium
TW202138999A (en) Data dividing method and processor for convolution operation
WO2021261252A1 (en) Computation circuit, computation method, program, and computation circuit design method
US20230206964A1 (en) Digital phase change memory (pcm) array for analog computing
EP3816867A1 (en) Data reading/writing method and system in 3d image processing, storage medium, and terminal
CN112580675A (en) Image processing method and device, and computer readable storage medium
US11704546B2 (en) Operation processing apparatus that calculates addresses of feature planes in layers of a neutral network and operation processing method
US11775809B2 (en) Image processing apparatus, imaging apparatus, image processing method, non-transitory computer-readable storage medium
US11182128B2 (en) Multiply-accumulate operation device, multiply-accumulate operation methods, and systems
JP3788804B2 (en) Parallel processing apparatus and parallel processing method
US20230105329A1 (en) Image signal processor and image sensor including the image signal processor
KR20210076420A (en) Electronic apparatus and control method thereof
JP2024516514A (en) Memory mapping of activations for implementing convolutional neural networks
JP2022512311A (en) Matrix math instruction set tiling algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828720

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21828720

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP