WO2021261252A1

WO2021261252A1 - Computation circuit, computation method, program, and computation circuit design method

Info

Publication number: WO2021261252A1
Application number: PCT/JP2021/021922
Authority: WO
Inventors: 和茂橋本; 正志森
Original assignee: 三菱電機株式会社
Priority date: 2020-06-24
Filing date: 2021-06-09
Publication date: 2021-12-30

Abstract

A computation circuit (40) performs computation processing on input data (42). The computation processing on the input data includes m processing steps which can be processed in parallel to each other. The computation processing amount of each of the m processing steps differs from the computation processing amount of at least one of the remaining processing steps. The computation circuit (40) comprises: n computing units (44) that execute the m processing steps, where n is an integer greater than 1 and less than m; and a control processor (32). The control processor (32) randomly allocates each of the m processing steps to one of the n computing units (44) on the basis of a random number.

Description

Arithmetic circuit, arithmetic method, program, and arithmetic circuit design method

The present disclosure relates to an arithmetic circuit, an arithmetic method, a program for executing this arithmetic method, and a design method of the arithmetic circuit.

For processing in which the processing time increases as the amount of input data increases, it is common practice to shorten the processing time by parallel processing using multiple arithmetic units. In parallel processing, the original processing is divided into partial processing of the same algorithm, and each partial processing is calculated independently by the corresponding arithmetic unit. Then, by integrating the results of each partial process with a relatively small amount of calculation, it is possible to obtain the same or approximately the same result as the result of the original process. Parallel processing is suitably used for signal processing, certain processing of artificial intelligence (for example, calculation of the average value of a plurality of random variables according to the same distribution), and the like.

One of the problems with parallel processing is that the amount of data input to each arithmetic unit varies. If the amount of input data varies from one arithmetic unit to another, there is a disadvantage that the total processing time is determined by the processing time of the arithmetic unit having the largest amount of input data.

International Publication No. 2019/053835 (Patent Document 1) discloses a method for solving the above problems. Specifically, according to this document, in the multiplication process of the coefficient matrix and the input vector, the multiplication of each non-zero element included in the coefficient matrix and the corresponding element of the input vector is set as the processing unit, and the processing in each arithmetic unit is performed. The multiplication is assigned to each arithmetic unit so that the number of units is leveled.

International Publication No. 2019/053835

Although the calculation method described in International Publication No. 2019/053835 (Patent Document 1) is effective in shortening the processing time of multiplication processing, it takes time and effort to determine the processing to be assigned to each calculation unit. It takes. This is because the number of unit processes to be assigned to each arithmetic unit is not determined until the search for non-zero elements is completed for all input data. As a result, the circuit scale for the preprocessing of allocating the arithmetic processing to each arithmetic unit is increased, and the total processing time is also increased.

The present disclosure has been made in consideration of the above-mentioned problems, and one of the purposes thereof is to reduce the processing time of the entire arithmetic processing by suppressing the variation in the processing time for each arithmetic unit by a relatively simple method. It is to provide an arithmetic circuit that can be shortened, an arithmetic method, and a program for executing this arithmetic method.

The arithmetic circuit of one embodiment performs arithmetic processing on the input data. The arithmetic processing on the input data includes m processing steps that can be processed in parallel with each other. The arithmetic processing amount of each of the m processing steps is different from the arithmetic processing amount of the other processing steps. The arithmetic circuit includes n arithmetic units that execute m processing steps and a control processor, where n is an integer of 2 or more and smaller than m. The control processor randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.

According to the above embodiment, by randomly assigning each of the m processing steps to any one of the n arithmetic units based on a random number, the variation in the processing time for each arithmetic unit is suppressed. The processing time of the entire arithmetic processing can be shortened.

It is a block diagram which shows an example of the image recognition system which includes the arithmetic circuit by Embodiment 1. FIG. It is a block diagram which shows an example of the structure of the signal input part of FIG. It is a flowchart which shows the outline of the arithmetic processing using the arithmetic circuit of FIG. It is a flowchart which shows the detail of the arithmetic processing using the arithmetic circuit of FIG. It is a flowchart which shows the example of the process in step S110 of FIG. 4 in detail. It is a figure which conceptually shows the convolution operation in the convolution layer of a convolutional neural network. It is a figure which shows the example of the processing time of each processing step of the convolution operation shown in FIG. 6 in a tabular form. It is a figure which shows the processing time of each arithmetic unit in a tabular form when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit. It is a figure which shows the example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format. In the example of allocation of arithmetic units shown in FIG. 9, it is a figure which shows the processing time of each arithmetic unit in a table format. It is a figure for demonstrating the effect of Embodiment 1. FIG. It is a flowchart which shows the other realization method of the process in step S110 of FIG. It is a figure which shows the example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 12 in a table format. In the processing step allocation example shown in FIG. 13, it is a figure which shows the processing time of each arithmetic unit in a table format. It is a figure for conceptually showing the difference between the process step allocation method shown in FIG. 9 and the process step allocation method shown in FIG. It is a flowchart which shows the other realization method of the process in step S110 of FIG. It is a figure which conceptually shows the allocation method of the processing step shown in FIG. It is a flowchart which shows the design procedure of an arithmetic circuit. It is a figure which conceptually shows the concrete example of the design method of the arithmetic circuit shown in FIG.

Hereinafter, each embodiment will be described in detail with reference to the drawings. Hereinafter, the convolutional operation in the CNN (Convolutional Neural Network) model will be described as an example, but the application target of the operation circuit and the operation method of the present disclosure is not limited to the convolutional operation. In the following description, the same or corresponding parts may be designated by the same reference numerals and the description may not be repeated.

Embodiment 1.
[Overall configuration of image recognition system]
FIG. 1 is a block diagram showing an example of an image recognition system including an arithmetic circuit according to the first embodiment.

The image recognition system 30 has a system configuration assuming execution of an image recognition application in a surveillance camera for the purpose of person detection, an in-vehicle camera for the purpose of object detection, or the like. The image recognition system 30 has a function of detecting a specific object from image data according to a CNN model. Specifically, the image recognition system 30 performs a two-dimensional convolution operation, that is, a product-sum operation such as a product (Ax + b) of a matrix A and a vector x.

As shown in FIG. 1, the image recognition system 30 includes a signal input unit 31, a CPU (Central Processing Unit) 32, a memory 33, a DMAC (Direct Memory Access Controller) 34, a parallel processing calculation unit 35, a reader / writer 36, and a reader / writer 36. A network interface 37 is provided. Each of the above configurations is interconnected via the bus interconnect 38.

The signal input unit 31 generates image data by converting light incident through an optical system (not shown) into an electric signal. The image data is arithmetically processed by the arithmetic circuit 40. A configuration example of the signal input unit 31 will be described later with reference to FIG.

The CPU 32 functions as a control processor that controls the entire image recognition system 30. The CPU 32 also accesses the dedicated memory 41 inside the parallel processing calculation unit 35.

The memory 33 stores instructions and control data executed by the CPU 32. The memory 33 includes a volatile memory such as a DRAM (Dynamic Random Access Memory) and a SRAM, and an electrically rewritable non-volatile memory such as a flash memory.

The DMAC 34 controls direct data transfer between the signal input unit 31, the memory 33, and the dedicated memory 41 of the parallel processing calculation unit 35, without going through the CPU 32.

Parallel processing calculation unit 35 performs two-dimensional convolution calculation processing. The parallel processing calculation unit 35 includes a dedicated memory 41, an input data control unit 43, and n arithmetic units 44_1 to 44_n as its internal configuration. The arithmetic units 44_1 to 44_n are referred to as the arithmetic unit 44 when they are generically referred to or when they indicate unspecified ones. In the first embodiment, the processing capacity of each of the n arithmetic units 44 is the same. Alternatively, the processing capacity of each of the n arithmetic units 44 is substantially the same to the extent that the difference in processing time of each arithmetic unit does not matter.

The n arithmetic units 44 can be programmed in parallel with each other. The number n of the arithmetic units 44 is determined depending on the number m of processing steps that can be processed in parallel included in the processing program for the input data 42. 2 ≦ n <m holds. The arithmetic unit 44 can be configured by a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), a multi-core processor, or the like.

The dedicated memory 41 stores the input data 42 processed by the n arithmetic units 44 and the arithmetic results of each arithmetic unit 44. The input data control unit 43 assigns each of the above processing steps to each arithmetic unit 44. The arithmetic unit 44 executes arithmetic processing for the assigned processing step.

The arithmetic circuit 40 is configured by the CPU 32, the memory 33, and the parallel processing calculation unit 35. The CPU 32 and the memory 33 may be provided inside the parallel processing calculation unit 35.

The reader / writer 36 writes data or a program to the storage medium and reads out the data or program stored in the storage medium. The storage medium stores data or programs non-temporarily by magnetic or optical methods or by using semiconductor memory. As a storage medium, a CD (Compact Disc), a DVD (Digital Versatile Disc), a Blu-ray disc, a hard disk, a flash memory, or the like can be used.

The network interface 37 is connected to an external device via the network. The program executed by the CPU 32 and the program executed by the arithmetic unit 44 may be provided via the network via the network interface 37, or may be provided by the storage medium via the reader / writer 36.

FIG. 2 is a block diagram showing an example of the configuration of the signal input unit of FIG. The block diagram of FIG. 2 shows a configuration example when a CMOS (Complementary Metal Oxide Semiconductor) device is used as the photoelectric element, but a CCD (Charge Coupled Devices) may be used as the optical sensor, or the like. The type of optical sensor may be used.

With reference to FIG. 2, the signal input unit 31 includes an optical system (not shown) for condensing light, a sensor array 51, a column ADC (Analog-to-Digital Converter) 52, a vertical scanning circuit 53, and the like. It includes a horizontal scanning circuit 54, an output amplifier 55, a frame buffer 56, and a signal processing circuit 57.

The sensor array 51 includes a plurality of photoelectric elements arranged in a matrix. The light input to the signal input unit 31 is focused on the sensor array 51 by the optical system.

The vertical scanning circuit 53 is connected to a plurality of control signal lines extending in the row direction of the sensor array 51, and drives a readout circuit of each photoelectric element via each control signal line. The horizontal scanning circuit 54 is connected to a plurality of output signal lines extending in the column direction of the sensor array 51, and reads an optical signal from each photoelectric element via each output signal line.

The column ADC 52 converts an optical signal read from each photoelectric element into a digital signal. The output amplifier 55 amplifies the converted digital signal. The frame buffer 56 temporarily stores the amplified digital signal frame by frame. The signal processing circuit 57 removes noise and the like contained in the digital signal and executes various image corrections.

[Operation processing]
Next, the operation of the arithmetic circuit 40 of FIG. 1 will be described. FIG. 3 is a flowchart showing an outline of arithmetic processing using the arithmetic circuit of FIG.

With reference to FIG. 3, the process executed by the arithmetic circuit 40 is divided into a preprocessing S100 that is executed only once at the beginning and an arithmetic process S200 that is repeatedly executed a plurality of times according to the input data.

As an example, the preprocessing S100 is executed by the CPU 32 according to the program. The preprocessing S100 may be executed by another general-purpose CPU. The arithmetic processing S200 is mainly executed by the parallel processing calculation unit 35 according to the program, and the overall control of the arithmetic processing S200 is executed by, for example, the CPU 32.

FIG. 4 is a flowchart showing details of arithmetic processing using the arithmetic circuit of FIG. In the following description, the arithmetic processing for the input data 42 executed by the arithmetic circuit 40 includes m processing steps capable of parallel processing. The number of processing steps m is larger than the number n of the arithmetic units 44.

With reference to FIG. 4, in step S110 of the preprocessing S100, the CPU 32 allocates each of the m processing steps to any one of the n arithmetic units 44 based on the random number. Therefore, each arithmetic unit 44 is assigned at least one of m processing steps.

In step S210 of the arithmetic processing S200, each arithmetic unit 44 of the parallel processing calculation unit 35 executes the assigned processing step. If the execution of all the processing steps assigned in each arithmetic unit 44 is not completed (NO in step S220), the above step S210 is repeated. When each arithmetic unit 44 executes all the assigned processing steps, the arithmetic processing ends.

FIG. 5 is a flowchart showing in detail an example of the process in step S110 of FIG. In the flowchart of FIG. 5, the identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are executed in order.

In step S300 of FIG. 5, the CPU 32 initializes the identification number i of the processing step to 1.

In the next step S310, the CPU 32 generates a uniform random number of integers in the range of 1 to n as identification numbers of n arithmetic units. Let r (i) be a uniform random number of the generated integers. 1 ≦ r (i) ≦ n holds. A known pseudo-random number generation algorithm may be used to generate a uniform random number. For example, linear congruential or multiply-with-carry can be used. In the first embodiment, since the processing capacity of each of the n arithmetic units is the same or substantially the same, it is desirable to generate a uniform random number having the same appearance probability of each random number. As described in the third embodiment, when the processing capacity of each arithmetic unit is different, it is necessary to generate a random number different from the uniform random number.

In the next step S320, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer uniform random number r (i).

In the next step S330, the CPU 32 increments the identification number i of the processing step by 1.

The above steps S310 to S330 are repeated until the identification number i of the processing step exceeds m (until YES is set in step S340). As described above, each of the m processing steps is assigned to any one of the n arithmetic units 44.

[Specific example of arithmetic processing]
Hereinafter, the operation of the arithmetic circuit 40 in FIG. 1 will be further described by taking the convolutional operation in the convolutional layer of the convolutional neural network as an example. It is assumed that the parallel processing calculation unit 35 is provided with four arithmetic units 44_1 to 44_4 (n = 4).

FIG. 6 is a diagram conceptually showing the convolution operation in the convolution layer of the convolutional neural network. As shown in FIG. 6, the output data 62 is generated by the convolution operation between the input data 60 and the kernel 61. Further, after the bias is added to each element of the output data 62, an activation function is applied to each element of the output data 62.

In the convolution operation, while sliding the kernel 61 on the input data 60 at regular intervals, the elements of the kernel 61 and the corresponding elements of the input data 60 are multiplied, and the sum of them is obtained. The interval for sliding the kernel (that is, stride) is 1. In this case, in the convolution operation, the operations of the following equations (1) to (9) are executed. For the sake of simplicity, the bias addition and the activation function operation are omitted.

y ₁ = x ₁ * w ₁ + x ₂ * w ₂ + x ₃ * w ₃ + x ₆ * w ₄ + x ₇ * w ₅ + x ₈ * w ₆
+ X ₁₁ * w ₇ + x ₁₂ * w ₈ + x ₁₃ * w ₉ ... (1)
y ₂ = x ₂ * w ₁ + x ₃ * w ₂ + x ₄ * w ₃ + x ₇ * w ₄ + x ₈ * w ₅ + x ₉ * w ₆
+ X ₁₂ * w ₇ + x ₁₃ * w ₈ + x ₁₄ * w ₉ ... (2)
y ₃ = x ₃ * w ₁ + x ₄ * w ₂ + x ₅ * w ₃ + x ₈ * w ₄ + x ₉ * w ₅ + x ₁₀ * w ₆
+ X ₁₃ * w ₇ + x ₁₄ * w ₈ + x ₁₅ * w ₉ ... (3)
y ₄ = x ₆ * w ₁ + x ₇ * w ₂ + x ₈ * w ₃ + x ₁₁ * w ₄ + x ₁₂ * w ₅ + x ₁₃ * w ₆
+ X ₁₆ * w ₇ + x ₁₇ * w ₈ + x ₁₈ * w ₉ ... (4)
y ₅ = x ₇ * w ₁ + x ₈ * w ₂ + x ₉ * w ₃ + x ₁₂ * w ₄ + x ₁₃ * w ₅ + x ₁₄ * w ₆
+ X ₁₇ * w ₇ + x ₁₈ * w ₈ + x ₁₉ * w ₉ … (5)
y ₆ = x ₈ * w ₁ + x ₉ * w ₂ + x ₁₀ * w ₃ + x ₁₃ * w ₄ + x ₁₄ * w ₅ + x ₁₅ * w ₆
+ X ₁₈ * w ₇ + x ₁₉ * w ₈ + x ₂₀ * w ₉ ... (6)
y ₇ = x ₁₁ * w ₁ + x ₁₂ * w ₂ + x ₁₃ * w ₃ + x ₁₆ * w ₄ + x ₁₇ * w ₅ + x ₁₈ * w ₆
+ X ₂₁ * w ₇ + x ₂₂ * w ₈ + x ₂₃ * w ₉ … (7)
y ₈ = x ₁₂ * w ₁ + x ₁₃ * w ₂ + x ₁₄ * w ₃ + x ₁₇ * w ₄ + x ₁₈ * w ₅ + x ₁₉ * w ₆
+ X ₂₂ * w ₇ + x ₂₃ * w ₈ + x ₂₄ * w ₉ … (8)
y ₉ = x ₁₃ * w ₁ + x ₁₄ * w ₂ + x ₁₅ * w ₃ + x ₁₈ * w ₄ + x ₁₉ * w ₅ + x ₂₀ * w ₆
+ X ₂₂ * w ₇ + x ₂₃ * w ₈ + x ₂₄ * w ₉ … (9)
Each of the operations of the above equations (1) to (9) corresponds to a processing step capable of parallel processing with each other. Hereinafter, the identification numbers of the processing steps represented by the formulas (1) to (9) are set to 1 to 9, respectively.

By the way, when the input data 60 includes 0 elements, it is not necessary to execute multiplication with the corresponding elements of the kernel 61 for the 0 elements. _{For example, if the values of x 4} , x ₅ , x ₉ , x ₁₃ to x ₁₈ , x ₂₁ , x ₂₂ , and x ₂₅ of the input data 60 are set to 0, the calculation processing amount of each processing step varies.

FIG. 7 is a diagram showing an example of the processing time of each processing step of the convolution operation shown in FIG. 6 in a table format. Since the processing performance of each arithmetic unit is assumed to be the same in the first embodiment, the arithmetic processing amount of each processing step is proportional to the processing time. When the input data 60 includes the zero element as described above, the processing time of each processing step is represented as shown in FIG.

FIG. 8 is a diagram showing the processing time of each arithmetic unit in a table format when each processing step of the convolution operation shown in FIG. 6 is regularly assigned to each arithmetic unit. In FIG. 8, processing steps 1 to 4 are assigned to arithmetic units 1 to 4 in order, processing steps 5 to 8 are assigned to arithmetic units 1 to arithmetic units 4 in order, and the remaining processing steps. 9 is assigned to the arithmetic unit 1. As described above, when the processing steps 1 to 9 are regularly assigned to the arithmetic units 44_1 to 44_4, the processing time in each arithmetic unit 44 varies as shown in FIG. Therefore, the total processing time is determined by the processing time of the arithmetic unit 44_1 having the longest processing time.

FIG. 9 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 5 in a table format. As shown in FIG. 9, in the arithmetic circuit 40 of the first embodiment, the CPU 32 randomly assigns each of the processing steps 1 to 9 to each arithmetic unit 44 based on a uniform random number.

FIG. 10 is a diagram showing the processing time of each arithmetic unit in a table format in the arithmetic unit allocation example shown in FIG. Compared with the case where each processing step is regularly assigned to the arithmetic unit 44 as shown in FIG. 8, in the case of the first embodiment shown in FIG. 10, the degree of variation of each arithmetic unit can be relaxed.

In the above example, the degree of variation may increase depending on the generation of random numbers. However, in the arithmetic processing in the actual arithmetic circuit 40, both the number of processing steps and the number of arithmetic units 44 are much larger than in the case of the above example. Therefore, by randomly assigning each processing step to any one arithmetic unit 44 based on a uniform random number, it is possible to suppress variations in the processing time for each arithmetic unit 44.

[Effect of Embodiment 1]
FIG. 11 is a diagram for explaining the effect of the first embodiment. In FIGS. 11A and 11B, as a result of allocating each processing step to any one arithmetic unit, the processing times from the arithmetic unit 1 to the arithmetic unit n are N ₁ , ..., N _{n, respectively.} Suppose.

FIG. 11A shows a case where the processing time of each arithmetic unit varies as a result of regularly allocating each processing step to the arithmetic units 1 to n. As shown in FIG. 11 (A), the processing time N ₁ processing time longest calculator 1, the arithmetic circuit 40 overall processing time is determined.

FIG. 11B shows a case where each processing step is randomly assigned to the arithmetic units 1 to n according to a uniform random number. In this case, since the variation in the processing time for each arithmetic unit can be suppressed, the processing time of the entire arithmetic circuit 40 can be shortened as compared with the case of FIG. 11A.

Next, the arithmetic circuit 40 of the present embodiment is compared with the arithmetic circuit of the above-mentioned International Publication No. 2019/053835 (Patent Document 1).

In the case of the arithmetic circuit of Patent Document 1, arithmetic processing is assigned to each arithmetic unit so that the processing amount in each arithmetic unit is leveled. Therefore, in the case of the product-sum operation described with reference to FIGS. 6 and 7, it is necessary to search in advance for non-zero elements included in the input data 60 and estimate the operation processing amount of each processing step. Therefore, it takes time for preprocessing to assign each processing step to any one arithmetic unit.

On the other hand, in the case of the arithmetic circuit 40 of the present embodiment, each processing step is uniformly and randomly assigned to any one arithmetic unit. Therefore, it is not necessary to estimate the arithmetic processing amount of each processing step in advance based on the search result for the non-zero element. As a result, the time required for preprocessing can be shortened as compared with the case of the arithmetic circuit of Patent Document 1, and thereby the entire processing time including preprocessing can be shortened.

Embodiment 2.
[Outline of Embodiment 2]
The arithmetic circuit 40 of the first embodiment randomly assigns each processing step of input data to any one arithmetic unit based on a uniform random number. In this case, the number of processing steps assigned to each arithmetic unit varies depending on the generation of random numbers. As a result, the processing time of each arithmetic unit may vary.

Therefore, in the arithmetic circuit of the second embodiment, the condition that the number of processing steps assigned to each arithmetic unit is substantially equal, that is, the difference in the number of processing steps for each arithmetic unit is within one is satisfied. After satisfying this condition, the CPU 32 randomly assigns each processing step to any one of the arithmetic units. Hereinafter, a specific description will be given with reference to the drawings.

[Calculator allocation procedure]
FIG. 12 is a flowchart showing another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. The m processing steps are assigned identification numbers from the first to the mth.

The number of random number generation processes is i (however, 1 ≦ i ≦ m, i is an integer), and the identification number of the arithmetic unit is j (however, 1 ≦ j ≦ n, j is an integer). In the flowchart of FIG. 12, the first random number generation process to the mth random number generation process are executed in order.

In step S400 of FIG. 12, the CPU 32 initializes each of the number i of the random number generation processing and the identification number j of the arithmetic unit to 1.

In the next step S410, the CPU 32 generates a uniform random number of an integer not equal to the random number already generated within the range of 1 to m. Hereinafter, in the second embodiment, unless there is a misunderstanding, a uniform random number of integers is simply referred to as an integer random number. Let r (i) be the integer random number generated in the i-th time. 1 ≦ r (i) ≦ m, and r (i) is not equal to any of r (1) to r (i-1).

In the next step S420, the CPU 32 allocates the r (i) th processing step to the jth arithmetic unit 44_j using the random number r (i) generated in the i-th time.

In the next step S430, the CPU 32 increments the number i of the random number generation processing by 1 and increments the identification number j of the arithmetic unit by 1. When the identification number j of the arithmetic unit exceeds n (YES in step S440), the CPU 32 initializes the identification number j of the arithmetic unit to 1 (step S450).

The above steps S410, S420, and S430 are repeated until the number i of the random number generation processing exceeds m (until YES is obtained in step S460). As a result, the first to nth processing steps are almost evenly distributed between the first arithmetic unit 44_1 to the mth arithmetic unit 44_m, that is, the difference in the number of processing steps for each arithmetic unit is within one. Is assigned to be.

[Specific example of arithmetic processing]
Hereinafter, the operation of the arithmetic circuit 40 of the second embodiment will be further described by using the same example as the convolution arithmetic shown in FIGS. 6 and 7 of the first embodiment. The parallel processing calculation unit 35 is provided with four arithmetic units 44_1 to 44_4 (m = 4), and the arithmetic processing is divided into nine processing steps (n = 9).

FIG. 13 is a diagram showing an example in which each processing step of the convolution operation shown in FIG. 6 is assigned to each arithmetic unit by the procedure shown in FIG. 12 in a table format. As shown in FIG. 13, each processing step is assigned to the arithmetic unit based on a non-overlapping integer random number from 1 to 9.

Specifically, first, the fifth processing step is assigned to the first arithmetic unit 44_1 by using the integer random number 5 generated in the first time. Subsequently, using the integer

random numbers

1, 6 and 4 generated from the second to the fourth times, the second arithmetic unit 44_2 to the fourth arithmetic unit 44_4 are used to obtain the first and sixth arithmetic units. And the fourth processing step is assigned respectively.

Next, using the integer

random numbers

9, 7, 2, and 8 generated from the 5th to the 8th times, the 9th, 7th, and so on to the 1st to 4th arithmetic units 44_1 to 44_4. The second and eighth processing steps are assigned.

Finally, the third processing step is assigned to the first arithmetic unit 44_1 using the integer random number 3 generated in the ninth time.

FIG. 14 is a diagram showing the processing time of each arithmetic unit in a table format in the processing step allocation example shown in FIG. As shown in FIG. 14, the number of processing steps assigned to each of the first to fourth arithmetic units 44_1 to 44_1 is almost evenly, that is, the number of processing steps is two or three. Each processing step can be assigned as follows. As a result, it is possible to suppress variations in the processing time for each arithmetic unit.

FIG. 15 is a diagram for conceptually showing the difference between the processing step allocation method shown in FIG. 9 and the processing step allocation method shown in FIG. 15 (A) conceptually shows the method of allocating the processing steps in the case of the first embodiment shown in FIG. 9, and FIG. 15 (B) shows the method of allocating the processing steps in the case of the second embodiment shown in FIG. Is conceptually shown.

With reference to FIG. 15A, in the case of the first embodiment, the processing steps are selected in order from the first processing step to the ninth processing step, and the arithmetic unit is assigned to the selected processing step. Assigned. The arithmetic unit to be assigned is randomly selected using a uniform random number.

With reference to FIG. 15B, in the case of the second embodiment, the arithmetic units are sequentially selected from the first arithmetic unit to the fourth arithmetic unit in order, and the selected arithmetic units are selected. Processing steps are assigned. The processing steps assigned to each arithmetic unit are randomly selected using a uniform random number of integers generated so as not to overlap in the range of 1 to m.

[Effect of Embodiment 2]
As described above, according to the arithmetic circuit of the second embodiment, the arithmetic units assigned to each processing step are periodically selected in a fixed order. On the other hand, the processing steps assigned to each arithmetic unit are randomly selected using uniform random numbers generated so as not to overlap. As a result, n processing steps are allocated to m arithmetic units almost evenly, that is, the difference in the number of processing steps for each arithmetic unit is within 1 within a range that does not lose randomness. be able to. As a result, it is possible to suppress variations in the processing time for each arithmetic unit.

[Modification example]
Step S310 in the flowchart of FIG. 5 may be modified. Specifically, in step S310, the CPU 32 generates an integer random number that does not overlap in the range of 1 to n. When the generation of n integer random numbers is completed, the CPU 32 again generates non-overlapping integer random numbers in the range of 1 to n. The above procedure is repeated from i = 1 to i = m. Also by this method, the same result as in the case of FIG. 12 can be obtained.

Embodiment 3.
[Outline of Embodiment 3]
In the first and second embodiments, it is assumed that the processing performance of each arithmetic unit 44 constituting the parallel processing calculation unit 35 is substantially the same. In the third embodiment, a case where there is a difference in processing performance for each arithmetic unit 44 will be described. In this case, the identification number of each arithmetic unit is randomly generated at a frequency proportional to the processing performance of each of the n arithmetic units, and the arithmetic unit corresponding to the generated identification number is assigned to each processing step. .. As a result, the processing time for each arithmetic unit can be made almost equal.

As a method of generating random numbers with a given frequency distribution, for example, the inverse function method or the von Neumann rejection method can be used. In addition, any known method may be used.

Specifically, in the inverse function method, a distribution function is assumed in which the identification numbers 1 to n of the arithmetic units 44_1 to 44_n are defined as the domain and the processing performance of each arithmetic unit is used as the range, and the cumulative distribution function of this distribution function is defined as F. do. Then, a new random number generation function is obtained by applying the inverse function F- ^{1 of this cumulative distribution function to the uniform random number generation function.} Hereinafter, a specific description will be given with reference to the drawings.

[Calculator allocation procedure]
FIG. 16 is a flowchart showing still another realization method of the process in step S110 of FIG. Similar to the case of FIGS. 4 and 5 of the first embodiment, the parallel processing calculation unit 35 includes n arithmetic units 44_1 to 44_n. Further, the arithmetic processing executed by the arithmetic circuit 40 includes m (m> n) processing steps capable of parallel processing. In the flowchart of FIG. 16, the identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are executed in order.

In step S500 of FIG. 16, the CPU 32 initializes the identification number i of the processing step to 1.

In the next step S510, the CPU 32 generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 44_1 to 44_n. For example, the above-mentioned inverse function method is used to generate such an integer random number. Let the generated integer random number be r (i). 1 ≦ r (i) ≦ n holds.

In the next step S520, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit 44_r (i) using the generated integer random number r (i).

In the next step S530, the CPU 32 increments the identification number i of the processing step by 1.

The above steps S510 to S530 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S540). As described above, each of the m processing steps is assigned to any one of the n arithmetic units 44.

FIG. 17 is a diagram conceptually showing the method of allocating the processing steps shown in FIG. In FIG. 17, the parallel processing calculation unit 35 is provided with four arithmetic units 1 to 4 (m = 4), and the arithmetic processing includes nine processing steps (n = 9). Further, it is assumed that the fourth arithmetic unit 4 has the highest processing performance, the processing performance of the arithmetic unit 3 is the second highest, and the processing performance of the arithmetic unit 1 and the arithmetic unit 2 is low.

As shown in FIG. 17, five

processing steps

2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two processing steps 1 are assigned to the arithmetic unit 3 having the next highest processing performance. , 6 are assigned. Processing steps 4 and 8 are assigned to the

arithmetic units

1 and 2 having low processing performance, respectively. By making the number of processing steps assigned to the arithmetic unit different according to the processing performance of the arithmetic unit in this way, it is possible to suppress variations in the processing time for each arithmetic unit.

[Effect of Embodiment 3]
As described above, according to the arithmetic circuit of the third embodiment, the CPU 32 randomly generates an identification number of each arithmetic unit at a frequency proportional to the processing performance of each of the n arithmetic units, and the generated identification number is used as the identification number. Assign the corresponding arithmetic unit to each processing step. As a result, a larger number of processing steps are assigned to the arithmetic unit having higher processing performance, and as a result, it is possible to suppress variations in the processing time for each arithmetic unit.

Embodiment 4.
In the fourth embodiment, the design method of the arithmetic circuit of the third embodiment will be described. Specifically, we present a design method that can optimize both the processing speed and the circuit area of the entire arithmetic circuit, taking into consideration the difference in processing speed and the difference in circuit area for each arithmetic unit.

For example, if an arithmetic circuit is configured by using all n arithmetic units having different processing circuits and circuit areas, it is assumed that the area of the entire arithmetic circuit exceeds the allowable range. In this case, it is necessary to select an arithmetic unit to be incorporated in the arithmetic circuit so that the area of the entire arithmetic circuit is within the allowable range. If the arithmetic unit having the largest circuit area is simply not included in the arithmetic circuit in order, the processing speed may not meet the specifications. Therefore, it is necessary to optimize both the processing speed of the entire arithmetic circuit and the circuit area. Hereinafter, a specific description will be given with reference to FIG.

FIG. 18 is a flowchart showing the design procedure of the arithmetic circuit. The design procedure of FIG. 18 is executed, for example, by the CPU of the design support device.

In step S600 of FIG. 18, the CPU selects at least one arithmetic unit not included in the arithmetic circuit based on the circuit area of each of the arithmetic units 1 to n. In this case, at least one arithmetic unit is selected so that the area of the entire arithmetic circuit is within the permissible range. The CPU sets at least one selected arithmetic unit to non-allocation.

As in the case of the third embodiment, the arithmetic processing executed by the arithmetic circuit includes m (m> n) processing steps capable of parallel processing. The identification number of the processing step is i (however, 1 ≦ i ≦ m, i is an integer), and the first processing step to the mth processing step are sequentially selected.

In the next step S610, the CPU initializes the identification number i of the processing step to 1.

In the next step S620, the CPU generates integer random numbers in the range of 1 to n as identification numbers of n arithmetic units at a frequency proportional to the processing performance of the arithmetic units 1 to n. For the generation of such an integer random number, for example, the above-mentioned inverse function method or the von Neumann rejection method is used. Let the generated integer random number be r (i). 1 ≦ r (i) ≦ n holds.

In the next step S630, the CPU determines whether or not the allocation to the r (i) th arithmetic unit is prohibited. If the allocation is prohibited (YES in step S630), the CPU returns the process to step S620. On the other hand, if the allocation is possible (NO in step S630), the CPU advances the process to step S640.

In step S640, the CPU 32 allocates the i-th processing step to the r (i) -th arithmetic unit using the generated integer random number r (i). In the next step S6500, the CPU increments the identification number i of the processing step by 1.

The above steps S620 to S650 are repeated until the parameter i representing the identification number of the processing step exceeds m (until YES is obtained in step S660). As described above, each of the m processing steps is assigned to any of the arithmetic units except the arithmetic unit whose allocation is prohibited.

In the next step S670, the CPU calculates the time required for the arithmetic processing by simulation or the like based on the allocation result of the above processing step.

In the next step S680, when the CPU sets another arithmetic unit to prohibit allocation, the process is returned to step S600 and each of the above steps is repeated. For example, the CPU may allocate processing steps and calculate processing time for all combinations of arithmetic units such that the area of the entire arithmetic circuit fits within the allowable area.

In the next step S690, the CPU selects the combination of arithmetic units when the processing time is the shortest as the arithmetic unit to be incorporated in the arithmetic circuit. This makes it possible to optimize both the processing speed of the entire arithmetic circuit and the circuit area.

The above arithmetic circuit design method can be summarized in the following procedures (i) to (iv). The procedure (i) to (iv) is realized, for example, by causing a computer as a design support device to execute a program.

(I) The computer determines a combination of a plurality of arithmetic units having different processing performances and circuit areas from each other so that the total circuit area is equal to or less than a predetermined upper limit value (step S600).

(Ii) For each of the m processing steps, the computer has a frequency proportional to the processing performance of each arithmetic unit constituting the above combination, and the identification number of each of the plurality of arithmetic units constituting the above combination. Any one of them is randomly generated. Then, the computer assigns each processing step to the arithmetic unit corresponding to the generated identification number (steps S610 to S660).

(Iii) The computer estimates the processing time of m processing steps based on the allocation result of m processing steps to the plurality of arithmetic units constituting the above combination (step S670).

(Iv) The computer determines a plurality of combinations of a plurality of arithmetic units by executing the above procedure (i) a plurality of times, and the above procedure (ii) and the procedure (ii) for each of the plurality of combinations. By executing iii), the processing time of m processing steps is estimated for each of the plurality of combinations (when YES in step S680). The computer determines the combination of the plurality of arithmetic units having the shortest processing time as the arithmetic unit used in the arithmetic circuit (step S690).

FIG. 19 is a diagram conceptually showing a specific example of the design method of the arithmetic circuit shown in FIG. In FIG. 19, among the four arithmetic units 1 to 4 (n = 4), three arithmetic units actually incorporated in the arithmetic circuit are selected. The arithmetic processing shall include eight processing steps (m = 8). Further, it is assumed that the fourth arithmetic unit 4 has the highest processing performance, the processing performance of the arithmetic unit 3 is the second highest, and the processing performance of the arithmetic unit 1 and the arithmetic unit 2 is low. Further, it is assumed that the second arithmetic unit 2 has the largest circuit area, the circuit area of the arithmetic unit 4 is the second largest, and the circuit areas of the arithmetic unit 1 and the arithmetic unit 3 are small.

As shown in FIG. 19, in order to keep the area of the entire arithmetic circuit within the allowable range, the arithmetic unit 2 having the largest circuit area is not assigned a processing step. Processing steps are assigned to the other arithmetic units 1 to 3 according to the processing speed. Specifically, five

processing steps

2, 3, 5, 7, and 9 are assigned to the arithmetic unit 4 having the highest processing performance, and two

processing steps

1, 6 are assigned to the arithmetic unit 3 having the next highest processing performance. Assigned. Processing steps 4 and 8 are assigned to the arithmetic unit 1 having low processing performance.

After that, the processing time of the entire arithmetic processing is calculated based on the arithmetic processing amount of each processing step and the processing speed of the arithmetic units 1 to 3. Finally, the combination of arithmetic units incorporated in the arithmetic circuit is determined so that the area of the entire processing circuit is within the allowable range and the processing time is the shortest.

Embodiment 5.
In the fifth embodiment, the random number generation method described in the second embodiment is applied to the optimization of the circuit layout of the logic cell.

For example, in the design of LSI (Large Scale Integration), it is assumed that a logic cell composed of an arithmetic unit is randomly assigned to a plurality of circuit areas in a semiconductor chip. In this case, the number of logical cells assigned to each circuit area varies depending on the generation of random numbers. As a result, the circuit area may vary from circuit area to circuit area.

Therefore, in the fifth embodiment, the condition that the number of logical cells assigned to each circuit area is almost equal, that is, the difference in the number of logical cells for each circuit area is within one is satisfied.

Specifically, as described in the second embodiment, the circuit areas are selected cyclically in a fixed order. On the other hand, the logical cells assigned to each circuit area are randomly selected using uniform random numbers generated so as not to overlap. As a result, n logical cells are allocated almost evenly to m circuit areas, that is, the difference in the number of logical cells in each circuit area is within 1 within a range that does not lose randomness. be able to. As a result, it is possible to suppress variations in the circuit area for each circuit area.

The embodiments disclosed this time should be considered to be exemplary in all respects and not restrictive. The scope of this application is indicated by the scope of claims rather than the above description, and is intended to include all modifications within the meaning and scope of the claims.

30 image recognition system, 31 signal input unit, 32 CPU, 33 memory, 35 parallel processing calculation unit, 36 reader / writer, 37 network interface, 38 bus interconnect, 40 arithmetic circuit, 41 dedicated memory, 42, 60 input data, 43 Input data control unit, 44 arithmetic unit, 61 kernel, 62 output data.

Claims

An arithmetic circuit that performs arithmetic processing on input data.
The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
The arithmetic circuit is
With n being an integer greater than or equal to 2 and smaller than m, n arithmetic units that execute the m processing steps and
Equipped with a control processor
The control processor is an arithmetic circuit that randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
The n arithmetic units are associated with n identification numbers, respectively.
The control processor uniformly and randomly generates any one of the n identification numbers for each of the m processing steps, and each process is performed by the arithmetic unit corresponding to the generated identification numbers. The arithmetic circuit according to claim 1, wherein a step is assigned.
The control processor attaches each of the m processing steps to any one of the n arithmetic units based on a random number so that the difference in the number of processing steps assigned to each arithmetic unit is 1 or less. The arithmetic circuit according to claim 1, which is randomly assigned.
The control processor cyclically selects the arithmetic units to be assigned to each of the m processing steps in a fixed order.
The arithmetic circuit according to claim 3, wherein the control processor randomly selects a processing step to be assigned to each of the n arithmetic units using a uniform random number generated so as not to overlap.
The n arithmetic units have different processing performances from each other.
The n arithmetic units are associated with n identification numbers, respectively.
The control processor randomly generates any one of the n identification numbers for each of the m processing steps at a frequency proportional to the processing performance of each of the n arithmetic units. The arithmetic circuit according to claim 1, wherein each processing step is assigned to the arithmetic unit corresponding to the generated identification number.
The arithmetic circuit according to any one of claims 1 to 5, wherein the arithmetic processing for the input data includes a convolutional arithmetic in a convolutional neural network model.
It is a calculation method for input data.
The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
The calculation method is
A step in which the control processor randomly assigns each of the m processing steps to any one of the n arithmetic units based on a random number.
An arithmetic method, wherein each of the n arithmetic units includes a step of executing at least one assigned processing step.
A program for causing an arithmetic circuit to execute arithmetic processing on input data.
The arithmetic processing for the input data includes m processing steps that can be processed in parallel with each other, and the arithmetic processing amount of each of the m processing steps is the arithmetic processing amount of at least one of the remaining processing steps. Unlike
The arithmetic circuit is
With n being an integer greater than or equal to 2 and smaller than m, n arithmetic units and
Including control processor
The program
The control processor is made to randomly assign each of the m processing steps to any one of the n arithmetic units based on a random number.
A program that causes each of the n arithmetic units to perform at least one assigned processing step.
The method for designing an arithmetic circuit according to claim 5.
A step of determining a combination of a plurality of arithmetic units having different processing performances and circuit areas from each other so that the total circuit area is equal to or less than a predetermined upper limit value.
For each of the m processing steps, one of the identification numbers of the plurality of arithmetic units constituting the combination is randomly assigned at a frequency proportional to the processing performance of each arithmetic unit constituting the combination. And the step of assigning each processing step to the arithmetic unit corresponding to the generated identification number,
A step of estimating the processing time of the m processing steps based on the allocation result of the m processing steps to the plurality of arithmetic units constituting the combination, and a step of estimating the processing time of the m processing steps.
By executing the step of determining the combination of the plurality of arithmetic units a plurality of times, a plurality of combinations of the plurality of arithmetic units are determined, and the step of assigning the combination to each of the plurality of arithmetic units and the step of estimating the estimation are performed. By executing the above, the processing time of the m processing steps is estimated for each of the plurality of combinations, and the combination of the plurality of arithmetic units having the shortest processing time is used for the arithmetic circuit. A method for designing an arithmetic circuit, which comprises a step of determining n arithmetic units.