WO2020084692A1

WO2020084692A1 - Computation processing device and computation processing device control method

Info

Publication number: WO2020084692A1
Application number: PCT/JP2018/039370
Authority: WO
Inventors: 洋征和田
Original assignee: 富士通株式会社
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2020-04-30
Also published as: JP6984762B2; JPWO2020084692A1

Abstract

A random number generation circuit (121) generates a random number. On the basis of the location in which a number to be rounded is positioned and decimal point location information for output data, a multiplier (125) shifts the location of the random number such that the beginning of the random number matches the rounding location of the number to be rounded. An adder (129) adds the random number that was shifted by the multiplier (125) and the number to be rounded, which was positioned in a prescribed location. A rounding circuit (135) outputs, as output data, data of a prescribed range that includes a prescribed number of digits of significant figures from the rounding location in the addition result from the adder (129).

Description

Arithmetic processing device and method for controlling arithmetic processing device

The present invention relates to an arithmetic processing device and a control method for the arithmetic processing device.

Deep learning, which is becoming more and more important these days, requires a huge amount of calculation and memory consumption. When the calculation amount and the memory consumption amount increase, the calculation load and the memory load increase, and the learning time becomes long. Therefore, in order to reduce the amount of calculation and the amount of memory consumption and shorten the learning time, it is desirable to use a method of performing calculation with the lowest possible accuracy while maintaining the learning and inference capabilities. In such a method, an operation using a fixed point is often performed.

However, as a problem that occurs when calculating with low accuracy, the rounded value of the operation result tends to be biased to a certain value with low accuracy. If the rounded values are biased, there is a problem that the learning becomes difficult to proceed from that point.

There is a conventional technique that introduced probabilistic rounding in order to eliminate such a bias in values at low precision. This is the probability depending on the value lower than the digit to be rounded in the calculation result before rounding, and the rounded value is made a value that is truncated at the rounding digit (truncated) or the value obtained by adding 1 to the value. This is a technique in which the expected value of the rounding result and the value before rounding are made equal by selecting or. For example, when the operation result 1.8 is rounded to an integer, if probabilistic rounding is used, 80 is 2% with a probability of 20 and 1 is 20% with a probability of rounding. As a result, the expected value of the rounding result is 1.8, which is the same as the value before rounding.

In deep learning, multiply-accumulate operation is often used for matrix elements. The product-sum cumulative operation is an operation in the form of C '= C + A * B, and is an operation of accumulating the product of the next element in the calculation result up to a certain time. At this time, while continuing the cumulative operation, the cumulative result is generally held at a value having a precision considerably higher than the number of inputs to the multiplication. For example, when A and B, which are the numbers of inputs to the multiplication in the above equation, are 16 bits wide, the multiplication result A × B is 32 bits. At this time, the accumulation register for storing the accumulation of multiplication results preferably has a size for storing the accumulation result of 32-bit values, and may have a width of 40 bits, for example.

From the above, if the multiplication result is rounded, there is a possibility that the accuracy will be greatly reduced. Therefore, in hardware that performs deep learning, a process of performing probabilistic rounding on a result of a product-sum cumulative operation that has more bits than the number of inputs to multiplication and outputting a low-precision result is performed. Is desirable.

When performing rounding, there is a method to generate a carry when desired by adding some number to the lower digit that will be discarded due to rounding, and to obtain the desired rounding result in the upper digit depending on the rounding position. . This method is also effective in stochastic rounding. Therefore, a conventional technique has been proposed in which stochastic rounding is performed by utilizing the addition circuit in the product-sum calculator to perform addition to the lower digits. Further, there is a conventional technique in which an output from a random noise circuit is added to a fractional part of data to perform a rounding process. Further, as a device for rounding a cumulative register, there is a conventional technique in which a rounding circuit receives a value from a random number generator and uses it for rounding judgment.

U.S. Patent Application Publication No. 2017/0220341 U.S. Patent Application Publication No. 2017/0102920 Japanese Patent Laid-Open No. 03-63722 Special table 2004-506365 gazette

However, in the conventional technique that uses the adder circuit in the product-sum calculator for probabilistic rounding, rounding is performed on the multiplication result. As described above, generally, the bit width of the input value to the multiplication result is much narrower than the bit width of the multiplication result accumulation register. Therefore, rounding the value of the accumulating register with low precision by rounding the multiplication result may result in an incorrect rounded value. Further, in the conventional technique in which the rounding circuit determines the rounding by using the value obtained from the random number generator, a large-scale modification to the rounding circuit is added. The rounding process by the rounding circuit has a critical path for floating point arithmetic. Therefore, when a large-scale modification is performed on the rounding circuit to perform the probabilistic rounding, there is a risk that it may become a factor of worsening the delay of the critical path of the existing operation. That is, it is difficult to add a large-scale modification to the floating point multiply-add calculator.

The disclosed technique has been made in view of the above, and an object thereof is to provide an arithmetic processing device that executes appropriate probabilistic rounding with a simple configuration, and a control method for the arithmetic processing device.

In one aspect of the arithmetic processing device and the control method of the arithmetic processing device disclosed in the present application, the random number generation unit generates a random number. The random number moving unit moves the position of the random number based on the position where the rounding target number is arranged and the decimal point position information of the output data so that the beginning of the random number coincides with the rounding position of the rounding target number. The adding unit adds the random number moved by the random number moving unit and the rounding target number arranged at the predetermined position. The output unit outputs, as the output data, data in a predetermined range including a significant digit of a predetermined digit from the rounded position in the addition result of the addition unit.

In one aspect, the present invention can perform appropriate probabilistic rounding with a simple configuration.

FIG. 1 is a diagram showing an overall configuration diagram of an information processing apparatus. FIG. 2 is a circuit diagram of the product-sum calculator. FIG. 3 is a diagram showing an outline of the calculation of the probabilistic rounding process by the product-sum calculator according to the embodiment. FIG. 4 is a diagram for explaining digit shift of random numbers. FIG. 5 is a diagram for explaining alignment of added values by the normalization shifter. FIG. 6 is a diagram illustrating a specific example of the probabilistic rounding process. FIG. 7 is a flowchart of the entire processing executed by the product-sum calculation unit. FIG. 8 is a flowchart of the probabilistic rounding process by the product-sum calculator according to the embodiment.

An embodiment of the arithmetic processing device and the control method for the arithmetic processing device disclosed in the present application will be described below in detail with reference to the drawings. It should be noted that the arithmetic processing device and the method for controlling the arithmetic processing device disclosed in the present application are not limited by the following embodiments.

1 is an overall configuration diagram of the information processing device. The information processing device 50 includes a PCI (Peripheral Component Interconnect) card 1 and a host computer 2. The PCI card 1 and the host computer 2 are connected by a PCI bus and exchange data with each other.

The host computer 2, for example, performs overall management when executing deep learning. When executing deep learning, the host computer 2 instructs the PCI card 1 to execute a predetermined calculation in deep learning such as a convolution calculation.

The PCI card 1 receives a command from the host computer 2, executes a calculation, and outputs the calculation result to the host computer 2. As shown in FIG. 1, the PCI card 1 has a plurality of processing units 10, an overall command control unit 11, a memory controller 12, a memory 13 and a PCI control unit 14. The PCI card 1 corresponds to an example of “arithmetic processing device”.

The PCI control unit 14 receives from the host computer 2 an input of an operation instruction for instructing execution of operation and operation data used in the operation. Then, the PCI control unit 14 outputs the acquired operation command and operation data to the memory controller 12.

Further, the PCI control unit 14 receives the input of the calculation result for the designated calculation from the memory controller 12. Then, the PCI control unit 14 outputs the calculation result to the host computer 2. Specifically, the PCI control unit 14 issues an instruction to the memory controller 12 to read the calculation result in the memory 13, and causes the read data to be output to the host computer 2 via the confidence.

The memory controller 12 receives, from the PCI control unit 14, input of operation instructions and operation data used in the operation. Then, the memory controller 12 stores the acquired operation instruction and operation data in the memory 13.

Further, the memory controller 12 receives from the overall instruction control unit 11 an instruction to store the operation data used when executing the operation in the vector register 111. Then, the memory controller 12 stores the designated operation data in the designated vector register 111. Here, when transmitting data to the subsequent processing unit 10 of the processing units 10 arranged in series, the memory controller 12 bypasses the product-sum calculation unit 100 and outputs the calculation data to the multiplexer 103.

When the operation is completed, the overall instruction control unit 11 receives a notification of the completion of the operation from the overall instruction control unit 11 and instructs the memory controller 12 with a predetermined instruction to serially output the operation result in the vector register 111. The processing units 10 arranged in line are pulled out or stored in the memory 13.

The overall command control unit 11 performs overall management of the operations instructed to be executed by the host computer 2. The overall command control unit 11 receives an instruction from the host computer 2 via the PCI control unit 14, and sequentially reads and executes the overall command sequence stored in the memory 13. As the overall instruction, an instruction for transferring an operation instruction sequence from the memory 13 to the operation instruction buffer 102, an instruction for storing operation data from the memory 13 in the vector register 111, and an operation instruction control for the operation instruction sequence stored in the operation instruction buffer 102 There are an instruction for causing the unit 101 to start execution, an instruction for storing the operation result stored in the vector register 111 in the memory 13, an instruction for ending the execution of the instruction sequence, and the like.

The overall instruction control unit 11 causes the processing unit 10 to execute the arithmetic instruction sequence. When causing the processing unit to execute an operation, the overall instruction control unit 11 instructs the memory controller 12 to acquire the operation data used when executing the operation. Further, when the calculation in the processing unit 10 is completed, the overall command control unit 11 instructs the memory controller 12 to store the calculation result. Further, when all the processes of the operation instructed to be executed are completed, the overall instruction control unit 11 notifies the memory controller 12 of the completion of the operation.

Next, the processing unit 10 will be described. A plurality of processing units 10 are mounted on one PCI card 1 as shown in FIG. A plurality of processing units 10 are connected in parallel and in series. The number of processing units 10 is 128 in one sun. The processing unit 10 includes a product-sum calculation unit 100, a calculation instruction control unit 101, a calculation instruction buffer 102, and a multiplexer 103.

The arithmetic instruction control unit 101 manages and controls the execution processing of arithmetic instructions. The arithmetic instruction control unit 101 receives an instruction to execute an arithmetic instruction sequence from the overall instruction control unit 11. Here, an instruction that can be executed by the processing unit 10 is called an arithmetic instruction in contrast with the whole instruction. The instruction includes an arithmetic instruction in a narrow sense that causes the product-sum operation unit to perform an operation, and a general-purpose register (illustrated No) operation instructions, branch instructions, repeat instructions, and instructions to stop the execution of instruction sequences.

Then, the arithmetic instruction control unit 101 sequentially acquires the arithmetic instructions stored in the arithmetic instruction buffer 102. Next, the arithmetic instruction control unit 101 instructs the vector register 111 to output the arithmetic data designated by the acquired arithmetic instruction. Further, the arithmetic instruction control unit 101 outputs an instruction to execute an operation to the product-sum arithmetic unit 112 to the product-sum arithmetic unit 112 according to the acquired arithmetic instruction. After that, the operation instruction control unit 101 loops the operation using the operation result in the product-sum operation unit 112.

Then, when the operation is completed, the operation instruction control unit 101 gives an instruction to execute the probabilistic rounding process, and performs a product-sum operation on decimal point position information which is information indicating which bit of the accumulating register is to be the decimal point position of the output. Output to the device 112. The decimal point position information represents the decimal point position calculated from the learning result so as to keep the weighting parameter of each layer in the neural network as effective as possible within the range of the bit width that can be calculated. This value is a value determined in the process of executing the deep learning program, and is a variable value for the information processing device 50. Then, when an instruction is given from the memory controller 12, the stochastic rounding operation result stored in the vector register 111 is stored in the memory 13 via the chain of processing units and the memory controller 12.

The operation command control unit 101 may, for example, VECTOR. h. Issue commands such as accstrnd ELE #, QNUM, DST #. Here, QNUM represents decimal point position information. ELE # is a number indicating which element is to be subjected to stochastic rounding when the product-sum calculator 112 has a register having a plurality of elements. DST # represents the number of the register that stores the result of the probabilistic rounding. When this instruction is executed, the bit range corresponding to the QNUM of the element of the fixed-point accumulation register specified by ELE # is subjected to probabilistic rounding based on the value lower than the specified range, and the result is fixed-point. The value is stored in the register designated by DST #.

The arithmetic instruction buffer 102 is a storage area for storing an arithmetic instruction sequence. The arithmetic instruction buffer 102 stores the arithmetic instruction sequence input from the memory controller 12 in the input order from the designated address. After that, in response to a request to acquire the arithmetic instruction from the arithmetic instruction controller 101, the arithmetic instruction buffer 102 outputs the arithmetic instruction of the requested address to the arithmetic instruction controller 101.

The product-sum calculation unit 100 has a vector register 111 and a product-sum calculation unit 112. However, the vector register 111 included in the product-sum calculation unit 100 corresponds to a part of the entire vector register mounted in the processing unit 10.

The vector register 111 receives an input of operation data used when executing an operation from the memory controller 12, and stores the input operation data. After that, the vector register 111 receives the instruction from the arithmetic instruction control unit 101 and outputs the arithmetic data used in the arithmetic to the product-sum arithmetic unit 112. In addition, in the case of the product-sum accumulation operation, the vector register 111 receives an input of the operation result subjected to the probabilistic rounding process from the product-sum operation unit 112 after the operation loop process by the product-sum operation unit 112 is completed. When the memory controller 12 receives an instruction to output to the memory 13, the vector register 111 outputs to the multiplexer 103 the operation result of the product-sum operation unit 112 that has been subjected to the stochastic rounding processing.

The product-sum calculator 112 receives an instruction to execute a calculation from the calculation instruction control unit 101. Then, the product-sum calculation unit 112 executes the product-sum calculation using the calculation data input from the vector register 111. After that, the product-sum calculator 112 outputs the calculation result to the vector register 111. When the accumulation is instructed by the instruction, the product-sum calculation unit 112 holds the accumulation calculation result in a register (accumulator) in the calculation unit and uses it in the subsequent accumulation calculation instruction. The product-sum calculator 112 repeats the product-sum calculation on the value input from the vector register 111 until the product-sum accumulation calculation is completed.

After that, when the loop processing of the product-sum accumulation operation is completed, the product-sum operation unit 112 receives an instruction to execute the probabilistic rounding process from the operation instruction control unit 101. At that time, the product-sum calculation unit 112 receives input of decimal point position information from the calculation instruction control unit 101. Then, the product-sum calculation unit 112 uses the calculation result, which has been stored in the internal register and which has completed the loop processing, as the probabilistic rounding target number, and uses the circuit that performs the product-sum calculation to calculate the stochastic rounding target number. Executes rounding processing. After that, the product-sum calculator 112 outputs the calculation result that has been subjected to the probabilistic rounding process to the vector register 111 and stores it.

Here, the function of the product-sum calculator 112 for performing probabilistic rounding will be described in detail with reference to FIG. FIG. 2 is a circuit diagram of the product-sum calculator.

The product-sum calculation unit 112 includes a random number generation circuit 121, a power generation unit 122,

multiplexers

123 and 124, a multiplier 125, an exponent code calculation unit 126, a digit shifter 127, a multiplexer 128, and an adder 129. Further, the product-sum calculator 112 includes a fixed-point register 130, a precision loss prediction unit 131, a shift amount calculation unit 132, a multiplexer 133, a normalization shifter 134, and a rounding circuit 135. Here, the product-sum calculator 112 performs two processes: an actual calculation process requested by the host computer 2, such as convolution, and a process of calculating a probabilistic rounded value of the calculation result. Therefore, the actual calculation process is called an actual calculation, and the process of calculating the probabilistic rounding value of the calculation result is called the probabilistic rounding process.

Explain the operation of each part in actual calculation. In the case of actual calculation, the multiplexer 124 receives the input of one calculation data to be multiplied. The multiplexer 123 also receives the other operation data to be multiplied. In this case, the

multiplexers

123 and 124 output the operation data to the multiplier 125. The multiplier 125 multiplies the two calculation data input from the

multiplexers

123 and 124. Then, the multiplier 125 outputs the multiplication result to the adder 129.

On the other hand, the exponent code calculator 126 receives inputs of three calculation data used for actual calculation from the vector register 111. Then, the exponent sign operation unit 126 calculates the shift amount for aligning the digits of the mantissas of the product and the addend when the operation instruction is a floating-point product sum operation. In the case of a fixed-point arithmetic instruction, the exponent sign arithmetic unit 126 calculates the shift amount for digit alignment. In the case of a fixed-point arithmetic instruction, the exponent sign arithmetic unit 126 uses a constant pre-installed in hardware as the shift amount of the digit shifter so that the digit of the addend matches the digit of the multiplication result. Further, the exponent code calculator 126 calculates the sign of the calculation result. Then, the exponent sign operation unit 126 determines the remaining operation data other than the operation data input to the

multiplexers

123 and 124, and the shift amount for digitizing the operation data (addend) and the multiplication result of the multiplier 125. Output to the digit shifter 127.

Further, the exponent sign operation unit 126 receives an input of the result of predicting the amount of precision loss from the precision loss unit 131 when the arithmetic instruction is a floating-point multiply-add operation. Then, the exponent code calculation unit 126 calculates the left shift amount used for normalizing the mantissa of the calculation result from the obtained precision loss amount. The normalization of the mantissa is to adjust (shift) the bit positions of the entire mantissa so that the most significant digit of the mantissa is 1. After that, the exponent code calculator 126 outputs the calculated left shift amount to the multiplexer 133.

The digit loss amount obtained by the digit loss amount prediction unit 131 may include an error within a predetermined range from the true digit loss amount depending on the circuit configuration. Whether or not there is an error in the predicted digit cancellation amount can be known by checking whether or not the result of shifting the designated amount by the normalization shifter 134 is that the mantissa is normalized correctly. When it is found that there is an error in the digit cancellation amount, the normalization shifter 134 performs an additional shift for adjustment and notifies the exponent code operation unit that there is an error in the prediction. When the exponent code calculation unit 126 receives the error notification from the normalization shifter 134, the exponent code calculation unit 126 calculates the exponent in consideration of the additional shift for error adjustment. On the other hand, when the normalization shifter 134 does not receive the error notification, the exponent code calculator 126 calculates the exponent when there is no additional shift. In addition, the exponent sign operation unit 126 receives a notification from the rounding circuit 135 and adjusts the exponent when a carry occurs due to rounding of the mantissa. The exponent sign operation unit 126 outputs the finally obtained sign and exponent, and concatenates this with the rounding result of the mantissa output from the rounding circuit 135 described later, thereby completing the floating-point operation result. The calculation result is output to the vector register 111.

The digit shifter 127 shifts the operation data input from the exponent code operation unit 126 by the shift amount specified by the exponent code operation unit 126. Then, the arithmetic data having undergone digit alignment is output to the multiplexer 128.

The multiplexer 128 selects the input from the digit shifter 127 in the case of a floating point actual operation. In the fixed-point cumulative calculation, when the accumulation is in progress, it is not necessary to align the numbers to be added, so the input from the fixed-point register 130 is selected. In the fixed-point cumulative calculation, in the case of the first cumulative calculation, the number input from the vector register 111 to the product-sum calculation unit 112 is selected as an addend, and therefore the input from the digit shifter 127 is selected. Then, the multiplexer 128 outputs the operation data input from the digit shifter 127 to the adder 129.

The adder 129 receives an input of the multiplication result of two pieces of operation data from the multiplier 125. The adder 129 also receives from the multiplexer 128 the input of the remaining digitized digitized operation data (addend). Then, the adder 129 adds the multiplication result of the two calculation data and the remaining calculation data (addend) aligned with the digit. Then, the adder 129 outputs the addition result to the normalization shifter 134 and the precision loss prediction unit 131. Here, in practice, the adder 129 performs two-stage addition, that is, carry save addition and carry propagation addition, using two numbers of the addition result signal and the carry signal. The sum of the stages of addition is simply called addition. Then, actually, the adder 129 outputs the result of the carry save addition to the carry loss amount prediction unit 131.

The carry loss prediction unit 131 receives the input of the intermediate addition result from the adder 129. Then, the digit loss amount prediction unit 131 predicts the digit loss amount from the acquired addition result. After that, the digit loss amount prediction unit 131 outputs the predicted digit loss amount to the exponent code calculation unit 126.

The multiplexer 133 selects the input from the exponent code calculator 126 in the case of actual calculation. Then, the multiplexer 133 outputs the left shift amount input from the exponent code calculation unit 126 to the normalization shifter 134.

The normalization shifter 134 receives the addition result input from the adder 129. Further, the normalization shifter 134 receives the input of the left shift amount from the multiplexer 133. Then, the normalization shifter 134 shifts the addition result to the left according to the input left shift amount and adjusts the output position. After that, the normalization shifter 134 outputs the left-shifted operation result to the rounding circuit 135. Here, when it is determined that there is an error in the prediction of the digit cancellation amount due to the alignment by the left shift, the normalization shifter 134 performs an additional shift for error adjustment and notifies the exponent code operation unit 126 of the error notification. Send.

The rounding circuit 135 receives the input of the calculation result from the normalization shifter 134. Then, the rounding circuit 135 executes rounding with a predetermined number of digits. After that, the rounding circuit 135 outputs the calculation result rounded to a predetermined digit to the vector register 111.

Next, the operation of each part in the probabilistic rounding process will be explained. In the case of probabilistic rounding processing, the random number generation circuit 121 generates an n-bit uniform random number. Then, the random number generation circuit 121 outputs the generated random number to the multiplexer 123. Here, the random number generated by the random number generation circuit 121 does not have to be a truly uniform random number, but may be a pseudo random number within a practically usable range.

The random number generation circuit 121 uses, for example, an LFSR (Linear Feedback Shift Register) arranged in the product-sum calculator 112. For example, in the case of using the LFSR, in the present embodiment, the internal state of the LFSR is updated every time the probabilistic rounding instruction is executed, and the LFSR outputs a new random number value when the next probabilistic rounding instruction is issued.

However, the random number generation circuit 121 is not limited to the LFSR, and may be a pseudo random number generation circuit having higher randomness, or a circuit that acquires a random bit from the fluctuation of the environment. The random number generation circuit 121 may be a circuit attached to each product-sum calculation unit 112 or a circuit shared by a plurality of product-sum calculation units 112. The random number generation circuit 121 corresponds to an example of “random number generation unit”.

Here, I will explain how to select the number of bits of the random number. When performing rounding stochastically, the value of the bit lower than the rounding position in the number of stochastic rounding targets is finally discarded. For example, if the number of bits of the random number is n and the number of discarded bits is m, and if the random number is uniform and n is equal to or greater than m, then the expected value after the probabilistic rounding is stochastic. It can be set to a value equal to the number of rounding targets. However, if the relationship of n ≧ m is to be maintained even when there are many m, the number of bits to be calculated increases, and the circuit for digit alignment and addition correspondingly increases. If the circuit becomes too large, it becomes difficult for the existing hardware of the product-sum calculation unit 112 to accommodate the calculation circuit having the number of bits to be used. In that case, it is not preferable to add an arithmetic circuit for probabilistic rounding of the lacking portion, because the circuit amount increases.

Therefore, if the relationship of n ≧ m is not maintained and n may be smaller than m, a method of keeping n in the number of bits that can be stored in the existing product-sum calculator 112 can be considered. In that case, of the m bits to be discarded, the bits lower than the n-th bit lower than the rounding position are not added to the random number and do not contribute to the result after rounding. The expected value after rounding deviates from the value before rounding by that amount. However, the deviation of the expected value is reduced to about half each time n is increased by 1. Therefore, if n is increased to a certain degree or more, the deviation due to the large value of m becomes sufficiently small, which is not a practical problem. Therefore, it is preferable that an appropriate value be determined for n according to practical requirements and the amount of existing arithmetic circuits.

In the case of probabilistic rounding processing, the multiplexer 123 selects the n-bit uniform random number input from the random number generation circuit 121. Then, the multiplexer 123 outputs the n-bit uniform random number to the multiplier 125.

The power generation unit 122 receives an input of a probabilistic rounding instruction from the arithmetic instruction control unit 101. At the same time, the power generation unit 122 acquires the decimal point position information of the operand included in the probabilistic rounding instruction. Then, the power generation unit 122 uses the decimal point position information to generate a power of 2 according to the position to be rounded. Here, the position to be rounded corresponds to a digit in which the digit immediately above the digit is a significant digit. Then, the power generation unit 122 outputs the generated power of 2 to the multiplexer 124.

In the case of probabilistic rounding processing, the multiplexer 124 selects the power of 2 input from the power generation unit 122. Then, the multiplexer 124 outputs the power of 2 to the multiplier 125.

The multiplier 125 receives an n-bit uniform random number input from the multiplexer 123. Further, the multiplier 125 receives an input of a power of 2 corresponding to the position to be rounded from the multiplexer 124. Then, the multiplier 125 multiplies the obtained random number by the obtained power of 2 to match the leading digit of the random number with the position to be rounded in the probabilistic rounding target number.

Here, with reference to FIG. 3, an outline of calculation of the probabilistic rounding processing by the product-sum calculator 112 according to the present embodiment will be described. FIG. 3 is a diagram showing an outline of the calculation of the probabilistic rounding process by the product-sum calculator according to the embodiment. The position P in the probabilistic rounding target number 200 represents a rounding position. The range L is a range used as a calculation result.

First, the random number generation circuit 121 generates a random number 201 which is an n-bit uniform random number. Then, the random number 201 is multiplied by a power of 2 by the multiplier 125, and the random number 201 is left-shifted so that the head of the random number 201 matches the rounding position P of the probabilistic rounding target number 200. At this time, the random number 201 is a random number having a lower n-bit digit from the rounding position.

Now, referring to FIG. 4, the digit alignment shift of the random number will be described in more detail. FIG. 4 is a diagram for explaining digit shift of random numbers. Here, a case will be described in which the random number 201 is a 12-bit random number and a decimal point is a 3-bit random number. Further, a case will be described where the decimal point position of the fixed-point decimal accumulator 211 (the probabilistic rounding target number) is immediately to the right of the least significant bit.

The multiplier 125 shifts the random number by multiplying the random number 201 by 2 ^ E which is a power of 2. Here, the decimal point position information is represented as QNUM, and E = QNUM-9. When QNUM = 0, all digits of the random number are buried below the least significant bit of the fixed-point decimal accumulator 211. Further, when QNUM = 12, the least significant digit of the random number is located at the least significant digit of the fixed decimal accumulator 211. In this case, the position D represents the decimal point position.

In FIG. 4, a frame 210 represents a shift of the random number 201 by the multiplier 125 that performs a multiplication of 24 bits × 24 bits with single precision. In FIG. 4, the 12-bit random number 201A is the position when QNUM = 0. Here, the least significant digit of the multiplier 125 in this case is 2 ^ −8 when considering a fixed decimal number of 16 bits. When QNUM = 0, E = -9, and the multiplier 125 performs a process of multiplying 2 ^ -9, but the multiplier 125 does not have a circuit of that digit. However, in the case of QNUM = 0, all digits of the random number are buried below the least significant bit of the fixed-point decimal accumulator 211 and rounded down. Therefore, it is not necessary to add the random number. Therefore, when QNUM = 0, the multiplier 125 sets 2 ^ E, which is the value to be multiplied, to 0.

When QNUM = 1, E = -8, and the multiplier 125 multiplies the 12-bit random number 201A by 2 ^ -8 and shifts the 12-bit random number 201A to obtain the 12-bit random number 201B. Further, when QNUM = 8, E = −1, and the multiplier 125 multiplies the 12-bit random number 201 by 2̂−1 and shifts the 12-bit random number 201A to obtain the 12-bit random number 201C. Further, when QNUM = 24, E = 15, and the multiplier 125 multiplies the 12-bit random number 201 by 2 ^ 15 and shifts the 12-bit random number 201A to obtain the 12-bit random number 201D.

That is, when the output result is the least significant 16-bit data 213 of the fixed-point decimal accumulator 211, QNUM = 0. Further, when the most significant 16-bit data 212 of the fixed-point decimal accumulator 211 is taken as the output result, QNUM = 24. That is, the QNUM given by the instruction determines which digit in the fixed-point decimal accumulator 211 or more positions is to be the significant digit. Then, in the case of FIG. 4, the multiplier 125 multiplies the random number 201 by a power of 2 obtained from QNUM that takes a value between 0 and 24, and thus the lowest significant digit in the fixed-point decimal accumulator 211. The leading position of the random number 201 is aligned with the next lower digit.

Return to Figure 2 and continue the explanation. The multiplier 125 outputs to the adder 129 a random number in which the leading digit is aligned with the digit one lower than the least significant digit of the significant digit that is the multiplication result. The multiplier 125 corresponds to an example of the “random number moving unit”.

The fixed-point register 130 is a cumulative register (accumulator) used in product-sum cumulative calculation. The fixed point register 130 stores the number of probabilistic rounding targets that are targets of probabilistic rounding. The probabilistic rounding target number is a calculation result calculated in actual calculation. Since the multiplication result is added to the value of the accumulation register in the accumulation operation, the fixed-point register 130 has a bit width sufficient to set the value of the accumulation register. Then, in the case of probabilistic rounding processing, the fixed point register 130 outputs the number of probabilistic rounding targets to the multiplexer 128.

In the case of probabilistic rounding processing, the multiplexer 128 selects the number of probabilistic rounding targets input from the fixed-point register 130. Then, the multiplexer 128 outputs the probabilistic rounding target number to the adder 129.

The adder 129 receives from the multiplier 125 an input of a random number in which the leading digit is aligned with the digit one lower than the least significant digit. Further, the adder 129 receives the input of the number of probabilistic rounding targets from the multiplexer 128. Then, the adder 129 adds a random number to the number of probabilistic rounding targets. As a result, probabilistic rounding is performed according to the number of digits below the beginning of the random number in the number of probabilistic rounding targets. That is, the adder 129 performs probabilistic rounding by adding a random number to the number of probabilistic rounding targets. After that, the adder 129 outputs the addition result to the normalization shifter 134. The adder 129 corresponds to an example of “adding unit”.

Here, the probabilistic rounding processing by the adder 129 will be further described with reference to FIG. The probabilistic rounding target number 200 is provided from the fixed-point register 130. Then, the range L represents a range of numerical values which is desired to be used as a result of the rounding process. The adder 129 adds the random number 201 whose start position has been shifted to the rounding position P by the multiplier 125 to the probabilistic rounding target number 200. Here, when the least significant digit of the stochastic rounding target number 200 is higher than the least significant digit of the random number 201, the adder 129 adds a bit having a value of 0 to the stochastic rounding target number 200 to set the least significant digit. Match and then add. By this addition, the added value 202 in which the carry M1 is stochastically generated according to the value equal to or smaller than the rounding position P is obtained.

Here, carry occurs stochastically according to the lower value. This is because the expected value of the probabilistic rounding result may deviate from the value before rounding depending on how to select the bit number n of the random number as described above, and the amount of deviation differs depending on the value of n. In this embodiment, n = 12 is set based on the number of multiplication / addition bits that can be stored in the existing product accumulator 112 for single precision floating point and the practicality judgment based on the convergence simulation of deep learning. In this case, the maximum deviation of the expected value after rounding is about 0.00025. However, the number of bits of the random number is not limited to this, and it is preferable that an appropriate number of bits of 1 or more is selected according to the requirements expected for the operation and the balance between the existing circuit and the allowable additional circuit amount. .

Return to Figure 2 and continue the explanation. The shift amount calculation unit 132 receives input of decimal point position information from the arithmetic instruction control unit 101. Then, the shift amount calculation unit 132 calculates the shift amount according to the decimal point position information. Specifically, the shift amount calculation unit 132 uses the shift used to move from the position where valid data is placed in the data output from the adder 129 to the valid data position in the output of the normalization shifter. QNUM is calculated. After that, the shift amount calculation unit 132 outputs the calculated shift amount to the multiplexer 133.

In the case of probabilistic rounding processing, the multiplexer 133 selects the shift amount input from the shift amount calculation unit 132. Then, the multiplexer 133 outputs the shift amount to the normalization shifter 134.

The normalization shifter 134 receives from the adder 129 the input of the number of probabilistic rounding targets subjected to the probabilistic rounding process. The normalization shifter 134 also receives an input of the shift amount from the multiplexer 133. Then, the normalization shifter 134 shifts the number of probabilistic rounding targets according to the shift amount. Specifically, the normalization shifter 134 shifts the digit right above the rounding position of the input target number to the leftmost digit of the valid number output from the product-sum calculator 112. Make a shift. The normalization shifter 134 outputs the left-shifted probabilistic rounding target number to the rounding circuit 135. The normalization shifter 134 is an example of the “moving unit”.

For example, in FIG. 3, the normalization shifter 134 shifts the added value 202 so that the least significant digit of the range L used as the calculation result in the added value 202 matches the least significant digit of the output data, and shifts the shift value. 203.

Alignment of the added value by the normalization shifter 134 will be described with reference to FIG. FIG. 5 is a diagram for explaining alignment of added values by the normalization shifter.

In Fig. 5, the case of the following conditions is described as an example. The 40-bit addition value 202 is output from the adder 129. The value output from the adder 129 corresponds to the position from bit 16 to bit 55 of the intermediate bus before the output of the normalization shifter 134. Further, 16-bit data of the value is output. Then, bits 48 to 63 of the operation result bus from which data is output from the normalization shifter 134 are output as data.

The normalization shifter 134 shifts the added value 202 to the left. As a result, the normalization shifter 134 moves the used 16 bits of the addition value 202 from the bit 48 of the operation result bus to the output position 214 corresponding to the bit 63.

For example, in FIG. 5, when the value indicating the decimal point position information is QNUM and QNUM = 0, it is assumed that the lowest 16-bit data 213 of the fixed-point decimal accumulator 211 is the output target. In this case, if QNUM = 0, the normalization shifter 134 moves the data 213 from bit 16 to bit 32 of the intermediate result bus to the output position 214 of bit 48 to bit 63 of the operation result bus. In this case, the left shift amount is 32. Further, when QNUM = 24, it is assumed that the uppermost 16-bit data 212 of the fixed-point decimal accumulator 211 is an output target. In this case, if QNUM = 24, the normalization shifter 134 moves the data 212 from bit 40 to bit 55 of the intermediate result bus to the output position 214 of bit 48 to bit 63 of the operation result bus. In this case, the left shift amount is 8. That is, in the example of FIG. 5, the normalization shifter 134 performs the left shift by using the left shift amount obtained as “32-QNUM” by the shift amount calculation unit 132.

Return to Figure 2 and continue the explanation. The rounding circuit 135 receives, from the normalization shifter 134, an input of the left-shifted stochastic rounding target number. Then, in the case of the probabilistic rounding processing, the rounding circuit 135 cuts off the digits below the predetermined digit of the input number of stochastic rounding targets. Then, the rounding circuit 135 outputs the probabilistic rounding target number in which digits below the predetermined digit are truncated. The output from the rounding circuit 135 is sent to the vector register 111 as output data. The call rounding circuit 135 is an example of the “output unit”.

For example, in FIG. 3, the rounding circuit 135 performs the cutoff M2 on the lower digit of the range L used in the shift value 203. Then, the range 204, which is the higher digit of the range L to be used, of the data output from the rounding circuit 135 is not included in the output data 205 and is discarded. Then, the output data 205 is sent to the vector register 111.

Here, the rounding circuit 135 is an ordinary floating point arithmetic circuit. In normal rounding of floating-point arithmetic, the rounding circuit 135 determines the rounding bit or sticky bit obtained from the lower digit, the value of the least significant digit of the round, the positive / negative of the operation result, and the specified rounding mode, and the rounding circuit 135 One of the following two processes is performed. The first process is a process in which the rounding circuit 135 outputs a value obtained by cutting the input value below the rounding position as it is. The second process is a process in which the rounding circuit 135 adds 1 to a value obtained by cutting the input value below the rounding position and outputs the value. When the probabilistic rounding process is executed, the rounding circuit 135 is designated to always select the above-described first process.

Next, the probabilistic rounding process will be described using specific data with reference to FIG. FIG. 6 is a diagram illustrating a specific example of the probabilistic rounding process. In FIG. 6, a case where processing is performed using a 64-bit wide bus will be described. Further, the position of each data will be described as [x: y], where x represents the most significant bit and y represents the least significant bit.

In this case, the arithmetic instruction control unit 101 outputs QNUM = 16 as the decimal point position information. The stochastic rounding target number 300 output from the fixed-point register 130 is located at [55:16]. Here, bits 15 and below correspond to decimals and below. Then, bits 31 and below of the probabilistic rounding target number 300 are rounded, and [47:32] is a range to be used.

The logical random number 301 generated by the random number generation circuit 121 is logically arranged such that the decimal point is 15 or less, and is located in the place surrounded by the broken line. In reality, the random number 301 is arranged as the initial position on the circuit at the position when QNUM = 0. Specifically, the position in the case of QNUM = 0 is the position where the leading bit of the random number 301 comes to the bit next to the least significant bit of the stochastic rounding target number 300. Then, the power generation unit 122 obtains 2 ^ 7 as a power of 2 according to QNUM = 16. The multiplier 125 multiplies the random number by 2 ^ 7 to shift the random number 301 to the position [31:20].

The adder 129 adds a bit having a value of 0 to the random number 301 located at [31:20] so that the least significant bit coincides with the stochastic rounding target number 300, and adds it to the stochastic rounding target number 300. , The addition value 302 is calculated. The added value 302 is located at [55:16]. Here, the output range 303 is [63:48].

Therefore, the normalization shifter 134 shifts the stochastic rounding target number 300 so that the data of the range [47:32] used in the stochastic rounding target number 300 is located at [63:48]. ‥

After that, the rounding circuit 135 aborts and discards bits 49 or less in the number 300 of stochastic rounding targets after the shift. Furthermore, 64 or more bits in the number 300 of stochastic rounding targets after shifting are not output and are discarded. As a result, the remaining 16-bit output data 304 is output as the calculation result.

Next, with reference to FIG. 7, an overall flow of processing executed by the product-sum calculation unit 100 will be described. FIG. 7 is a flowchart of the entire processing executed by the product-sum calculation unit.

The product-sum calculation unit 100 executes the product-sum calculation in the actual calculation using the product-sum calculation unit 112 (step S1).

Then, the product-sum calculation unit 100 determines whether the product-sum accumulation calculation is completed (step S2). When the product-sum accumulation operation is not completed (step S2: No), the product-sum operation unit 100 returns to step S1 and repeats the product-sum operation.

On the other hand, when the product-sum accumulation operation is completed (step S2: Yes), the product-sum operation unit 100 uses the product-sum operation unit 112 to perform the probabilistic rounding process (step S3).

Thereafter, according to the instruction from the memory controller 12, the calculation result stored in the vector register 111 is output to the memory 13 via the chain of processing units and the memory controller 12 (step S4).

Next, with reference to FIG. 8, a flow of the probabilistic rounding process by the product-sum calculator according to the present embodiment will be described. FIG. 8 is a flowchart of the probabilistic rounding process by the product-sum calculator 112 according to the embodiment. The process shown in the flowchart of FIG. 8 is an example of the process performed in step S3 of FIG.

The random number generation circuit 121 acquires an n-digit uniform random number (step S101). Then, the random number generation circuit 121 outputs the generated random number to the multiplier 125 via the multiplexer 123.

Further, the power generation unit 122 generates a power of 2 according to QNUM (step S102). Then, the random number generation circuit 121 outputs the generated power of 2 to the multiplier 125 via the multiplexer 124.

The multiplier 125 receives a random number input from the random number generation circuit 121. The multiplier 125 also receives from the multiplexer 124 an input of a power of 2 according to QNUM. Then, the multiplier 125 multiplies the random number by a power of 2, and shifts the random number so that the beginning of the random number is located at the rounding position (step S103). After that, the multiplier 125 outputs the multiplication result to the adder 129.

The adder 129 acquires the number of probabilistic rounding targets stored in the fixed point register 130 via the multiplexer 128 (step S104).

Further, the adder 129 receives the input of the multiplication result from the multiplier 125. Then, the adder 129 executes the stochastic rounding by adding the multiplication result of the multiplier 125 to the number of probabilistic rounding targets (step S105). Then, the adder 129 outputs the addition result representing the number of probabilistic rounding targets subjected to the probabilistic rounding process to the normalization shifter 134.

The shift amount calculation unit 132 calculates the shift amount according to the decimal point position information acquired from the arithmetic instruction control unit 101 (step S106). Then, the shift amount calculation unit 132 outputs the calculated shift amount to the normalization shifter 134 via the multiplexer 133.

The normalization shifter 134 receives, from the adder 129, an input of the addition result representing the number of probabilistic rounding targets subjected to the probabilistic rounding process. Further, the normalization shifter 134 receives the input of the shift amount from the shift amount calculation unit 132. Then, the addition result is shifted to the left by the shift amount (step S107). Then, the normalization shifter 134 outputs the left-shifted value to the rounding circuit 135.

The rounding circuit 135 receives the input of the left-shifted value from the normalization shifter 134. Then, the rounding circuit 135 cuts off a predetermined digit or less of the left-shifted value and discards bits below the output range (step S108).

After that, the rounding circuit 135 outputs a predetermined number of bits from the lower order as a result (step S109).

The product-sum calculator 112 according to the present embodiment has a random number generation circuit 121, a power multiplier generation circuit 122, a shift amount calculation unit 132, and

multiplexers

123, 124, and 133 added to a circuit that performs floating-point calculation. Then, the probabilistic rounding process using fixed point is executed.

As described above, the product-sum calculator 112 according to the present embodiment can execute the fixed-point stochastic rounding by adding a small amount of circuits to the circuit used for the floating-point product-sum calculation. Become. Further, the product-sum calculator 112 according to the present embodiment executes stochastic rounding on the cumulative calculation result of multiplication. Therefore, it is possible to execute appropriate probabilistic rounding with a simple configuration.

1 PCI Card 2 Host Computer 10 Processing Unit 11 Overall Command Control Unit 12 Memory Controller 13 Memory 14 PCI Control Unit 50 Information Processing Device 100 Product Sum Operation Unit 101 Operation Command Control Unit 102 Operation Command Buffer 103 Multiplexer 111 Vector Register 112 Product Sum Operation Unit 121 random number generation circuit 122 power multiplier generation unit
125 Multiplier 126 Exponent code calculator 127 Digit shifter 129 Adder 130 Fixed point register 131 Digit loss predictor 132 Shift amount calculator 134 Normalization shifter 135 Rounding

circuit

123, 124, 128, 133 Multiplexer

Claims

A random number generator that generates random numbers,
Based on the predetermined position where the rounding target number is arranged and the decimal point position information of the output data, a random number moving unit that moves the position of the random number so that the beginning of the random number matches the rounding position of the rounding target number,
An adding unit that adds the random number moved by the random number moving unit and the rounding target number arranged at the predetermined position;
And an output unit that outputs, as the output data, data in a predetermined range including a significant digit of a predetermined digit from the rounded position in the addition result of the addition unit.
Further comprising a moving unit for moving the valid digit of the predetermined digit from the rounded position in the addition result of the adding unit so as to match the output position of a predetermined predetermined digit,
The output unit discards and outputs a value other than the output position of the addition result that has been moved from the rounding position so that the significant digit of the predetermined digit matches the output position. The arithmetic processing unit according to Item 1.
A power multiplier generating unit that obtains a power of 2 for matching the head of the random number with the rounding position of the rounding target number based on the position where the rounding target number is arranged and the decimal point position information of the output data. Further preparation,
The random number generating unit generates a random number represented by a binary number, and the random number moving unit moves the position of the random number by multiplying the random number by the power factor obtained by the power factor generating unit. The arithmetic processing device according to claim 1.
The random number moving unit multiplies two pieces of input data to perform a floating point operation,
The arithmetic processing apparatus according to claim 1, wherein the addition unit performs floating point arithmetic by adding the input data to the multiplication result by the random number movement unit.
Generate a random number,
Based on the predetermined position where the rounding target number is arranged and the decimal point position information of the output data, the position of the random number is moved so that the beginning of the random number matches the rounding position of the rounding target number,
Add the moved random number and the rounding target number arranged at the predetermined position,
A method of controlling an arithmetic processing device, comprising outputting data including a significant digit of a predetermined digit from the rounded position of an addition result as the output data.