WO2023121666A1

WO2023121666A1 - Iterative divide circuit

Info

Publication number: WO2023121666A1
Application number: PCT/US2021/064961
Authority: WO
Inventors: Michael Dibrino
Original assignee: Futurewei Technologies, Inc.
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-06-29

Abstract

A divide circuit, according to certain embodiments disclosed herein, includes a prescaler and an iterator. The prescaler is configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor. The iterator is configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate of the current iteration from a partial remainder of a prior iteration and generate a partial remainder of the current iteration from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration.

Description

ITERATIVE DIVIDE CIRCUIT

FIELD

[0001] The following is related generally to the field of electronic circuits and, more specifically, to electronic circuits for performing arithmetic functions such as division.

BACKGROUND

[0002] Computer systems frequently include one or more circuits for performing arithmetic functions such as addition, subtraction, multiplication, and division. Division operations may take more time and resources than other operations such as addition, subtraction and multiplication. In general-purpose computer architectures, one or more circuits for performing arithmetic operations such as division may be integrated as execution units within a central processing unit (CPU). For example, computer systems frequently include combinatorial logic circuits (e.g., in a microprocessor) such as a Vector Unit, or VU, an arithmetic logic unit, or ALU, and/or a floating-point unit, or FPU, often referred to as a math coprocessor. One or more such units may include divide circuits (divide units) to perform division operations. In general-purpose computer architectures, one or more VUs, ALUs and/or FPUs may be integrated as execution units within the central processing unit. In some cases, division operations performed by one or more such units may take significant time and may affect other operations. For example, division operations that take significant time and may cause other operations (e.g., other operations in a reorder buffer) to be delayed, which may impact performance. Circuits that perform divide operations in an efficient manner may mitigate such delays. SUMMARY

[0003] According to one aspect of the present disclosure, a divide circuit includes a prescaler and an iterator. The prescaler is configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor. The iterator is configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration

[0004] Optionally, in the preceding aspect, the prescaler is further configured to obtain the prescaling factor from the divisor.

[0005] Optionally, in any of the preceding aspects, the prescaling factor is approximately the reciprocal of the divisor.

[0006] Optionally, in any of the preceding aspects, the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate prescaling factor values as a function of divisor values.

[0007] Optionally, in the preceding aspect, the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

[0008] Optionally, in any of the preceding aspects, the divide circuit further comprises a normalizer connected to the prescaler, the normalizer configured to normalize the dividend and the divisor for the prescaler.

[0009] Optionally, in any of the preceding aspects, the divide circuit further comprises an output processing circuit connected to the iterator, the output processing circuit configured to perform at least one of floating point rounding or 2’s complementation for signed integer division.

[0010] Optionally, in any of the preceding aspects, the divide circuit further comprises a bypass connection between the prescaler and the output processing circuit, bypassing the iterator, the bypass connection configured to provide the prescaled dividend as a low-accuracy quotient to the output processing circuit.

[0011] Optionally, in any of the preceding aspects, the divide circuit further comprises one or more Booth Recoders.

[0012] Optionally, in any of the preceding aspects, the divide circuit further comprises one or more Redundant Binary Signed Digit Full Adders.

[0013] According to an additional aspect of the present disclosure, there is provided a method of dividing a dividend by a divisor, comprising: prescaling a dividend by a prescaling factor to generate a prescaled dividend in a prescaler stage; prescaling a divisor by the prescaling factor to generate a prescaled divisor in the prescaler stage; setting the prescaled dividend as an initial partial remainder for an initial iteration; and in an iterator stage connected to the prescaler stage, for each iteration of one or more iterations: generating a partial quotient estimate and a partial remainder from a partial quotient estimate and a remainder of a prior iteration.

[0014] Optionally, the preceding aspect of the method further includes obtaining the prescaling factor from the divisor.

[0015] Optionally, in any of the preceding aspects of the method the prescaling factor is approximately the reciprocal of the divisor.

[0016] Optionally, in any of the preceding aspects, the method further comprises obtaining the prescaling factor from one or more lookup tables that include prescaling factor values as a function of divisor values.

[0017] Optionally, in any of the preceding aspects of the method the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

[0018] Optionally, in any of the preceding aspects, the method further comprises normalizing the dividend prior to prescaling the dividend; and normalizing the divisor prior to prescaling the divisor. [0019] Optionally, in any of the preceding aspects, the method further comprises performing floating point rounding on at least one of a quotient and a remainder.

[0020] Optionally, in any of the preceding aspects, the method further comprises prescaling another dividend by another prescaling factor obtained from another divisor to generate another prescaled dividend; and providing the prescaled dividend as a low-accuracy quotient via a bypass connection that bypasses the iterator stage.

[0021] Optionally, in any of the preceding aspects, generating the partial remainder includes subtracting a product of the partial quotient estimate of the current iteration and the prescaled divisor from the partial remainder of the prior iteration.

[0022] Optionally, in any of the preceding aspects, the method further comprises generating the product of the partial quotient estimate of the current iteration and the prescaled divisor using a plurality of Redundant Binary Signed Digit Full Adders.

[0023] According to a further aspect, a divide circuit includes: a plurality of pipelined divide circuits, each pipelined divide circuit comprising: a prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor; and an iterator connected to the prescaler, the iterator configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of a current iteration and a partial remainder of a prior iteration.

[0024] Optionally, in the preceding aspect, the plurality of pipelined divide circuits include one or more first pipelined divide circuits and one or more second pipelined divide circuits, the first pipelined divide circuits being separately selectively powered from the second pipelined divide circuits.

[0025] Optionally, in any of the preceding aspects, each of the plurality of pipelined divide circuits further comprises a normalizer.

[0026] Optionally, in any of the preceding aspects, the prescaler is further configured to obtain the prescaling factor from the divisor. [0027] Optionally, in any of the preceding aspects, the prescaling factor is approximately the reciprocal of the divisor.

[0028] Optionally, in any of the preceding aspects, the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate prescaling factor values as a function of divisor values.

[0029] Optionally, in any of the preceding aspects, the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

[0030] Optionally, in any of the preceding aspects, each of the pipelined divide circuits includes a bypass connection between the prescaler and an output processing circuit, bypassing the iterator, the bypass connection configured to provide the prescaled dividend as a low-accuracy quotient to the output processing circuit.

[0031] Optionally, in any of the preceding aspects, the prescaler further comprises one or more Booth Recoders.

[0032] Optionally, in any of the preceding aspects, the iterator further comprises one or more Redundant Binary Signed Digit Full Adders.

[0033] According to other aspects, a microprocessor includes one or more divide circuits, each divide circuit comprising: a plurality of pipelined components including at least a normalizer, a prescaler and an iterator, the normalizer configured to remove leading signed bits from integer values or leading zeros from floating-point values, the prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor, the iterator configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate of the current iteration and from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration.

[0034] Optionally, in the preceding aspect, the prescaler is further configured to obtain the prescaling factor from the divisor. [0035] Optionally, in any of the preceding aspects, the prescaling factor is approximately the reciprocal of the divisor.

[0036] Optionally, in any of the preceding aspects, the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate prescaling factor values as a function of divisor values.

[0037] Optionally, in any of the preceding aspects, the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

[0038] According to another aspect, a divide circuit includes means for prescaling a dividend by a prescaling factor to generate a prescaled dividend and prescaling a divisor by the prescaling factor to generate a prescaled divisor; and means for iteratively calculating a quotient, the means for iteratively calculating the quotient configured to, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration.

[0039] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.

BRIEF DESCRIPTION OF THE DRAWINGS

[0040] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying Figures (FIGs.) for which like references indicate elements.

[0041] Figures 1A and 1 B are respectively block diagrams of a computer system and a microprocessor that can be incorporated into such a computer system. [0042] Figures 2A-C illustrate division operations including prescaling and iterating.

[0043] Figure 3 illustrates an example of a divide circuit with multiple pipelines.

[0044] Figure 4 illustrates an example of a divide circuit with multiple independently powered individual divide circuits.

[0045] Figure 5A-D illustrate circuits that may be used to implement a divide circuit.

[0046] Figure 6A-D are pipeline diagrams for pipelined divide operations.

[0047] Figure 7 illustrates an example of a method of dividing, according to an embodiment of the present technology.

[0048] Figure 8 illustrates a detailed example of a method of dividing, according to an embodiment of the present technology.

[0049] Figure 9A-H illustrate a detailed example of a division operation, according to an embodiment of the present technology.

[0050] Figure 10 illustrates an example of a computing system configured to implement embodiments of the present technology.

DETAILED DESCRIPTION

[0051] The following presents circuits and methods for performing divide operations efficiently (e.g., in a microprocessor). A divide circuit may include a prescaler and an iterator. The prescaler prescales a Dividend (x) and a Divisor(d) (e.g., where the division operation is to find a quotient q = x/d) by multiplication of each operand by a common factor (f). The value of f may be approximately the reciprocal (inverse) of x (e.g., f«1/x) so that the prescaled Divisor (d*f) is approximately 1 and the prescaled Dividend (x*f) is approximately the quotient. In some cases, the prescaled Dividend (x*f) is a sufficiently accurate quotient value and may be provided as an output without further operation (e.g., a rapidly-obtained low-accuracy quotient value, which may be adequate in some cases). A bypass connector may allow the prescaled Dividend (x*f) to bypass the iterator and pass directly to output processing circuits. When the prescaled Dividend (x*f) is not sufficiently accurate for a given calculation (e.g., as specified by a circuit or routine initiating the calculation), one or more iterations may be performed by the iterator, with each iteration providing additional accuracy. The prescaled Dividend (x*f) may provide an initial partial remainder and the partial quotient estimate for each iteration may be obtained from the partial remainder (e.g., first b bits of the partial remainder, where b is the number of bits per cycle). The partial remainder for each iteration may be obtained by subtracting the product of the partial quotient estimate of the current iteration and the prescaled divisor (d*f) from the partial remainder of the prior iteration. Input processing circuits may process (e.g., normalize) dividends and divisors prior to prescaling. Output processing circuits may process (e.g., perform floating-point rounding) quotient values. Multiple divide operations may be performed in parallel by prescalers and iterators that are pipelined so that divide calculations may be rapidly performed, which may provide various benefits (e.g., avoiding stalling a reorder buffer). Pipelines may be formed in different circuits with independent power control (e.g., pipelines can be powered on/off independently) so that unused circuits are powered off to reduce power consumption.

[0052] The present embodiments of the disclosure may be implemented in many different forms and the claim scope should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided to convey the inventive embodiment concepts to those skilled in the art. Indeed, the disclosure is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.

[0053] Figures 1A and 1 B are respectively block diagrams of a computer system and a microprocessor that can be incorporated into such a computer system that may be used to implement one or more embodiments of the disclosure. In the simplified representation of Figure 1A, the computer system 100 includes a computer 105, one or more input devices 101 and one or more output devices 103. Common examples of input devices 101 include a keyboard or mouse or even a voice recognition system to support voice commands. Common examples of output devices 103 include monitors, speakers or printers. The computer 105 includes a memory 107 and a microprocessor 120, where in this simplified representation the memory 107 is represented as a single block. The memory 107 can include ROM memory, RAM memory and/or non-volatile memory and, depending on the embodiment, can include separate memory for data and instructions.

[0054] Figure 1 B illustrates one embodiment for the microprocessor 120 of Figure 1A and also includes the memory 107. In the representation of Figure 1 B, the microprocessor 120 includes control logic 125, a processing block 140, an input interface 121 , and an output interface 123. The dashed lines represent control signals exchanged between the control logic 125 and the other elements of the microprocessor 120 and the memory 107. The solid lines represent the flow of data and instructions within the microprocessor 120 and between the microprocessor 120 and the memory 107.

[0055] The processing block 140 includes combinatorial logic 143 that is configured to execute instructions and registers 141 in which the combinatorial logic stores instructions and data while executing these instructions. In the simplified representation of Figure 1 B, certain elements or units, such as a vector unit (VII) 146, an arithmetic logic unit (ALU) 147, and a floating-point unit (FPU) processor 149, are shown, while other elements are not explicitly shown in the combinatorial logic 143. The combinatorial logic 143 is connected to the memory 107 to receive and execute instruction and supply back the results. The combinatorial logic 143 is also connected to the input interface 121 to receive input from input device(s) 101 and/or other sources. The combinatorial logical 143 is also connected to the output interface 123 to provide output to output device(s) 103 and/or other destinations.

[0056] A microprocessor, such as the microprocessor 120, may perform a range of different functions including arithmetic operations. For example, a microprocessor may perform addition, subtraction, multiplication and/or division. Performing division in an efficient manner is generally more challenging than performing other types of arithmetic operations (e.g., more challenging than addition, subtraction, and multiplication). Aspects of the present technology may enable efficient division operations to be performed by a microprocessor. More generally, aspects of the present technology can more generally be applied to embodiments for one or more central processing units (CPUs), graphic processing units (GPUs), artificial intelligence (Al) accelerators, Tensor Processing Units (TPUs), and/or any other digital logic that performs division.

[0057] Division operations may be performed by one or more divide units in the microprocessor 120. For example, one or more of the VU 146, the ALU 147 and the FPU 149 may include one or more divide units. Other elements of combinatorial logic 143 may alternatively or additionally include one or more divide units. The present technology is not limited to divide units in any particular location (in a microprocessor or elsewhere) or used for any particular purpose and can be applied wherever efficient division operations are desired.

[0058] A number of different approaches may be used to perform division. Divides are typically high-latency arithmetic instructions (e.g., 15-60 cycles or more, or more in some cases) that are typically executed in an iterative multiplicative implementation (Goldschmidt or Newton-Raphson) or purely iterative digit recurrence implementation requiring a look-up table (SRT = Sweeney, Robertson, and Tocher). High-latency instructions can cause instructions to queue up in the reorder buffer (ROB) and delay retirement in an out-of-order machine and can halt execution in an in-order machine.

[0059] In general, it would be beneficial to reduce divide latency below that of traditional iterative digit-recurrence (SRT) or multiplicative/iterative (Goldschmidt/Newton-Raphson) methods using circuits that support all common 8/16/32/64 bit integer and floating-point formats, support both scalar and vector implementations, be cost (area) effective, be scalable to reasonably high clock speeds and allow at least partial pipelining. Aspects of the present technology provide such benefits as detailed in examples below.

[0060] Figure 2A shows an example of a block diagram of a divide circuit 200 (divide unit) that may be implemented in a larger circuit (e.g., in the combinatorial logic 143 of the microprocessor 120). The divide circuit 200 is shown receiving a dividend (x), which may also be referred to as a numerator and a divisor (d), which may also be referred to as a denominator. The divide circuit 200 calculates a quotient (q) from a dividend (x) and a divisor (d) such that q=x/d. Aspects of the present technology allow such quotients to be rapidly calculated (low latency) in compact (area efficient) and power efficient circuits.

[0061] The divide circuit 200 includes an input processing circuit 202, which receives the dividend (x) and the divisor (d) and performs some processing (preprocessing) to convert these numbers into a suitable format. In some cases, the dividend (x) and the divisor (d) are received in a suitable format and no processing is required at this point (e.g., the input processing circuit 202 may pass the dividend (x) and the divisor (d) unchanged to the prescaler 204). In some circuits, no input processing circuit may be required (e.g., where the dividend (x) and the divisor (d) are always provided to a divide circuit in a suitable format). In one example, the input processing circuit 202 may perform normalization of received numbers (e.g., may remove leading zeros or otherwise normalize a received number). An example of a suitable format may be a floating-point format and examples below use the floatingpoint format. However, the present technology is not limited to floating-point operations.

[0062] The output of the input processing circuit 202 (e.g., dividend (x) and divisor (d) as normalized or otherwise processed) is sent to the prescaler 204 (prescaling circuit). The prescaler 204 may prescale one or more numbers. For example, the prescaler 204 may prescale the dividend (x) and the divisor (d) received from the input processing circuit 202 by multiplying both numbers by a factor (f). The value of f may be selected, and selection of this value may be performed by the prescaler 204 in this example (in other examples, selection of f may be performed separately). Thus, the prescaler 204 may obtain f and then calculate values for a prescaled dividend (x*f) and a prescaled divisor (d*f).

[0063] The output of the prescaler 204 (e.g., prescaled dividend (x*f) and prescaled divisor (d*f)) are then sent to the iterator 206 (iterative circuit), which may perform one or more iteration to calculate the quotient q from the prescaled dividend (x*f) and the prescaled divisor (d*f). The output of the prescaler 204 may also be provided, via a bypass connection 210 (bypass) that bypasses the iterator 206, to the output processing circuit 208. The bypass connection 210 may allow a value for the quotient (q) to be obtained without using the iterator 206 in some cases, e.g., as described in examples below.

[0064] The output of the iterator 206 (e.g., a value for quotient (q)) is sent to the output processing circuit 208. The output processing circuit 208 may perform suitable processing to convert the value for the quotient (q) to a suitable format or otherwise process the value for the quotient (q) before outputting a value of quotient (q). For example, the output processing circuit 208 may perform floating point rounding on the value of quotient (q) or may perform 2’s complement operations for signed integer division.

[0065] Figure 2B illustrates an example of operations performed by the prescaler 204 to obtain scaled dividend (x*f or Sdvd) and scaled divisor (d*f or Sdiv) from dividend (x) and divisor (d) respectively. The operations include receiving dividend (x) and divisor (d) 212 and obtaining a value (f) of the prescaling factor such that f«1/d 214. For example, the value of f may be obtained from an approximate calculation, from one or more lookup tables, or from some combination of calculation and lookup table(s). The operation further includes prescaling dividend (x) to obtain scaled dividend (x*f) 216, prescaling divisor (d) to obtain scaled divisor (d*f) 218 and outputting scaled dividend (x*f) and scaled divisor (d*f) (e.g., to iterator 206). By using a value of f that is approximately equal to 1/d, the scaled divisor (d*f) is approximately d*(1/d), or approximately 1 (d*f « d*(1/d) = d). This means that the quotient (the answer to the division equation to be solved) may be approximated by the scaled dividend (x*f): q = x/d = (x*f)/(d*f)

If f«1/d: q = (x*f)/(d*f) « (x*f)/(d*(1/d)) = (x*f)/1 = x*f

Thus, the scaled dividend is an approximate value for the quotient (q) in this case. The accuracy of this approximation may depend on how close the value of f is to 1/d (the accuracy of the approximation f«1 Zd). In some cases, the scaled dividend (x*f) may be a sufficiently accurate value and may be passed directly to the output processing circuit 208 (e.g., via the bypass connection 210) and after output processing may be provided by the divide circuit 200 as the calculated quotient value (a low-accuracy quotient). In other cases, the scaled dividend (x*f) may be passed to the iterator 206 where one or more iterations may be applied to obtain a value for the quotient (q). The steps of Figure 2B may be performed by the prescaler 204, which may be considered an example of a prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor. Other prescalers may also perform such steps, and the present technology is not limited to any particular prescaler design.

[0066] Figure 2C illustrates an example of a method performed by the iterator 206 to generate a value for the quotient (q), which can also be referred to as a quotient value. The method includes receiving the scaled divisor (d*f) and the scaled dividend (x*f) at step 230 (e.g., from the prescaler 204) and setting an initial partial remainder value equal to the scaled divisor, pro = pro[n-1 , ... , n-k, n-k-1 , ... , 0] = d*f, where k > b and b is the number of bits per cycle at step 232. The number pro[n-1 , ... , n-k, n-k-1 , ... , 0] indicates an n-bit number that includes first k bits (n-1 to n-k-1 ) where k is greater or equal to b. For example, b may be 8, 16 or 32 bits. Each iteration may generate a b-bit partial quotient estimate and k may be equal to b (e.g., if no sign bit is used) or may be greater than b (e.g., if sign bit(s) or other bits are included in a value). The method further includes initializing a counter to zero, i=0, at step 234 for an initial iteration, or iteration zero.

[0067] A partial quotient estimate value qs+i for iteration i+1 is then generated from the first k bits of the partial remainder pn of the prior iteration, iteration i, qs+i = pn[n-1 , ... , n-k], at step 236. For the initial iteration, where i=0, qs+i is the first k bits of pro (pro[n- 1 , ... , n-k]), that is, the first k bits of the scaled divisor d*f. For each iteration of one or more iterations, a partial quotient estimate may be generated from a partial remainder of a prior iteration in this way. The partial quotient estimate is then used to obtain a partial remainder. The partial remainder for iteration i+1 , (pr<i+i)) is obtained by subtracting the partial quotient estimate qs+i of the current iteration times the scaled dividend Sdiv (qi+i*Sdi_V or qi+i*d*f) from the partial remainder of the prior iteration, pn (e.g., pri - qi+i*sdiv), and multiplying by 2^b (shifting the result b-bits to the left) to give pr<i+i) = (pn - qi+i*Sdiv)*2^b at step 238. For each iteration of one or more iterations, a partial remainder may be generated from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration using this equation. The method includes incrementing the i value: i = i+1 at step 240 (e.g., from zero to one) and then obtaining qi+i for the next value of i (e.g., obtaining q2 from q2 = pn[n-1 , ... , n-k], where pn was generated in step 238). Steps 236, 238 and 240 may be considered an iteration 242 and may be repeated any suitable number of times (e.g., the value of i may be incremented repeatedly before termination). For example, where b=16 and d*f is a 64 bit number, four iterations may be performed. Where b=16 and d*f is a 512 bit number, 32 iterations may be performed. In some cases, the number of iterations may be determined by the accuracy required (e.g., how accurately the value of q should be calculated) and iterating may terminate before reaching the last bits of d*f. Partial quotient estimates from one or more iterations may be combined (e.g., in a buffer) to obtain a quotient value q that is sent to the output processing circuit 208, q = [q1 , q2, q3 ... ] at step 244 (e.g., the first b bits of q from qo, the next b bits from qi, the next b bits from q2 and so on).

[0068] The steps illustrated in Figure 2C may be performed by the iterator 206, which may be considered an example of an iterator configured to receive a prescaled dividend and a prescaled divisor from a prescaler (e.g., the prescaler 204) and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate and a partial remainder of a prior iteration. For example, partial quotient estimate qi is generated from qi = pn[n-1 , ... , n-k] after i is incremented so that pn[n-1 , ... , n-k] comes from step 238 of a prior iteration and in step 238 the partial remainder pr<i+i) is generated from a partial quotient estimate and partial remainder of a prior iteration (pn - qi+i*Sdiv) * 2^b. It will be understood that the order of these steps does not have to follow the order shown and that the present technology may be implemented in various ways.

[0069] Aspects of the present technology allow pipelining of at least certain portions to be pipelined in a divide unit. For example, normalizing, 1/x estimating, prescaling, and iterating may be implemented in a pipelined manner so that multiple divide operations may be performed in parallel.

[0070] Figure 3 illustrates an example of a divide unit 300 that implements pipelining. The divide unit 300 includes an input processing circuit 302, a prescaler stage 304, an iterator stage 306, and an output processing circuit 308. The input processing circuit 302 may receive multiple dividends and divisors in parallel. The input processing circuit 302 includes n normalizers (normalizer 0, normalizer 1 ... normalizer n-1 ). Normalizers 0 to n-1 may operate in parallel so that up to n values may be normalized in parallel. For example, normalizers 0 to n-1 may normalize floating-point denormal or subnormal values by removing leading zeros. Integers (signed and unsigned) may also be normalized.

[0071] The prescaler stage 304 includes n 1/x estimator/prescalers (1/x estimator/prescaler 0, 1/x estimator/prescaler 1 , ... 1/x estimator/prescaler n-1 ). These 1/x estimator prescalers may operate in parallel so that up to n values may have appropriate 1/x values estimated and may be prescaled accordingly in parallel. For example, each 1/x estimator/prescaler may implement 1/x estimating and prescaling as discussed with respect to the prescaler 204 and shown in Figure 2B.

[0072] While shown as a combined circuit stage, in some implementations, a 1/x estimator may be formed as a separate circuit stage from a prescaler. For example, step 214 of Figure 2B may be performed by a 1/x estimator and subsequent prescaling (e.g., steps 216, 218) may be performed by a separate prescaler stage using a 1/x value obtained from a corresponding 1/x estimator stage.

[0073] The iterator stage 306 includes n iterators (iterator 0, iterator 1 , ... iterator n- 1 ). Iterators 0 to n-1 may operate in parallel so that iterations may be performed on up to n values in parallel. For example, each iterator 0 to n-1 may implement an iterative process as discussed with respect to iterator 206 and as shown in Figure 2C.

[0074] Normalizers, 1/x estimator/prescalers and iterators may be operated in a pipelined arrangement so that the output of one pipelined component in a pipelined divide circuit is passed to the next pipelined component. Figure 3 shows n pipelined divide circuits or pipelines, including a first pipeline 310, a second pipeline 312, ... and an nth pipeline 314. The first pipeline 310 is formed by pipelined components including normalizer 0, 1/x estimator/prescaler 0 and iterator 0 so that given values (e.g., dividend and divisor) may be normalized by normalizer 0, the normalized values passed to 1/x estimator/prescaler 0, which prescales the values according to the 1/x value estimated and passes the prescaled values to iterator 0. Similarly, second pipeline 312 is formed by normalizer 1 , 1/x estimator/prescaler 1 and iterator 1 so that given values may be normalized by normalizer 1 , the normalized values passed to 1/x estimator/prescaler 1 , which prescales the values according to the 1/x value estimated and passes the prescaled value to iterator 1 . Each of the n pipelines may operate independently in parallel to calculate multiple quotients from multiple values in parallel (e.g., as described with respect to Figures 2A-C).

[0075] Quotients calculated by each of iterators 0 to n-1 are sent to the output processing circuit 308, which may perform post-processing to convert quotient values to a suitable format (e.g., performing floating-point rounding) similarly to the processing circuit 208. While shown as a single unit in this example, in some cases, multiple output processing circuits (e.g., n output processing circuits) may be provided, one for each pipeline, so that pipelining extends to output processing.

[0076] The divide circuit 300 may be formed as a single circuit with all components managed together (e.g., all powered on/off together) or may include components that are separately managed (e.g., separately powered on/off). Components of a given stage (e.g., normalizers of the input processing stage 302, estimator/prescalers of the estimator/prescaler stage 304, iterators of the iterator stage 306) may be identical or may be different. For example, components of different pipelines may be configured for different operations (e.g., to process different numbers of bits) so that different pipelines may have different capacities.

[0077] Figure 4 shows another example of a divide circuit 400 that has a pipelined arrangement that includes multiple individual divide circuits 440, 460, 480 that may be separately operated (e.g., separately selectively powered on/off). This arrangement may provide flexibility and power efficiency by allowing one or more pipelines to operate while circuits for one or more other pipelines are powered off to save power (e.g., divide circuit 440 may be powered on while divide circuits 460 and 480 are powered off).

[0078] The divide circuit 440 includes a normalizer stage 442 that includes a single normalizer (normalizer 0) and an estimator/prescaler stage 444 that includes a single 1/x estimator prescaler (1/x estimator/prescaler 0). Two iterators (iterator 0 and iterator 1 ) are provided in an iterator stage 446 with outputs of iterators provided to an output processing circuit 448. As an example, each of the iterators of the iterator stage 446 may process 16-bits per cycle (e.g., b = 16) so that up to 32 bits may be processed by the divide circuit 440 in parallel.

[0079] The divide circuit 460 includes two normalizers (normalizer 0 and normalizer 1 ) in a normalizer stage 462, two estimator/prescalers (1/x estimator/prescaler 0, 1/x estimator/prescaler 1 ) in an estimator/prescaler stage 464, and two iterators (iterator 0, iterator 1 ) in an iterator stage 466. These components form two pipelines with components configured to process 8 bits per cycle (e.g., a first pipeline formed by normalizer 0, 1/x estimator/prescaler 0, and iterator 0, each configured to process 8 bits per cycle, and a second pipeline formed by normalizer 1 , 1/x estimator/prescaler 1 , and iterator 1 , each configured to process 8 bits per cycle). Thus, the divide circuit 460 may be used to process sixteen bits at a time or may be used to process 8 bits at a time (using just one pipeline).

[0080] The divide circuit 480 is similar to the divide circuit 460 and includes two normalizers (normalizer 0 and normalizer 1 ) in a normalizer stage 482, two estimator/prescalers (1/x estimator/prescaler 0, 1/x estimator/prescaler 1 ) in an estimator/prescaler stage 484, and two iterators (iterator 0, iterator 1 ) in an iterator stage 486. These components form two pipelines with components configured to process 8 bits per cycle (e.g., a first pipeline formed by normalizer 0, 1/x estimator/prescaler 0, and iterator 0, each configured to process 8 bits per cycle, and a second pipeline formed by normalizer 1 , 1/x estimator/prescaler 1 , and iterator 1 , each configured to process 8 bits per cycle). Thus, the divide circuit 480 may be used to process sixteen bits at a time or may be used to process 8 bits at a time (using just one pipeline).

[0081] Individual divide circuits 440, 460, 480 of the divide circuit 400 may be independently managed. For example, power may be individually controlled for divide circuits 440, 460, 480 so that any one or more of divide circuits 440, 460, 480 may be powered on while any one or more other ones of divide circuits 440, 460, 480 may be powered off. In this way, divide circuit 400 is adaptable to accommodate different numbers of bits per cycle up to 64 bits (e.g., 8, 16, 32, or 64 bits). When fewer than 64 bits are processed, one or more of divide circuits 440, 460, 480 may be powered off to save power. Each divide circuit 440, 460, 480 may comprise one or more pipelined divide circuits (e.g., pipelined arrangement of at least a 1/x estimator/prescaler and iterator) so that, for example, first pipelined divide circuits of divide circuit 460 may be separately selectively powered from second pipelined divide circuits of divide circuit 48O.This flexible arrangement may support a variety of scalar and vector processing.

[0082] Any of the divide circuits described above may be implemented in any suitable manner using appropriate components. An example implementation of portions of a divide circuit are shown in Figures 5A-C.

[0083] Figure 5A shows an example implementation of a prescale stage 510 that may be used in any of the divide circuits described (e.g., as prescaler 204, or any 1/x Estimator/prescaler above). The prescale stage 510 receives a normalized divisor (x) from the register 516 and receives a normalized dividend (d) from the buffer 518. (Input processing circuits that may perform normalization are not shown in this Figure.) An estimate for the reciprocal of the divisor (1/x) is obtained from the 1 Zx Estimate table(s) 520. The 1/x Estimate table(s) 520 may include a single table or multiple tables to obtain an approximate value of 1/x from the normalized value of x from register 516. For example, different tables may be provided for different ranges of x (e.g., each of a plurality of lookup tables may characterize a different range of divisor values and/or may provide a different level of accuracy). Because the relationship between x and 1/x is not a straight line relationship, using different tables for different ranges may provide higher accuracy. The estimate of 1/x (f) from the 1/x Estimate table(s) 520 is provided to a multiplier circuit 522 (implemented by a Prescale (multiply) Compression Tree in combination with two Carry Propagate Adders (CPAs) in this example), which also receives the normalized divisor (x) from the register 516 and multiplies these values to obtain a value x*f that is approximately equal to one (e.g., between 1.000 and 1.001 ). The x*f value is sent to the buffer 523 for an iterator stage (not shown). The 1/x estimate (f) is also sent to the multiplier 524, which also receives the normalized dividend (d) from the buffer 518 and multiplies these values to obtain a value for d*f. This value (which may be considered an approximate value for the quotient) is sent to the multiplexer 526, which also receives the input 529 (a partial remainder for the next iteration) from an iterator stage. The multiplexer 526 selects either the d*f value (for initial iteration) or the partial remainder to send to the buffer 528 for the iterator stage.

[0084] Figure 5B illustrates an example implementation of an iterator stage 530 that may be used in any of the divide circuits described (e.g., as iterator 206, or any iterators above). The iterator stage 530 may be connected in line (pipelined) with the prescale stage 510 of Figure 5A and shares the buffer 523 (for scaled divisor) and the buffer 528 (for the next partial remainder). The partial remainder from the buffer 528 may be sent as two vectors, a + vector and a - vector, which may be used by the quotient multiple select recorder 532 to determine a partial quotient estimate value (e.g., qi+i) to provide to the multiplier 534 (in a first iteration, the scaled dividend is selected by the multiplexer 526 for use as the initial partial quotient estimate). The quotient multiple select recorder 532 also provides partial quotient estimate values on the output line 533 (e.g., to a quotient buffer). The multiplier 534 multiplies the partial quotient estimate by the scaled divisor from the buffer 523 to obtain a result (e.g., qi+i*Sdiv, or qi+i*d*f) that is passed to a Redundant Binary Signed Digit Full Adder (RBSA FA) 536. The RBSD FA 536 also receives a partial remainder (e.g., pn) from the buffer 528 through multiplexer 538. Multiplexer 538 may conditionally swap + vector and - vector (PR_Plus and PR_Minus) to invert a partial remainder (e.g., pn) if, and only if, Sign(Partial_Remainder) is positive as indicated by input: the partial remainder sign 540. The RBSD FA 536 then combines the value from the multiplier 534 (e.g., qi+i*Sdiv) and the value from the multiplexer 538 (e.g., pn) to generate a value that is a multiple of the scaled divisor (Sdiv, d*f) +/- the partial remainder (qs+1 ), which is sent to multiplexer 542. The multiplexer 542 may conditionally swap plus and minus to invert the multiple of the scaled divisor and the partial remainder (e.g., qi+i*Sdiv) if, and only if, Sign(Partial_Remainder) is positive (partial remainder positive). The result from the multiplexer 542 is a partial remainder +/- the multiple of the scaled divisor (e.g., pn - qi+i*Sdiv). This result is passed to the left shift register 544, which shifts all bits by b bits to the left (b=16 in this example), e.g., by multiplying by 2^b. This results in a value (e.g., (pn - qi+i*Sdiv) * 2^b), which may be used as a partial remainder for a subsequent iteration, e.g., pr<i+i) = (pn - qi+i*Sdiv) * 2^b as previously illustrated in step 238 of Figure 3C. This result is sent back to the multiplexer 526 for a subsequent iteration.

[0085] Figure 5C shows an example implementation of the multiplier 534. The multiplier 534 is shown as formed by four multiplexers, Mux 550, 551 , 552, 553, connected to RBSD FA 556 and a RBSD FA 558, which in turn are in turn connected to a third RBSD FA 559. The multiplier 534 also receives the input 535 from the quotient multiple select recorder 532. The input 535 may be provided to one or more of the multiplexers 550-553. Other implementations are also possible.

[0086] Figure 5D illustrates an implementation of a 1/x estimate circuit (e.g., 1/x estimate table(s) 520) that includes a lookup circuit 589 and a 1/x estimate multiplier circuit 563. The lookup circuit 589 includes three lookup tables (LLITs), the LUT 560, the LUT 561 and the LUT 562 that relate values of x to corresponding constant values. Each LUT may cover a different range of x to achieve higher precision with lower total number of bits. A normalized divisor (x) from the register 516 (buffer) may be compared with values in one or more of the LUTs 560-562 and a constant value (c) selected by multiplexer 564 and passed to a Booth Recoder 566. The booth recoded value is then sent to the buffer 568 and on to the 1/x estimate multiplier circuit 563. In an example implementation, the LUT 560 (m=7) includes 48 entries of 15 bits each, LUT 561 (m=8) includes 152 entries of 16 bits each and the LUT 562 (m=9) includes 16 entries of 18 bits each.

[0087] In the 1/x estimate multiplier circuit 563, the booth recoded value of constant c is passed to the Booth Multiplexers 570, 571 , which also receive a divisor value from register 516, via buffer 569. Outputs of Booth Multiplexers 570, 571 are passed through a series of Carry-Save Adders (CSAs) 574-57 and on to the Redundant Booth Recorder 580. The output of the Redundant Booth Recoder 580 (a prescaling factor value that is an estimate of 1/x) is sent to buffer 582 for use (e.g., by a multiplier such as the multiplier circuit 522).

[0088] Figures 6A-D are pipeline diagrams for divide operations using a divide circuit according to the present technology (e.g., the divide circuits 200, 300, 400). Each pipeline diagram extends over twelve cycles (cycles 1 -12 from left to right). The left column indicates the steps being performed including normalization (“norm”), prescaling (“prescale”), iteration (“Iter 0” and “Iter 1” corresponding to two iterators), and output processing or quotient processing (“Quotient”) in corresponding pipelined components. Numbered entries correspond to a given divide operation (e.g., a dividend and a divisor) as the operation progresses through a divide circuit. Each pipeline diagram corresponds to a different level of accuracy. [0089] Figure 6A shows an example of multiple divide operations with 64-bit accuracy for a scalar integer (“INT64”) in a 16 bit/cycle divide circuit. Normalization for operation 0 occurs in cycle 1 , followed by prescaling in cycle 2, iteration (in Iter 0) in cycles 3-6 (four cycles of 16 bits to achieve 64 bit accuracy) and output processing in cycle 7. Because Iter 0 is busy until cycle 7, a new operation for Iter 0 (operation 2) is not started until cycle 5. Similarly, operation 1 occupies Iter 1 for cycles 4-7 so that Iter 1 is unavailable until cycle 8 and a new operation for Iter 1 (operation 3) is not started until cycle 6. It can be seen that the latency in this example is seven cycles (e.g., operation 0 extends from cycle 1-7).

[0090] Figure 6B shows an example of multiple divide operations with 32-bit accuracy for a scalar integer (“INT32”) in a in a 16 bit/cycle divide circuit. Normalization for operation 0 occurs in cycle 1 , followed by prescaling in cycle 2, iteration (in Iter 0) in cycles 3-4 (two cycles of 16 bits to achieve 32 bit accuracy) and output processing in cycle 5. Because Iter 0 is available in cycle 5, a new operation for Iter 0 (operation 2) is started in cycle 3 (immediately after normalization of operation 1 ). Operation 1 occupies Iter 1 only for cycles 4-5 so that Iter 1 is available at cycle 6 and a new operation for Iter 1 (operation 3) is started at cycle 4. It can be seen that the latency in this example is five cycles (e.g., operation 0 extends from cycle 1 -5).

[0091] Figure 6C shows an example of multiple divide operations with 16-bit accuracy for a scalar integer (“INT16”) in a in a 16 bit/cycle divide circuit. Normalization for operation 0 occurs in cycle 1 , followed by prescaling in cycle 2. A dividend that is scaled by a factor of about 1/x (where x is the divisor) is approximately equal to the quotient (a low-accuracy quotient). For 16-bit accuracy, the prescaled dividend may be sufficiently accurate and may be sent directly to the output processing stage (e.g., via a bypass connection that bypasses iterator(s)) so that output processing of operation 0 occurs in cycle 3, without performing any iterations in Iter 0 or Iterl . It can be seen that the latency in this example is three cycles (e.g., operation 0 extends from cycle 1 -3), which indicates the low latency achievable by using a bypass connection where low-accuracy quotient values are adequate.

[0092] Figure 6D shows an example of multiple divide operations with 8-bit accuracy for a scalar integer (“INT8”) in a in a 16 bit/cycle divide circuit. Normalization for operation 0 occurs in cycle 1 , followed by prescaling in cycle 2. A dividend that is scaled by a factor of about 1/x (where x is the divisor) is approximately equal to the quotient. For 8-bit accuracy (like the 16-bit accuracy of Figure 6C), the prescaled dividend may be sufficiently accurate and may be sent directly to the output processing stage (e.g., via a bypass connection that bypasses iterator(s)) so that output processing of operation 0 occurs in cycle 3. It can be seen that the latency in this example is three cycles (e.g., operation 0 extends from cycle 1 -3).

[0093] While the examples of Figures 6A-D refer to division of 64-bit scalar integers, aspects of the present technology may be applied to operands (e.g., dividends and divisors) having different numbers of bits (e.g., 8, 16, 32, 256, 512) and/or in other formats including scalar floating point, vector integers, vector floating point.

[0094] Figure 7 illustrates a method according to an embodiment of the present technology. The method includes prescaling a dividend by a prescaling factor to generate a prescaled dividend in a prescaler stage at step 702 and prescaling a divisor by the prescaling factor to generate a prescaled divisor in the prescaler stage at step 704 (e.g., prescaling d and x by f as illustrated in Figure 2B using circuits illustrated in Figure 5A). The method further includes, at step 706, setting the prescaled dividend as an initial partial remainder for an initial iteration. The method also includes, at step 708, in an iterator stage connected to the prescaler stage, for each iteration of one or more iterations: generating a partial quotient estimate and a partial remainder from a partial quotient estimate and a remainder of a prior iteration (e.g., as illustrated in Figure 2C using circuits illustrated in Figures 5B-C).

[0095] Figure 8 illustrates more detailed example of a method according to an embodiment of the present technology. Figure 8 includes additional steps that are not illustrated in the example of Figure 7. The method includes normalizing the divisor prior to prescaling the divisor at step 810, normalizing the dividend prior to prescaling the dividend at step 812, and obtaining a prescaling factor from a plurality of lookup tables that include prescaling factor values as a function of divisor values at step 814, each of the plurality of lookup tables characterizing a different range of divisor values (e.g., from LLITs 560-562). The method in Figure 8 also includes steps 702, 704, 706 and 708, which were described above with respect to Figure 7, and thus need not be described again. The method further includes performing floating point rounding or a 2’s complement operation at step 816 (e.g., by output processing circuit 208). A 2’s complement operation may be used for signed integer division. The method also includes prescaling another dividend by another prescaling factor obtained from another divisor to generate another prescaled dividend at step 818 (e.g., prescaling new numbers for another division operation) and providing the prescaled dividend as a low-accuracy quotient via a bypass connection that bypasses the iterator stage at step 820 (e.g., providing a prescaled dividend directly from the prescaler 204 to the output processing circuit 208 via the bypass connection 210 to reduce latency as illustrated in Figures 6C-D).

[0096] Figures 9A-H illustrate a worked example of a division operation (to divide X/D) that may be performed according to any of the methods and using any of the circuits described above. This example uses single-precision, 32-bit floating point (FP32) source operands but the present technology is not limited to such operands.

[0097] In the example illustrated in Figure 9A, the Dividend (X) is 98279774 and the Divisor (D) is 2559543, giving a “True” Quotient of 38.3973795 (in decimal values). These values are also given in different formats including Decimal as a power of 2, Hexadecimal (“Hex. Value”), floating point hexadecimal (in the box) and with mantissas padded to 53-bit dataflow to handle double-precision operands. The True Quotient is also given in binary and Hexadecimal.

[0098] Figure 9B illustrates normalization of the Dividend and Divisor to 53 bits including counting the leading zeros in both numbers to identify 29 leading zeros so that subsequent steps may ignore the 29 leading zeros, which may allow faster performance of subsequent steps. Normalization of Figure 9B may be performed by input processing circuits (e.g., input processing circuit 202 or any of the normalizers described above).

[0099] Figure 9C illustrates generating an estimate of 1/D (the reciprocal of the divisor D). The normalized Divisor is shown in both hexadecimal and binary formats. Table 910 shows different values of m (7,8 and 9 in this example) that correspond to different tables (e.g., LLITs 560-562), each having a valid table range as shown. Different values of m correspond to different accuracy. Choosing m=8 and checking in the corresponding table (e.g., LUT 561 ) provides a constant value “C” of 0x2AF31. A value D’ is then obtained (shown in both binary and hexadecimal) and then 1/D (or D’ ¹) is obtained by multiplying the value of constant C obtained from multiplying C (0x2AF31 ) and D’ (0x1_388E_47FF_FFFF_FFFF_FFFF). This provides the answers shown in different formats (e.g., 0x1A380D in hexadecimal truncated to 21 bits). This value (f) is an approximation for 1/D, which may be tested by multiplying it by D to check if the result is approximately 1 (although this is not necessary in normal operation).

[00100] Figure 9D shows prescaling of the dividend by multiplying it by f (e.g., by the value f or D’¹, obtained in Figure 9C, 0x1A380D). This provides the scaled Dividend, x*f, or Sdvd shown in different formats including padded to 76 bits (e.g., 21 bits of reciprocal, up to 53 bits of dividend, and one bit for sign extension) and scaled plus and minus values (borrow-save notation). It can be seen that the first sixteen bits of the scaled dividend shown, 0x1332_D, are identical to the first seventeen bits of the quotient value shown in Figure 9A, which illustrates that the scaled dividend of this example is a sufficiently good approximation for 16-bit accuracy.

[00101] Figure 9E shows prescaling of the divisor by multiplying it by f (e.g., by the value f or D’¹, obtained in Figure 9E, 0x1A380D). This provides the scaled Divisor, D*f, D*D^-1 or Sdvr shown as a scaled divisor without padding and padded to 75 bits (e.g., 21 bit reciprocal estimate + 1 bit sign extension).

[00102] Figure 9F illustrates values obtained from the prescaled divisor including the scaled divisor value found in Figure 9E (the “True”) value shown on the top line), the 2’s complement of the True value (second line), the x3 multiple of the True value (third line) and the x3 multiple of the 2’s complement of the True value (fourth line), each shown as 75-bit values. These values may be used in iterations below in some implementations.

[00103] Figure 9G illustrates operation of an initial (first) iteration. In this iteration, the partial quotient estimate is multiplied by the prescaled divisor (e.g., from Figure 9E) and the result is subtracted from the partial remainder (e.g., using plus and minus partial remainder values labeled 920) to obtain values for Partial Remainder - Quotient*Scaled Divisor plus and minus values labeled 922. The partial remainder is then shifted left by 16 bits to obtain the values shifted values labeled 924. These values are then used in the next iteration.

[00104] Figure 9H illustrates operation of a second iteration. In this iteration, values for the Partial Remainder - Quotient*Scaled Divisor labeled 924 from the previous iteration (initial iteration) become the partial remainder values labeled 926 for the second iteration. In this iteration, the partial quotient estimate is multiplied by the prescaled divisor (e.g., from Figure 9E) and the result is subtracted from the partial remainder (e.g., using plus and minus partial remainder values labeled 926) to obtain values for Partial Remainder - Quotient*Scaled Divisor plus and minus values labeled 928. The partial remainder is then shifted left by 16 bits to obtain the values shifted values labeled 930 (e.g., to obtain pr(i+1 ) = (pri - qi+i*sdiv) * 2b as in step 238). These values are then used as partial remainder values in the next iteration. The number of iterations performed may depend on the number of bits per cycle and the accuracy desired.

[00105] Figure 10 is a high-level block diagram of a computing system 1400 that can be used to implement various embodiments of the microprocessors described above. In one example, the computing system 1400 is a network system 1400. Specific devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc.

[00106] The network system may comprise a computing system 1401 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like. The computing system 1401 may include a central processing unit (CPU) 1410, a memory 1420, a mass storage device 1430, and an I/O interface 1460 connected to a bus 1470, where the CPU can include a microprocessor such as described above with respect to Figures 1A-B. The computing system 1401 is configured to connect to various input and output devices (keyboards, displays, etc.) through the I/O interface 1460. The bus 1470 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus or the like. [00107] The CPU 1410 may comprise any type of electronic data processor, including the microprocessor 120 of Figure 1 B, which includes divide circuits (e.g., any of the divide circuits described above). The CPU 1410 may be configured to implement any of the schemes described herein with respect to division, using any one or combination of steps described in the embodiments. The memory 1420 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory 1420 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.

[00108] The mass storage device 1430 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1470. The mass storage device 1430 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.

[00109] The computing system 1401 also includes one or more network interfaces 1450, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or one or more networks 1480. The network interface 1450 allows the computing system 1401 to communicate with remote units via the network 1480. For example, the network interface 1450 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the computing system 1401 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like. In one embodiment, the network interface 1450 may be used to receive and/or transmit interest packets and/or data packets in an ICN. Herein, the term “network interface” will be understood to include a port.

[00110] The components depicted in the computing system of Figure 10 are those typically found in computing systems suitable for use with the technology described herein, and are intended to represent a broad category of such computer components that are well known in the art. Many different bus configurations, network platforms, and operating systems can be used. [00111] The technology described herein can be implemented using hardware, firmware, software, or a combination of these. Depending on the embodiment, these elements of the embodiments described above can include hardware only or a combination of hardware and software (including firmware). For example, logic elements programmed by firmware to perform the functions described herein is one example of elements of the described divide unit. A divide unit can include a processor, FGA, ASIC, integrated circuit or other type of circuit. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.

[00112] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media. [00113] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Applicationspecific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.

[00114] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.

[00115] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. [00116] The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.

[00117] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

[00118] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

CLAIMS What is claimed is:

1 . A divide circuit, comprising: a prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor; and an iterator configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration.

2. The divide circuit of claim 1 , wherein the prescaler is further configured to obtain the prescaling factor from the divisor.

3. The divide circuit of any of claims 1-2, wherein the prescaling factor is approximately a reciprocal of the divisor.

4. The divide circuit of any of claims 1 -3, wherein the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate lookup values as a function of divisor values and the prescaling factor is a function of the lookup value and divisor value.

5. The divide circuit of claim 4, wherein the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

6. The divide circuit of any of claims 1 -5, further comprising a normalizer configured to normalize the dividend and the divisor for the prescaler.

7. The divide circuit of any of claims 1 -6, further comprising an output processing circuit connected to the iterator, the output processing circuit configured to perform at least one of floating point rounding or 2’s complementation.

8. The divide circuit of any one of claims 1 -7, further comprising a bypass connection between the prescaler and the output processing circuit, bypassing the iterator, the bypass connection configured to provide the prescaled dividend as a low- accuracy quotient to the output processing circuit.

9. The divide circuit of any one of claims 1 -7, wherein the prescaler comprises a plurality of lookup tables and a Booth recoder configured to perform Booth encoding of a value obtained from the one or more lookup tables.

10. The divide circuit of any of claims 1 -9, wherein the iterator comprises one or more Redundant Binary Signed Digit Full Adder to multiply a partial quotient estimate and the prescaled divisor.

11. A method of dividing a dividend by a divisor, comprising: prescaling a dividend by a prescaling factor to generate a prescaled dividend in a prescaler stage; prescaling a divisor by the prescaling factor to generate a prescaled divisor in the prescaler stage; setting the prescaled dividend as an initial partial remainder for an initial iteration; and in an iterator stage connected to the prescaler stage, for each iteration of one or more iterations: generating a partial quotient estimate and a partial remainder from a partial quotient estimate and a remainder of a prior iteration.

12. The method of claim 11 , further comprising obtaining the prescaling factor from the divisor.

13. The method of any one of claims 11-12, wherein the prescaling factor is approximately the reciprocal of the divisor.

14. The method of any one of claims 11 -13, further comprising obtaining the prescaling factor from one or more lookup tables that include prescaling factor values as a function of divisor values.

15. The method of claim 14, wherein the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

16. The method of any one of claims 11 -15, further comprising: normalizing the dividend prior to prescaling the dividend; and normalizing the divisor prior to prescaling the divisor.

17. The method of any one of claims 11 -16, further comprising performing floating point rounding to obtain a rounded floating point quotient or performing a 2’s complement operation to generate a signed integer quotient.

18. The method of any one of claims 11 -17, further comprising: prescaling another dividend by another prescaling factor obtained from another divisor to generate another prescaled dividend; and providing the prescaled dividend as a low-accuracy quotient via a bypass connection that bypasses the iterator stage.

19. The method of any one of claims 11-18, wherein generating the partial remainder includes subtracting a product of the partial quotient estimate of the current iteration and the prescaled divisor from the partial remainder of the prior iteration.

20. The method of any one of claim 19 further comprising generating the product of the partial quotient estimate of the current iteration and the prescaled divisor using a plurality of Redundant Binary Signed Digit Full Adders.

21. A divide circuit, comprising: a plurality of pipelined divide circuits, each pipelined divide circuit comprising: a prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor; and an iterator connected to the prescaler, the iterator configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder from the partial remainder of a prior iteration.

22. The divide circuit of claim 21 wherein the plurality of pipelined divide circuits include one or more first pipelined divide circuits and one or more second pipelined divide circuits, the first pipelined divide circuits being separately selectively powered from the second pipelined divide circuits.

23. The divide circuit of any of claims 21 -22, wherein each of the plurality of pipelined divide circuits further comprises a normalizer.

24. The divide circuit of any of claims 21-23, wherein the prescaler is further configured to obtain the prescaling factor from the divisor.

25. The divide circuit of any of claims 21 -24, wherein the prescaling factor is approximately the reciprocal of the divisor.

26. The divide circuit of any of claims 21-25, wherein the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate prescaling factor values as a function of divisor values.

27. The divide circuit of claim 26, wherein the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

28. The divide circuit of any of claims 21-27, wherein each of the pipelined divide circuits includes a bypass connection between the prescaler and an output processing circuit, bypassing the iterator, the bypass connection configured to provide the prescaled dividend as a low-accuracy quotient to the output processing circuit.

29. The divide circuit of any of claims 21 -28, wherein the prescaler further comprises a plurality of lookup tables and one or more Booth Recoders configured to perform Booth encoding of a value obtained from one or more lookup tables.

30. The divide circuit of any of claims 21 -29, wherein the iterator further comprises one or more Redundant Binary Signed Digit Full Adder configured to multiply a partial quotient estimate and the prescaled divisor.

31. A microprocessor, comprising: one or more divide circuits, each divide circuit comprising: a plurality of pipelined components including at least a normalizer, a prescaler and an iterator, the normalizer configured to remove leading zeros from integer values, the prescaler configured to prescale a dividend by a prescaling factor to generate a prescaled dividend and to prescale a divisor by the prescaling factor to generate a prescaled divisor, the iterator configured to receive the prescaled dividend and the prescaled divisor from the prescaler and, for each iteration of one or more iterations, generate a partial quotient estimate from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of a current iteration and a partial remainder of a prior iteration.

32. The microprocessor of claim 31 , wherein the prescaler is further configured to obtain the prescaling factor from the divisor.

33. The microprocessor of any of claims 31 -22, wherein the prescaling factor is approximately the reciprocal of the divisor.

34. The microprocessor of any of claims 31 -33, wherein the prescaler is further configured to obtain the prescaling factor from one or more lookup tables that indicate prescaling factor values as a function of divisor values.

35. The microprocessor of claim 34, wherein the one or more lookup tables include a plurality of lookup tables, each of the plurality of lookup tables characterizing a different range of divisor values.

36. A divide circuit, comprising: means for prescaling a dividend by a prescaling factor to generate a prescaled dividend and prescaling a divisor by the prescaling factor to generate a prescaled divisor; and means for iteratively calculating a quotient, the means for iteratively calculating the quotient configured to, for each iteration of one or more iterations, generate a partial quotient estimate of the current iteration from a partial remainder of a prior iteration and generate a partial remainder from a partial quotient estimate of the current iteration and a partial remainder of a prior iteration.