CN212411183U

CN212411183U - Arithmetic circuit, chip and computing device for executing hash algorithm

Info

Publication number: CN212411183U
Application number: CN202021746320.9U
Authority: CN
Inventors: 范志军; 刘建波; 杨作兴
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2021-01-26
Anticipated expiration: 2030-08-19

Abstract

The present disclosure relates to an arithmetic circuit, a chip and a computing device for performing a hash algorithm. An arithmetic circuit for performing a hash algorithm, comprising a plurality of arithmetic stages arranged in a pipeline structure, each arithmetic stage comprising: a set of inputs and a set of outputs, the inputs being correspondingly coupled to the outputs of the previous operational stage and the outputs being correspondingly coupled to the inputs of the next operational stage; a plurality of combinational logic modules, each having an input coupled to at least a portion of the set of inputs; a plurality of delay modules, each having an input coupled to one of the set of inputs and an output coupled to one of the set of outputs not coupled to the combinational logic module, such that each of such outputs is coupled to one of the delay modules; and a plurality of complementary delay modules, each having an input coupled to the output of a corresponding combinational logic module and an output coupled to one of the set of outputs, wherein each delay module and complementary delay module are formed of identical delay cells connected in series such that the calculated delay from the input to each of the outputs of each operational stage is substantially equal.

Description

Arithmetic circuit, chip and computing device for executing hash algorithm

Technical Field

The present disclosure relates to bitcoin mining. And more particularly, to an arithmetic circuit for performing a hash algorithm, and a chip and a computing apparatus including the arithmetic circuit.

Background

Bitcoin is a virtual encrypted digital currency in the form of P2P (Peer-to-Peer), the concept of which was originally proposed by the minwis at 11/1 of 2008 and was formally born at 3/1 of 2009. The bitcoin is unique in that it is not issued by a specific currency institution, but is generated by a large number of operations according to a specific algorithm.

The core of the mining machine for bitcoin excavation is to obtain the reward according to the operational capability of the mining machine calculation SHA-256 algorithm. For a mining machine, chip size, chip running speed and chip power consumption are three factors that are crucial to determining the performance of the mining machine, wherein chip size determines chip cost, chip running speed determines mining machine running speed (i.e., computing power), and chip power consumption determines power consumption (i.e., excavation cost). In practical applications, the most important performance index for measuring the mining machine is the power consumption consumed by a unit computing power, i.e., a power computing power ratio.

Fig. 1 shows a prior art arithmetic circuit 100 for bitcoin mining. The arithmetic circuit 100 implements the SHA-256 algorithm using a pipeline (pipeline) structure.

As shown in fig. 1, the arithmetic circuit 100 includes N operational stages arranged in a pipeline structure, wherein each operational stage has a set of inputs 101 and a set of outputs 102, the set of inputs of each operational stage is correspondingly coupled to the set of outputs of the previous operational stage, and the set of outputs of each operational stage is correspondingly coupled to the set of inputs of the next operational stage.

Each operational stage comprises a plurality of

combinational logic modules

111, 112, 113 for performing combinational logic operations based on data input to the operational stage.

In addition, each arithmetic stage includes a set of registers for storing data. As shown in fig. 1, each group of registers includes 8 buffer registers A, B, C, D, E, F, G, H and 16 extension registers W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15.

It should be noted that, for ease of understanding, the number of each set of registers in fig. 1 is programmed corresponding to the SHA-256 algorithm, and the connection relationship between each register and the respective

combinational logic module

111, 112, 113 is schematically depicted corresponding to the SHA-256 algorithm. For the sake of clarity, the connection between the registers and the respective

combinational logic modules

111, 112, 113 is depicted only in the first arithmetic stage.

Each set of registers is clocked to pass data along each operational stage in sequence. Each set of registers is triggered at each clock cycle, passing the set of data stored therein to the next arithmetic stage for computation. At the same time, a new set of input data is input at the input 101 of the arithmetic circuit 100 and passed to the first arithmetic stage via the first set of registers to start the computation; and a new set of output data is output from the output 102 of the arithmetic circuit 100 via the last set of registers. That is, the clock is used to flip the register, feed the input data, and extract the output data.

When a register is triggered, the signal at its input should have stabilized and can be passed back by the register. The period of this clock is therefore limited by the computation delay of each operational stage, i.e. the clock period should be greater than or equal to the computation delay of each operational stage. In general, the clock period is selected to be substantially equal to the computation delay of each operational stage.

For the arithmetic circuit 100, register latency (e.g., Ck2q latency when the register is a latch), clock tree latency, etc. are typically much smaller than the computational latency of the combinational logic block. Thus, the clock period may be selected to be substantially equal to the computational delay of the combinational logic block of each operational stage.

Thus, the throughput and the computational power of the arithmetic circuit 100 for performing the hash algorithm are determined by the clock frequency for the registers, i.e., by the computational delay of the combinational logic block of each arithmetic stage.

However, it is desirable to increase the computational frequency and throughput of the arithmetic circuit 100 without reducing the computational latency of the combinational logic blocks of each arithmetic stage, thereby reducing the power consumption computation ratio. There is therefore a need for new techniques.

SUMMERY OF THE UTILITY MODEL

It is an object of the present disclosure to provide an arithmetic circuit for performing a hash algorithm.

According to an aspect of the present disclosure, there is provided an arithmetic circuit for performing a hash algorithm, characterized in that the arithmetic circuit includes a plurality of arithmetic stages arranged in a pipeline structure, wherein each arithmetic stage includes: a set of inputs and a set of outputs, the set of inputs being correspondingly coupled to a set of outputs of a previous operational stage and the set of outputs being correspondingly coupled to a set of inputs of a subsequent operational stage; a plurality of combinational logic modules, each combinational logic module having an input coupled to at least a portion of the set of inputs; a plurality of delay modules, each delay module having an input coupled to one of the set of inputs and an output coupled to one of the set of outputs not coupled to the combinational logic module, such that the outputs of the set of outputs not coupled to the combinational logic module are each coupled to one delay module; and a plurality of complementary delay modules, each complementary delay module having an input coupled to the output of a corresponding combinational logic module and an output coupled to one of the set of outputs, wherein each of the delay modules and complementary delay modules of each operational stage is comprised of identical delay cells connected in series and configured such that the calculated delay from the set of inputs to each of the set of outputs of each operational stage is substantially equal.

In one implementation, the computational delay of each operational stage is substantially equal to k times a period of a clock used to feed the input data to the set of inputs, where k is an integer greater than or equal to 2.

In one implementation, each delay module is made up of M delay cells connected in series, where M is a multiple of k.

In one implementation, k is 2 or 3.

In one implementation, M is greater than or equal to 10 and less than or equal to 20.

In one implementation, M is 3 to 10 times k.

In one implementation, each delay cell is formed by a buffer or a pair of inverters.

In one implementation, the number of complementary delay blocks per arithmetic stage is equal to the number of combinational logic blocks, such that each of the set of outputs is coupled to one of the delay blocks and the complementary delay blocks.

In one implementation, the arithmetic circuitry is configured to execute the SHA256 algorithm.

According to another aspect of the present disclosure, there is provided a chip including the arithmetic circuit as described above.

According to yet another aspect of the present disclosure, there is provided a computing device comprising a chip as described above.

Other features of the present disclosure and advantages thereof will become more apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

fig. 1 shows a schematic diagram of an arithmetic circuit for performing a hashing algorithm according to the prior art.

Fig. 2 shows a schematic diagram of an arithmetic circuit for performing a hashing algorithm according to one or more exemplary embodiments of the present disclosure.

Fig. 3 shows a schematic diagram of an operational stage in the operational circuit shown in fig. 2.

Fig. 4 is a timing diagram illustrating the operation circuit shown in fig. 2 executing the hash algorithm.

Note that in the embodiments described below, the same reference numerals are used in common between different drawings to denote the same portions or portions having the same functions, and a repetitive description thereof will be omitted. In some cases, similar reference numbers and letters are used to denote similar items, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

For convenience of understanding, the positions, sizes, ranges, and the like of the respective structures shown in the drawings and the like do not sometimes indicate actual positions, sizes, ranges, and the like. Therefore, the present disclosure is not limited to the positions, dimensions, ranges, and the like disclosed in the drawings and the like.

Detailed Description

Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. That is, the structures and methods herein are shown by way of example to illustrate different embodiments of the structures and methods of the present disclosure. Those skilled in the art will understand, however, that they are merely illustrative of exemplary ways in which the disclosure may be practiced and not exhaustive. Furthermore, the figures are not necessarily to scale, some features may be exaggerated to show details of particular components.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

Fig. 2 shows a schematic diagram of an arithmetic circuit 200 for performing a hashing algorithm according to one or more exemplary embodiments of the present disclosure. The arithmetic circuit 200 may be used to perform the SHA-256 algorithm.

As shown in fig. 2, the arithmetic circuit 200 includes N operational stages (N is a positive integer) arranged in a pipeline structure, wherein each operational stage includes: a set of inputs and a set of outputs, a plurality of

combinational logic modules

211, 212, 213, a plurality of delay modules 230, and a plurality of

supplemental delay modules

221, 222, 223.

For ease of understanding, the respective inputs, outputs of each operational stage and the connection relationships between the respective

combinational logic modules

211, 212, 213 in fig. 2 are schematically depicted corresponding to the SHA-256 algorithm. For the sake of clarity, the connections between the various inputs, outputs and the various combinational logic blocks 211, 212, 213 are only depicted in the first operational stage.

For example, the first operation stage includes a set of inputs 201-1 and a set of outputs 202-1, where the inputs 201-1 and the outputs 202-1 each include 24 data, which correspond to data stored in 8 buffer registers A, B, C, D, E, F, G, H and 16 extension registers W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15, respectively, in the operation circuit 100 shown in fig. 1. For ease of understanding, the reference numbers of the corresponding registers of the respective data in the prior art are schematically indicated at each set of inputs and outputs.

The first arithmetic stage further comprises a plurality of combinational logic blocks 211, 212, 213, each having an input coupled to at least a portion of the set of inputs 201-1. For example, the inputs of the combinational logic module 213 are coupled to the inputs numbered W0, W1, W9, W14 of the set of inputs 201-1. The configuration and function of the combinational logic blocks 211, 212, 213 in the arithmetic circuit 200 correspond to the configuration and function of the combinational logic blocks 111, 112, 113 in the arithmetic circuit 100 shown in fig. 1, respectively.

In addition, the first operation stage further includes a plurality of delay modules 230 and a plurality of

complementary delay modules

221, 222, 223.

Wherein each delay module 230 has an input coupled to one of the set of inputs 201-1 and an output coupled to one of the set of outputs 202-1 that is not coupled to a combinational logic module such that the outputs of the set of outputs 202-1 that are not coupled to a combinational logic module are each coupled to one delay module. For example, the input of the uppermost delay module 230 in FIG. 2 is coupled to the input labeled A in input 201-1 and the output is coupled to the output labeled B in output 202-1. In FIG. 2, the outputs of outputs 202-1 labeled B, C, D, F, G, H, W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14 are not coupled to combinational logic blocks, each of which is coupled to one delay module 230.

Each

complementary delay module

221, 222, 223 has an input coupled to the output of the corresponding

combinational logic module

211, 212, 213 and an output coupled to one of the set of outputs 202-1. For example, the inputs of the complementary delay blocks 221, 222, 223 are coupled to the outputs of the combinational logic blocks 211, 212, 213, respectively, and the outputs are coupled to the outputs labeled A, E and W15, respectively, in the output 202-1.

In the embodiment shown in fig. 2, the number of complementary delay blocks per arithmetic stage is preferably equal to the number of combinational logic blocks, such that each of the sets of outputs is coupled to one of the delay blocks and the complementary delay blocks. In other embodiments, the number of complementary delay blocks per arithmetic stage may be less than the number of combinational logic blocks.

Fig. 3 shows a schematic diagram of an operational stage 300 in the operational circuit 200 shown in fig. 2.

As shown in fig. 3, the operation stage 300 includes: a set of inputs 301 and a set of outputs 302, a plurality of

combinational logic modules

311, 312, 313, a delay module 330, and

complementary delay modules

322, 323.

The delay module 330 and the

complementary delay modules

322 and 323 are all composed of the same delay units 340 connected in series. For example, in the embodiment shown in fig. 3, the

complementary delay modules

322, 323 are respectively composed of 1 delay cell 340 and 3 delay cells 340 connected in series, and each of the delay modules 330 is composed of M delay cells 340 connected in series (M is a positive integer).

The delay module 330 and the

complementary delay modules

322 and 323 are formed by the same delay units 340 connected in series, so that delay errors among the delay units 340 can be properly offset, and the obtained delays of the delay module 330 and the

complementary delay modules

322 and 323 are more accurate. Such delay errors between the respective delay cells 340 are caused by various factors (e.g., process, temperature, etc.) during the manufacture, installation, and operation of the delay cells 340.

In a preferred embodiment, each delay cell 340 may be formed of a buffer or a pair of inverters. In other embodiments, the delay unit 340 may be composed of one or more elements capable of implementing the delay function.

The delay module and the complementary delay module of each arithmetic stage should be configured such that the computational delay from a set of inputs to each of a set of outputs of each arithmetic stage is substantially equal. That is, the delay module 330 and the

complementary delay modules

322, 323 in the arithmetic stage 300 should be configured such that the calculated delays of the outputs numbered A, B, C, D, E, F, G, H, W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15 from the set of inputs 301 to the set of outputs 302 are substantially equal.

As mentioned above, the register delay, clock tree delay, etc. are much smaller than the computation delay of the combinational logic module. In other words, delay module 330 and

supplemental delay modules

322, 323 in the arithmetic stage 300 should be configured such that the following are substantially equal:

1. the calculated delay from input 301 to the output of output 302 labeled a, i.e., the sum of the calculated delays of combinational logic blocks 311 and 312;

2. the calculated delay from input 301 to the output 302 labeled E, i.e., the sum of the calculated delays of the

combinational logic block

312 and 1 delay unit 340;

3. the calculated delay from input 301 to the output 302 labeled W15, i.e., the sum of the calculated delays of combinational logic block 313 and 3 delay cells 340;

4. the calculated delays of the outputs labeled others (B, C, D, F, G, H, W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14), i.e., the sum of the calculated delays of the M delay cells 340, from the input 301 to the output 302.

Those skilled in the art will appreciate that the number and configuration of the delay modules 330 and the

complementary delay modules

322, 323 in fig. 3 are exemplary and may be adjusted accordingly according to the hash algorithm implemented by the arithmetic circuit 300 and the specific configuration of the chip.

In the embodiment shown in fig. 3, the number of complementary delay blocks 322,323 is less than the number of combinational logic blocks, and the output labeled a is not coupled to a complementary delay block, but is directly coupled to the combinational logic block 311. In some embodiments, the output with the longest computation delay of the corresponding combinational logic module in the set of outputs may not be coupled to the complementary delay module, but directly to the corresponding combinational logic module. In other words, the computation delay of the output (a) with the longest computation delay from the input 301 to the corresponding combinational logic block in the outputs 302 is determined directly as the computation delay of the arithmetic stage 300, and the computation delays to the other outputs (B, C,.., W15) in the outputs 302 are complemented by the delay block 330 and the complementary delay blocks 322, 323. An advantage of such an embodiment is that no additional computation delay is introduced, minimizing the overall computation delay of the arithmetic stage 300.

In such embodiments, the number of delay cells 340 included in delay module 330 and

supplemental delay modules

322, 323 may be determined based on the need to compensate for the calculated delay. For example, in the embodiment shown in fig. 3, in order to complement the difference between the calculated delay from input 301 to the output denoted by E in output 302 and the calculated delay from input 301 to the output denoted by a in output 302, i.e. in order to complement the calculated delay of the combinational logic module 311, the complementary delay module 322 is arranged to be made up of 1 delay unit 340.

In other embodiments, the number of complementary delay blocks may be equal to the number of combinational logic blocks, and the number of delay cells 340 included in delay block 330 and complementary delay blocks 322, 323 may also be determined in combination with other factors. For example, to better offset the delay error between the delay units 340, the number of delay units 340 included in the delay module 330 and the

complementary delay modules

322, 323 may be increased appropriately. However, the number of delay units 340 should not be excessively large in consideration of the manufacturing cost and power consumption of the chip.

In a preferred embodiment, the number M of the delay units 340 included in the delay module 330 may be greater than or equal to 10 and less than or equal to 20. In a further preferred embodiment, M may be greater than or equal to 12 and less than or equal to 18.

It is noted that the expression "substantially equal" in this context means that the two are approximately equal within a certain error, but not necessarily exactly equal. For example, "substantially equal" means that the two are approximately equal within a 2% error. Preferably, the two are approximately equal within a 1% error. In some contexts, the error may be about 5%. It should be understood by those skilled in the art that this is in accordance with technical principles and engineering practices.

The computational delay from a set of inputs to each arithmetic stage to each of a set of outputs is substantially equal, which enables data to be passed in sequence along the respective arithmetic stages in time without triggering via registers. In other words, the arithmetic circuit 200 of the present disclosure does not require the cache register and the extension register in the related art (i.e., 8 cache registers A, B, C, D, E, F, G, H and 16 extension registers W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15 in the arithmetic circuit 100 shown in fig. 1).

Further, as described above, the period of the clock for triggering the register, feeding the input data, and extracting the output data in the related art should be greater than or equal to the computation delay of each operation stage. However, the clock period for feeding the input data and extracting the output data in the arithmetic circuit 200 of the present disclosure does not need to be greater than or equal to the computation delay of the combinational logic block of each arithmetic stage. Thus, the computational frequency and throughput of the operational circuit 200 of the present disclosure is not limited by the computational latency of the combinational logic blocks of each operational stage.

Fig. 4 is a timing diagram illustrating the operation circuit 200 shown in fig. 2 executing the hash algorithm.

As shown in FIG. 4, a clock CLK is used to feed input data to the operational circuit 200 at input 201-1. The period of the clock CLK is T. At each rising edge of the clock CLK a new set of input data is fed to the operational circuit 200 at input 201-1.

Those skilled in the art will appreciate that the sets of input data in fig. 4 are fed to the input 201-1 of the operational circuit 200 at the rising edge of the clock CLK, by way of example only. In other embodiments, the input data may also be fed to the input 201-1 of the operational circuit 200 at the falling edge of the clock CLK.

As described above, the period T of the clock CLK of the operational circuit 200 does not need to be greater than or equal to the computation delay of each operational stage. Alternatively, the period T of the clock CLK may be smaller than the computation delay of each operational stage, so that the computation frequency and throughput of the operational circuit 200 are increased, thereby increasing the computation power of the operational circuit 200 and decreasing the power consumption computation power.

In a preferred embodiment, the computational delay of each operational stage may be substantially equal to k times the period T of the clock CLK, where k is an integer greater than or equal to 2. This allows each operational stage to accommodate exactly k sets of data when the operational circuit 200 is in operation.

Increasing the value of k is beneficial to increase the throughput rate of the operational circuit 200 and to decrease its power consumption computation ratio, based on a basic determination of the computation delay of each operational stage. However, when the value of k is large, the negative influence of the delay error between the delay units 340 becomes large, which increases the risk of delay skew and data corruption in each operation stage. Preferably, k may be selected to be 2 or 3.

To control the negative effects of delay errors between the individual delay units 340, M may preferably be chosen to be 3 to 10 times k. Further preferably, M may be selected to be 4 to 8 times k. Further preferably, M may be selected to be 5 to 7 times k.

Fig. 4 exemplarily shows a timing diagram of the arithmetic circuit 200 executing the hash algorithm in the case where k is 2.

In the embodiment shown in fig. 4, the computation delay of each operation stage is 2T. In other words, the calculated delay from one set of inputs to each of one set of outputs of each operational stage of the operational circuit 200 is 2T.

That is, in each operational stage of the operational circuit 200, the sum of the computational delays of the combinational logic blocks 211, 212 and the supplemental delay block 221 (i.e., the computational delay from the set of inputs to the set of outputs labeled A for each operational stage), the sum of the computational delays of the combinational logic block 212 and the supplemental delay block 222 (i.e., the computational delay from the set of inputs to the set of outputs labeled E for each operational stage), the sum of the computational delays of the combinational logic block 213 and the supplemental delay block 223 (i.e., the computational delay from the set of inputs to the set of outputs labeled W15 for each operational stage), and the computational delay of the delay block 230 (i.e., the computational delay from the set of inputs to the set of outputs labeled others (B, C, D, F, G, H, W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, and W10), The calculated delays of the outputs of W11, W12, W13, W14)) are all 2T.

As shown in fig. 4, at t ═ 0, at the first rising edge of the clock CLK, a first set of data (data 1) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200, and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage, as well as to the delay block 230 and the complementary delay blocks 221, 222, 223. After a 2T computation delay, at T2T, data 1 arrives at the output 202-1 of the first arithmetic stage and is passed on to the input 201-2 of the second arithmetic stage.

Thereafter, data 1 is passed to the

combinational logic modules

211, 212, 213 of the second arithmetic stage and to the delay module 230 and to the

complementary delay modules

221, 222, 223, and likewise with a calculated delay of 2T, at T ═ 4T, data 1 arrives at the output 202-2 of the second arithmetic stage and is passed on to the input 201-3 of the third arithmetic stage.

Thereafter, also after a computation delay of 2T, when T is 6T, data 1 arrives at the output 202-3 of the third arithmetic stage and is passed on to the input 201-4 of the fourth arithmetic stage.

Furthermore, at T, at the second rising edge of the clock CLK, a second set of data (data 2) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200 and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T and 2T, data 1 and data 2 are both accommodated in the first arithmetic stage of the arithmetic circuit 200. After a 2T computation delay, at T3T, data 2 arrives at the output 202-1 of the first arithmetic stage and is passed on to the input 201-2 of the second arithmetic stage.

Thereafter, data 2 is passed to the combinational logic blocks 211, 212, 213 of the second arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T3T and T4T, data 1 and data 2 are both accommodated in the second arithmetic stage of the arithmetic circuit 200. Also after a computation delay of 2T, at T-5T, data 2 arrives at the output 202-2 of the second arithmetic stage and is passed on to the input 201-3 of the third arithmetic stage. Between T5T and T6T, data 1 and data 2 are both accommodated in the third arithmetic stage of the arithmetic circuit 200.

Furthermore, at the third rising edge of the clock CLK, at T-2T, a third set of data (data 3) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200 and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T2T and T3T, data 2 and data 3 are both accommodated in the first arithmetic stage of the arithmetic circuit 200. After a computation delay of 2T, at T — 4T, data 3 arrives at the output 202-1 of the first arithmetic stage and is passed on to the input 201-2 of the second arithmetic stage.

Thereafter, data 3 is passed to the combinational logic blocks 211, 212, 213 of the second arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T4T and T5T, data 2 and data 3 are both accommodated in the second arithmetic stage of the arithmetic circuit 200. Also after a computation delay of 2T, at T-6T, data 3 arrives at the output 202-2 of the second arithmetic stage and is passed on to the input 201-3 of the third arithmetic stage.

Furthermore, at the fourth rising edge of the clock CLK, at T-3T, a fourth set of data (data 4) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200 and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T3T and T4T, data 3 and data 4 are both accommodated in the first arithmetic stage of the arithmetic circuit 200. After a computation delay of 2T, at T-5T, data 4 arrives at the output 202-1 of the first arithmetic stage and is passed on to the input 201-2 of the second arithmetic stage.

Thereafter, the data 4 is passed to the combinational logic blocks 211, 212, 213 of the second arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T5T and T6T, data 3 and data 4 are both accommodated in the second arithmetic stage of the arithmetic circuit 200.

Furthermore, at the fifth rising edge of the clock CLK, at T-4T, a fifth set of data (data 5) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200 and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T4T and T5T, data 4 and data 5 are both accommodated in the first arithmetic stage of the arithmetic circuit 200. After a 2T computation delay, at T6T, data 5 arrives at the output 202-1 of the first arithmetic stage and is passed on to the input 201-2 of the second arithmetic stage.

Further, at the sixth rising edge of the clock CLK, at T-5T, the sixth set of data (data 6) is fed to the input 201-1 of the first arithmetic stage of the arithmetic circuit 200 and then passed to the combinational logic blocks 211, 212, 213 of the first arithmetic stage as well as the delay block 230 and the complementary delay blocks 221, 222, 223. Between T-5T and T-6T, data 5 and data 6 are both accommodated in the first arithmetic stage of the arithmetic circuit 200.

It can be seen that each operational stage can accommodate k sets of data during normal operation of the operational circuit 200, i.e., N operational stages can simultaneously compute k x N sets of data. In contrast, the prior art arithmetic circuit 100 including N arithmetic stages can only calculate N sets of data simultaneously. This is one of the significant advantages of the present invention over the prior art.

The arithmetic circuit according to the present disclosure can be implemented in various suitable manners such as software, hardware, a combination of software and hardware, and the like. In one implementation, a chip for bitcoin mining may include an arithmetic circuit as described above, and the chip may also be included in a computing device for bitcoin mining.

The terms "front", "back", "top", "bottom", "over", "under" and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.

As used herein, the word "exemplary" means "serving as an example, instance, or illustration," and not as a "model" that is to be reproduced exactly. Any implementation exemplarily described herein is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, the disclosure is not limited by any expressed or implied theory presented in the preceding technical field, background, utility model content, or detailed description.

As used herein, the term "substantially" is intended to encompass any minor variation resulting from design or manufacturing imperfections, device or component tolerances, environmental influences, and/or other factors. The word "substantially" also allows for differences from a perfect or ideal situation due to parasitics, noise, and other practical considerations that may exist in a practical implementation.

In addition, the foregoing description may refer to elements or nodes or features being "connected" or "coupled" together. As used herein, unless expressly stated otherwise, "connected" means that one element/node/feature is directly connected to (or directly communicates with) another element/node/feature, either electrically, mechanically, logically, or otherwise. Similarly, unless expressly stated otherwise, "coupled" means that one element/node/feature may be mechanically, electrically, logically, or otherwise joined to another element/node/feature in a direct or indirect manner to allow for interaction, even though the two features may not be directly connected. That is, to be "coupled" is intended to include both direct and indirect connections of elements or other features, including connections that utilize one or more intermediate elements.

In addition, "first," "second," and like terms may also be used herein for reference purposes only and are thus not intended to be limiting. For example, the terms "first", "second", and other such numerical terms referring to structures or elements do not imply a sequence or order unless clearly indicated by the context.

It will be further understood that the terms "comprises/comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components, and/or groups thereof.

In the present disclosure, the term "providing" is used in a broad sense to encompass all ways of obtaining an object, and thus "providing an object" includes, but is not limited to, "purchasing," "preparing/manufacturing," "arranging/setting," "installing/assembling," and/or "ordering" the object, and the like.

Those skilled in the art will appreciate that the boundaries between the above described operations merely illustrative. Multiple operations may be combined into a single operation, single operations may be distributed in additional operations, and operations may be performed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments. However, other modifications, variations, and alternatives are also possible. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. The various embodiments disclosed herein may be combined in any combination without departing from the spirit and scope of the present disclosure. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. An arithmetic circuit for performing a hash algorithm, the arithmetic circuit comprising a plurality of arithmetic stages arranged in a pipeline structure, wherein each arithmetic stage comprises:

a set of inputs and a set of outputs, the set of inputs being correspondingly coupled to a set of outputs of a previous operational stage and the set of outputs being correspondingly coupled to a set of inputs of a subsequent operational stage;

a plurality of combinational logic modules, each combinational logic module having an input coupled to at least a portion of the set of inputs;

a plurality of delay modules, each delay module having an input coupled to one of the set of inputs and an output coupled to one of the set of outputs not coupled to the combinational logic module, such that the outputs of the set of outputs not coupled to the combinational logic module are each coupled to one delay module; and

a plurality of complementary delay modules, each complementary delay module having an input coupled to the output of a corresponding combinational logic module and an output coupled to one of the set of outputs, wherein,

each of the delay module and the complementary delay module of each operational stage is comprised of identical delay cells connected in series and configured such that the calculated delay from the set of inputs to each of the set of outputs of each operational stage is substantially equal.

2. The operational circuit of claim 1, wherein the computational delay of each operational stage is substantially equal to k times a period of a clock used to feed the input data to the set of inputs, where k is an integer greater than or equal to 2.

3. The operational circuit of claim 2, wherein each delay block is comprised of M delay cells connected in series, where M is a multiple of k.

4. The operational circuit of claim 2, wherein k is 2 or 3.

5. The operational circuit of claim 3, wherein M is greater than or equal to 10 and less than or equal to 20.

6. The operational circuit of claim 3, wherein M is 3 to 10 times k.

7. The operational circuit of any of claims 1-6, wherein each delay cell is comprised of a buffer or a pair of inverters.

8. The operational circuit of any of claims 1-6, wherein a number of complementary delay blocks per operational stage is equal to a number of combinational logic blocks such that each of the set of outputs is coupled to one of a delay block and a complementary delay block.

9. The operational circuit of any of claims 1-6, wherein the operational circuit is configured to perform a SHA256 algorithm.

10. A chip, characterized in that it comprises an arithmetic circuit according to any one of claims 1-9.

11. A computing device, characterized in that it comprises a chip according to claim 10.