WO1999013610A2 - Pipelined completion for asynchronous communication - Google Patents
Pipelined completion for asynchronous communication Download PDFInfo
- Publication number
- WO1999013610A2 WO1999013610A2 PCT/US1998/019192 US9819192W WO9913610A2 WO 1999013610 A2 WO1999013610 A2 WO 1999013610A2 US 9819192 W US9819192 W US 9819192W WO 9913610 A2 WO9913610 A2 WO 9913610A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- circuit
- completion
- data
- tree
- stage
- Prior art date
Links
- 238000004891 communication Methods 0.000 title description 8
- 238000012545 processing Methods 0.000 claims description 66
- 239000000872 buffer Substances 0.000 claims description 41
- 230000001934 delay Effects 0.000 claims description 11
- 238000000034 method Methods 0.000 description 17
- 230000001360 synchronised effect Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 10
- 230000007704 transition Effects 0.000 description 9
- 238000000354 decomposition reaction Methods 0.000 description 7
- 230000007935 neutral effect Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 2
- 244000309464 bull Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003797 telogen phase Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
- G06F9/3871—Asynchronous instruction pipeline, e.g. using handshake signals between stages
Definitions
- the present invention relates to information processing, and more specifically to architecture and operation of asynchronous circuits and processors.
- synchronous processing devices Many information processing devices operate based on a control clock signal to synchronize operations of different processing components and therefore are usually referred to as "synchronous" processing devices.
- different processing components may operate at different speeds due to various factors including the nature of different functions and different characteristics of the components or properties of the signals processed by the components. Synchronization of these different processing components requires the speed of the control clock signal to accommodate the slowest processing component. Thus, some processing components may complete respective- operations ⁇ - ⁇ earlier than other slow components and have to wait until all processing components complete their operations.
- the speed of a synchronous processor can be improved by increasing the clock speed to a certain extent, synchronous processing is not an efficient way of utilizing available resources.
- Such an asynchronous processor can be optimized for high-speed processing by special pipelining techniques based on unique properties of the asynchronous architecture.
- Asynchronous pipelining allows multiple instructions to be executed at the same time. This has the effect of executing instructions in a different order than originally intended.
- An asynchronous processor compensates for this out-of-order execution by maintaining the integrity of the output data without a synchronizing clock signal.
- a synchronous processor relies on the control clock signal to indicate when an operation of a component is completed and when the next operation of another component may start. By eliminating such synchronization of a control clock, a pipelined processing component in an asynchronous processor, however, generates a completion signal instead to inform the previous processing component the completion " of ⁇ an operation.
- PI and P2 are two adjacent processing components in an asynchronous pipeline.
- the component PI receives and processes data X to produce an output Y.
- the component P2 processes the output Y to produce a result Z.
- At least two communication channels are formed between PI and P2 : a data channel that sends Y from PI to P2 and a request/acknowledgment channel by which P2 acknowledges receiving of Y to PI and requests the next Y from PI.
- the messages communicated to PI via the request/acknowledgment channel are produced by P2 according to a completion signal internal to P2.
- this completion signal can introduce an extra delay that degrades the performance of the asynchronous processor. Such extra delay is particularly problematic when operations of a datum are decomposed into two or more concurrent elementary operations on different portions of the datum.
- Each elementary operation requires a completion signal.
- the completion signals for all elementary operations are combined into one global completion signal that indicates completion of operations on that datum.
- a completion circuit (“completion tree'T is needed to collect all elementary completion signals to generate that global completion signal. The complexity of such a completion tree increases with the number of the elementary completion signals.
- the present disclosure provides a pipelined completion tree for asynchronous processors.
- a high throughput and a low latency can be achieved by decomposing any pipeline unit into an array of simple pipeline blocks.
- Each block operates only on a small portion of the datapath.
- Global synchronization between stages, when needed, is implemented by copy trees and slack matching. More specifically, one way to reduce the delay in the completion tree uses asynchronous pipelining to decompose a long critical cycle in a datapath into two or more short cycles.
- One or more decoupling bu-ffe " rs may " ⁇ be ⁇ — ' disposed in the datapath between two pipelined stages.
- Another way to reduce the delay in the completion tree is to reduce the delay caused by distribution of a signal to all N bits in an N-bit datapath. Such delay can be significant when N is large.
- One embodiment of the asynchronous circuit uses the above two techniques to form a pipelined completion tree in each stage to process data without a clock signal.
- This circuit comprises a first processing stage receiving an input data and producing a first output data, and a second processing stage, connected to communicate with said first processing stage without prior knowledge of delays associated with said first and second processing stages and to receive said first output data to produce an output.
- Each processing stage includes: a first register and a second register connected in parallel relative to each other ' to respectively receive a first portion and a second portion of a received data, a first logic circuit connected to said first register to produce a first completion signal indicating whether all bits of said first portion of said received data are received by said first register, a second logic circuit connected to said second register to produce a second completion signal indicating whether all bits of said second portion of said received data are received by said second register, a third logic circuit connected to receive said first and second completion signals and configured to produce a third completion signal to indicate whether all bits of said first and second portions of said received data are received by said first and second registers, a first buffer circuit connected between said first logic circuit and the third logic circuit to pipeline said first and third logic circuits, and a
- FIG. 1 shows two communicating processing stages in an asynchronous pipeline circuit based on a quasi-delay- insensitive four-phase handshake protocol.
- FIG. 2 shows a prior-art completion tree formed by two-input C-elements.
- FIG. 3A is a simplified diagram showing the asynchronous pipeline in FIG. 1.
- FIG. 3B shows an improved asynchronous pipeline with a decoupling buffer connected between two processing stages.
- FIG. 3C shows one implementation of the circuit of FIG. 3D using a C-element as the decoupling buffer.
- FIG. 4 shows an asynchronous circuit implementing a pipelined completion tree and a pipelined distribution circuit in each processing stage.
- FIG. 5 shows a copy tree circuit. ' ⁇ FIG. 6 shows one embodiment of the copy tree in
- FIG. 5 is a diagrammatic representation of FIG. 5.
- FIG. 7A is a diagram illustrating decomposition of an N-bit datapath of an asynchronous pipeline into two or more parallel datapaths with each having a processing block to process a portion of the N-bit data.
- FIG. 7B is a diagram showing different datapath structures at different stages in an asynchronous pipeline.
- FIG. 7C shows a modified circuit of the asynchronous pipeline in FIG. 7A where a processing stage is decomposed into two pipelined small processing stages to improve the throughput .
- FIG. 8 shows an asynchronous circuit having a control circuit to synchronize decomposed processing blocks of two different processing stages.
- FIG. 9A shows a balanced binary tree.
- FIG. 9B shows a skewed binary tree.
- FIG. 9C shows a 4-leaf skewed completion tree.
- FIG. 9D shows a 4-leaf balanced completion tree.
- the asynchronous circuits disclosed he ' rein are quasi " delay-insensitive in the sense that such circuits do not use any assumption on, or knowledge of, delays in most operators and wires.
- One of various implementations of such quasi- delay-insensitive communication is a four-phase protocol for communication between two adjacent processing stages in an asynchronous pipeline. This four-phase protocol will be used in the following to illustrate various embodiments and should not be construed as limitations of the invention.
- FIG. 1 is a block diagram showing the implementation of the four-phase protocol in an asynchronous pipeline.
- Two adjacent stages (or processing components) 110 (“A") and 120 (“B") are connected to send an N-bit data from the first stage 110 to the second stage 120 via data channels 130.
- a communication channel 140 is implemented to send a request/acknowledgment signal "ra" by the second stage 120 to the first stage 110.
- the signal ra either requests data to be sent or acknowledges reception of data to the first stage 110.
- the processing stages 110 and 120 are not clocked or synchronized to a control clock signal.
- the first stage 110 includes a register part R A , 112, and a control part "C A ", 114.
- the register part 112 " ⁇ stores data to be sent to the second stage 120.
- the control part 114 generates an internal control parameter "x" 116 to the data channels 130, e.g., triggering sending data or resetting the data channels.
- the control part 114 also controls data processing in the first stage 110 which generates the data to be sent to the second stage 120.
- the second stage 120 includes a register part 122 that stores received data from register part 112, a control part "C B ", 124, that generates the request/acknowledgment signal ra over the channel 140 and controls data processing in the second stage, and a completion tree 126 that connects the register part 122 and the control part 124.
- the completion tree 126 is a circuit that checks the status of the register part 122 and determines whether the processing of the second stage 120 on the received data from the first stage 110 is completed.
- An internal control parameter "y" 128 is generated by the completion tree 126 to control the operation of the control part 224.
- One possible four-phase handshake protocol is as follows. When the completion tree 126 detects that the second stage 120 has completed processing of the received data and is ready to receive the next data from the first ⁇ stage 110, a request signal is generated by the control part 124 in response to a value of the control parameter y (128) and is sent to the control part 114 via the channel 140 to inform the first stage 110 that the stage 120 is ready to receive the next data. This is the "request" phase. Next, in a data transmission phase, the first stage
- control part 110 responds to the request by sending out the next data to the second stage 120 via the data channels 130. More specifically, the control part 114 processes the request from the control part 124 and instructs the register part 112 by using the control parameter x (116) to send the next data .
- An acknowledgment phase follows. Upon completion of receiving the data from the first stage 110, the completion tree 126 changes the value of the control parameter y (128) so that the control part 124 produces an acknowledgment signal via the channel 140 to inform the first stage 110 (i.e., the control part 114) of completion of the data transmission.
- control part 114 changes the value of the control parameter x (116) which instructs - the register " — part 112 to stop data transmission. This action resets the data channels 130 to a "neutral" state so that the next data can be transmitted when desired.
- completion tree 126 resets the value of the control parameter y to the control part 124 to produce another request. This completes an operation cycle of request, data transmission, acknowledgment, and reset.
- Each processing component or stage operates as fast as possible to complete a respective processing step and then proceeds to start the next processing step.
- Such asynchronous pipelined operation can achieve a processing speed, on average, higher than that of a synchronous operation.
- a delay-insensitive code is characterized by the fact that the data rails" alternate " ⁇ between a neutral state that doesn't represent a valid encoding of a data value, and a valid state that represents a valid encoding of a data value. See, Alain J. Martin, "Asynchronous Data paths and the Design of an Asynchronous Adder” in Formal Methods in System Design, 1:1, Kluwer, 117- 137, 1992.
- the above four-phase protocol can be broken down into a set phase and a rest phase.
- the set phase includes the sequence of transitions performed in the request phase and the transmission phase (assuming that all wires are initially set low) :
- Each transition is a process where a signal (e.g., ra, x, D, or y) changes its value.
- the reset phase includes the sequence of transitions in the acknowledgment phase and the final reset phase: ra l ; xl ; Dt ; y ⁇ .
- HSE handshake expansion
- the false value, yl, of the completion signal y represents completion of processing and instructs the control part 124 to send out a request.
- the true value, yl represents completion of receiving data and instructs the control part 124 to send out an acknowledgment.
- the architecture of the completion tree 126 and the generation of the completion signals, yi and yT, are now described in detail .
- y When all write-acknowledgment ⁇ signals are raised, y can be raised to produce the completion signal yl . Similarly, wack k is lowered when the corresponding bit b ⁇ is reset to its neutral value according to a chosen delay-insensitive protocol. Hence, y can be reset to zero when all write-acknowledgment signals are reset to zero (the neutral value) . This can be expressed as the following:
- the completion tree 126 is constructed and configured to perform the above logic operations to generate the proper completion signals (i.e., either yl and yl).
- the completion tree uses a tree of two- input C-elements as shown in FIG. 2.
- the two-input C- element also known as Muller C element
- Muller C element is a logic gate which outputs a high or low only when both inputs are h gh ⁇ or low, respectively, and the output remains unchanged from a previous value if the inputs are different from each other.
- the number of C-elements in FIG. 2 may be reduced by using C-elements of more than two inputs, such as three or even four inputs.
- the existing VLSI technology limits the number of inputs in such C-elements since as the number of p-transistors connected in series to form the C- elements increases, the performance of the C-elements is usually degraded.
- the number of the inputs of a C-element may be up to 4 with acceptable performance.
- the throughput of an asynchronous system is determined by the delay through the longest cycle of transitions. Such a cycle is called a "critical cycle.”
- a delay ⁇ c through the sequence C is a good estimated lower-bound for the critical cycle delay.
- the target throughput in the 0 . 6- ⁇ m CMOS technology is around 300 MHZ.
- the critical cycle delay is thus about 3 ns .
- the completion tree delay is around 1 ns .
- one third of the critical cycle delay is caused by the completion tree. This is a significant portion of the critical delay.
- FIGs. 3A and 33 show an example of breaking a long critical cycle between two pipelined stages A and B into two short cycles by pipelining A and B through a buffer.
- FIG. 3A shows two components 310 (A) and 320 (B) communicate with each other through two simple handshake _ channels 312 (a) and 322 (jb) .
- the protocol may include the following sequence of transitions: " - - — A ;a ⁇ ;B ⁇ ;b ⁇ ;A ⁇ ;al;Bl;bl where Al,Bt, ⁇ l,BI represent the transitions inside A and B.
- a simple buffer 330 can be introduced to form an asynchronous pipelining between A and B as in FIG. 3B to reduce this long cycle into two short cycles .
- the two handshakes are synchronized by the buffer, not by a clock signal.
- the buffer can be implemented in various ways.
- FIG. 3C shows one simple implementation that uses a single C-element 340 of two inputs al, b2 and two outputs a2, bl.
- the C-element 340 receives the input al and an inverted input of b2 to produce two duplicated outputs ⁇ a2, bl.
- the two handshakes are synchronized in the following way:
- This particular buffer allows the downgoing phase of A to overlap with the upgoing phase of B and the upgoing phase of A to overlap with the downgoing phase of B. Such overlap reduces the duration of the handshaking process.
- adding additional stages in an asynchronous pipeline may not necessarily increase the forward latency of the pipeline and may possibly reduce the forward latency.
- the above technique of decomposing a long cycle into two or more pipelined short cycles can reduce the delay along the datapath of a pipeline. However, this does not address another delay caused by distribution of a signal to all N bits in an N-bit datapath, e.g., controlling bits in a 32-bit register that sends out data (e.g., the register " 112 "" in the stage 110) .
- Such delay can also be significant, specially when N is large (e.g., 32 or 64 or even 128) .
- These m small datapaths are connected parallel to one another and can transmit data simultaneously relative to one another.
- the N-bit register of a stage in the N-bit datapath can also be replaced by m small registers of n bits.
- the number m and thereby n are determined by the processing tasks of the two communicating stages.
- a 32-bit datapath for example, can be decomposed into four 8-bit blocks, or eight 4-bit blocks, or sixteen 2- bit blocks, or even thirty-two 1-bit blocks to achieve a desired performance.
- decomposition of a long cycle into two or more small cycles can be applied to two directions: one along the pipelined stages by adding decoupling buffers therebetween and another "orthogonal" direction by decomposing a single datapath into two or more small datapaths that are connected in parallel.
- FIG. 4 shows a 32-bit asynchronous pipeline with "" a ⁇ pipelined completion tree based on the above two-dimensional decomposition.
- Four 8-bit registers 401A, 401B, 401C, 401D in the sending stage 110 are connected with respect to one another in parallel.
- four 8-bit registers 402A, 402B, 402C, 402D in the receiving stage 120 that respectively correspond to the registers in the sending stage 110 are also connected with respect to one another in parallel. This forms four parallel 8-bit datapaths.
- Decomposition along the datapaths is accomplished by using the decoupling buffer shown in FIGs. 3B and 3C.
- the control part 114 responds to this signal 411 to control the registers 401A, 401B, 401C, and 401D to send the next data.
- At least two decoupling buffers such as 412A and 422A, are introduced in each datapath with one in the sending stage 110 and another in the receiving stage 120.
- the buffer 412A for example, is disposed on wires (ctl,ral) to interconnect the control part 114, the completion tree 410, register 401A, and the request/acknowledge signal for the first datapath.
- the buffer 422A is disposed on wires (xl, ral) to interconnect the first completion tree 403A, the control part 124, the completion tree 420, and the completion tree 410.
- the completion trees 403A, 403B, 403C, and 403D are pipelined to the completion tree 420 via buffers 422A, 422B, 422C, and 422D, respectively.
- the completion trees in the stage 110 are also pipelined through buffers 412A, 412B, 412C, and 412D.
- Such pipelined completion significantly reduces the delay in generating the completion signal for the respective control part.
- the above decoupling technique can be repeated until all completion trees have a delay below an acceptable level to achieve a desired throughput.
- buffers 414 and 424 may be optionally added on wires (ra, x) and (ra, y) to decouple the control parts 114 and 124, respectively.
- decoupling buffers may increase the latency of an asynchronous pipeline, a proper balance between the latency requirement and the throughput requirement should be maintained when introducing such buffers .
- a stage in an asynchronous circuit usually performs both sending and receiving.
- One simple example is a one- place buffer having a register, an input port L, and an output port R. This buffer repeatedly receives data on the port L, and sends the data on the port R .
- the register that holds the data is repeatedly written and read.
- the completion mechanism for the control 114 in the sending stage 110 and the completion mechanism for the control 124 in the receiving stage 120 are similar in circuit construction and function. Since data is almost never read and written simultaneously, such similarity can be advantageously exploited to share a portion of the pipelined completion mechanism between s . ending data and receiving data within a stage. This simplifies the circuit and reduces the circuit size. "" " ⁇ In particular, distributing the control signals from the control part in each stage to data cells and merging the signals from all data cells to the control part can be implemented by sharing many circuit elements. In FIG. 4, a portion of circuit, a "copy tree" is used in both stages. This copy tree is shown in FIG. 5. The copy tree includes two pipelined circuits: a pipelined completion tree circuit for sending a completion signal based on completion signals from data cells to the global control part in each stage and a pipelined distribution circuit for sending control signals from the global control part to data cells.
- FIG. 6 shows one embodiment of a copy tree for a stage that has k data cells.
- This copy tree is used for both distributing k control signals from the control part (e.g., 114 in FIG. 4) to all data cells and merging k signals from all data cells to the control part.
- the signals r l r s are signals going to data cells, (1 ⁇ i ⁇ k) , as requests to receive or send.
- the completion signal ct ⁇ comes from data cell i, as a request/acknowledgment signal.
- the copy tree shown in FIG. 6 is only an example. Other configurations are possible.
- a program specification of a copy tree for both sending and receiving is as follows:
- each data cell i contains a control part that communicates with a respective copy tree through the channel L .
- the copy tree and the control for each data cell may be eliminated.
- each small datapath handles only a small number of bits of the N bits, the data processing logic and the control can be ⁇ integrated together to form a single processing block without having a separate control part and a register.
- the registers in each stage shown in FIG. 4 can be eliminated. Therefore, the global control part in each stage is distributed into the multiple processing blocks in the small datapaths. Without the register, the data in each processing block can be stored in a buffer circuit incorporated in the processing block.
- Such implementation can usually be accomplished based on reshuffling of half buffer, precharged half buffer, and precharged full buffer disclosed in U.S. Application No.
- FIG. 7A shows one embodiment of an asynchronous circuit by implementing multiple processing blocks.
- the datapaths between different stages in an N-bit asynchronous pipeline may have different datapath structures to reduce the overall delay. The difference in the datapaths depends on the nature and complexity of these different stages.
- One part of the N-bit pipeline for example, may have a single N-bit data path while another part may have m n-bit datapaths " .
- FIG. 7B " ⁇ shows three different datapath structures implemented in four pipelined stages.
- FIG. 7C shows another example of decomposing a long cycle into small cycles based on the circuit in FIG. 7A.
- the pipelined stage A can be decomposed into two pipelined stages Al and A2.
- Each processing block of the stages Al and A2 is simplified compared to the processing block in the original stage A.
- Each stage, Al or A2 performs a portion of the processing task of the original stage A.
- Al and A2 are properly constructed, the average throughput of the stages Al and A2 is higher than that of the original stage A.
- Decomposition of an N-bit datapath into multiple small datapaths shown in FIG. 7A allows each small datapath to process and transmit a portion of the data.
- the first small datapath handles bits 1 through 8
- the second small datapath handles bits 9 through 18, etc.
- synchronization of different small datapaths and a global " completion mechanism are not needed. This rarely occurs in most practical asynchronous processors except " " some local processing or pure buffering of data.
- the pipelined stages are often part of a logic unit (e.g., a fetch unit or a decode unit) .
- Each processing block in stage k+1 usually need read some information from two or more different processing blocks in the stage k.
- the decomposed small datapaths need to be synchronized relative to one another.
- FIG. 8 A control circuit is introduced between the stage k and stage k+1 to gather global information from each processing block of stage k and computes appropriate control signals to control the related processing blocks in stage k+1.
- Decomposed datapaths are not shown in FIG. 8.
- the stage k compares two 32-bit numbers A and B and the operations of the stage k+1 depends on the comparison result.
- the control circuit produces a control signal indicating the difference (A-B) based on the signals from the decomposed datapaths in the stage k. This control signal is then distributed to all decomposed blocks in the stage k+1.
- One aspect of the control circuit is to synchronize ""- the operations of the two stages k and k+1. Similar to the connections between the control part 114 and the data cells in the stage 110 of FIG. 4, a copy tree can be used to connected the control circuit to each of the stages k and k+1.
- the copy trees are preferably implemented as pipelined completion circuits. For example, each processing block in the stage k is connected to a block completion tree for that block. The block completion tree is then pipelined to a global completion tree via a decoupling buffer. The output of the global completion tree is then connected to the control circuit. This forms the pipelined completion tree in the copy tree that connects the stage k to the control circuit.
- the single control wire of a basic completion tree needs to be replaced with a set of wires encoding the different values of the control signal.
- the copy tree shown in FIG. 6 can be extended in the case of a two-valued signal encoded by wires r and s.
- the control circuit in FIG. 8 can introduce an extra delay between the stage k and k+1, in particular since T.he " ⁇ pipelined completion tree used usually has a plurality of decoupling buffers. This delay can form a bottleneck to the speed of the pipeline. Therefore, it may be necessary in certain applications to add buffers in a datapath between the stages k and k+1 in order to substantially equalize the length of different channels between the two stages. This technique is called "slack matching".
- FIG. 9A shows a balanced binary tree.
- a tree used in the present invention may not be balanced or binary.
- a binary tree can be skewed as shown in FIG. 9B .
- FIG. 9C shows a 4-leaf skewed completion tree and
- FIG. 9D shows a balanced 4-leaf completion tree.
- the above embodiments provide a high throughput and a low latency by decomposing any pipeline unit into an array of simple pipeline blocks. Each block operates only on a small portion of the datapath. The global completion delay is essentially eliminated. Global synchronization between stages is implemented by copy trees and slack matching.
- control circuit in FIG. 8 may be connected between any two stages other than two adjacent stages as shown.
- the number of decoupling buffers between two stages can be varied.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Communication Control (AREA)
- Information Transfer Systems (AREA)
- Multi Processors (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU94851/98A AU9485198A (en) | 1997-09-12 | 1998-09-11 | Pipelined completion for asynchronous communication |
GB0006105A GB2345168B (en) | 1997-09-12 | 1998-09-11 | Pipelined completion for asynchronous communication |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US5866297P | 1997-09-12 | 1997-09-12 | |
US60/058,662 | 1997-09-12 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO1999013610A2 true WO1999013610A2 (en) | 1999-03-18 |
WO1999013610A3 WO1999013610A3 (en) | 1999-06-17 |
Family
ID=22018157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1998/019192 WO1999013610A2 (en) | 1997-09-12 | 1998-09-11 | Pipelined completion for asynchronous communication |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU9485198A (en) |
GB (3) | GB2345168B (en) |
WO (1) | WO1999013610A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2365583A (en) * | 2000-02-18 | 2002-02-20 | Hewlett Packard Co | Pipeline decoupling buffer for handling early data and late data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3290511A (en) * | 1960-08-19 | 1966-12-06 | Sperry Rand Corp | High speed asynchronous computer |
-
1998
- 1998-09-11 AU AU94851/98A patent/AU9485198A/en not_active Abandoned
- 1998-09-11 GB GB0006105A patent/GB2345168B/en not_active Expired - Fee Related
- 1998-09-11 WO PCT/US1998/019192 patent/WO1999013610A2/en active Application Filing
-
2002
- 2002-11-08 GB GBGB0226015.6A patent/GB0226015D0/en not_active Ceased
- 2002-11-08 GB GBGB0226044.6A patent/GB0226044D0/en not_active Ceased
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3290511A (en) * | 1960-08-19 | 1966-12-06 | Sperry Rand Corp | High speed asynchronous computer |
Non-Patent Citations (7)
Title |
---|
BURNS et al., "Syntax-Directed Translation of Concurrent Programs into Self-Time Circuits", In: ADVANCED RESEARCH IN VLSI, PROCEEDINGS OF THE 5TH MIT CONFERENCE, MIT PRESS, 1988, p. 35-50, XP002920254 * |
BURNS et al., "Synthesis of Self-Timed Circuits by Program Transformation", In: THE FUSION OF HARDWARE DESIGN AND VERIFICATION, Edited by G.J. MILNE, North Holland, 1988, p. 1-18, XP002920255 * |
CHO et al., "Design of a 32-bit Fully Asynchronous Microprocessor (FAM)", In: PROCEEDINGS OF THE 35TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, IEEE, August 1992, Vol. 2, p. 1500-1503, XP002920249 * |
FURBER et al., "Dynamic Logic in Four-Phase Micropipelines", In: PROCEEDINGS, 2ND INTERNATIONAL SYMPOSIUM ON ADVANCED RESEARCH IN ASYNCHRONOUS CIRCUITS AND SYSTEMS, IEEE, March 1996, p. 11-16, XP002920250 * |
FURBER et al., "Four-Phase Micropipeline Latch Control Circuits", In: IEEE TRANSACTIONS ON VLSI SYSTEMS, IEEE, June 1996, Vol. 4, No. 2, p. 247-253, XP002920252 * |
KEARNEY et al., "Performance Evaluation of Asynchronous Logic Pipelines with Data Dependent Processing Delays", In: PROCEEDINGS, 2ND WORKING CONFERENCE ON ASYNCHRONOUS DESIGN METHODOLOGIES, IEEE, May 1995, p. 4-13, XP002920253 * |
MARTIN, "Asynchronous Datapaths and the Design of an Asynchronous Adder", In: DEPARTMENT OF COMPUTER SCIENCE, CALIFORNIA INSTITUTE OF TECHNOLOGY, Pasadena, CA, USA, June 1991, p. 1-24, XP002920251 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2365583A (en) * | 2000-02-18 | 2002-02-20 | Hewlett Packard Co | Pipeline decoupling buffer for handling early data and late data |
US6629167B1 (en) | 2000-02-18 | 2003-09-30 | Hewlett-Packard Development Company, L.P. | Pipeline decoupling buffer for handling early data and late data |
GB2365583B (en) * | 2000-02-18 | 2004-08-04 | Hewlett Packard Co | Pipeline decoupling buffer for handling early data and late data |
Also Published As
Publication number | Publication date |
---|---|
GB0006105D0 (en) | 2000-05-03 |
GB0226044D0 (en) | 2002-12-18 |
AU9485198A (en) | 1999-03-29 |
GB2345168A (en) | 2000-06-28 |
GB0226015D0 (en) | 2002-12-18 |
WO1999013610A3 (en) | 1999-06-17 |
GB2345168B (en) | 2002-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6038656A (en) | Pipelined completion for asynchronous communication | |
US6502180B1 (en) | Asynchronous circuits with pipelined completion process | |
CA1325286C (en) | Method and apparatus for interfacing a system control unit for a multi-processor system with input/output units | |
Scott et al. | The impact of pipelined channels on k-ary n-cube networks | |
US7882278B2 (en) | Utilizing programmable channels for allocation of buffer space and transaction control in data communications | |
US6167502A (en) | Method and apparatus for manifold array processing | |
US7647435B2 (en) | Data communication method and apparatus utilizing credit-based data transfer protocol and credit loss detection mechanism | |
JP3869726B2 (en) | High capacity asynchronous pipeline processing circuit and method | |
US5386585A (en) | Self-timed data pipeline apparatus using asynchronous stages having toggle flip-flops | |
JP2019079526A (en) | Synchronization in multi-tile, multi-chip processing arrangement | |
McAuley | Four state asynchronous architectures | |
US7249207B2 (en) | Internal data bus interconnection mechanism utilizing central interconnection module converting data in different alignment domains | |
US8669779B2 (en) | Systems, pipeline stages, and computer readable media for advanced asynchronous pipeline circuits | |
WO1994017488A1 (en) | Multipipeline multiprocessor system | |
JP2006518058A (en) | Pipeline accelerator, related system and method for improved computing architecture | |
US8106683B2 (en) | One phase logic | |
US5999961A (en) | Parallel prefix operations in asynchronous processors | |
US20060174050A1 (en) | Internal data bus interconnection mechanism utilizing shared buffers supporting communication among multiple functional components of an integrated circuit chip | |
Kol et al. | A doubly-latched asynchronous pipeline | |
EP1121631B1 (en) | Reshuffled communications processes in pipelined asynchronous circuits | |
JP2003337807A (en) | High speed operation method and system for cross bar | |
WO1999013610A2 (en) | Pipelined completion for asynchronous communication | |
Hoare et al. | Bitwise aggregate networks | |
KR100947446B1 (en) | Vliw processor | |
US7698535B2 (en) | Asynchronous multiple-order issue system architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
NENP | Non-entry into the national phase in: |
Ref country code: KR |
|
ENP | Entry into the national phase in: |
Ref country code: GB Ref document number: 200006105 Kind code of ref document: A Format of ref document f/p: F |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase in: |
Ref country code: CA |
|
122 | Ep: pct application non-entry in european phase |