WO1997007451A2  Method and system for implementing data manipulation operations  Google Patents
Method and system for implementing data manipulation operationsInfo
 Publication number
 WO1997007451A2 WO1997007451A2 PCT/US1996/013195 US9613195W WO1997007451A2 WO 1997007451 A2 WO1997007451 A2 WO 1997007451A2 US 9613195 W US9613195 W US 9613195W WO 1997007451 A2 WO1997007451 A2 WO 1997007451A2
 Authority
 WO
 Grant status
 Application
 Patent type
 Prior art keywords
 control
 bit
 stage
 data
 system
 Prior art date
Links
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/76—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
 G06F7/762—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data having at least two separately controlled rearrangement levels, e.g. multistage interconnection networks

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/76—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for programme control, e.g. control unit
 G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
 G06F9/30—Arrangements for executing machineinstructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30018—Bit or string instructions; instructions using a mask

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for programme control, e.g. control unit
 G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
 G06F9/30—Arrangements for executing machineinstructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30025—Format conversion instructions, e.g. FloatingPoint to Integer, decimal conversion

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for programme control, e.g. control unit
 G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
 G06F9/30—Arrangements for executing machineinstructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
Abstract
Description
METHOD AND SYSTEM FOR IMPLEMENTING DATA
MANIPULATION OPERATIONS
FIELD OF THE INVENTION
The present invention relates to the field of bit and byte permutations, and particularly to bit and byte permutations performed in order to carry out operations in a digital data processing system, and particularly in a digital computer system.
TERMINOLOGY
This section defines several terms which are used in the rest of this document. The term "crossbar" refers to an operation which, in general, takes as input some number of values, α , and produces as output some number of values, b . where each of the output values may take its value from any one of the input values. Each output value is completely independent of the other output values. A crossbar therefore functions as a general switching mechanism. It is very common for the number of input and output values to be the same, i.e., a = b . In this case, it is sometimes referred to as an a way, or a wide, crossbar. The term crossbar is also used in a physical sense, in which case it refers to a switch which is capable of performing a crossbar operation.
The term "multiplexer" is very similar to the term "crossbar". In its most basic sense, a multiplexer is like a crossbar which produces a single output rather than several output values. In that sense, a crossbar may be constructed from several multiplexers, each of which takes the same input. It some cases, the term multiplex may implicitly refer to a set of multiplex operations which are
independently applied to produce several output values. In this use it is synonymous with the term crossbar. It is usually clear from context whether the term
multiplexer, or the term multiplex operation, is being used in the sense of a single output value vs. multiple output values. The term multiplex may be used in either a physical or an operational sense.
The term "perfect shuffle", or just "shuffle", refers to the operation of effectively splitting a sequence of input values into several piles, then shuffling the piles together by perfectly alternating between the piles in a cyclic manner. In general, an a way perfect shuffle indicates that a piles are involved. If this number is unspecified in a reference to a shuffle operation, it may refer to a twoway shuffle, or it may refer to a multiway shuffle with an unspecified number of piles, depending on the context. It is generally assumed that the total number of elements is a multiple of the number of piles involved in the shuffle, so that each pile has the same size. Perfect shuffles are discussed in more detail later. The term perfect shuffle may be used in either a physical or an operational sense.
The term "perfect deal", or just "deal", refers to the operation of effectively dealing out a sequence of input values into several piles in a cyclic manner, then stacking the piles together. The dealing is done in a way which preserves the relative order of elements within each pile (as opposed to reversing them), so in a sense it is like dealing from the bottom of the "deck". In general, an a way perfect deal indicates that a piles are involved. If this number is unspecified in a reference to a deal operation, it may refer to a twoway deal, or it may refer to a multiway deal with an unspecified number of piles, depending on the context. It is generally assumed that the total number of elements is a multiple of the number of piles involved in the deal, so that each pile has the same size. Perfect deals are discussed in more detail later. The term perfect deal may be used in either a physical or an operational sense. Perfect shuffles and perfect deals are closely related. Perfect shuffles and perfect deals which use the same number of "piles" are inverses of each other. Furthermore, a multiway perfect shuffle is always equivalent to some multiway perfect deal. In particular, if the number of elements is ab , then an a way perfect shuffle is equivalent to a b way perfect deal. BACKGROUND OF THE INVENTION
Digital data processing systems, and particularly digital computer systems, are generally able to perform some set of operations on words of data. In some systems the set of possible operations may be quite small, whereas in other systems the set may be fairly large. It is convenient to group data processing operations into classes which are related in function and/or use. For example, floating point arithmetic (addition, subtraction, multiplication, etc.) often forms such a class in systems which support such operations. Similarly, integer arithmetic may form a class, logical operations may form a class, and memory operations (loads and stores) may form a class. Many systems also support a class of shift/rotate operations.
The shift/rotate class of operations may include operations which shift data left and right (possibly sign extending or zero filling), operations which rotate data left or right, or shift/merge operations which first shift a data field, then merge the result with another operand. This class of operations differs from the arithmetic classes in that it is primarily permuting and copying data rather than generating fundamentally new values from old values. Some systems also include operations for reversing the bits in a word. Other permutation and copying operations, which can't be easily expressed as a simple sequence of shift, rotate, or shift/merge operations, are typically performed by utilizing lookup tables which are stored in memory. For example, to perform any fixed operation in which all data bits in the result are derived from specific bits in the source operand, one can first break the source operand into several smaller fields (which serves to reduce the number of table entries required). For each such field there is a corresponding table which is indexed by the value of that field. In general, the width of each table entry is the size of the final combined result. Each entry in the table contains zeros for all result bits not derived from values in the field used to index the table, and the corresponding values of the index for all result bits which are derived from values in that field. The final result of the operation is formed by logically ORing the partial results from each of the tables.
Although such a method is clearly very general, it has several disadvantages. One disadvantage is that the tables themselves may occupy a significant amount of memory. Another disadvantage is that this method is usually fairly slow. In order to use it, each field in the source operand must first be extracted from the operand, then used as an index for a load from the corresponding table, and finally the partial results must be combined to form the final result. As the number of fields grows, the number of operations increases linearly. On the other hand, using larger fields results in exponential growth of the number of table entries, and therefore in the amount of memory required.
SUMMARY OF THE INVENTION
The present invention is a general method for arbitrarily permuting a sequence of elements, an extension to the general method which allows some extensions of permutations which involve the copying of individual elements, and a system based on the extended general method which implements a class of operations which can perform the primitive steps of the extended general method, as well as a much larger class of operations which generally involve the copying and/or permuting of data. In addition, the present invention includes several classes of instructions for performing operations which generally involve the copying and/or permuting of elements.
The general method can perform an arbitrary permutation of w elements, by breaking it down into an n dimensional rectangle whose sides correspond to any set of factors of w , i.e., w = f_{1}f_{2}... f_{n}. In one embodiment, the elements to be permuted are bits. The method consists of a sequence of 2n  1 sets of
permutations across the various dimensions. This method is not restricted to values of w which are powers of two.
The extended general method is obtained by replacing each of the
permutation steps of the general method with multiplexing steps. For example, when permuting across dimension i , each permutation of f_{i} elements is replaced with a full f_{i}tof_{i} crossbar, or equivalently, f_{i} independent f_{i}to1 multiplexers.
These crossbar, or multiplex, operations can perform permutations as a special case.
Therefore, the extended general method supports all of the permutations performed by the general method, and in addition allows certain types of copying to be performed.
For a given embodiment of either the general or extended general method, the correspondence between the elements and the n dimensional rectangle may take one of many forms. For example, the elements may in fact already be arranged in the shape of the n dimensional rectangle, or alternatively in the shape of a lower dimensional rectangle which results from some of the original dimensions being expanded to reduce the total number of dimensions. In another embodiment, the elements may exist purely as a onedimensional sequence, with some specified correspondence to the rectangle (the most obvious choices are rowmajor and columnmajor, and simple mixtures of the two). In some embodiments, it may be just as easy to permute/multiplex across one dimension as across another.
However, in other embodiments, it may be more difficult, or even impossible, to permute/multiplex across some dimensions. One way to avoid this problem is to restrict the set of dimensions over which the permute/multiplex operations need to occur. This can be achieved by reshaping (i.e. transposing) the data between successive permutation/multiplex steps. Using this technique, it is possible to restrict the set of dimensions over which the permutation/multiplex operations need to occur down to a single dimension..
Furthermore, in the case where the elements exist as a onedimensional sequence, the permute/multiplex operations can be constrained so as to always operate on groups of consecutive elements. In a rowmajor or columnmajor representation, or a simple mixture of the two, a sufficient subset of the set of possible transposes can be achieved by performing multiway perfect shuffle/deal operations. Assuming a rowmajor representation (i.e., the last dimension varies most rapidly), an f_{1} × f_{2} ×... f_{n} rowmajor n dimensional rectangle can be transposed into an f_{i+1} ×... f_{n} × f_{1} ×... f_{i} rectangle by performing an (f_{1}f_{2}... f_{i} )way perfect shuffle, which in this context is equivalent to an (f_{i+1}f_{i+2}... f_{n} )way perfect deal. In cases where the elements exist as a onedimensional sequence, the addition of shuffle/deal operations can constrain the permute/multiplex operations to operate on groups of consecutive elements. Furthermore, the number of elements which must be accessible within a given group will never exceed the largest dimension in the n dimensional rectangle, i.e., the maximum of the f values. Thus, in one embodiment of the present invention each of the 2n  1 steps in the general method or extended general method can be implemented with a single operation by combining the shuffle/deal operation with the permute/multiplex operation, in either order (i.e., either shuffle before or after the permutation).
The system of the present invention for implementing the general and extended general methods of the present invention consists of 2n  1 sequential stages, which may easily be pipelined. Each stage performs its corresponding permute/multiplex steps as described above. Each stage is connected to the previous and next stages. A variation of this system implements a smaller number of stages, possibly only one stage, and cycles the elements through it multiple times
(transposing between iterations) to achieve the equivalent of 2n  1 stages. Of course, doing so may inhibit pipelining.
In one embodiment, the system is a twodimensional implementation of the extended general method of the present invention. The data elements are bits, and the width of a data word is w bits. In this embodiment, the data is arranged as a twodimensional a × b rectangle ( a rows and b columns), where w = ab . There are three stages in the data path: Stage one consists of a groups of b , b to1 data multiplexers which operate within a given row, stage 2 consists of b groups of a , a to1 data multiplexers which operate within a given column, and stage 3 consists of a groups of b , b to1 data multiplexers which operate within a given row.
The data multiplexers within each stage are physically arranged as an a × b rectangle. This allows the data buses to be easily shared within each stage. Each data multiplexer is controlled by an encoded (log_{2} b)bit value (for stages 1 and 3) or (log_{2} α)bit value (for stage 2), with the log_{2} values being rounded up if the result is nonintegral (which will happen if the operand is not a power of 2). Each bit of the control value for a given data bit in a given stage is independently chosen from several values by a control multiplexer.
Although each bit of a given control value for a given data bit in a given stage is independently controlled, there is still some sharing of data, which greatly reduces and simplifies the wiring. Each of the select signals for a control multiplexer for a given control bit in a given stage is shared across the entire row (for stages 1 and 3) or column (for stage 2). Furthermore, most of the inputs for a control multiplexer for a given control bit in a given stage are shared across the entire column (for stages 1 and 3) and row (for stage 2). This system also allows a "fill" value to override some of the result bits (i.e. the bits of the data word after all of the stages). This is implemented by providing a bus containing a set of fill values which are selected on a bitbybit basis. The selection is controlled by the output of another set of control multiplexers in stage 3. As with the other control multiplexers in stage 3, these multiplexers are controlled by select signals which are shared across rows, and the inputs to these multiplexers come from signals which are shared across columns.
This system implements various shift and rotate operations, a class of shuffle/multiplex operations which can perform the primitive steps of the extended general method, and several other classes such as copy/swap, select, expand, compress, and bit field deposit/withdraw. Many of these operations are supported in a "group" form, which allows a single data word to be viewed as several smaller data words packed together. Such a group operation then acts independently on each of the smaller data words. In one embodiment of the system, w = 128 , a = 16 , and b = 8 (so log_{2} α = 4 and log_{2} b  4). In this embodiment, the data is arranged as a 16× 8 rectangle (16 rows and 8 columns).
The classes of operations of the present invention generally involve the copying and/or permuting of elements. In general, these operations can apply to any sequence of elements which may be permuted and, in some cases, copied. In one embodiment, the operations are instructions in a digital computer, and the elements are bits. The following classes of operations are included in the present invention:
1. A general class of perfect shuffle/deal operations. 2. A general class of data multiplexing operations.
3. A general class of combined perfect shuffle/deal operations and data
multiplexing operations.
4. An extension to the general class of perfect shuffle/deal operations which
permits an arbitrary reordering of dimensions (i.e., an arbitrary transpose). 5. An extension to the general class of combined perfect shuffle/deal operations and data multiplexing operations which permits an arbitrary reordering of
dimensions (i.e., an arbitrary transpose) in place of the perfect shuffle/deal component of the operation.
6. A general class of data selection operations. 7. A general class of copy/swap operations which support certain patterns of data copying and/or data reversal.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates a first embodiment of the general system of the present invention including a single control generation unit and a permute/multiplex unit. Figure 2 illustrates a second embodiment of the general system of the present invention including multiple control generation units, a control select unit and a permute/multiplex unit. Figure 3 illustrates a third embodiment of the general system of the present invention having lowlevel simplification of control signals and including a single control generation unit and a final control selection and permute/multiplex unit. Figure 4 illustrates a fourth embodiment of the general system of the present invention having lowlevel simplification of control signals including multiple control generation units, a control select unit, and a final control selection and permute/multiplex unit.
Figure 5 illustrates one embodiment of a twodimensional system of the present invention.
Figure 6 illustrates the layout of a stage 1 cell of a twodimensional embodiment of the system of the present invention designed to operate on 128 bits.
Figure 7 illustrates the layout of a stage 2 cell in a twodimensional embodiment of the system of the present invention designed to operate on 128 bits.
Figure 8 illustrates the layout of a stage 3 cell in a twodimensional embodiment of the system of the present invention designed to operate on 128 bits. Figures 9A9D illustrate alternative physical layout arrangements of stages
13 of the system of the present invention.
Figure 10 illustrates a first embodiment of a stage 1 cell in a twodimensional embodiment of the system of the present invention as shown in Figure 6. Figures 11A and 11B illustrate a first embodiment of a stage 2 cell in a twodimensional embodiment of the system of the present invention as shown in Figure 7.
Figures 12A and 12B illustrate a first embodiment of a stage 3 cell in a twodimensional embodiment of the system of the present invention as shown in Figure 8. Figures 13A and 13B illustrate a second embodiment of a stage 1 cell in a twodimensional embodiment of the system of the present invention, including an additional input data bus.
Figures 14A  14D illustrate a second embodiment of a stage 3 cell in a twodimensional embodiment of the system of the present invention, including fill operation circuitry.
Figures 15A  15F illustrate a third embodiment of a stage 3 cell in a twodimensional embodiment of the system of the present invention, including circuitry for providing additional multiplexer control to the stage. Figures 16A  16F illustrate a fourth embodiment of a stage 3 cell in a twodimensional embodiment of the system of the present invention, including fill operation circuitry and circuitry for providing additional multiplexer control to the stage.
Figure 17 illustrates an embodiment of a cell that employs bus overloading by using the same bus to provide fill data and additional multiplexer control data.
Figure 18 illustrates an embodiment of two adjacent cells that employ bus overloading by using the same bus to provide fill data and additional multiplexer control data to each of the adjacent cells and also employ bus sharing by using the same additional multiplexer control data for both adjacent cells. Figure 19 illustrates a 16bit data word having bit index numbers ranging from 0  15.
Figure 20 illustrates an example of a simple rotate operation being performed on a 16bit data word.
Figure 21 illustrates an example of a bit reversal operation being performed on a 16bit data word.
Figure 22 illustrates an example of a twoway shuffle operation being performed on a 16bit data word.
Figure 23 illustrates an example of a twoway deal operation being performed on a 16bit data word. Figure 24 illustrates the equivalency between performing a transpose operation on a 3 × 5 rectangle and performing a threeway shuffle operation on the 15 elements within the rectangle.
Figure 25 illustrates the equivalency shown in Figure 24 where the rows and columns of the rectangles have been renumbered.
Figure 26 illustrates the bit index of an element in a 4 × 8 rectangle before and after a transpose operation.
Figure 27 illustrates an example of an outer group shuffle operation being performed on a 32bit dataword where the outer group size for the operation is 8. Figure 28 illustrates an example of an inner group shuffle operation being performed on a 32bit dataword, where the inner group size for the operation is 4.
Figure 29 illustrates an example of an outer/inner group shuffle operation being performed on a 128bit dataword, where the inner group size for the operation is 8 and the outer group size is 32. DETAILED DESCRIPTION
A general method for arbitrarily permuting a sequence of elements, and an extension to the general method which allows some extensions to permutations which involve the copying of individual elements, is described in detail hereinafter. A system based on the extended general method which implements a class of operations which can perform the primitive steps of the extended general method, as well as a much larger class of operations which generally involve the copying and/or permuting of data, is also described. Finally, several classes of instructions for performing operations which generally involve the copying and/or permuting of elements are described. In the following description, numerous specific details are set forth, such as data path width within a microprocessor and specific microprocessor operations in order to provide a thorough understanding of the present invention. It will be obvious, however, to one skilled in the art that these specific details need not be employed to practice the present invention. In other instances, wellknown logic gates, as well as some simple combinatorial circuits which may be built from such gates, have not been described in detail in order to avoid unnecessarily obscuring the present invention.
General Method
The general method of the present invention is a method for arbitrarily permuting a sequence of elements. The elements could be physical objects (for example, glass bottles), or they could be data values (for example, data bits). The general method for permuting a sequence of elements can be used as long as the primitive steps of the method can be performed on the elements.
The general method can perform an arbitrary permutation of w elements, by breaking it down into an n dimensional rectangle whose sides correspond to any set of factors of w , i.e., w = f_{1}f_{2}... f_{n}. In one embodiment, the elements to be permuted are bits. The method consists of a sequence of 2n  1 sets of
permutations across the various dimensions. The specific order of the sets of permutations is shown below. This method is not restricted to values of w which are powers of two.
For a given dimension i, to perform independent permutations of f_{i}
elements along dimension i, means that each onedimensional slice of the rectangle obtained by holding constant the coordinates of each dimension, except dimension i, is independently permuted. More formally, for each slice through dimension i, a permutation function p can be defined such that element (x_{1} , ... , x_{n} ) in the new, permuted rectangle comes from element (x_{1}, ... , x_{i1}, p(x_{i} ), x_{i+1} , ... , x_{n}) in the old, unpermuted rectangle. Note that there are independent p functions
involved in permuting across dimension i, one for each onedimensional slice through dimension i obtained by holding the coordinates of the other dimensions constant. In other words, there is a separate p function for each set of
(x_{1}, ... , x_{i1}, x_{i+1}, ... , x_{n}) values. Furthermore, since each p function specifies a permutation, no two values of a given p function may be the same, i.e., x≠ y⇒ p(x)≠ p(y) . This ensures that each element of the old, unpermuted rectangle appears exactly once in the new, permuted rectangle.
The choice of which dimension to call dimension 1, which dimension to call dimension 2, etc. is completely flexible, and may be made in whatever way best suits the needs of a particular embodiment. The only requirement is that the pattern shown above is followed, i.e., permutation steps 1 and 2n  1 involve the same dimension, permutation steps 2 and 2 n  2 involve the same dimension, etc., with each dimension but one being selected twice, and the remaining dimension being selected once in permutation step n . Determining the Permutation Steps of the General Method
The following procedures show how the individual permutation steps of the general method may be determined for any permutation which the general method is to perform. At the same time, it should become clear that this method can be used to perform any arbitrary permutation of w elements.
First, a procedure which solves a simpler problem is described.
Procedure 1. Given an a × b rectangle ( a rows and b columns) containing b separate copies of each of the values from 1...a , arbitrarily distributed throughout the rectangle, find a set of a independent permutations of the a rows such that each column in the permuted rectangle contains the values 1...a , in some unspecified order. In other words, for each row, this procedure finds a permutation of the elements within that row, such that, after each row has been permuted, each column in the rectangle contains the values 1... a .
The following explanation of this procedure describes a series of element interchanges which transform the original rectangle into one which satisfies the condition of each column containing the values 1... a . Although at some points the procedure describes temporarily interchanging elements from different rows, these temporary interchanges are always reversed before the procedure completes. The resulting effective element interchanges always involve elements from within the same row. The required permutation can be obtained by composing the series of effective element interchanges as the procedure proceeds, or it can be obtained from a direct comparison of the initial rectangle with the final rectangle. If a given row contains multiple copies of some value, the permutation obtained by the latter method may be different from that obtained by the former, although either method will yield an acceptable solution.
1. If b = 1 , the condition is already satisfied.
2. If b = 2 , first mark each row as unprocessed. Then proceed as follows, starting with step 2a: a. If there are no remaining unprocessed rows, the procedure is
complete. Otherwise, pick some unprocessed row. Mark that row as processed. Let A be the value in column 1 and B be the value in column 2. Now proceed to step 2b. b. If value B equals value A, a cycle has been completed. In this case, return to step 2a. Otherwise, find the remaining unprocessed row which contains value B, switch the two elements in that row (so that
B is moved to column 1), mark that row as processed, set the new B to be the new value in column 2, and repeat step 2b.
It should be obvious that an instance of each value ends up in both column 1 and column 2, which is what is required. 3. If b > 2 , proceed as follows: a. If column 1 contains no missing values, recursively solve the
a × (b  1) rectangle formed by removing column 1, and the procedure is complete. b. Otherwise, let A be some value in column 1 which appears more than once in that column, and let B be some value which is missing from column 1. Find some other column, k , which contains at least one instance of B. Temporarily swap an A from column 1 with a B from column k , and mark the two swapped values so that they can be located later. This reduces the number of missing values in column 1 by one, so the new a x b rectangle can be solved recursively. After recursively solving the new rectangle, reswap the marked A and B values which were previously swapped. If they ended up in the same column, the procedure is complete. Otherwise, they're in two different columns. in which case those two columns can be solved as a simple a × 2 case. In fact, the one or two rows affected by the reswap must end up in the same cycle, so an optimization of the procedure is to begin step 2a of the a × 2 procedure with one of the row(s) affected by the reswap, then quit after processing that cycle, as opposed to repeating step 2a. In practice, it may be more efficient to iterate rather than recurse on the missing values in column 1 during step 3b.
The resulting permutations performed by this procedure can clearly be reduced to a single set of a independent permutations of the a rows. The following is an example of applying Procedure 1 to a 4 × 3 rectangle, i.e. matrix, shown below as Matrix 1. The numbers inside the parentheses indicate the original positions of the corresponding values in Matrix 1. They are shown to help distinguish between different instances of the same value in the matrix. Since the procedure is recursive, in this example the current invocation of the procedure is indicated by a parenthesized number following the current step.
The matrix contains three instances of each of the values from 1 to 4. In the initial matrix, it can be seen that some columns contain multiple copies of some values and no copies of other values. When the procedure terminates, each column will contain a single instance of each of the values from 1 to 4.
Procedure 2. Given w elements, broken down into a 2dimensional a × b rectangle ( a rows and b columns, with w = ab ), find the 3 sets of permutations required by the general method to perform some given permutation of the elements. Specifically, find (1) a set of a independent permutations of the a rows, (2) a set of b independent permutations of the b columns, and (3) another set of a independent permutations of the a rows, such that, when the three sets of permutations are performed, in order, on the rectangle, they will achieve the desired permutation of the w elements.
1. First, look at the destination row of each value in the rectangle, and ignore the destination column for the time being. Viewed this way, the rectangle contains b separate copies of the values from 1... a , arbitrarily distributed throughout the rectangle. Procedure 1 can therefore be used to find a set of a independent permutations of the a rows such that each column of the permuted rectangle contains the values 1...a , in some unspecified order. Now, looking at both the row and the column information once again, it is clear that each column contains one value for each row, in some unspecified order.
2. At this point, each column contains one value for each row, in some
unspecified order. Permute each column so that the value for each row is in that row.
3. At this point, each value is in the correct row. Permute each row so that the values are also in the correct column.
Finally, a procedure which determines the individual permutation steps of the general method in the general, n dimensional case is described. Procedure 3. Given w elements, broken down into an n dimensional f_{1} × f_{2} ×... f_{n} rectangle (where w = f_{1}f_{2}... f_{n} ), find the 2n 1 sets of permutations required by the general method to perform some given permutation of the elements.
1 . If n = 1 , permute the w elements along the one (and only) dimension, and the procedure is complete. This is the degenerate case.
2. If n = 2 , apply Procedure 2. 3. If n > 2 , first reduce the number of dimensions by one by collapsing the first two dimensions into a single, larger dimension, resulting in an (n  1) dimensional (f_{1}f_{2} ) × f_{3} ×... ×f_{n} rectangle. The correspondence between the elements in each twodimensional f_{1} × f_{2} slice of the original rectangle and the corresponding onedimensional f_{1}f_{2}element slice of the reduced rectangle can be chosen arbitrarily. The only important thing is that the last n  2 coordinates of a given element be the same in both the original and the reduced rectangles.
Now, recursively find the 2(n  1)  1 = 2n  3 sets of permutations required to permute the reduced, (n  1) dimensional rectangle. Once this is done, the first n  2 and last n  2 sets of permutations each permute across one of the last n  2 dimensions, and can be transferred directly from the reduced rectangle to the original rectangle. It only remains to transform the set of permutations for step n  1 in the reduced rectangle, which permute across the first, f_{1}f_{2}wide dimension of the reduced rectangle, into three sets of permutations in the original rectangle, first across the f_{2}wide second dimension, then across the f_{1}wide first dimension, then once again across the f_{2}wide second dimension. This can be achieved by applying Procedure 2 independently to each twodimensional f_{1} × f_{2} slice of the original rectangle (after the first n  2 permutation steps have been performed).
Extended General Method
The extended general method of the present invention is obtained by replacing each of the permutation steps of the general method with multiplexing steps. For example, when permuting across dimension i, each permutation of f_{i} elements is replaced with a full f_{i}to f_{i} crossbar, or equivalently, f_{i} independent f_{i}to1 multiplexers. Note that these crossbar, or multiplex, operations can perform permutations as a special case. Therefore, the extended general method supports all of the permutations performed by the general method, and in addition allows certain types of copying to be performed.
It is important to note, however, that the extended general method does not, in general, support arbitrary combinations of copying and permuting. That is to say, when applied to a sequence of w elements, it can not, in general, perform an arbitrary crossbar, or multiplex, operation on those elements. It can, however, perform any arbitrary permutation of those elements (just as the general method can), and in addition supports many useful forms of copying.
Some examples of these operations are described later. General System
Both the general and extended general methods may be employed in the absence of a physical system which is based upon them. For example, they may be used in computer software, particularly on a computer system which provides some support for the primitive permute operations of the general method or multiplex operations of the extended general method. However, a physical system based on the general or extended general method can perform many useful functions, and in addition offers many advantages over alternative approaches.
The essential feature of a system based on the general or extended general method is that it employs 2n  1 sequential stages. Each stage permutes/multiplexes its input along the appropriate dimension of the corresponding n dimensional rectangle, as outlined in the descriptions of the general and extended general methods. The output of a given stage then becomes the input to the next stage.
One way to implement such a system is to build 2n  1 independent stages. An alternative is to build fewer stages, possibly only a single stage, and then cycle the elements through multiple times to achieve the effect of 2n  1 stages. In the latter case, it may be desirable to transpose the data before cycling through so that the same groups of elements are involved in the permute/multiplex operations (i.e., the permute/multiplex operations are performed across some fixed set of dimensions, possibly only a single dimension). Physical Placement of Cells
Regardless of whether all 2n  1 stages are physically implemented or whether some smaller number of stages are implemented, a choice of where to physically place the individual cells within each stage must be made, where a cell refers to the portion of a stage which is responsible for producing a single element of the output of that stage. Furthermore, when more than one stage is physically implemented, a choice of where to physically place the stages with respect to each other must also be made.
Several issues must be considered with respect to the placement of the stages in relation to each other, and the placement of cells within a given stage. For instance, in one embodiment of the system of the present invention, the cells of each stage may be physically placed in a single row, with each stage being placed directly beneath the preceding stage. However, since different stages may permute across different dimensions, it is inevitable that the cells in some stages will be forced to permute/multiplex from among input elements which are widely separated in the horizontal dimension. In embodiments in which this is undesirable, the cells could be physically reordered within each stage so that the permute/multiplex operations for that stage always involve consecutive groups of elements. However, this merely moves the problem from within a given stage to the interface between stages, since now the elements would have to be physically reordered (e.g., transposed) between stages.
In another simple placement strategy the cells are physically arranged within each stage as an n dimensional rectangle whose elements correspond to the elements of the n dimensional rectangle of the general or extended general method upon which the system is based. (Of course, this isn't physically possible for values of n which are greater than 3, or even 2 in many embodiments, but for purposes of the current discussion, this limitation will be temporarily ignored). In such a physical arrangement, the groups of elements being permuted/multiplexed by a given stage each consist of all the elements in a linear slice through the appropriate dimension. In particular, the elements in each group are always contiguous.
Further, note that these properties will be preserved by any permutation of the slices across any dimension of the ndimensional rectangle, or by any combination of permutations of the slices across any set of dimensions of the ndimensional rectangle. As for the relative placement of the different stages with respect to each other, there are two basic approaches. One is to physically arrange all cells into a single n dimensional rectangle, where each element in the n dimensional rectangle contains the corresponding cells from each stage, i.e., to physically interleave the stages with each other. The other basic approach is to keep each stage physically separate. With this approach, the output of one stage must be shipped to the next stage, which then performs a permute/multiplex operation along the appropriate dimension. The elements from one stage would move in a straight line to the corresponding point in the next stage, then either turn 90 degrees along the dimension to be permuted/multiplexed (if the dimension is different from the one along which the element was moved), or else continue in a straight line through the n dimensional rectangle of the next stage (if the dimension is the same as the one along which the element was moved). Therefore, with this approach it is desirable to move the elements from one stage to the next along the same dimension across which the next stage is going to permute/multiplex. Finally, mixtures of these two approaches may also be used. However, as was mentioned above, a placement strategy based on an n dimensional rectangle can't be used directly if n is greater than 3, or even 2 in many embodiments. In such cases, the number of dimensions may be effectively reduced by treating groups of cells, corresponding to slices of the n dimensional rectangle, first as single cells, then applying the general strategy to each slice. General Control Structure for System
So far the physical connections between the stages of a system and the physical placement of the stages (both the placement of cells within each stage and the relative placement of stages with respect to each other) have been discussed. However, in order to use such a system, there must be some way to control the permute/multiplex operations of each stage.
In the simplest embodiment, there is a single control generation unit (i.e. CGU(1)) which produces control for each stage of the system in permute/multiplex unit PMU(1) (see Figure 1). The control generation unit may take several control parameters as input, and from these it produces all of the control information needed to perform the permute/multiplex operations on Data In in each stage of the system to generate Data Out.
In a slightly more complicated embodiment (Figure 2), the system may be used to perform several unrelated functions. In such a system, it may be simplest to build several independent control generation units (i.e. CGU(1), CGU(2), CGU(3) . . . etc). The output of each control generation unit feeds into a control select unit (i.e. CSU(l)), which is controlled by a function select input. Depending on the function select, the output of one of the control generation units is chosen and becomes the output of the control select unit which controls PMU(1). It may be the case that some functions do not require all stages of the system. In such cases, the control for the unused stages is extremely simple, since those stages simply perform the identity permutation on their input. In such cases, the control generation unit for that function may not produce any control for the unused stages. Instead, the control select unit may simply use a fixed control pattern for that stage of that function. Furthermore, this control pattern may be shared by more than one function. It may also be the case that several functions can use the same control information for some stages. In such cases, it may make sense to merge the control generation units for those stages of those functions. Lowlevel Control Simplification
The amount of control information required for a given stage is fairly large. For a stage which permutes across dimension i , there are independent
permutations/multiplexes of f_{i} elements along dimension i .
In the case of permutation operations (as required by the general method), there are
possible sets of control values for this stage, which requireslog_{2} (f_{i} !) bits of information to describe. In the case of multiplex operations
(as required by the extended general method), there are possible
sets of control values for this stage, which requires w log_{2} f_{i} bits of information to describe. Generating this amount of information for each stage can require a substantial amount of logic. Even after it's generated, it must still be relayed to the permute/multiplex units for that stage, which may require a substantial amount of wiling. For example, if the control information is represented as digital values transmitted though wires, then a large number of wires have to run a potentially long distance to reach the point where the control information is used. If there are multiple control generation units which produce independent control for that stage, the problem is compounded.
One way to reduce the amount of control information which must be generated and relayed to the appropriate stage of the system is to take advantage of the regulaiity of the functions to be performed by the system, and to use this regularity to partition the control for a given stage into some small number of sets which are shared across some or all other dimensions of the ndimensional rectangle. The other dimensions of the n dimensional rectangle would then have a corresponding set of shared control select information which would be used to determine the control information for a given cell in that stage. In some cases, it might make sense to precombine some of the control select and/or control information at the physical periphery of the stage, with the final control selection taking place at each cell in the stage. In fact, for some stages of some functions, a single set of control values may suffice, in which case the control select values are constant for those stages of that function, and the other sets of control values for those stages of that function are unused. In general, a given stage of a given function may require fewer sets of control values than are implemented for that stage, in which case the remaining control values for that stage of that function are unused, and the control select values for that stage of that function come from a restricted range which excludes the unused control values.
This control selection may take several forms. In its simplest form, one of the control values for a given slice through the dimension being
permuted/multiplexed is selected for a given cell and the other control values for that slice are ignored. A more complicated form of selection would permit portions of one control value to be combined with portions of another control value. For example, if the final control for a given cell is in the form of a binary number, it may be desirable to independently select each bit of the control value. In fact, for some functions the simplest way to compute the control for a given cell of a given stage is to bitwise XOR a value from each (n  1) dimensional slice through the n dimensional rectangle which intersects that cell. This can be fit into this general scheme by noting that a twooperand XOR operation is merely a special case of a 2to1 multiplex operation. One operand acts as the select input to the multiplexer, and the inputs to the multiplexer are the second operand and its complement. If the first operand is true, the complemented value of the second operand is selected.
Otherwise the value of the second operand is selected. Physical Placement of Cells with Lowlevel Control
Simplification
In an earlier section, some of the issues which affect the physical placement of cells within a stage, as well as the relative placement of different stages with respect to each other, were discussed. The lowlevel control simplification utilizes sets of control and control select values which are shared across various slices of the n dimensional rectangle. Therefore, if the cells are physically arranged as an n dimensional rectangle, all cells which share a given control or control select value will lie in a straight line. If fewer than n dimensions are available, arrangements which tend to preserve physical linearity across dimensions will generally result in the simplest connectivity of control select and control signals.
Control Structure for General System with Lowlevel Control
Simplification
In an earlier section, various control structures for the general system were described. These structures can easily be adapted to accommodate the lowlevel control simplification described above. Instead of generating a single set of unshared control values for a given stage, the control generation units now generate several sets of shared control values for a given stage, as well as a set of control select values for that stage. Although more sets of values are generated, each set has far fewer values in it due to the sharing. The result is a greatly reduced number of total control values.
In the simplest embodiment, there is a single control generation unit (i.e. CGU(1)) which in general produces several sets of shared control values for each stage of the system (see Figure 3). Final control selection for a given stage takes place at each cell within that stage. The final control selection and corresponding permute/multiplex units for each stage are shown in Figure 3 as FCS/PMU(1).
In a slightly more complicated embodiment, there are several independent control generation units (i.e. CGU(1). CGU(2), CGU(3) . . . etc.) each generating control values, i.e. control and control select 1, control and control select 2, control and control select 3, . . . etc.. respectively, (see Figure 4). Both the control and control select outputs of each control generation unit feed into control select unit CSU(l), which is controlled by a function select input. Depending on the function select, the output of one of the control generation units is chosen and coupled to FCS/PMU(1) where final control selection and data permutation/multiplexing is performed.
Functional Extensions to the General System
A basic general system, based on either the general or extended general methods, has been described. Some simple modifications which extend the functionality in various useful ways are now described.
The most obvious extension is to base the system on the extended general method rather than the general method. This has already been mentioned as a variation. In order to replace the permutation operations with unrestricted multiplex operations, it must be possible to copy the elements. Although this may not be possible in systems in which the elements are physical objects (for example, glass bottles), it will usually be possible in systems in which the elements are data values (for example, bits). In such systems, the simplest implementation of the general method may in fact also be an implementation of the extended general method. Although the extended general method cannot, in general, perform an arbitrary crossbar operation, it can nevertheless support a large number of useful extensions to permutations which involve the copying of some elements. For example, if the elements are bits, the extended general method can be used to implement various right shift operations which perform sign extension. A system based on the extended general method and which takes advantage of the lowlevel control simplification technique described above can be extended by allowing the control for one or more stages to be directly, independently specified for each cell in that stage, i.e., to either not implement or to bypass the lowlevel control simplification for those stages. If this is done in the final stage of a function (stage 2n  1 ), it is possible to combine some other operations of the system with a subsequent arbitrary multiplex operation across the dimension which the final stage multiplexes. In particular, if the final stage is unused by a function (so it has a single set of control values for that stage for that function, and those values specify the identity permutation), or if the control values for the final stage each take a very simple form (such as the identity permutation (whose encoded, zerobased control values are the identity function), or a permutation which reverses the elements along that dimension (whose encoded, zerobased control values are the bitwise complement of the identity function)), then a simple substitution of the unshared. encoded zerobased control values and/or their complements for the original shared, encoded zerobased control values will result in the functional composition of the original function followed by the unrestricted multiplex operation across the dimension which the final stage multiplexes. It turns out that there exist control values for a generalized transpose operation, which includes all perfect shuffle/deal operations as a special case, for which the control values of the final stage satisfy this condition. It is therefore possible to use this extension to implement a general transpose/mux, or as a special case a shuffle/mux, operation. Note that such an operation is capable of performing the primitive operations of the extended general method.
Another extension to the general system is to permit "fill" values to be introduced, before or after some stages in the system. A fill value may be some fixed value (for example, in the case where the elements are bits, the values 0 and 1 are obvious choices for fill values), or it may be supplied as an additional input to the system, either as some small set of values which may be introduced into various positions, or as a complete set of alternate values, one for each element position. Regardless of how the fill values are specified, a mechanism is needed to control when they are to be used in place of the corresponding value from the previous stage or from the input to the system. One way to do this is to introduce a new set of control values at each fill point in the system, one for each element, where each control value indicates whether the corresponding fill value should override the value at that point. One way to simplify this is to use a structure similar to that described for the lowlevel control simplification. In this case, one or more sets of binary control values would be defined for each slice along one dimension, and a set of control select values would be defined for slices across the other dimensions. The control select values would ultimately select one of the control values for a given cell. The constant control values 0 and 1 may be made implicitly available to reduce the number of control signals. In any case, if the final selected control value is 0, then the fill function is disabled for that cell. If it is 1, it is enabled, and the value for that cell is taken from the appropriate fill value. Such a fill mechanism may be used to support left and right logical shift operations (where the fill value is 0), or bit field deposit operations (where the fill values are taken from a bus which contains the data being deposited into). A particularly useful place to introduce the fill mechanism is in the output of the final stage (stage 2n  1). Another fairly obvious extension to the general system is the ability to tap into or out of the system between stages. For example, tapping into the system between stages s and s + 1 could be used for functions which don't need stages 1 through s . This would allow more time for the elements to arrive as input to the system. One way to implement this would be to modify the output portion of stage s to conditionally use an alternate set of inputs as its output.
Tapping out of the system between stages s and s + 1 could be used for functions which don't need stages s + 1 through 2n  1 . This would make the result of such functions available earlier. Simplifying Restrictions to Functional Extensions to the
General System
Some simplifying restrictions to some of the functional extensions to the general system described above may be desirable in some embodiments. For example, a system based on the extended general method and which takes advantage of the lowlevel control simplification technique described above, and which has been extended by allowing the control for one or more stages to be independently specifiable for each cell, and which has also been extended to allow a full set of fill values to be used for fill operations, takes a large number of additional inputs. If the additional multiplex control values and fill values are never needed by the same function, then these values can coexist, and may share some of the same inputs. For example, if the additional inputs are supplied via data buses, some of the same buses may be used for both functions. Of course, if the two sets of inputs are used at different times, it may be necessary to buffer the inputs which are needed later so that both sets of inputs are expected at the same time, since otherwise it may not be possible to mix the two functions in a pipelined environment.
For a system based on the extended general method and which takes advantage of the lowlevel control simplification technique described above, and which has been extended by allowing the control for one or more stages to be directly, independently specified for each cell in that stage, it may be desirable to limit the extension so that some sharing of the direct multiplexer control still takes place. Although the resulting functionality is less powerful, it will reduce the number of additional inputs which must be provided. For example, the number of additional inputs can be halved if the values are shared between adjacent slices of the n dimensional rectangle. If the slices across a given dimension are ordered such that loworder and highorder slices are alternated, then cells whose corresponding logical element positions differ by will be physically adjacent to each other, so it
is very convenient to share multiplexer control for such pairs of cells. The functional result is that the multiplex portion of the function will be duplicated on the high and loworder halves of the input.
Special Cases of the General System
Although the system described thus far is very general and somewhat abstract, certain special cases are particularly useful, and for that reason they are singled out here.
One noteworthy special case is when the number of dimensions is 2. A system based on a 2dimensional rectangle, with only 3 stages, is easier to understand and control than a system based on a higherdimensional rectangle, with more than 3 stages. It is also easier to physically place the cells and stages of such a system.
Another special case of interest is when the number of elements is a power of two. In this case, the size of any dimension of any n dimensional rectangular representation of the elements will also be a power of two. If the control values for such a system are encoded as zerobased binary numbers, then all possible values for a given number of control bits can be meaningfully defined, i.e., there will be no outofrange control values. The control can therefore be very efficiently represented. Some control generation issues are also simplified when dealing with sizes which are powers of two.
Basic Twodimensional System This section describes an embodiment of the general system described earlier. This embodiment is based on a twodimensional rectangle, and it incorporates the lowlevel control simplification described earlier. The system has three permute/multiplex stages which are implemented with multiplexers. In this description, the first and third stages multiplex within rows and the second stage multiplexes within columns. However, it should be understood that a system may be implemented in which the first and third stages multiplex within columns and the second stage multiplexes within rows. In any case, these multiplexers are referred to as data multiplexers, in order to distinguish them from other multiplexers in the system. The data multiplexers are controlled by encoded control values which are individually constructed at each cell in the rectangle for each of the three stages in the system.
Stages 1 and 3 are controlled by control signals which are shared within each column and control select signals which are shared within each row. Stage 2 is controlled by control signals which are shared within each row and control select signals which are shared within each column. For all stages, the control select signals select the control signal to use for each control bit of the encoded control value for each individual cell in that stage. This selection is performed
independently for each bit of the encoded control value for each cell of each stage in the system. The selection is performed by multiplexers. These multiplexers are referred to as control multiplexers in order to distinguish them from other multiplexers in the system.
Figure 5 illustrates a general block diagram of one embodiment of the twodimensional system of the present invention including stages 13 (which includes control and data multiplexers CDM(1), CDM(2), and CDM(3)), control generation units CGU(1)  CGU(n), and control select unit CSU(1). A decoded instruction, along with any associated control operands, is coupled to each of the control generation units CGU(1)  CGU(n). The decoded instruction typically originates from a program being executed by a computer system. The decoded instruction, together with its associated control operands, defines a particular operation that is to be performed on a given word of data. DATA/IN (Figure 5). The data may be retrieved from memory storage, a register file, register file bypass logic, or directly from the result of a previous instruction.
Stage 1 performs a set of independent row multiplex operations on
DATA/IN, controlled by its corresponding control values (i.e., XC(1)) and control select values (i.e., XCS(1)), producing D2, as shown in Figure 5. Stage 2 then performs a set of independent column multiplex operations on D2, controlled by its corresponding control values (i.e., XC(2)) and control select values (i.e., XCS(2)), producing D3, as shown in Figure 5. Finally, Stage 3 performs a set of
independent row multiplex operations on D3. controlled by its corresponding control values (i.e., XC(3)) and control select values (i.e., XCS(3)), producing DATA/OUT, as shown in Figure 5. The value of DATA/OUT is the value which results from applying the operation specified by the decoded instruction, together with its associated control operands, on DATA/IN.
In general, each control generation unit takes several input values and produces several output values. The input values originate from the decoded instruction and its associated control operands. The output values are the sets of control and control select values needed to control the stages of the system when it performs a corresponding operation. A given control generation unit may be implemented with combinatorial logic networks composed of standard logic gates (AND, OR, XOR, etc.), memories (either readonly or read/write), programmable logic arrays, or any combination of these items. It should be understood, however, that other techniques for implementing a control generation unit should be obvious to one skilled in the art of logic design.
In general, each control generation unit corresponds to some class of operations which have similar control characteristics, and each control generation unit generates multiple sets of control values (shown as C(1)C(n) in Figure 5) for each stage and a single set of control select values (shown as CS(1)CS(n) in Figure 5) for each stage. The control and control select values produced by each control generation unit form the input to the control select unit CSU(1). The control select unit consists of a set of multiplexers, referred to here as control select multiplexers. The data inputs of the control select multiplexers are coupled to the control and control select outputs produced by the control generation units CGU(1)CGU(n). The data outputs of the control select multiplexers comprise the set of control and control select values that are coupled to the data and select inputs of the control multiplexers in stages 1, 2, and 3. In figure 5, XC(1), XC(2), and XC(3) are the sets of control values for stages 1, 2, and 3, respectively, and XCS(1), XCS(2). and XCS(3) are the sets of control select values for stages 1, 2, and 3, respectively. In general, each bit of each control or control select signal for each stage of the system is generated by one control select multiplexer, whose inputs are the corresponding control or control select values produced by the control generation units. The control select multiplexers are controlled by a set of function select signals SEL(1). In the simplest case, the function select signal selects all of the control and control select values produced by the control generation unit which corresponds to the class of operation being performed, with the control and control select values produced by all of the other control generation units being discarded (i.e., not selected) by the control generation unit. It should be understood, however, that in a pipelined implementation, a different operation may be active in each stage of the system at any given time, with the control for each stage of the system for a given operation being utilized at different points in the pipeline. In particular, the control for stage 1 for a given operation will be utilized before the control for stage 2 for that operation, which in turn will be utilized before the control for stage 3 for that operation.
The above description implies that each control generation unit generates a complete set of control and control select values for each stage of the system. In most cases, however, this is more control information than is needed to perform a given operation. The final set of control and control select signals generated by the control select unit is designed to be flexible enough to control every operation that the system is required to perform. However, this is this more control information than is needed for most operations, and in fact it may be the case that no single operation requires all of the control and control select signals which are generated by the control select unit. Because of this, in a typical embodiment of the present invention, most of the control generation units only generate a subset of the possible control and control select signals. The unneeded control and control select values for a given operation may be taken from some simple set of constant values, or may be duplicated from other control or control select values produced by the
corresponding control generation unit, or may even be irrelevant, "don't care" values (in the case of control values which are never selected by the corresponding control select values).
One example of a situation in which some control and control select signals are unused is when a given operation doesn't require the use of a given stage of the system. In this case, the unneeded stage must propagate its input to its output unaltered. In this case, only one set of corresponding encoded control values is needed for that stage, and those values specify the identity operation for that stage. For example, if the unneeded stage is one which multiplexes within rows (i.e.. if it is stage 1 or stage 3), then the encoded control value for column 0 is 0. for column 1 is 1, for column 2 is 2, etc. Since only one set of control values is needed, the corresponding control select values for that stage are also be constant, and always specify the one set of defined control signals. The other, unused control signals are therefore irrelevant, "don't care" values. In this example, the control select unit corresponding to the operation may generate a single set of constant control values for that stage and a single set of constant control select values for that stage.
However, since it is likely that other operations may need those the same control and/or control select values for that stage, and since no logic is needed to generate those values (since they're constant), it makes sense to make them available as fixed inputs to the corresponding control select multiplexers, which may be used by multiple classes of operations. The "don't care" control values may be taken from any existing inputs to the corresponding control select multiplexers, eliminating the need to add additional inputs to those multiplexers.
Another example of a situation in which some control and control select signals are unused is when a given operation requires fewer sets of control values for a given stage than are supported by the system, possibly only a single set. In this case, the control select values corresponding to the unneeded control values are constant (with value 0), and the unneeded control values themselves are irrelevant, "don't care" values. The constant and "don't care" control and control select values may be handled as in the previous example.
For some operations, some control or control select values for a given stage may be duplicated in some way. For example, although the system described here permits bitwise selection of the encoded control values that are generated by the control multiplexers, many operations don't make use of this feature. In those cases, control bits for a given row (stages 1 and 3) or column (stage 2) come from the same set of control values. Because of this, the corresponding control generation units do not generate a separate set of control select values for each bit of encoded control, but instead generate a single set of control select values which are shared across the encoded control bits. Each of these control select values is used as input to several control select multiplexers.
For other operations, it may be the case that the high and low halves of the data word have identical control or control select values. In those cases, only one set of the duplicate control or control select values needs to be generated. Those control or control select values then provide the input to the corresponding control select multiplexers for both the high and low halves of the data word.
The above examples demonstrate that in many cases a given control generation unit is able to generate fewer control or control select values than the system supports, which can reduce the amount of control logic, and in some cases the amount of wiring, required to implement the system.
It should also be noted that the present invention is not limited to the methods of reducing the number of control and control select signals, as described above, and it should be obvious to one skilled in the art of logic design that other techniques for reducing the number of control and control select signals produced by a given control generation unit may be employed.
Stage 1 of the Basic Twodimensional System
Figure 6 illustrates a block diagram of stage 1 of the system shown in Figure 5 for the particular case in which the dataword has 128 bits and is viewed as a twodimensional rectangle having 16 rows and 8 columns. It should be noted that all of the stage 1 cells have not been shown in Figure 6 in order to simplify the concept of the stage. For instance, although row(1) actually comprises cells S1(0)  S1 (7), Figure 6 only shows cells S1(0), S1(1), and S1(7). Similarly, rows(2) through row(14) have been omitted, however, it should be understood that the rows not shown each comprise eight cells in the same manner as rows(0), (1), and (15) shown in Figure 6.
Please also note with reference to Figure 6, and to any other figures henceforth, reference numbers used to indicate shared column control and column control select buses (e.g. C1A(0,02) or C1A(0,0)) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the column number, and 3) the number after the comma indicates the bit or range of bits within the column. For instance, C1A(0,2) indicates bit 2 of column 0 of the C1A bus, and C1A(0,02) indicates bits 02 of column 0 of the C1A bus.
Exceptions to this rule will be noted as they arise.
Similarly, reference numbers used to indicate shared row control and row control select buses (e.g. CS1(0,02) or CS1(0,0)) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the row number, and 3) the number after the comma indicates the bit or range of bits within the row. For instance, CS1(0,2) indicates bit 2 of row 0 of the CS1 bus, and CS1(0,02) indicates bits 02 of row 0 of the CS1 bus.
Exceptions to this rule will be noted as they arise.
Finally, reference numbers used to indicate data buses (as opposed to control or control select buses) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the row or range of rows, and 3) the number after the comma indicates the column or range of columns. For instance, D1(0,2) indicates the bit associated with row 0, column 2 of the D1 bus, D1(0,07) indicates the bits associated with row 0, columns 07 of the D1 bus, and D2(015,0) indicates the bits associated with column 0, rows 015 of the D2 bus. This convention applies to data buses which run either horizontally or vertically. Exceptions to this rule will be noted as they arise.
Referring to Figure 6, stage 1 comprises 128 cells (i.e. S1(0)  S1(127)) each cell for generating one bit of multiplexed data onto one of buses D2(015, 0) D2(015,7). Each column of 16 cells generates 16 bits of data. For instance the cells in column(0), i.e. cells S1(0), S 1(8), S1(16), . . . S1(120), generate 16 bits of data which are coupled to output bus D2(015,0). Similarly, the cells in column( 1), i.e. cells S1(1), S1(9), S1(17), . . . S1(121 ), generate 16 bits of data which are coupled to output bus D2 (015,1).
Stage 1 performs eight independent 8to1 multiplex operations per each of the 16 rows. Sixteen 8 bit input buses, D1(0,07)  D1(15,07), comprise the 16 input row buses. Each 8 bit input bus is coupled to each of the 8 cells in the corresponding row. For instance, 8 bit input bus D1(0,07) is coupled to each of cells S1(0)  S1(7) in row(0) as shown in Figure 6, input bus D1(1,07) is coupled to the row(1) cells, S1 (8)  S1(15), and input bus D1(15,07) is coupled to the row(15) cells, S1(120)  S1( 127).
All of the cells in a given row generate 8 bits (i.e., one bit per cell) of multiplexed data. For instance, each of cells S 1(0)  S1 (7) generate 1 bit of multiplexed data: collectively S1(0)  S1(7) generate 8 bits total. Each of the cells in a given row couples its corresponding multiplexed data bit to a different output column bus. For instance, in the row(0), cell S1(0) couples its output data bit to D2(0,0) in column(0), S 1 (1 ) couples its output data bit to D2(0,1) in column( 1 ), S1(7) couples its output data bit to D2(0,7) in column(7). In the next row. S1(8) couples a bit to D2(1,0) in column(0), cell Sl(9) couples a bit to D2(1,1) in column(1), and cell S 1(15) couples a bit to D2( 1.7) in column(7). In the last row shown in Figure 6, S1( 120) couples a bit to D2(15,0) in column(0), cell S 1( 121) couples a bit to D2(15, 1) in column ( 1), and cell S 1(127) couples a bit to D2(15,7) in column(7).
Each of the 128 cells are controlled by a set of 6 control bits  3 bits provided from control bus C1A and 3 bits provided from control bus C1B. All of the cells in a given column share the same set of 6 control bits. For instance cells S 1(0), S 1(8), . . . S1(120) share the same 6 control bits, however, each column in stage 1 is controlled by a different set of 6 control bits. Hence, for the eight columns in stage 1, there are eight 3 bit C1A control buses. (C1A(0,02)  C1A(7.02)), and eight 3 bit C1B control buses, (C1B(0,02)  C1B(7,02)).
In addition to the 6 control bits coupled to each of the 128 cells, a 3bit control select signal is coupled to each of the cells. All of the cells in a given row share the same set of 3 control select bits. For instance, cells S1(0)  S1(7) each share the same 3 control select bits. Each row of cells in stage 1 is controlled by a different set of control select signals. Thus, there are sixteen sets of 3 bit control select buses, CS1(0,02)  CS 1(15,02), a different set per row. The control select signals allow for a bitwise selection between each of the C1A and C1B control bits coupled to each cell. Thus, although control bits C1A and C1B are shared between columns, the CS1 signals allow for a bitwise selection of control for each cell. For instance, in the case of cell S1(0), control select signal CS 1(0,0) selects between control bits C1A(0,0) and C1B(0,0), control select signal CS1(0,1) selects between control bits C1A(0,1) and C1B(0,1), and control select signal CS 1(0,2) selects between control bits C1A(0,2) and C1B(0,2).
The flow of data into and out of the block diagram of stage 1 shown in Figure 6 is such that sixteen 8 bit buses, D1(0,07)  D1(15,07), enter horizontally and eight 16 bit buses, D2(0 15,0)  D2(0 15,7), exit vertically and provide the data for the next stage. It can also be seen in Figure 6 that the data provided on buses D2(015,0)  D2(015,7) is arranged to make column multiplexing in the next stage possible, since buses of contiguous column bits are provided to stage 2. It should also be noted that the manner in which control is provided to the cells greatly reduces the amount of control signals. The reason for this is that to perform an 8to1 multiplex operation with a decoded multiplexer in each cell, three control signals (i.e. # of control bits = 3 bits = log_{2} 8 ) per cell are required  or 384 bits (i.e. 128 × 3 bits l cell ) of control data for stage 1. Obviously, 384 lines of control data would represent a very large number of wires to be coupled to stage 1. However, the present invention avoids using these large numbers of control wires by sharing control values between columns and using control select signals shared by rows to perform bitwise selection of control values as shown in Figure 6. As a result, the present invention greatly reduces the number of control signals needed to be coupled to stage 1. Specifically, the present invention only uses 48 bits of control select bits for row(0)  row(15) i.e.:
16 × 3 CS1 bits = 48 row control select bits; and 48 bits of control bits for column(0)  column(7):
8 columns × (3 C1A bits + 3 C1B bits) = 48 column control bits ;
Thus, 96 total control and control select bits are used in stage 1 (as compared to 384 bits) in the case in which control and control select bits are shared between columns and rows.
Stage 2 of the Basic Twodimensional System Figure 7 illustrates a block diagram of a second stage of the system shown in
Figure 5 for the particular case in which the data word has 128 bits and is viewed as a twodimensional rectangle having 16 rows and 8 columns. As can be seen, stage 2 comprises 128 cells  each cell for providing one bit of data by performing a 16to 1 column multiplex operation. Each cell couples one bit of data to one bit line of the sixteen 8 bit output buses, D3(0,07)  D3(15,07). The input data provided from the previous stage, i.e. from buses D2(015,0)  D2(015,7), are each coupled to the stage 2 columns. For instance, input bus D2(0 15,0) is coupled to each of the column(0) cells, S2(0), S2(8), . . . S2(120).
Similar to stage 1, control is provided by two control buses (C2A and C2B) and a control select bus (CS2). Since a 16to1 multiplex operation is being performed by each cell in stage 2, 4 bits of control are required (i.e.
# of control bits = 4 bits  log_{2} 16 ). The two 4 bit control buses are shared by the cells within the same given row. For example, C2A(0,03) and C2B(0,03) are shared between row(0) cells, S2(0)  S2(7), and C2A(1,03) and C2B(1,03) are shared between row(1) cells, S2(8)  S2(15). Control select signals, CS2(0,03) CS2(7,03), are used to perform the bitwise selection of each of the shared control bits provided to each cell. Each of the 4 bit control select buses are common to all cells within the same column. For example, column(0) cells, S2(0), S2(8), . .
S2(120), share control select bus CS2(0,03) and column(1) cells S2(1), S2(9)
S2(121) share control select signal bus CS2(1,03).
Stage 3 of the Basic Twodimensional System Figure 8 illustrates a block diagram of a third stage of the system shown in
Figure 5 for the particular case in which the data word has 128 bits and is viewed as a twodimensional rectangle having 16 rows and 8 columns. As can be seen, stage 3 comprises 128 cells  each cell for providing one bit of data by performing an 8to1 column multiplex operation. Each cell couples one bit of data to one bit line of the sixteen 8 bit output buses, D4(0,07)  D4(15.07). The input data provided from the previous stage, i.e. D3(0,07)  D3(15,07), are each coupled to a row of cells. For instance input buses D3(0,07) are coupled to each of the row(0) cells, S3(0), S3(1), . . . S3(7).
Control is provided to each cell by three 3 bit control buses (C3A, C3B and C3C) and three 3 bit control select buses (CS3A, CS3B, and CS3C). There are 8 sets of C3A, C3B, and C3C control buses, i.e. C3A(0,02)  C3A(7,02),
C3B(0,02)  C3B(7,02). and C3C(0,02)  C3C(7,02). Each set of control buses are shared by one column of cells. For example, buses C3A(0,02), C3B(0,02) and C3C(0,02) are shared between column(0) cells, S3(0), S3(8). . . S(120), and buses C3A(1,02), C3B(1,02) and C3C(1,02) are shared between column(1 ) cells, S3(1), S3(9), . . .S3(121).
There are 16 sets of CS3A, CS3B, and CS3C control select signals, i.e. CS3A(0,02)  CS3A(15,02), CS3B(0,02)  CS3B(15,02), and CS3C(0,02) CS3C(15,02). Each set of CS3A, CS3B, and CS3C control select buses are shared by a given row of cells. For example, row(0) cells, S3(0)  S3(7), share control select buses CS3A(0,02), CS3B(0,02), and CS3C(0,02), and row( 1 ) cells S3(8)  S3(15) share control select buses CS3A(1,02), CS3B(1,02), and CS3C(1,02). For a given cell, three control select bits (one from each of the three control select buses) are used to select between three control bits (one from each of the three control buses). For instance, control select bits CS3A(0,0), CS3B(0,0), and
CS3C(0,0) select between the three control bits C3A(0,0), C3B(0,0), and
C3C(0,0), control select bits CS3A(0,1), CS3B(0,1), and CS3C(0,1) select between the three control bits C3A(0,1), C3B(0, 1 ), and C3C(0,1), and control select bits CS3A(0,2), CS3B(0,2), and CS3C(0,2) select between the three control bits C3A(0,2), C3B(0,2), and C3C(0,2). To achieve this type of selection, one of three control select bits is " 1" while the other two are "0". Physical Placement of Cells and Stages in the TwoDimensional
System
As can be seen in Figures 68, the cell and bus configurations are kept straight to reduce wire length. In a circuit comprising so many buses and cells, this particular feature of the present invention greatly optimizes implementation of the system and method of the present invention. For instance, the data from stage 1 flows in horizontally on bus D1 and flows out vertically on bus D2; the data from stage 2 is designed to flow in vertically and flow out horizontally; and finally, the data from stage 3 is designed to flow in horizontally and flow out horizontally.
Thus, the data flow for stages 13 as a unit is horizontal, while internally, buses are positioned to enhance the flow of data between stages.
Figures 9A9D illustrate several embodiments of the physical layout of the three stages. These embodiments optimize the data flow of the three stages as a unit (i.e. horizontal data flow) and also the internal data flow between stages. It should be understood that other layouts may be possible and the present invention is not limited to the configurations shown in Figures 9A9D.
Figure 9A illustrates a first embodiment of the physical layout of the three stages. In this embodiment, the first and second stage cells are physically interleaved to form a single block. Each cell within the merged first and second stage block comprises a first stage cell and a second stage cell. In this
configuration, data flows horizontally into the merged first and second stage block and flows horizontally into and out of the third stage block. Within the merged first and second stage block, data flows vertically from the first stage to the second stage. In the next embodiment of the layout (Figure 9B), each of the stages are separate and data flows as would be expected, horizontally into stage 1  vertically out of stage 1, vertically into stage 2  horizontally out of stage 2, horizontally into stage 3  and horizontally out of stage 3. It can be seen that the overall horizontal flow of data through the unit is preserved. However, the horizontal flow of data is skewed, such that the data flowing into stage 1 is positioned higher than the data flowing out of stage 3.
Another embodiment of the layout of the three stages merges the second and third stages (Figure 9C) while still another merges all three of the stages (Figure 9D).
It should be understood that the cells in Figures 68 are shown being consecutively arranged in rows from S(0)  S(127). For instance row(0) comprises cells S(0)  S(7), row(1 ) comprises S(8)  S(15) and so on until the last row( 15) which comprises cells S(120)  S(127). However, it should be understood that the cells may also be arranged in a different order. For instance, in one embodiment of the present invention, rows are interleaved such that row(0) comprises cells S(0) S(7), row(1) comprises cells S(64)  S(71), row(2) comprises cells S(8)  S(15), row(3) comprises cells S(72)  S(79). The last two rows in this interleaved embodiment are arranged such that row(14) comprises S(56)  S(63) and row( 15) comprises cells S(120)  S(127). One reason for arranging the cells in this manner is that if the data provided to the input of the three stages originates from two interleaved 64bit registers, it may be more convenient to preserve this interleaving. Another reason that cell interleaving may be employed is to facilitate the sharing of control buses, control select buses, and any other control or data buses as will be described below.
Simplified Embodiment of a Stage 1 Cell of the Twodimensional
System
Figure 10 illustrates a simplified embodiment of a stage 1 cell S1(0) as shown in Figure 6 comprising data multiplexer DMX1, control multiplexers CMX1(13), and flipflops FF1(14).
DMX1 is a combined decoder and multiplexer, such that the control signals coupled to the select inputs of the multiplexer are encoded and the multiplexer decodes these control signals to determine which data on its input to pass to its output. For instance, a control input signal of "01 1" (i.e. 3 in decimal) on the control input of DMX1 passes D 1(0,3) to the output of DMX1.
Each of the eight inputs of DMX1 is coupled to one bit of data provided from bus D1(0,07). The three data select inputs of DMX1 are each coupled to the output of one of CMX1( 13) through each of FF1(13). In response to the control signals provided from CMX1(1 3), DMX1 outputs one of its eight input data bits through FF1(4) to one bit line within bus D2(015,0).
As shown in Figures 6 and 10. bus D2(0 15,0) is a sixteenbit vertical column bus that runs along cells S1(0), S1(8), . . . S1(120). Each of these sixteen cells in this particular column couple one bit of data to a different bit line within bus D2(015,0). For instance, cell S1(0), couples one bit of data to data line D2(0,0), as shown in Figure 10. The cell directly below cell S1(0), i.e. S1 (8) as shown in Figure 6. couples one bit of data to D2(1,0), and so on for all cells within that column. Since an 8to 1 decoder/multiplexer is being used, three bits of control are required to select 1 of the 8 inputs. Each of control buses C1A(0,02) and
C1B(0,02) provide three bits of control values. One bit from each of control buses C1A(0,02) and C1B(0,02) is coupled to each of CMX1(13). The 3bit control select bus CS1(0,02) performs a bitwise selection of the control values from each of control buses C1A(0,02) and C1B(0,02) and determines whether the selected control value of a particular control multiplexer comes from the A or B control bus.
It should be noted that in the embodiment shown in Figure 10, FF1(14) are used in order to implement a pipelined system. However, these flip flops may not be required if pipelining is not performed. In other instances, more flipflops may be added or moved to different data paths to achieve different types of pipelining, Furthermore, these flipflops may also be replaced with latches which perform the same function as the flipflop in a pipelined system.
Figure 10 illustrates a first stage cell in a simplified form. However, other implementations of the first stage cell may comprise other circuitry to enhance its capabilities in order to increase the number of operations that the system of the present invention is capable of supporting. Simplified Embodiment of a Stage 2 Cell of the Twodimensional
System
Figure 11 (including Figures 11A and 11B) illustrates one embodiment of a stage 2 cell S2(0) including data multiplexer DMX2, control multiplexers CMX2( 14), and flipflops FF2(15). As with the previous stage, DMX2 is a combined decoder/multiplexer, FF2(15) may or may not be needed depending on what type of or whether pipelining is used, and the flipflops may be substituted with latches.
Bus D2(015,0) from the previous stage provides the input data bits to DMX2. The control values for DMX2 are provided by CMX2(14) through FF2(14). The output of DMX2 is coupled to bit line D3(0,0) within output bus D3(0,07) through FF2(5). Output bus D3(0,07) comprises eight bit lines  one for each cell within the same row.
As shown in Figures 7 and 11, bus D3(0,07) is an eightbit horizontal row bus that runs along row(0) cells, S2(0)  S2(7). Each of the eight cells in this row couple one bit of data to a different bit line within bus D3(0,07). For instance, cell S2(0), couples one bit of data to data line D3(0,0), as shown in Figure 11. The cell directly adjacent to cell S2(0) in the same row. i.e. cell S2(1) as shown in Figure 7, couples one bit of data to D3(0,1), and so on for all cells within that row.
Control data lines C2A(0,03) and C2B(0,03) provide one control bit to each of CMX2(14). One control select data bit from control select bus CS2(0,03) is coupled to a corresponding one of the select inputs of CMX2(14). The select signal provided by bus CS2(0,03) selects a control bit from either of control buses C2A(0,03) and C2B(0,03) so as to allow a bitwise selection of the control bits provided from these buses. Simplified Embodiment of a Stage 3 Cell of the Twodimensional
System
Figure 12 (including Figures 12A and 12B) illustrates one embodiment of a third stage cell S3(0) comprising data multiplexer DMX3, control multiplexers CMX3(13), and flipflops FF3(14). DMX3 is a combined decoder and multiplexer and CMX3(13) are conventional multiplexers. Thus, DMX3 requires 3 bits of control to select between its 8 data inputs, whereas. CMX3(13), require 3 exclusive control bits to select 1 of 3 data inputs. As with the previous stages, FF3(14) may or may not be employed or may be substituted with latches.
Each of the eight data inputs of DMX3 is coupled to one bit of data provided from bus D3(0,07). The three data select inputs of DMX3 are coupled to the output of one of CMX3(13) through each of FF3( 13). In response to the control signals provided from CMX3(13), DMX3 outputs one of its eight input data bits to the input D of FF3(4). Output Q of FF3(4) passes the data output from DMX3 to line D4(0,0), in output bus D4(0,07). As can be seen in Figure 12. the input and output buses are both horizontal. However, for the embodiment shown, the output bus could have easily been a vertical bus.
As shown in Figures 8 and 12, bus D4(0,07) is an eightbit horizontal row bus that runs along row(0) cells, S3(0)  S3(7). Each of the eight cells in this row couple one bit of data to a different bit line within bus D4(0,07). For instance, cell S3(0), couples one bit of data to data line D4(0,0), as shown in Figure 12. The cell directly adjacent to cell S3(0) in the same row, i.e. cell S3(1) as shown in Figure 8, couples one bit of data to D4(0, 1 ), and so on for all cells within that row.
Three 3bit control buses, C3A(0,02), C3B(0,02) and C3C(0,02), provide one control bit to each of CMX3(13). Three 3 bit control select buses CS3A(0,02), CS3B(0,02), and CS3C(0,02) perform a bitwise selection between the control bits provided by buses C3A(0,02), C3B(0,02) and C3C(0,02) and determine whether the selected control value of a particular control multiplexer comes from the A, B, or C control bus.
A First Modification to Stage 1 Cell of the Twodimensional System
Figure 13 (including Figures 13A and 13B) illustrates a modified stage 1 cell S1(0) having an additional 16bit input load align data bus DCH1(0 15,0) which provides data to override the data provided by multiplexer DMX1. This
modification is employed when performing load align operations. As shown in Figure 13, DCH1(015.0) runs parallel to output bus D2(015,0). Data line DCH1(0,0) is coupled to the 1 input of multiplexer MX1 and the other input 0 of MX1 is coupled to the output of DMX 1. The select input of MX1 is coupled to control signal SDCH. Control signal SDCH determines whether the data coupled to output bus D2 comes from the override bit line DCH 1(0,0) or whether the data comes from the multiplex operation performed by DMX1. There are eight 16bit DCH1 buses, for a total of 128 bits, with each bus being common to a given column of cells and providing one bit of data to each cell in the given column. As shown in Figure 13, DCH1(0 15,0) is a sixteenbit vertical column bus that runs along cells S1(0), S1(8), . . . S1(120). Each of these sixteen cells in this particular column receives its one bit of override data from a different bit line within DCH1(015,0). For instance, cell S 1(0), receives one bit of data from data line DCH1(0,0), as shown in Figure 13. The cell directly below cell S1(0), i.e. S1(8) as shown in Figure 6, receives one bit of data from DCH1(1,0), and so on for all cells within that column. It should be noted that since each bit line in the DCH1 bus is used by a single cell, this bus may be oriented either vertically (as shown in Figure 13) or horizontally. It should be obvious that if oriented horizontally, an 8bit horizontal DCH1 bus would be used to provide the load align data to the eight row(0) cells instead of the vertical 16bit DCH1 bus shown in Figure 13. Furthermore, in an embodiment in which the orientation of the DCH1 is horizontal, a total of sixteen 8bit buses are used to provide the load align data to all of the stage 1 cells. Finally, it should be understood that the orientation of the DCH1 bus is dependent on the direction that the load align data is supplied to stage 1.
Note that the SDCH signal is common to all cells in stage 1 for this embodiment. The SDCH signal is distributed to all cells by creating eight copies of the signal, each of which is shared by all cells within the same column. Note, this signal may be distributed in other manners, such as creating 16 copies each of which is shared by cells within the same row or by creating eight copies which are shared by cells in adjacent rows. It should be noted that the previously described elements shown in Figure 13 perform the same function as described in conjunction with Figure 10.
A First Modification to Stage 3 Cell of the Twodimensional System
Figure 14 (including Figures 14A  14D) illustrates a modified stage 3 cell S3(0). The implementation of the third stage cell shown in Figure 14 is designed to support fill operations performed by the system of the present invention in which one or more bit locations are filled with a bit provided from a fill bus (i.e. F3(0,07), Figure 14). For some operations, the fill bus may contain an additional data operand, while for other operations it may contain all ones, or all zeros. There are sixteen 8bit F3 buses, for a total of 128 bits, with each bus being common to a given row of cells and providing one bit of data to each cell in the given row. As shown in Figure 14, F3(0,07) is an eightbit horizontal row bus that runs along cells S3(0)  S3(7). Each of these eight cells in this particular row receives its fill bit from a different bit line within F3(0,07). For instance, cell S3(0), receives one bit of data from data line F3(0,0), as shown in Figure 14. The cell adjacent to cell S3(0), i.e. S3(l) as shown in Figure 8, receives one bit of data from F3(0,l), and so on for all cells within that row.
In Figure 14, all of the elements as described in Figure 12, including
DMX3, CMX3(13), and FF3(14), function the same as the elements in Figure 12. In addition, Figure 14 includes conventional multiplexers CMX3(4) and MX3, and FF3(5).
Input 0 of MX3 is coupled to the output Q of DMX3 and input 1 of MX3 is coupled a data bit provided from fill line F3(0,0) in fill bus F3(0,07). The select input of MX3 is coupled to the output Q of FF3(5). The input D of FF3(5) is coupled to the output Q of CMX3(4). CMX3(4) provides the control signal to MX3 and determines whether the data coupled to output data line D4(0,0) comes from the fill bus or from DMX3. As can be seen, conventional multiplexer CMX3(4) is controlled by 4 bit lines ZS3A(0)  ZS3D(0) of which only one is exclusively "hot". (i.e. only one is high and the remainder are low). Control select bit lines ZS3A(0) ZS3D(0) are shared by all cells in the same row.
Control is generated by CMX3(4) in the following manner:
1) In the case in which ZS3A(0) is "1" then CMX3(4) passes a "1" to the select input of MX3 and causes MX3 (and all other cells in that row) to couple the fill data provided from F3(0,07) to bus D4. For example, MX3 will couple the fill bit from F3(0,0) to D4(0,0):
2) In the case in which ZS3D(0) is " 1" then CMX3(4) passes a "0" to the select input of MX3 and causes MX3 (and all other cells in that row) to couple the data provided from the data multiplexer to bus D4. For example, MX3 will couple the data provided from DMX3 to D4(0,0);
3) In the case in which either of ZS3B(0) or ZS3C(0) is " 1", then CMX3(4) passes either the data from the data multiplexer or the data from the fill bus depending on the control signals provided on control lines Z3B(0) and Z3C(0). For instance, if ZS3C(0) is "1 " and control bit Z3C(0) is "1", then a fill bit is passed to the data output bus. However, if Z3C(0) is "0", the data comes from DMX3.
Similarly, when ZS3B(0) is "1", control bit Z3B(0) determines whether the data coupled to the data output bus comes from the fill bit or from DMX3.
As shown in Figure 14, each of bit lines Z3B(0) and Z3C(0) provide one bit of control to cell S3(0). Bit lines Z3B(0) and Z3C(0) are column bit lines that run vertical along the column(0) (refer to Figure 8) cells. Thus, Z3B(0) and Z3C(0) are also coupled to each of the cells in column(0). Furthermore, each of columns(07) in stage 3 have a unique set of Z3B and Z3C bit lines, making a total of 8 Z3B bits and 8 Z3C bits. For instance, the column of cells adjacent to column (0), i.e.
column(1). is coupled to bits lines Z3B(1) and Z3C(1).
As shown in Figure 14, each of bit lines ZS3A(0)  ZS3D(0) provide one bit of control to cell S3(0). Bit lines ZS3A(0)  ZS3D(0) are row bit lines that run horizontal along the row(0) cells, S3(0)  S3(7). Thus, ZS3A(0)  ZS3D(0) are also coupled to each of the cells in row(0). Furthermore, each of rows(015) in stage 3 have a unique set of ZS3A  ZS3D bit lines, making a total of 16 ZS3A bits, 16 ZS3B bits, 16 ZS3C bits, and 16 ZS3D bits . For instance, the row of cells below row(0), i.e. row(1), is coupled to bit lines ZS3A(1)  ZS3D(1). Providing control in this manner allows for selection of whether the data coupled to output bus D4 from each cell in a given row is taken from the fill bus or the data multiplexers on a columnbycolumn basis thereby greatly enhancing the flexibility of filltype operations.
A Second Modification to Stage 3 Cell of the Twodimensional
System
Figure 15 (including Figures 15A  15F) illustrates a third embodiment of a third stage cell S3(0) including a set of buses that provide additional multiplexer control to the third stage cell. To implement this embodiment, additional control bits and control select bits are added to this stage 3 embodiment. Referring to Figure 15, in addition to buses C3A(0,02)  C3C(0,02), buses M3A(0,07)  M3C(0,07) also provide control bits to the inputs of multiplexers CMX3(13). Also, in addition to buses CS3A(0,02)  CS3C(0,02), buses CS3D(0,02) and CS3E(0,02) provide control select bits to each of CMX3(13). Each of buses M3A(0,07), M3B(0,07), and M3C(0,07) comprise 8 bits of data. As shown in Figure 15, M3A(0,07), M3B(0,07), and M3C(0,ϋ7) are three eightbit horizontal row buses that run along cells S3(0)  S3(7). Each cell in this particular row receives one bit from each of the M3A(0,07), M3B(0,07), and M3C(0,07) buses. For instance, cell S3(0), receives one bit from bus M3A(0,07), i.e. M3A(0,0), one bit from bus M3B(0,07), i.e. M3B(0,0), and one bit from bus M3C(0,07), i.e. M3C(0,0), as shown in Figure 15. The cell directly adjacent to cell S3(0), i.e. S3( 1 ), receives three bits of data from different bit lines within buses M3A(0,07), M3B(0,07), and M3C(0,07), i.e. from bit lines M3A(0,1 ), M3B(0,1), and M3C(0,1), and so on for all cells within that row.
Furthermore, each row of cells in stage 3 has a different set of
corresponding M3A  M3C buses. For instance, buses M3A(0,07), M3B(0,07). and M3C(0,07) provide data to the cells in row(0), buses M3A(1,07), M3B(1,07), and M3C( 1,07) provide data to the cells in row(1), and so on for each row of cells in stage 3.
For the stage 3 cell shown in Figure 15, i.e. S3(0), bus M3A(0,07) provides two additional control bits to the input of CMX3(3). Similarly, bus M3B(0,07) provides two additional control bits to CMX3(2) and M3C(0,07) provides two additional control bits to CMX3(1). As can be seen in Figure 15, one of the two additional control bits applied to CMX3(3) is provided from bus
M3A(0,07) and the other of the two additional bits is the complement of the bit provided from M3A(0,07), (as indicated by inverted input 3 of CMX3(3) shown in Figure 15). Similarly, one of the additional control bits coupled to CMX3(2) is the bit provided from the M3B(0,07) bus and the other is its complement. Further, one of the additional control bits coupled to CMX3(1) is the bit provided from bus M3C(0,07) and the other is its complement.
Since CMX3(13) are conventional multiplexers, additional control select inputs are needed to allow for selection of the additional control bits provided by buses M3A(0,07)  M3C(0,07). The additional control select bits are provided from buses CS3D(0,02) and CS3E(0,02). Referring to Figure 15, 3bit bus CS3D(0,02) provides one bit of control to each of CMX3(13) and 3bit bus CS3E(0,02) provides one bit of control to each of CMX3(13). Specifically. CMX3(1) receives control from bit lines CS3D(0,2) and CS3E(0,2), CMX3(2) receives control from CS3D(0.1) and CS3E(0,1). and CMX3(3) receives control from CS3D(0,0) and CS3E(0,0). Buses CS3D(0,02) and CS3E(0,02) are shared across rows in the same manner as buses CS3A(0,02)  CS3C(0,02).
As with buses CS3A  CS3C, each row of cells in stage 3 has a different set of corresponding CS3D and CS3E buses. For instance, buses CS3D(0,02) and CS3E(0,02) provide data to the cells in row(0), buses CS3D(1,02) and CS3E( 1,02) provide data to the cells in row(1), and so on for each row of cells in stage 3.
A Third Modification to Stage 3 Cell of the Twodimensional System
Figure 16 (including Figures 16A  16F) illustrates still another embodiment of a third stage cell S3(0) which incorporates circuitry that allows for both filltype operations and which provides additional multiplex control. As shown in Figure 16, this embodiment includes circuitry to allow for a fill operation to be performed, i.e. fill bus F3(0,07) for providing fill bits to MX3, control bit lines Z3B(0) and Z3C(0) for providing data to multiplexer CMX3(4), and control select bits ZS3A(0)  ZS3D(0) for providing select bits to CMX3(4). In addition, Figure 16 includes circuitry which provides additional stage 3 multiplexer control, i.e. buses M3A(0,07)  M3C(0,07) and additional control select buses CS3D(0,02) and CS3E(0,02).
A Fourth Modification to Stage 3 Cell of the Twodimensional System
Employing Bus Overloading
Figures 12, 1416 illustrate a third stage cell for generating a single bit of data. The cell is implemented such that it employs buses that are used exclusively to provide a particular type of data to the cell. For instance, the M3A(0,07) M3C(0,07) buses shown in Figure 16 are used exclusively to provide control bits to CMX3(13) multiplexers. Similarly, fill bus F3(0,07) is used to provide bits of fill data to MX3. However, in another embodiment of the present invention, a single bus is used to provide two types of data (referred to as bus overloading).
Figure 17 illustrates the arrangement of the control and data buses for an embodiment of the S3(0) third stage cell in which the bus providing fill values also provides additional multiplex control values to one of control multiplexers CMX3( 1 3). Figure 17 shows bit lines F3(0,0)/M3A(0,0), M3B(0,0), and M3C(0,0).
Figure 17 also shows the other buses or bit lines coupled to the cell, i.e. D4(0,0), D3(0,07), CS3A(0,02)  CS3E(0,02), Z3B(0), Z3C(0), ZS3A(0)ZS3D(0), and C3A(0,02)  C3C(0,02). Bit lines M3B(0,0) and M3C(0,0) are each coupled to input ports 3 and 4 of each of CMX3(2) and CMX3(1), respectively, as in Figure 16. However, bit line F3(0,0)/M3A(0,0) is coupled to both input 1 of MX3 through an additional flipflop (not shown) as well as to input ports 3 and 4 of CMX3(3). The system is designed such that bit line F3(0,0)/M3A(0,0) is used for either providing a fill bit to MX3 or an input bit to input ports 3 and 4 of CMX3(3). but typically not both. If bus F3(0,0)/M3A(0,0) is providing a fill bit to MX3 in a particular operation, then the control bits on inputs 3 and 4 of CMX3(3) are generally not used. Conversely, if the data on bus F3(0,0)/M3A(0,0) is providing control to inputs 3 and 4 of CMX3(3), then a fill operation is not being performed. Furthermore, it is unlikely that data on this overloaded bus would be meaningful to both operations at once.
It should be noted that the reason that data is coupled through the additional flipflop to MX3 (as described above) is so that the data on bus F3(0,0)/M3A(0,0) is used in the same relative clock cycle regardless of its use. If this were not the case, the system would not be able to support full pipelining. Of course, this additional flipflop is only required in embodiments which include FF3(13).
It should also be obvious that either M3B or M3C could be used in place of M3A for purposes of bus overloading.
A Fifth Modification to Stage 3 Cell of the Twodimensional System
Employing Bus Sharing and Bus Overloading
Figure 18 illustrates still another embodiment of the third stage cell of the present invention in which both bus sharing and bus overloading are employed. In this embodiment, each third stage cell is designed to actually include two cells. This embodiment is particularly adaptable when input data is stored in two 64bit registers and bit lines of the registers are interleaved in a particular regular pattern. The interleave pattern is such that the first row of bits includes bits S(0)  S(7), the second row of bits includes bits S(64)  S(71), the third row includes S(8)  S(15), and the fourth row includes S(72)  S(79), etc.
Put in terms of interleaving rows, the above interleaving sequence is achieved by the following row interleaving configuration: row(0), row(8), row( 1). row(9), row(2), row( 10) . . . row(7), row(15). Thus rows(0) and row(8) are adjacent rows, rows( I) and (9) are adjacent and so on. In this case the first two adjacent rows are configured such that cells S3(0) and S3(64) are adjacent, cells S3(1) and S3(65) are adjacent, cells S3(2) and S3(66) are adjacent and so on.
Figure 18 illustrates adjacent cells in the case in which row interleaving as described above is employed which includes a first cell S3(0) corresponding to bit 0 from row(0) and a second cell S3(64) corresponding to bit 64 from row(8). Please note the following description includes reference to buses using the following format: the prefix indicates the bus name, the first number in the parentheses indicates the row or column number of the given bus in the interleaving
configuration and the second number in the parentheses indicates the bit number within that given bus.
Each of cells S3(0) and S3(64) include all of the circuit elements as shown in Figure 16. Specifically, both S3(0) and S3(64) include DMX3, CMX3(14), MX3, and FF3(15). In addition, cell S3(0) is shown being coupled to other buses or bit lines in the same manner as described in conjunction with Figure 16, i.e. D4(0,0), D3(0,07), CS3A(0,02)  CS3E(0,02), and ZS3A(0)ZS3D(0).
Similarly, S3(64) is shown being coupled to buses D4(8,0), D3(8,07), CS3A(8,02)CS3E(8,02), and ZS3A(8)  ZS3D(8). Cells S3(0) and S3(64) also share some control buses/bit lines, i.e. Z3B(0), Z3C(0), C3A(0,02)  C3C(0,02) as descirbed in previous embodiments.
The fill bus and the additional multiplexer control buses are both shared and overloaded in the cell shown in Figure 18. First, each of S3(0) and S3(64), share the same M3A  M3C buses, (instead of each having a separate set of M3A  M3C buses). Referring to Figure 18, each of buses F3(0,0)/M3A(0,0)/M3A(8,0),
F3(8,0)/M3B(0,0)/M3B(8,0), and M3C(0,0)/M3C(8,0) are coupled to both of the S3(0) and S3(64) CMX3(13) multiplexers. Thus, F3(0,0)/M3A(0,0)/M3A(8,0) provides the M3A(0,0) and M3A(8,0) data bit to input ports 3 and 4 of CMX3(3) control multiplexers  in both S3(0) and S3(64), respectively. Similarly,
F3(8,0)/M3B(0,0)/M3B(8,0) provides the M3B(0,0) and M3B(8,0) data bit to input ports 3 and 4 of CMX3(2) control multiplexers  in both S3(0) and S3(64), respectively. And finally, M3C(0,0)/M3C(8,0) provides the M3C(0,0) and M3C(8,0) data bit to input ports 3 and 4 of CMX3(2) control multiplexers  in both S3(0) and S3(64), respectively.
Due to this type of bus sharing between contiguous rows of cells, pairs of contiguous rows of cells receive the same additional multiplexer control instead of each row receiving a unique set of additional multiplexer control buses. However, the number of M3A  M3C buses is halved.
The same type of bus overloading as shown in Figure 17 is employed in the embodiment of the third stage cell shown in Figure 18. As shown in Figure 18, bus F3(0,0)/M3A(0,0)/M3A(8,0) provides the F3(0,0) data bit to cell S3(0) and bus F3(8,0)/M3B(0,0)/M3B(8,0) provides the F3(8,0) data bit to cell S3(64). Buses F3(0,0)/M3A(0,0)/M3A(8,0) and F3(8,0)/M3B(0,0)/M3B(8,0) are employed to either provide data that is used for a fill operation or data that is used for control values to the CMX3(3) control multiplexers. As can be seen, these buses are both shared and overloaded. It should be noted that in this particular embodiment, there are still 128 distinct fill bits that may be provided to the third stage cells as with the previous embodiments described above. It should be further noted that as with the previous embodiment shown in Figure 17, the signal coupled from bit line
F3(0,0)/M3A(0,0)/M3A(8,0) to MX3 in S3(0) and the signal coupled from bit line F3(8,0)/M3B(0,0)/M3B(8,0) to MX3 in S3(64) are each passed through an additional flipflop in order to support full pipelining. Of course, this additional flipflop is only required in embodiments which include FF3(13).
The embodiment shown in Figure 17 illustrates bus overloading and the embodiment shown in Figure 18 illustrates both bus sharing and bus overloading. It should be understood that other embodiments of the present invention may employ bus sharing of the additional multiplexer control data buses, without employing bus overloading with the fill bus. For instance, in one embodiment, adjacent rows share one set of M3A  M3C buses, but each of the adjacent rows is coupled to a separate, nonoverloaded, unshared fill bus. Furthermore, buses M3A  M3C may also be shared by adjacent rows in embodiments which do not include any fill buses (i.e., F3 buses).
Supported Instructions
The system described above can perform many useful operations on data words. In particular, when the system is part of a computer, it can be used to implement these operations as computer instructions. It should be understood that a given operand for a computer instruction may be taken from an immediate field of the instruction, from a register in the computer, or from some other memory in the computer. Although the choices of which combinations of operand sources to implement as instructions for a given operation is an important architectural consideration in the design of a computer, these choices can for the most part be ignored in the design of the functional unit which performs those operations. In this instance, the functional unit is the system described above.
Special case 2. In this special case, assume that the transpose function r is bijective, that is, both injective (i.e., onetoone) and surjective (i.e., onto). Then r is invertible, and n' = n . In fact, r is a permutation of the dimensions of the initial rectangle into the dimensions of the final rectangle, and each element in the initial rectangle will appear exactly once in the final rectangle. This case is a pure transpose operation, and it is also a pure permutation of the initial word into the final word. The transpose operation is invertible, with the inverse being another transpose operation defined by the transpose function r^{ 1}.
Special case 3. This is special case 2 with the added restriction that the number of elements in the initial and final rectangles is a power of two, i.e., a combination of special cases 1 and 2. This combination is singled out here so that it can be refeired to later.
d
Shuffle/BitMux. The combination of the multiway perfect shuffle operations and the bitmux operations described above can be used to apply the extended general method described earlier. In fact, since the multiway perfect shuffle is capable of aligning any consecutive sequence of bits in a bit index on any boundary (in particular, rightjustifying them, so that a subsequent bitmux operation can multiplex across the corresponding dimension), the combination supports the general, n dimensional version of the extended general method, provided none of the dimensions are larger than the bitmux operation can multiplex. In the normal case, the sequence of operations would be bitmux, shuffle, bitmux. shuffle. ..., shuffle, bitmux. The shuffle/bitmux operation combines these two operations, effectively performing first a multiway perfect shuffle, followed by a bitmux operation. This combined operation can therefore significantly reduce the number of operations necessary to apply the extended general method.
Internally, the system described earlier supporrs any perfect shuffle operation (including the group forms) in combination with any bitmux operation supported in that system, where the group sizes associated with the shuffle part of the operation are independent of the group sizes associated with the bitmux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bitmux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bitmux control for the high and low halves of the word is shared.
It turns out that the internal control generated by the system for performing the multiway perfect shuffle operation can be generated in such a way that the explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier.
BitMux/Shuffle. This operation is exactly like the shuffle/bitmux operation, except the bitmux portion of the operation is effectively performed before the shuffle portion, rather than after the shuffle portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the multiway perfect shuffle operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them.
Transpose/BitMux. This operation is similar to the shuffle/bitmux operation, except the multiway perfect shuffle part of the operation is replaced with the more general transpose operation, either the pure transpose operation or the extended transpose operation.
First consider the case where the transpose part of the operation is a pure transpose. Internally, the system described earlier supports any pure transpose operation in combination with any bitmux operation supported in that system, where the group sizes associated with the pure transpose part of the operation are independent of the group sizes associated with the bitmux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bitmux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bit mux control for the high and low halves of the word is shared.
It turns out that the internal control generated by the system for performing the pure transpose operation can be generated in such a way that the explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier. Now consider the case where the transpose part of the operation is an extended transpose. In this case, it is probably more practical to build a separate bit mux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier. BitMux/Transpose. This operation is exactly like the transpose/bitmux operation, except the bitmux portion of the operation is effectively performed before the transpose portion, rather than after the transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the transpose operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them.
Reverse/Transpose/BitMux. This operation is similar to the shuffle/bitmux and transpose/bitmux operations, except the first part of the operation is replaced with the more general reverse/transpose operation, either the pure reverse/transpose operation or the extended reverse/transpose operation. First consider the case where the reverse/transpose part of the operation is a pure reverse/transpose. Internally, the system described earlier supports any pure reverse/transpose operation in combination with any bitmux operation supported in that system, where the group sizes associated with the pure reverse/transpose part of the operation are independent of the group sizes associated with the bitmux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bitmux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bitmux control for the high and low halves of the word is shared.
It turns out that the internal control generated by the system for performing the pure reverse/transpose operation can be generated in such a way that the explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier. Now consider the case where the reverse/transpose part of the operation is an extended reverse/transpose. In this case, it is probably more practical to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier. BitMux/Reverse/Transpose. This operation is exactly like the reverse/transpose/bitmux operation, except the bitmux portion of the operation is effectively performed before the reverse/transpose portion, rather than after the reverse/transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the reverse/transpose operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them. Copy/Reverse/Transpose/BitMux. This operation is similar to the shuffle/bitmux and transpose/bitmux operations, except the first part of the operation is replaced with the more general copy/reverse/transpose operation, either the pure copy/reverse/transpose operation or the extended copy/reverse/transpose operation. It is probably more practical to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier.
BitMux/Copy/Reverse/Transpose. This operation is exactly like the copy/reverse/transpose/bitmux operation, except the bitmux portion of the operation is effectively performed before the copy/reverse/transpose portion, rather than after the copy/reverse/transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the copy/reverse/transpose operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them.
SuperTranspose/BitMux. This operation is similar to the shuffle/bitmux and transpose/bitmux operations, except the first part of the operation is replaced with the more general supertranspose operation. It is probably more practical to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier.
BitMux/SuperTranspose. This operation is exactly like the supertranspose/bitmux operation, except the bitmux portion of the operation is effectively performed before the supertranspose portion, rather than after the supertranspose portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the supertranspose operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them.
Copy/Reverse/SuperTranspose/BitMux. This operation is similar to the shuffle/bitmux and transpose/bitmux operations, except the first part of the operation is replaced with the more general copy/reverse/supertranspose operation. It is probably more practical to build a separate bitmux unit than to attempt to perform the bitmux portion of the combined operation within the system described earlier. BitMux/Copy/Reverse/SuperTranspose. This operation is exactly like the copy/reverse/supertranspose/bitmux operation, except the bitmux portion of the operation is effectively performed before the copy/reverse/supertranspose portion, rather than after the copy/reverse/supertranspose portion. The most practical way for the system described earlier to support this operation would be to build a separate bitmux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the
copy/reverse/supertranspose operation can not be generated in such a way that firststage explicit multiplexer control buses can contain the bitmux control operands directly, with no need to modify them.
Select. In describing the bitmux operation above, a hypothetical version which is not supported by the system described earlier was first defined. It was then shown that the bitmux operations that are supported by the system described earlier may be thought of as outer group versions of the hypothetical bitmux operation. In the same sense, the select operation may be thought of as an inner group version of the hypothetical bitmux operation.
Recall that the general, hypothetical version of the bitmux operation takes a vw bit control operand u which represents either a w × v or a v × w rowmajor rectangle. In this case, however, it is more natural to assume the former case, so the bit index function p(i) can be defined by p(i)[j]← u[iv + j] , or simply
p[i] = u[(i + 1)v  1→iv].
Furthermore, if the inner group size b is greater than or equal to , where d is the width of the
first (i.e.. leftmost) dimension of the ndimensional rectangle upon which the system is based, then the system can perform this operation simply by using u as the control for a corresponding stage of the system. Load Alignment. The system described earlier is able to perform a load alignment function on data which is retrieved from the memory system. Since this operation doesn't require the first stage of the system, a separate loadalign bus is used which bypasses the first stage and therefore allows more time for the data to arrive from the memory system. The raw data from the memory system consists of a fullwidth word. Within that word, the desired field has a size which is a power oftwo which is greater than or equal to 8, and is guaranteed to be aligned on a boundary which is a multiple of that size. Depending on the type of memory reference, the load align operation must optionally reverse the order of the bytes. In any case, the resultant data must be rightjustified, and either zerofilled or sign extended, depending on the type of the memory reference. Store Alignment. The system described earlier is able to perform a store alignment function on data which is to be sent to the memory system. The memory system is responsible for storing only those bytes which are being written, so the store align operation only needs to know the size of the data being stored and whether or not it needs to reverse the byte order. The value to be stored is rightjustified in the source, so the store align operation simply replicates the value across the entire word, possibly reversing the bytes within the value, depending on the type of the memory reference. In either case, the effect of the operation is always equivalent to some copy/reverse operation. Additional Details of Operations in a Specific Embodiment of the System
This section adds a few more details about some of the operations described above, in the context of a specific embodiment of the system described earlier.
In this embodiment, the system is part of a microprocessor, and serves as a functional unit which implements some of the instructions of the computer, and also performs some internal functions, such as load and store alignment. In this embodiment, the full data path width is 128 bits, and the system is based on a twodimensional, 16x8 rectangle (resulting in a threestage implementation). The microprocessor supports two basic word sizes. 64bit words and 128bit words. Machine registers are 64 bits wide. Instructions which operate on 64bit words use individual 64bit registers, while instructions which operate on 128bit words operate on adjacent pairs of 64bit registers, where the evennumbered register corresponds to the loworder half of the 128bit word and the oddnumbered register corresponds to the highorder half of the 128bit word. (There are some instructions that operate on 128bit operands which are constructed from arbitrary pairs of 64bit registers. The main reason this isn't done for other 128bit operands is because it requires more instruction fields to specify the two 64bit registers.) The 128bit data path consists of two 64bit halves, the highorder half and the low order half, which are interleaved at the byte level so that each half physically spans the entire width of the datapath. In other words, the physical order of the bytes is 0, 8, 1, 9, .... 7, 15. (Another way to look at this is to view the physical order of the bits in the data path as being the result of a twoway perfect shuffle with an inner group size of 8.) This embodiment shares the explicit multiplexer control buses across the high and low halves of the data path in the final (i.e.. third) stage. Internal to the microprocessor, operations which require a 64bit data operand are guaranteed that the operand will be replicated on both the high and loworder 64bit halves of the 128bit data path, and operations which produce a 64bit result must replicate that value on both the high and loworder 64bit halves of the 128bit data path. This convention allows the evennumbered registers to always receive their new values from the loworder half of the data path and the oddnumbered registers to always receive their new values from the highorder half of the data path. It is also generally fairly easy to support this convention by treating 64bit operations as 128bit operations with an outer group size of 64 (or less). which in most cases is sufficient.
In most cases, the instruction set defines group operations as reading and writing 128bit values, and nongroup operations as reading and writing 64bit values. There are, however, exceptions to this, as well as some operations which use both 64bit and 128bit data operands (i.e., source and/or destination operands). The list of operations described above is now revisited, giving any applicable comments about this specific embodiment.
Rotate. This embodiment supports 64bit nongroup versions and 128bit outer group versions, with outer group sizes ranging from 2 to 128. Note that only the loworder 6 bits (for the 64bit, nongroup version) or x bits (for the 128bit, group version) of the rotate amount affect the result. An outer group size of 1 is excluded since it is a noop. Both immediate and nonimmediate rotate amounts are supported.
Shift. This embodiment supports 64bit nongroup versions and 128bit outer group versions, with outer group sizes ranging from 2 to 128. Only the low order 6 bits (for the 64bit, nongroup version) or x bits (for the 128bit, group version) of the shift amount are used. An outer group size of 1 is excluded since it is a noop. Both immediate and nonimmediate shift amounts are supported.
Bit Field Deposit. This embodiment supports 64bit nongroup versions and 128bit outer group versions, with outer group sizes ranging from 1 to 64. Only immediate forms of these instructions are supported in this embodiment, i.e., the shift amount, field size, and outer group size are all encoded as immediates. For the 64bit, nongroup version, the shift amount must be greater than or equal to zero and less than 64, and the field size must be greater than or equal to one and less than or equal to 64 minus the shift amount. For the 128bit. group version, the shift amount must be greater than or equal to zero and less than α , and the field size must be greater than or equal to one and less than or equal to α minus the shift amount.
Bit Field Withdraw. This embodiment supports 64bit nongroup versions and 128bit outer group versions, with outer group sizes ranging from 1 to 64. Only immediate forms of these instructions are supported in this embodiment, i.e., the shift amount, field size, and outer group size are all encoded as immediates. For the 64bit. nongroup version, the shift amount must be greater than or equal to zero and less than 64, and the field size must be greater than or equal to one and less than or equal to 64 minus the shift amount. For the 128bit, group version, the shift amount must be greater than or equal to zero and less than α , and the field size must be greater than or equal to one and less than or equal to a minus the shift amount.
Expand. This embodiment only supports group versions. The source is 64 bits wide and the destination is 128 bits wide, with initial outer group sizes ranging from 1 to 64 and final outer group sizes ranging from 2 to 128. Only the loworder x bits of the shift amount are used, where x is derived from the final, larger group size. Both immediate and nonimmediate shift amounts are supported.
Compress. This embodiment only supports group versions. The source is 128 bits wide and the destination is 64 bits wide, with initial outer group sizes ranging from 2 to 128 and final outer group sizes ranging from 1 to 64. Only the loworder x bits of the shift amount are used, where x is derived from the initial, larger group size. Both immediate and nonimmediate shift amounts are supported.
Copy. This embodiment does not support this directly, since it is subsumed by copy/reverse (copy/swap). Bit reverse (Swap). This embodiment does not support this directly, since it is subsumed by copy/reverse (copy/swap).
Copy/Reverse (Copy/Swap). This embodiment supports 64bit and 128bit versions. Only immediate versions are supported, i.e., both the mask and invert values come from immediates. Shuffle/Deal. This embodiment supports 64bit group versions and 128 bit group versions. Only immediate versions are supported, i.e., the outer group size, inner group size, and shuffle amount are encoded as immediates. This embodiment uses a thirddegree polynomial to encode all meaningful combinations of these values. One of the encoded values is reserved for an identity (i.e., noop) shuffle, which is useful in the context of a shuffle/bitmux instruction. Transpose. This embodiment does not support this instruction. The primary reason is that the control operand is too large to fit into an immediate value in this instruction format, whereas the multiway shuffle/deal control can be encoded very compactly in an immediate, and the multiway shuffle/deal instruction covers most of the important cases of the transpose instruction. Furthermore, although generating the internal control for the pure transpose instruction is relatively easy, the internal control for the extended transpose instruction is more complicated to generate and requires significantly more logic to implement. The pure transpose instruction has the added complication of having to detect control values which do not specify pure transpose operations. Reverse/Transpose. This embodiment does not support this instruction.
The primary reason is that the control operand is too large to fit into an immediate value in this instruction format. Furthermore, although generating the internal control for the pure reverse/transpose instruction is relatively easy, the internal control for the extended reverse/transpose instruction is more complicated to generate and requires significantly more logic to implement. The pure
reverse/transpose instruction has the added complication of having to detect control values which do not specify pure reverse/transpose operations.
Copy/Reverse/Transpose. This embodiment does not support this instruction. The primary reason is that the control operand is too large to fit into an immediate value in this instruction format. Furthermore, generating the internal control for the copy/reverse/transpose instruction requires a significant amount of logic to implement. The pure copy/reverse/transpose instruction has the added complication of having to detect control values which do not specify pure copy/reverse/transpose operations. SuperTranspose. This embodiment does not support this instruction.
The primary reason is that the internal control requires a substantial amount of logic to compute. One possible way to avoid this problem would be to generate the control information in software, and then load the control information into internal control registers. However, that would only make sense if the same control were to be used many times before being changed.
Copy/Reverse/SuperTranspose. This embodiment does not support this instruction. The primary reason is that the internal control requires a substantial amount of logic to compute. One possible way to avoid this problem would be to generate the control information in software, and then load the control information into internal control registers. However, that would only make sense if the same control were to be used many times before being changed.
BitMux. This embodiment supports 64bit and 128bit versions, with an outer group size of 8. The 128bit version shares the multiplexer control across the high and loworder halves of the data path. Both versions therefore require 192 bits of multiplexer control. 64 of which come from a 64bit operand and 128 of which come from a 128bit operand. An outer group size of 4 is effectively supported through the shuffle/bitmux instruction, where it is possible to specify the identity (i.e., noop) shuffle.
Shuffle/BitMux. This embodiment supports 64bit and 128bit versions, with outer group sizes of 4 and 8. The 128bit version shares the multiplexer control across the high and loworder halves of the data path.
When the outer group size is 8. 192 bits of multiplexer control are needed. 64 of which come from a 64bit operand and 128 of which come from a 128bit operand. In this case, there is no room in the instruction encoding to specify the type of shuffle to perform. Therefore, only one fixed shuffle is supported for this case (in addition to the identity shuffle, which is supported as a plain bitmux instruction). The fixed shuffle has an outer group size of 64, an inner group size of 1, and is an 8way shuffle. This corresponds to a transpose of an 8x8 rectangle in the 64bit case, and to a pair of 8 × 8 rectangles in the 128bit case. Therefore, three of these instructions, or a bitmux instruction followed by two of these instructions, are sufficient to perform any 64bit permutation. Note that internally the system is capable of combining an arbitrary multiway group shuffle with a bitmux with an outer group size of 8. The only reason this isn't supported in the instruction set is due to the lack of an additional immediate operand field.
When the outer group size is 4, 128 bits of multiplexer control are needed, which come from a 128bit operand. These eliminates the need for the 64bit control operand required by the previous case. This additional operand field is used to encode an arbitrary multiway group shuffle, using the same encoding used by the shuffle/deal instruction. Note that this includes the identity (i.e., noop) encoding, so a plain bitmux with an outer group size of 4 can be obtained. BitMux/Shuffle. This embodiment could support this if the bitmux portion were performed outside of the system described earlier.
Transpose/BitMux. This embodiment does not support this instruction, since it does not support the general transpose instruction.
BitMux/Transpose. This embodiment can not support this, since it does not support the general transpose instruction.
Reverse/Transpose/BitMux. This embodiment does not support this instruction, since it does not support the reverse/transpose instruction.
BitMux/Reverse/Transpose. This embodiment does not support this instruction, since it does not support the reverse/transpose instruction. Copy/Reverse/Transpose/BitMux. This embodiment does not support this instruction, since it does not support the copy/reverse/transpose instruction.
BitMux/Copy/Reverse/Transpose. This embodiment does not support this instruction, since it does not support the copy/reverse/transpose instruction.
SuperTranspose/BitMux. This embodiment does not support this instruction, since it does not support the supertranspose instruction.
BitMux/SuperTranspose. This embodiment does not support this instruction, since it does not support the supertranspose instruction. Copy/Reverse/SuperTranspose/BitMux. This embodiment does not support this instruction, since it does not support the copy/reverse/super transpose instruction.
BitMux/Copy/Reverse/SuperTranspose. This embodiment can not support this, since it does not support the copy/reverse/supertranspose instruction . Select. This embodiment supports 64bit and 128bit versions, with an inner group size of 8. Both versions take a 64bit control operand. The 128bit version uses all 64 control bits, using the packing described earlier. The 64bit version only needs 24 control bits. However, rather than being densely packed in the loworder 24 bits of the control operand, they are sparsely packed in the loworder 32 bits of the control operand, with every fourth bit being ignored. This was done in order to make it easier to generate control values for the 64bit case (since they are now on poweroftwo boundaries, it becomes much more natural to use group operations to generate them). Load Alignment. This embodiment supports this internally.
Store Alignment. This embodiment supports this internally.
Claims
Priority Applications (2)
Application Number  Priority Date  Filing Date  Title 

US51639895 true  19950816  19950816  
US08/516,398  19950816 
Publications (2)
Publication Number  Publication Date 

WO1997007451A2 true true WO1997007451A2 (en)  19970227 
WO1997007451A3 true WO1997007451A3 (en)  19970410 
Family
ID=24055396
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

PCT/US1996/013195 WO1997007451A3 (en)  19950816  19960814  Method and system for implementing data manipulation operations 
Country Status (1)
Country  Link 

WO (1)  WO1997007451A3 (en) 
Cited By (3)
Publication number  Priority date  Publication date  Assignee  Title 

US6446198B1 (en) *  19990930  20020903  Apple Computer, Inc.  Vectorized table lookup 
EP2267903A3 (en) *  19981204  20120404  Qualcomm Incorporated  Turbo code interleaver using linear congruential sequences 
US8484532B2 (en)  20010223  20130709  Qualcomm Incorporated  Randomaccess multidirectional CDMA2000 turbo code interleaver 
Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US3812467A (en) *  19720925  19740521  Goodyear Aerospace Corp  Permutation network 
WO1990001740A1 (en) *  19880801  19900222  Board Of Regents, The University Of Texas System  Dynamic address mapping for conflictfree vector access 
EP0376769A2 (en) *  19881125  19900704  France Telecom  Apparatus for linecolumn matrix transposition using shift registers and permutation operators 
US5159690A (en) *  19880930  19921027  Massachusetts Institute Of Technology  Multidimensional cellular data array processing system which separately permutes stored data elements and applies transformation rules to permuted elements 
Patent Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US3812467A (en) *  19720925  19740521  Goodyear Aerospace Corp  Permutation network 
WO1990001740A1 (en) *  19880801  19900222  Board Of Regents, The University Of Texas System  Dynamic address mapping for conflictfree vector access 
US5159690A (en) *  19880930  19921027  Massachusetts Institute Of Technology  Multidimensional cellular data array processing system which separately permutes stored data elements and applies transformation rules to permuted elements 
EP0376769A2 (en) *  19881125  19900704  France Telecom  Apparatus for linecolumn matrix transposition using shift registers and permutation operators 
NonPatent Citations (3)
Title 

CVGIP IMAGE UNDERSTANDING, vol. 57, no. 1, 1 January 1993, pages 2441, XP000382765 ANTZOULATOS D G ET AL: "HYPERMATRIX ALGEBRA: THEORY" * 
IBM TECHNICAL DISCLOSURE BULLETIN, vol. 17, no. 6, 1 November 1974, page 1575/1576 XP002011206 HANNA C A ET AL: "BIT MANIPULATOR" * 
JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, vol. 23, no. 2, April 1976, pages 298309, XP002025764 FRASER: "Array Permutation by IndexDigit Permutation" * 
Cited By (4)
Publication number  Priority date  Publication date  Assignee  Title 

EP2267903A3 (en) *  19981204  20120404  Qualcomm Incorporated  Turbo code interleaver using linear congruential sequences 
US6446198B1 (en) *  19990930  20020903  Apple Computer, Inc.  Vectorized table lookup 
US7000099B2 (en)  19990930  20060214  Apple Computer Inc.  Large table vectorized lookup by selecting entries of vectors resulting from permute operations on subtables 
US8484532B2 (en)  20010223  20130709  Qualcomm Incorporated  Randomaccess multidirectional CDMA2000 turbo code interleaver 
Also Published As
Publication number  Publication date  Type 

WO1997007451A3 (en)  19970410  application 
Similar Documents
Publication  Publication Date  Title 

US5787302A (en)  Software for producing instructions in a compressed format for a VLIW processor  
US4816913A (en)  Pixel interpolation circuitry as for a video signal processor  
US4821208A (en)  Display processors accommodating the description of color pixels in variablelength codes  
US4727474A (en)  Staging memory for massively parallel processor  
US7216217B2 (en)  Programmable processor with group floatingpoint operations  
US7873812B1 (en)  Method and system for efficient matrix multiplication in a SIMD processor architecture  
US5189636A (en)  Dual mode combining circuitry  
US5991531A (en)  Scalable width vector processor architecture for efficient emulation  
US6519674B1 (en)  Configuration bits layout  
US6131152A (en)  Planar cache layout and instruction stream therefor  
US6397324B1 (en)  Accessing tables in memory banks using load and store address generators sharing store read port of compute register file separated from address register file  
US20050117653A1 (en)  Loop deblock filtering of block coded video in a very long instruction word processor  
US5761741A (en)  Technique for addressing a partial word and concurrently providing a substitution field  
US20030023646A1 (en)  Processor capable of executing packed shift operations  
US5765216A (en)  Data processor with an efficient bit move capability and method therefor  
US6173366B1 (en)  Load and store instructions which perform unpacking and packing of data bits in separate vector and integer cache storage  
US6298438B1 (en)  System and method for conditional moving an operand from a source register to destination register  
US5410727A (en)  Input/output system for a massively parallel, single instruction, multiple data (SIMD) computer providing for the simultaneous transfer of data between a host computer input/output system and all SIMD memory devices  
EP0363176A2 (en)  Word organised data processors  
US6154831A (en)  Decoding operands for multimedia applications instruction coded with less number of bits than combination of register slots and selectable specific values  
US5801975A (en)  Computer modified to perform inverse discrete cosine transform operations on a onedimensional matrix of numbers within a minimal number of instruction cycles  
US5941938A (en)  System and method for performing an accumulate operation on one or more operands within a partitioned register  
US6643765B1 (en)  Programmable processor with group floating point operations  
US6009505A (en)  System and method for routing one operand to arithmetic logic units from fixed register slots and another operand from any register slot  
US7197625B1 (en)  Alignment and ordering of vector elements for single instruction multiple data processing 
Legal Events
Date  Code  Title  Description 

AL  Designated countries for regional patents 
Kind code of ref document: A2 Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM 

AK  Designated states 
Kind code of ref document: A2 Designated state(s): AL AM AT AU AZ BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM 

AK  Designated states 
Kind code of ref document: A3 Designated state(s): AL AM AT AU AZ BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM 

AL  Designated countries for regional patents 
Kind code of ref document: A3 Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM 

121  Ep: the epo has been informed by wipo that ep was designated in this application  
DFPE  Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)  
REG  Reference to national code 
Ref country code: DE Ref legal event code: 8642 

122  Ep: pct application nonentry in european phase  
NENP  Nonentry into the national phase in: 
Ref country code: CA 