US5909552A - Method and apparatus for processing packed data - Google Patents
Method and apparatus for processing packed data Download PDFInfo
- Publication number
- US5909552A US5909552A US08/032,764 US3276493A US5909552A US 5909552 A US5909552 A US 5909552A US 3276493 A US3276493 A US 3276493A US 5909552 A US5909552 A US 5909552A
- Authority
- US
- United States
- Prior art keywords
- word
- multidigit
- packed
- operands
- operand
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012545 processing Methods 0.000 title claims abstract description 12
- 238000012856 packing Methods 0.000 claims abstract description 18
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 15
- 238000006243 chemical reaction Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012857 repacking Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/50—Adding; Subtracting
- G06F7/505—Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/38—Indexing scheme relating to groups G06F7/38 - G06F7/575
- G06F2207/3804—Details
- G06F2207/3808—Details concerning the type of numbers or the way they are handled
- G06F2207/3828—Multigauge devices, i.e. capable of handling packed numbers without unpacking them
Definitions
- the present invention relates generally to parallel computing and more specifically to executing arithmetic and logical operations on packed data.
- FIG. 1 A block diagram of a typical scalar computer is shown in FIG. 1.
- the computer 100 includes a main processor 110 coupled to a memory 120, an input device 130 and an output device 140.
- Input device 130 may include a keyboard, mouse, tablet or other types of input devices.
- Output device 140 may include a text monitor, plotter or other types of output devices.
- the main processor may also be coupled to a graphics output device 160 such as a graphics display through a graphics processor 150.
- Graphics processor 150 receives instructions regarding graphics from main processor 110.
- the graphics card then executes those instructions and generates RGB signals to the graphics display 160 thereby rendering the desired graphics output from the workstation processor.
- the present invention also provides a method of generating a region code for an X coordinate and a Y coordinate by packing the X coordinate and the Y coordinate into a first packed word with at least one buffer bit between the X and Y coordinates, packing the minimum and maximum values of the window into a second and a third packed words, generating a first and a second region subcode by comparing the first packed word with the second and third packed words, and then generating the region code by combining the first region subcode with the second region subcode.
- FIG. 1 is a block diagram of a typical scalar computer
- FIG. 3 is an example of three add operations using separate words as performed in the prior art
- FIG. 4 is an example of the same add operations using a packed word according to a preferred embodiment of the invention.
- FIG. 5 is a block diagram of a processor with a memory for packing, processing and unpacking words
- FIGS. 6 and 7 are flowcharts of the packing and unpacking of words
- FIG. 8 is a block diagram of a prior art technique for converting a point in window coordinates to a point in screen coordinates
- FIG. 10 is a diagram of the 4 bit region codes for the nine regions around and in a window based on the Cohen-Sutherland clipping algorithm
- FIG. 12 is a block diagram of a process for generating region codes for the Cohen-Sutherland clipping algorithm using a preferred embodiment of the invention.
- FIG. 2 is a diagram of three operands 210, 220 and 230 packed into a single 32 bit word 200 according to a preferred embodiment of the invention.
- Each operand includes a number of bits for data 212, 222, 232, plus a single signed bit 214, 224, 234.
- Also included between the operands are buffer bits 240, 250 that are set to zero prior to arithmetic or logical operations being performed upon the packed word 200. In two's complement arithmetic, a negative result in signed arithmetic operation causes propagation of a signed bit to the most significant bit. To prevent this propagation from flowing into bits of another operand's result, at least one buffer bit is reserved between each operand.
- bits are preferably cleared (set to zero) before any arithmetic or logical operation involving signed arithmetic and again before evaluating results.
- the operand at the most significant bit of the word does not require a buffer bit because overflows will propagate into an ALU (arithmetic logic unit) carry bit or will be lost without affecting the arithmetic result.
- ALU arithmetic logic unit
- multiple arithmetic and logical operations may be performed on the packed word between clearing the buffer bits without concern as to whether the operations will cause sign bit overflow.
- Each operand in the packed word can be varied in size and location. The number of operands used can also vary. The number of parallel operations achievable is limited to the number of useful operands with buffer bits that can be placed in a single ALU word.
- the buffer bits are set to 0 by using a logical and operation with a mask.
- the resulting fourth set of operands (OP 13 d) contains the desired values of 49, 7 and -2 which match the results of FIG. 3.
- the final masking step is not necessary if there are not going to be any further arithmetic or logical operations to be performed on the packed word.
- FIG. 5 is a block diagram of a processor 250 with a memory 255 for packing, processing and unpacking words.
- FIGS. 6 and 7 are flowcharts of the packing and unpacking of words using the processor and memory of FIG. 5.
- processor 250 can be found in a computer main processor or in a graphics processor.
- Processor 250 includes a central processing unit (CPU) 260, an arithmetic logic unit (ALU) 262, and registers 264, 266 and 268.
- CPU central processing unit
- ALU arithmetic logic unit
- FIG. 6 is a flowchart of the process for packing multiple operands into a single word.
- a first operand stored in memory 255 is loaded into a first register 264 by CPU 260.
- the operand in the first register is left shifted with zero fill a predetermined number of bits by ALU 262.
- a second operand is loaded from memory into the second register 266 by the CPU.
- a mask is loaded from memory or is generated by the CPU into the third register 268.
- the ALU uses the mask to mask off the left hand bits of the operand in the second register.
- step 275 the operand in the second register is then ORed with the contents of the first register into the first register by the ALU.
- the CPU determines whether to pack another operand into the packed word in step 277.
- This technique is particularly useful for two's complement integer arithmetic using adds, subtracts and compares with operand values that are limited in size.
- One of the better uses of this technique is in two dimensional graphics operations and in windowing applications.
- the X and Y coordinates for each vertex or point on the screen are limited to 16 bits in most applications.
- graphics displays are commonly 1280 by 1024 pixels in size.
- the number of bits necessary to represent any pixel in X and Y integer coordinates is 11 bits in X and 10 bits in Y. If organized in memory sequentially, a single 32 bit read can carry both X and Y coordinates of a vertex into a processor with 32 bit wide register. Assuming the numbers are in screen or window coordinates, this buffer bit can then be cleared without any loss of data.
- FIGS. 9 and 12 describe processes using the processor and memory of FIG. 5 in which packed values can be used in conjunction with packed arithmetic to increase processor throughput. Portions of the processes are grouped to show relative number of cycles per operation. The examples shown are important because they combine to form an loop for two dimensional vector processing. The cycles saved in this loop dramatically affect two dimensional vector performance.
- FIG. 8 is a block diagram of a prior art process for converting a point in window coordinates to a point in screen coordinates.
- the X and Y coordinates are typically stored in memory in a packed format. That is, the X coordinate is stored in the first 16 bits of the 32 bit word and the Y coordinate is stored in the second 16 bits of the 32 bit word.
- a first step 300 the packed coordinate value is loaded into a first register from memory.
- the packed coordinate is then unpacked into X and Y coordinates in a second step 310.
- the Y coordinate is then unpacked in step 314 by left shifting the unpacked word by 16 bits then right shifting the word by 16 bits. This process of shifting the Y coordinate to the left and right extends the coordinate if its value is negative.
- Step 312 is typically performed in 2 cycles.
- step 320 the X and Y coordinates are converted from window coordinates to screen coordinates. That is, the X origin value is added to the X coordinate in step 322 and the Y origin is added to the Y coordinate in step 324.
- the add operations usually take one cycle each to be performed.
- step 340 the final value of the packed coordinate is then transmitted on a bus or stored in memory in the next cycle.
- FIG. 9 is a block diagram of a process according to a preferred embodiment of the invention for converting a point in window coordinates to a point in screen coordinates. This process decreases the number of cycles needed to accomplish the window to screen coordinate conversion process.
- the first step 400 is to load the packed coordinate value into a first register from memory.
- step 410 the buffer bit between the X coordinate value and the Y coordinate value in the packed coordinate is cleared by ANDing the packed coordinate with the hexadecimal value FFFF7FFF.
- the X Y coordinates are then converted to screen coordinates by adding a packed hexadecimal value for the window origin 420.
- Optional step 430 again clears the buffer bit by ANDing the packed coordinate with 0FFFF7FFF. This step is not required if the following process ignores or is not affected by the uncleared buffer bits.
- a final step 440 the final value of the packed coordinate is then transmitted on a bus or stored in memory in the next cycle.
- the preferred process takes only 5 cycles as compared to 9 cycles for the prior art process, resulting in a savings of 4 cycles. This savings is primarily due to not unpacking and then repacking the packed coordinate value.
- the Cohen-Sutherland clipping algorithm is an algorithm designed to minimize the number of calculations necessary to determine whether a line between two endpoints is within a window.
- the algorithm efficiently identifies many lines which can be trivially accepted or rejected by using region checks on the endpoints. Intersection calculations are then required only for lines that are neither accepted or rejected.
- This algorithm is particularly efficient when a window is very large or very small relative to the size of the image to be rendered. That is, when most of the lines of the image to be rendered will be either inside the window or outside the window.
- the algorithm starts by assigning to each end point of a line a 4 bit region code based on the nine regions shown in FIG. 10.
- Each bit in the region code is set to 1 (true) if the following relations between the end point and window is true (the bits can be in any order as long as the order is consistent for all region codes):
- Bit 1--point is above window
- Bit 3--point is to right of window
- Bit 4--point is to left of window.
- both endpoints are in the window (i.e. their region codes are equal to 0000), then the line is trivially accepted as being completely inside the window. If only one endpoint is in the window, then the line is neither trivially accepted or rejected and computationally intensive line intersection calculations should be performed. If neither endpoint is in the window, then the resulting region codes for the two endpoints of the line are ANDed together. If the resulting joint region code is equal to 0000, then the line can be neither trivially accepted or rejected and computationally intensive line intersection calculations should be performed. If the resulting joint region code is not equal to 0000, then the line is trivially rejected as being completely outside the window.
- FIG. 11 is a block diagram of the prior art process for generating region codes for the Cohen-Sutherland clipping algorithm.
- a first step 510 the X and Y coordinate values are loaded into separate registers from memory, which may require unpacking, and a sign bit mask with value 80000000 is formed to be used during the algorithm.
- the region code is generated by comparing the X and Y coordinate values to the minimum and maximum X and Y values of the window.
- the first region code bit is equal to the X coordinate minus the minimum X window value with the result ANDed with the sign bit mask.
- the second region code bit is equal to the Y coordinate minus the Y minimum window value with the result ANDed with the sign bit mask. The second region code bit is then left shifted one bit and ORed with the first region code bit, resulting in an initial region code.
- the third region code bit is equal to the maximum X window value minus the X coordinate with the result ANDed with the sign bit mask.
- step 528 the fourth region code bit is equal to the maximum Y window value minus the Y coordinate with the result ANDed with the sign bit mask.
- the fourth region code bit is then left shifted three bits and ORed with the previous region code, resulting in the final region code.
- step 522 takes 2 cycles
- step 524 takes 4 cycles
- step 526 take 4 cycles
- step 528 takes 4 cycles. This results in a total of 14 cycles to generate the region code once the variables are set up in step 510.
- step 530 the region code bit is then tested to determine whether the region code is equal to 0000.
- FIG. 12 is a block diagram of a process for generating region codes for the Cohen-Sutherland clipping algorithm using a preferred embodiment of the invention.
- a first step 610 the packed X and Y coordinate value is loaded into a registers from memory, which will not require unpacking, and a packed sign bit mask with value 80004000 is formed to be used during the algorithm.
- the region code is generated by comparing the X and Y coordinate value to the minimum and maximum X and Y values of the window.
- the first two region code bits are equal to the X and Y coordinate value minus the minimum X and Y window value with the result ANDed with the packed sign bit mask.
- the third and fourth region code bits are equal to the X and Y coordinate value minus the maximum X and Y window values with the result negated and then ANDed with the packed sign bit mask.
- the third and fourth region code bits are then left shifted two bits and ORed with the previous region code bits, resulting in the final region code.
- step 622 takes 2 cycles
- step 624 takes 3 cycles
- step 626 takes 2 cycles. This results in a total of 7 cycles to generate the region code once the variables are set up in step 610.
- step 630 the region code bit is then tested to determine whether the region code is equal to 0000.
- the present invention provides for a large increase in speed in handling certain types of processes and algorithms. This increase in speed is substantial if the window coordinates used in the region code generation process shown in FIG. 12 were generated using the process shown in FIG. 9.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Executing Machine-Instructions (AREA)
- Image Generation (AREA)
Abstract
A method and an apparatus for processing a plurality of operands in parallel including packing the operands into a word with at least one cleared buffer bit between each operand and processing the packed word.
Description
This is a continuation of application Ser. No. 07/614,353 filed Nov. 15, 1990.
1. Technical Field
The present invention relates generally to parallel computing and more specifically to executing arithmetic and logical operations on packed data.
2. Background Art
A block diagram of a typical scalar computer is shown in FIG. 1. The computer 100 includes a main processor 110 coupled to a memory 120, an input device 130 and an output device 140. Input device 130 may include a keyboard, mouse, tablet or other types of input devices. Output device 140 may include a text monitor, plotter or other types of output devices. The main processor may also be coupled to a graphics output device 160 such as a graphics display through a graphics processor 150. Graphics processor 150 receives instructions regarding graphics from main processor 110. The graphics card then executes those instructions and generates RGB signals to the graphics display 160 thereby rendering the desired graphics output from the workstation processor.
Multiple main processors are sometimes used to perform parallel processing. That is, more than one processor is executing instructions at the same time. However, due to the linear nature of many algorithms and processes, this approach may not be advantageous in all circumstances.
The present invention provides a method and an apparatus for processing a plurality of operands in parallel including packing the operands into a word with at least one cleared buffer bit between each operand and processing the packed word.
The present invention also provides a method of generating a region code for an X coordinate and a Y coordinate by packing the X coordinate and the Y coordinate into a first packed word with at least one buffer bit between the X and Y coordinates, packing the minimum and maximum values of the window into a second and a third packed words, generating a first and a second region subcode by comparing the first packed word with the second and third packed words, and then generating the region code by combining the first region subcode with the second region subcode.
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.
FIG. 1 is a block diagram of a typical scalar computer;
FIG. 2 is a diagram of multiple operands packed into a single word according to a preferred embodiment of the invention;
FIG. 3 is an example of three add operations using separate words as performed in the prior art;
FIG. 4 is an example of the same add operations using a packed word according to a preferred embodiment of the invention;
FIG. 5 is a block diagram of a processor with a memory for packing, processing and unpacking words;
FIGS. 6 and 7 are flowcharts of the packing and unpacking of words;
FIG. 8 is a block diagram of a prior art technique for converting a point in window coordinates to a point in screen coordinates;
FIG. 9 is a block diagram of a technique according to a preferred embodiment of the invention for converting a point in window coordinates to a point in screen coordinates;
FIG. 10 is a diagram of the 4 bit region codes for the nine regions around and in a window based on the Cohen-Sutherland clipping algorithm;
FIG. 11 is a block diagram of the prior art process for generating region codes for the Cohen-Sutherland clipping algorithm; and
FIG. 12 is a block diagram of a process for generating region codes for the Cohen-Sutherland clipping algorithm using a preferred embodiment of the invention.
Both the main processor and the graphics processor typically use 8, 16, or 32 bit registers to store words of data, instructions, etc. The instructions are used by the processors to perform various arithmetic and logical operations on the words of data. Negative numbers are often represented in two's complement to facilitate these operations. Communication within the processors and between processors is also usually with 8, 16 or 32 bit instructions or words of data.
FIG. 2 is a diagram of three operands 210, 220 and 230 packed into a single 32 bit word 200 according to a preferred embodiment of the invention. Each operand includes a number of bits for data 212, 222, 232, plus a single signed bit 214, 224, 234. Also included between the operands are buffer bits 240, 250 that are set to zero prior to arithmetic or logical operations being performed upon the packed word 200. In two's complement arithmetic, a negative result in signed arithmetic operation causes propagation of a signed bit to the most significant bit. To prevent this propagation from flowing into bits of another operand's result, at least one buffer bit is reserved between each operand. These bits are preferably cleared (set to zero) before any arithmetic or logical operation involving signed arithmetic and again before evaluating results. The operand at the most significant bit of the word does not require a buffer bit because overflows will propagate into an ALU (arithmetic logic unit) carry bit or will be lost without affecting the arithmetic result. If more than one buffer bit is used between the operands, then multiple arithmetic and logical operations (at least equal to the number of buffer bits) may be performed on the packed word between clearing the buffer bits without concern as to whether the operations will cause sign bit overflow. Each operand in the packed word can be varied in size and location. The number of operands used can also vary. The number of parallel operations achievable is limited to the number of useful operands with buffer bits that can be placed in a single ALU word.
FIG. 3 is an example of three add operations using 16 bit words as performed in the prior art. Each of the operations is shown in decimal, hexadecimal and binary. In a first operation, 50 is added to -1 resulting in a value of 49. In a second operation, 4 is added to 3 for a result of 7. In a third operation, -1 is added to -1 resulting in a value of -2. The typical computer uses at least three cycles to perform these operations, not including any load or store functions.
FIG. 4 shows the same add operations using a packed 16 bit word with cleared buffer bits. As can be seen from the operation, the first operand (OP1) is 7 bits long, the second operand (OP2) is 4 bits long, and the third operand (OP3) is 3 bits long. In addition, there are 2 buffer bits (b) between the operands resulting in a total of 16 bits. The first set of operands (OP13 a) contains the values 50, 4 and 3. The second set of operands (OP13 b) contains the values -1, 3 and -1. The third set of operands (OP13 c) contains the results of the addition. One of the resulting buffer bits was set to 1 because of a carry forward of a sign bit. The buffer bits are set to 0 by using a logical and operation with a mask. The resulting fourth set of operands (OP13 d) contains the desired values of 49, 7 and -2 which match the results of FIG. 3. The final masking step is not necessary if there are not going to be any further arithmetic or logical operations to be performed on the packed word.
FIG. 5 is a block diagram of a processor 250 with a memory 255 for packing, processing and unpacking words. FIGS. 6 and 7 are flowcharts of the packing and unpacking of words using the processor and memory of FIG. 5. In the preferred embodiment, processor 250 can be found in a computer main processor or in a graphics processor. Processor 250 includes a central processing unit (CPU) 260, an arithmetic logic unit (ALU) 262, and registers 264, 266 and 268.
FIG. 6 is a flowchart of the process for packing multiple operands into a single word. In a first step 270, a first operand stored in memory 255 is loaded into a first register 264 by CPU 260. In a second step 271, the operand in the first register is left shifted with zero fill a predetermined number of bits by ALU 262. In the third step 272, a second operand is loaded from memory into the second register 266 by the CPU. In the fourth step 273, a mask is loaded from memory or is generated by the CPU into the third register 268. In step 274, the ALU then uses the mask to mask off the left hand bits of the operand in the second register. If the operand in the second register is not the last operand to be packed into the packed word, then the ALU will left shift the second register a predetermined number of times by the ALU in step 275. In step 276, the operand in the second register is then ORed with the contents of the first register into the first register by the ALU. The CPU then determines whether to pack another operand into the packed word in step 277.
The buffer bit is generated and cleared between operands by either the left shift zero fills or by the masking. In addition, the buffer bits may be cleared after the packing process with a mask. In alternative embodiments, the packed word may be filled with operands from left to right, right to left, or in any order than is desired.
FIG. 7 is a flowchart of the process for unpacking multiple operands from a single packed word. In a first step 280, a packed word is loaded into the first register 264 from memory 255 by ALU 262. In a second step, a mask is loaded from memory or is generated by the CPU into the second register 266. In step 281, the ALU then uses the mask to mask off the undesired operands, resulting in a desired operand in the third register 268. In the next step 283, the operand in the third register is left shifted and then right shifted to sign extend the operand. The resulting operand is then loaded into memory by the CPU in step 284. The CPU then determines whether to unpack another operand from the packed word in step 285.
This technique is particularly useful for two's complement integer arithmetic using adds, subtracts and compares with operand values that are limited in size. One of the better uses of this technique is in two dimensional graphics operations and in windowing applications. The X and Y coordinates for each vertex or point on the screen are limited to 16 bits in most applications. For example, graphics displays are commonly 1280 by 1024 pixels in size. The number of bits necessary to represent any pixel in X and Y integer coordinates is 11 bits in X and 10 bits in Y. If organized in memory sequentially, a single 32 bit read can carry both X and Y coordinates of a vertex into a processor with 32 bit wide register. Assuming the numbers are in screen or window coordinates, this buffer bit can then be cleared without any loss of data.
Upon completion of all computations on a vertex, it must be output to the rendering hardware. In general, to increase effective bus bandwidth and reduce I/O addresses required, it is necessary to output these operands in a packed format. A major advantage of the technique outlined here is that it allows all the processing to occur in packed form. This circumvents the need to first unpack, process then repack the vertex data.
FIGS. 9 and 12 describe processes using the processor and memory of FIG. 5 in which packed values can be used in conjunction with packed arithmetic to increase processor throughput. Portions of the processes are grouped to show relative number of cycles per operation. The examples shown are important because they combine to form an loop for two dimensional vector processing. The cycles saved in this loop dramatically affect two dimensional vector performance.
FIG. 8 is a block diagram of a prior art process for converting a point in window coordinates to a point in screen coordinates. In 32 bit computers, the X and Y coordinates are typically stored in memory in a packed format. That is, the X coordinate is stored in the first 16 bits of the 32 bit word and the Y coordinate is stored in the second 16 bits of the 32 bit word.
In a first step 300, the packed coordinate value is loaded into a first register from memory. The packed coordinate is then unpacked into X and Y coordinates in a second step 310. This can be accomplished in step 312 by first loading the packed word into an X coordinate register then right shifting the the register by 16 bits. In many computers, this can be performed in a single step. The Y coordinate is then unpacked in step 314 by left shifting the unpacked word by 16 bits then right shifting the word by 16 bits. This process of shifting the Y coordinate to the left and right extends the coordinate if its value is negative. Step 312 is typically performed in 2 cycles.
During step 320, the X and Y coordinates are converted from window coordinates to screen coordinates. That is, the X origin value is added to the X coordinate in step 322 and the Y origin is added to the Y coordinate in step 324. The add operations usually take one cycle each to be performed.
During step 330, the X and Y coordinates are repacked into the packed coordinate for being transmitted on a bus or stored in memory. This can accomplished by left shifting the X coordinate by 16 bits as shown in step 332 and then ORing the X coordinate and Y coordinate into the packed coordinate in step 334. This usually takes a total of 2 cycles to be performed.
As shown in step 340, the final value of the packed coordinate is then transmitted on a bus or stored in memory in the next cycle.
Table 1 shows the value of variables during the prior art window to screen conversion process shown in FIG. 5A. for a vertex with window coordinates of 10, -1 in a window with an origin of 100,200 on a screen. The value of the packed coordinate (Packed13 Coord), the X coordinate (X-- Coord) and the Y coordinate (Y13 Coord) are given at the end of each referenced step. For example, at the end of step 300, the packed coordinate value is 000AFFFF with the X and Y coordinate values being equal to 000A and FFFF for 10 and -1, respectively.
TABLE 1 ______________________________________ Variable Values during Prior Art Window to Screen Conversion Process Step Packed.sub.-- Coord X.sub.-- Coord Y.sub.-- Coord ______________________________________ 300 000AFFFF xxxxxxxx xxxxxxxx 310000AFFFF 0000000A FFFFFFFF 320000AFFFF 0000006E 000000C7 330006E00C7 006E0000 000000C7 340 006E00C7 006E0000 000000C7 ______________________________________
FIG. 9 is a block diagram of a process according to a preferred embodiment of the invention for converting a point in window coordinates to a point in screen coordinates. This process decreases the number of cycles needed to accomplish the window to screen coordinate conversion process.
As in FIG. 8, the first step 400 is to load the packed coordinate value into a first register from memory. In step 410, the buffer bit between the X coordinate value and the Y coordinate value in the packed coordinate is cleared by ANDing the packed coordinate with the hexadecimal value FFFF7FFF. The X Y coordinates are then converted to screen coordinates by adding a packed hexadecimal value for the window origin 420. Optional step 430 again clears the buffer bit by ANDing the packed coordinate with 0FFFF7FFF. This step is not required if the following process ignores or is not affected by the uncleared buffer bits. In a final step 440, the final value of the packed coordinate is then transmitted on a bus or stored in memory in the next cycle.
Table 2 shows the value of variables during the preferred art window to screen conversion process shown in FIG. 9 for a vertex with window coordinates of 10, -1 in a window with an origin of 100,200 on a screen. The value of the packed coordinate (Packed-- Coord), the X coordinate (X-- Coord) and the Y coordinate (Y-- Coord) are given at the end of each referenced step. For example, at the end of step 400, the packed coordinate value is 000AFFFF.
TABLE 2 ______________________________________ Variable Values during Preferred Window to Screen Conversion Process Step Packed.sub.-- Coord ______________________________________ 400000AFFFF 410000A7FFF 420006E80C7 430006E00C7 440 006E00C7 ______________________________________
As shown by FIGS. 8 and 9, the preferred process takes only 5 cycles as compared to 9 cycles for the prior art process, resulting in a savings of 4 cycles. This savings is primarily due to not unpacking and then repacking the packed coordinate value.
The Cohen-Sutherland clipping algorithm is an algorithm designed to minimize the number of calculations necessary to determine whether a line between two endpoints is within a window. The algorithm efficiently identifies many lines which can be trivially accepted or rejected by using region checks on the endpoints. Intersection calculations are then required only for lines that are neither accepted or rejected. This algorithm is particularly efficient when a window is very large or very small relative to the size of the image to be rendered. That is, when most of the lines of the image to be rendered will be either inside the window or outside the window.
The algorithm starts by assigning to each end point of a line a 4 bit region code based on the nine regions shown in FIG. 10. Each bit in the region code is set to 1 (true) if the following relations between the end point and window is true (the bits can be in any order as long as the order is consistent for all region codes):
If both endpoints are in the window (i.e. their region codes are equal to 0000), then the line is trivially accepted as being completely inside the window. If only one endpoint is in the window, then the line is neither trivially accepted or rejected and computationally intensive line intersection calculations should be performed. If neither endpoint is in the window, then the resulting region codes for the two endpoints of the line are ANDed together. If the resulting joint region code is equal to 0000, then the line can be neither trivially accepted or rejected and computationally intensive line intersection calculations should be performed. If the resulting joint region code is not equal to 0000, then the line is trivially rejected as being completely outside the window.
FIG. 11 is a block diagram of the prior art process for generating region codes for the Cohen-Sutherland clipping algorithm. In a first step 510 the X and Y coordinate values are loaded into separate registers from memory, which may require unpacking, and a sign bit mask with value 80000000 is formed to be used during the algorithm.
In a second step 520, the region code is generated by comparing the X and Y coordinate values to the minimum and maximum X and Y values of the window. In step 522, the first region code bit is equal to the X coordinate minus the minimum X window value with the result ANDed with the sign bit mask. In step 524, the second region code bit is equal to the Y coordinate minus the Y minimum window value with the result ANDed with the sign bit mask. The second region code bit is then left shifted one bit and ORed with the first region code bit, resulting in an initial region code. In step 526, the third region code bit is equal to the maximum X window value minus the X coordinate with the result ANDed with the sign bit mask. The third region code bit is then left shifted two bits and ORed with the previous region code. In step 528, the fourth region code bit is equal to the maximum Y window value minus the Y coordinate with the result ANDed with the sign bit mask. The fourth region code bit is then left shifted three bits and ORed with the previous region code, resulting in the final region code. In a typical computer, step 522 takes 2 cycles, step 524 takes 4 cycles, step 526 take 4 cycles and step 528 takes 4 cycles. This results in a total of 14 cycles to generate the region code once the variables are set up in step 510.
In step 530, the region code bit is then tested to determine whether the region code is equal to 0000.
FIG. 12 is a block diagram of a process for generating region codes for the Cohen-Sutherland clipping algorithm using a preferred embodiment of the invention. In a first step 610 the packed X and Y coordinate value is loaded into a registers from memory, which will not require unpacking, and a packed sign bit mask with value 80004000 is formed to be used during the algorithm.
In a second step 620, the region code is generated by comparing the X and Y coordinate value to the minimum and maximum X and Y values of the window. In step 622, the first two region code bits are equal to the X and Y coordinate value minus the minimum X and Y window value with the result ANDed with the packed sign bit mask. In step 624, the third and fourth region code bits are equal to the X and Y coordinate value minus the maximum X and Y window values with the result negated and then ANDed with the packed sign bit mask. The third and fourth region code bits are then left shifted two bits and ORed with the previous region code bits, resulting in the final region code. Using the preferred embodiment of the invention, step 622 takes 2 cycles, step 624 takes 3 cycles, and step 626 takes 2 cycles. This results in a total of 7 cycles to generate the region code once the variables are set up in step 610.
In step 630, the region code bit is then tested to determine whether the region code is equal to 0000.
As can be seen in FIGS. 9 and 12, the present invention provides for a large increase in speed in handling certain types of processes and algorithms. This increase in speed is substantial if the window coordinates used in the region code generation process shown in FIG. 12 were generated using the process shown in FIG. 9.
Although the present invention has been fully described above with reference to specific embodiments, other alternative embodiments will be apparent to those of ordinary skill in the art. For example, multiply or shift operations can be performed on a packed word if an appropriate number of buffer bits are used between operands. Therefore, the above description should not be taken as limiting the scope of the present invention which is defined by the appended claims.
Claims (17)
1. A method of processing a plurality of multidigit operands in parallel comprising the steps of:
a) packing the multidigit operands into a first word with at least one buffer bit between each multidigit operand; and
b) performing arithmetic operations on the first packed word with a second word, thereby providing a processed packed word.
2. The method of claim 1 wherein the step of performing arithmetic operations includes adding the first packed word with the second word.
3. The method of claim 1 further comprising the step of clearing the buffer bits in the processed packed word.
4. The method of claim 3 further comprising the step of unpacking at least one of the plurality of operands from the processed packed word.
5. The method of claim 4 wherein the step of unpacking comprises:
a) masking the processed packed word; and
b) sign extending the masked word.
6. A method of generating a region code for a first multidigit operand and a second multidigit operand from a third multidigit operand, a fourth multidigit operand, a fifth multidigit operand and a sixth multidigit operand comprising the steps of:
a) packing the first multidigit operand and the second operand into a first word with at least one buffer bit between the first and second multidigit operands;
b) packing the third multidigit operand and the fourth multidigit operand into a second packed word;
c) packing the fifth multidigit operand and the sixth multidigit operand into a third word;
d) generating a first region subcode by comparing the first packed word with the second packed word;
e) generating a second region subcode by comparing the first packed word with the third packed word; and
f) generating a region code by combining the first region subcode with the second region subcode.
7. A method of performing at least one arithmetic operation to a first plurality of multidigit operands with a second plurality of multidigit operands comprising the steps of:
a) packing the first plurality of multidigit operands into a first packed word with at least one buffer bit between each of the plurality of multidigit operands;
b) packing the second plurality of multidigit operands into a second packed word; and
c) performing the at least one arithmetic operation with the second packed word to the first packed word.
8. The method of claim 7 further comprising the steps of clearing the at least one buffer bits in the first packed word subsequent to performing the at least one arithmetic operation.
9. The method of claim 8 further comprising the steps of unpacking the first plurality of operands from the cleared first packed word.
10. An apparatus for performing arithmetic operations on a plurality of multidigit operands in parallel comprising:
a) means for packing the multidigit operands into a first packed word with at least one buffer bit between each multidigit operand; and
b) means for performing arithmetic operations on the first packed word with a second word, thereby providing a processed packed word.
11. An apparatus of claim 10 further comprising means for clearing the buffer bits in the processed packed word.
12. The apparatus of claim 11 further comprising means for unpacking at least one of the plurality of operands from the processed packed word.
13. An apparatus for performing arithmetic operations on a plurality of multidigit operands in parallel comprising a computer, the computer including:
a) memory means for storing multidigit operands;
b) means for packing stored multidigit operands into a first packed word with at least one buffer bit between each multidigit operand; and
c) means for performing arithmetic operations on the first packed word with a second word.
14. The method of claim 6 further comprising the step of performing arithmetic operations including adding the first packed word with the second packed word.
15. The method of claim 7 wherein the step of performing arithmetic operations includes adding the first packed word with the second word.
16. The method of claim 10 wherein the means for performing arithmetic operations includes means for adding the first packed word with the second word.
17. The apparatus of claim 13 wherein the means for arithmetic operations includes means for adding the first packed word with the second word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/032,764 US5909552A (en) | 1990-11-15 | 1993-03-16 | Method and apparatus for processing packed data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US61435390A | 1990-11-15 | 1990-11-15 | |
US08/032,764 US5909552A (en) | 1990-11-15 | 1993-03-16 | Method and apparatus for processing packed data |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US61435390A Continuation | 1990-11-15 | 1990-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
US5909552A true US5909552A (en) | 1999-06-01 |
Family
ID=24460888
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/032,764 Expired - Fee Related US5909552A (en) | 1990-11-15 | 1993-03-16 | Method and apparatus for processing packed data |
Country Status (3)
Country | Link |
---|---|
US (1) | US5909552A (en) |
EP (1) | EP0486143A3 (en) |
JP (1) | JP2601960B2 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6243803B1 (en) * | 1998-03-31 | 2001-06-05 | Intel Corporation | Method and apparatus for computing a packed absolute differences with plurality of sign bits using SIMD add circuitry |
US6295561B1 (en) * | 1998-06-30 | 2001-09-25 | At&T Corp | System for translating native data structures and specific message structures by using template represented data structures on communication media and host machines |
US20020059355A1 (en) * | 1995-08-31 | 2002-05-16 | Intel Corporation | Method and apparatus for performing multiply-add operations on packed data |
US6516406B1 (en) * | 1994-12-02 | 2003-02-04 | Intel Corporation | Processor executing unpack instruction to interleave data elements from two packed data |
US20030200373A1 (en) * | 2001-12-07 | 2003-10-23 | David Kent | Computer system component |
US20040073589A1 (en) * | 2001-10-29 | 2004-04-15 | Eric Debes | Method and apparatus for performing multiply-add operations on packed byte data |
US6751725B2 (en) | 1995-12-19 | 2004-06-15 | Intel Corporation | Methods and apparatuses to clear state for operation of a stack |
US20040117422A1 (en) * | 1995-08-31 | 2004-06-17 | Eric Debes | Method and apparatus for performing multiply-add operations on packed data |
US20050038977A1 (en) * | 1995-12-19 | 2005-02-17 | Glew Andrew F. | Processor with instructions that operate on different data types stored in the same single logical register file |
US7516307B2 (en) | 1998-03-31 | 2009-04-07 | Intel Corporation | Processor for computing a packed sum of absolute differences and packed multiply-add |
US20130275728A1 (en) * | 2011-12-22 | 2013-10-17 | Intel Corporation | Packed data operation mask register arithmetic combination processors, methods, systems, and instructions |
WO2023020984A1 (en) * | 2021-08-19 | 2023-02-23 | International Business Machines Corporation | Masked shifted add operation |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5651121A (en) * | 1992-12-18 | 1997-07-22 | Xerox Corporation | Using mask operand obtained from composite operand to perform logic operation in parallel with composite operand |
US5408670A (en) * | 1992-12-18 | 1995-04-18 | Xerox Corporation | Performing arithmetic in parallel on composite operands with packed multi-bit components |
US5375080A (en) * | 1992-12-18 | 1994-12-20 | Xerox Corporation | Performing arithmetic on composite operands to obtain a binary outcome for each multi-bit component |
US5437045A (en) * | 1992-12-18 | 1995-07-25 | Xerox Corporation | Parallel processing with subsampling/spreading circuitry and data transfer circuitry to and from any processing unit |
US5689592A (en) * | 1993-12-22 | 1997-11-18 | Vivo Software, Inc. | Parallel processing of digital signals in a single arithmetic/logic unit |
EP0661624A1 (en) * | 1994-01-04 | 1995-07-05 | Sun Microsystems, Inc. | Pseudo-superscalar technique for video processing |
US5598362A (en) * | 1994-12-22 | 1997-01-28 | Motorola Inc. | Apparatus and method for performing both 24 bit and 16 bit arithmetic |
GB2362731B (en) * | 2000-05-23 | 2004-10-06 | Advanced Risc Mach Ltd | Parallel processing of multiple data values within a data word |
EP4235399A1 (en) * | 2022-02-28 | 2023-08-30 | Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CMS) | Method for the computation of a narrow bit width linear algebra operation |
Citations (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4064556A (en) * | 1975-06-23 | 1977-12-20 | Sperry Rand Corporation | Packed loop memory with data manipulation capabilities |
US4321668A (en) * | 1979-01-02 | 1982-03-23 | Honeywell Information Systems Inc. | Prediction of number of data words transferred and the cycle at which data is available |
JPS57164334A (en) * | 1981-04-02 | 1982-10-08 | Nec Corp | Operating device |
US4371923A (en) * | 1970-12-28 | 1983-02-01 | Hyatt Gilbert P | Computer system architecture |
US4569016A (en) * | 1983-06-30 | 1986-02-04 | International Business Machines Corporation | Mechanism for implementing one machine cycle executable mask and rotate instructions in a primitive instruction set computing system |
US4595911A (en) * | 1983-07-14 | 1986-06-17 | Sperry Corporation | Programmable data reformat system |
US4598365A (en) * | 1983-04-01 | 1986-07-01 | Honeywell Information Systems Inc. | Pipelined decimal character execution unit |
USH472H (en) * | 1987-01-12 | 1988-05-03 | Method and apparatus for processing binary-coded/packed decimal data | |
US4780842A (en) * | 1986-03-26 | 1988-10-25 | Alcatel Usa, Corp. | Cellular processor apparatus capable of performing floating point arithmetic operations |
EP0296457A2 (en) * | 1987-06-26 | 1988-12-28 | International Business Machines Corporation | A high performance parallel binary byte adder |
US4878166A (en) * | 1987-12-15 | 1989-10-31 | Advanced Micro Devices, Inc. | Direct memory access apparatus and methods for transferring data between buses having different performance characteristics |
US4918647A (en) * | 1986-04-05 | 1990-04-17 | Burr-Brown Limited | Programmable interface unit which generates dedicated control signals in response to a single control word |
US4942516A (en) * | 1970-12-28 | 1990-07-17 | Hyatt Gilbert P | Single chip integrated circuit computer architecture |
US4949246A (en) * | 1988-06-23 | 1990-08-14 | Ncr Corporation | Adapter for transmission of data words of different lengths |
US4963867A (en) * | 1989-03-31 | 1990-10-16 | Ampex Corporation | Apparatus for packing parallel data words having a variable width into parallel data words having a fixed width |
US5040136A (en) * | 1989-10-23 | 1991-08-13 | Nec Corporation | Arithmetic circuit for calculating and accumulating absolute values of the difference between two numerical values |
US5068819A (en) * | 1988-06-23 | 1991-11-26 | International Business Machines Corporation | Floating point apparatus with concurrent input/output operations |
US5079548A (en) * | 1989-09-20 | 1992-01-07 | Fujitsu Limited | Data packing circuit in variable length coder |
US5113516A (en) * | 1989-07-31 | 1992-05-12 | North American Philips Corporation | Data repacker having controlled feedback shifters and registers for changing data format |
US5123091A (en) * | 1987-08-13 | 1992-06-16 | Digital Equipment Corporation | Data processing system and method for packetizing data from peripherals |
US5140322A (en) * | 1990-08-28 | 1992-08-18 | Ricoh Company, Ltd. | Coder unit for variable word length code |
US5146220A (en) * | 1990-04-05 | 1992-09-08 | Canon Kabushiki Kaisha | Data conversion method and apparatus for converting undefined length data to fixed length data |
US5162795A (en) * | 1990-03-28 | 1992-11-10 | Sony Corporation | Coding and decoding apparatus of variable length data |
US5189636A (en) * | 1987-11-16 | 1993-02-23 | Intel Corporation | Dual mode combining circuitry |
US5237701A (en) * | 1989-03-31 | 1993-08-17 | Ampex Systems Corporation | Data unpacker using a pack ratio control signal for unpacked parallel fixed m-bit width into parallel variable n-bit width word |
US5251321A (en) * | 1990-06-20 | 1993-10-05 | Bull Hn Information Systems Inc. | Binary to binary coded decimal and binary coded decimal to binary conversion in a VLSI central processing unit |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61224037A (en) * | 1985-03-29 | 1986-10-04 | Fujitsu Ltd | Arithmetic system using multiple-bit input of arithmetic element |
JPH04116718A (en) * | 1990-09-07 | 1992-04-17 | Toshiba Corp | Arithmetic unit |
-
1991
- 1991-09-09 JP JP3257058A patent/JP2601960B2/en not_active Expired - Lifetime
- 1991-09-30 EP EP19910308961 patent/EP0486143A3/en not_active Withdrawn
-
1993
- 1993-03-16 US US08/032,764 patent/US5909552A/en not_active Expired - Fee Related
Patent Citations (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4371923A (en) * | 1970-12-28 | 1983-02-01 | Hyatt Gilbert P | Computer system architecture |
US4942516A (en) * | 1970-12-28 | 1990-07-17 | Hyatt Gilbert P | Single chip integrated circuit computer architecture |
US4064556A (en) * | 1975-06-23 | 1977-12-20 | Sperry Rand Corporation | Packed loop memory with data manipulation capabilities |
US4321668A (en) * | 1979-01-02 | 1982-03-23 | Honeywell Information Systems Inc. | Prediction of number of data words transferred and the cycle at which data is available |
JPS57164334A (en) * | 1981-04-02 | 1982-10-08 | Nec Corp | Operating device |
US4598365A (en) * | 1983-04-01 | 1986-07-01 | Honeywell Information Systems Inc. | Pipelined decimal character execution unit |
US4569016A (en) * | 1983-06-30 | 1986-02-04 | International Business Machines Corporation | Mechanism for implementing one machine cycle executable mask and rotate instructions in a primitive instruction set computing system |
US4595911A (en) * | 1983-07-14 | 1986-06-17 | Sperry Corporation | Programmable data reformat system |
US4780842A (en) * | 1986-03-26 | 1988-10-25 | Alcatel Usa, Corp. | Cellular processor apparatus capable of performing floating point arithmetic operations |
US4918647A (en) * | 1986-04-05 | 1990-04-17 | Burr-Brown Limited | Programmable interface unit which generates dedicated control signals in response to a single control word |
USH472H (en) * | 1987-01-12 | 1988-05-03 | Method and apparatus for processing binary-coded/packed decimal data | |
EP0296457A2 (en) * | 1987-06-26 | 1988-12-28 | International Business Machines Corporation | A high performance parallel binary byte adder |
US5123091A (en) * | 1987-08-13 | 1992-06-16 | Digital Equipment Corporation | Data processing system and method for packetizing data from peripherals |
US5189636A (en) * | 1987-11-16 | 1993-02-23 | Intel Corporation | Dual mode combining circuitry |
US4878166A (en) * | 1987-12-15 | 1989-10-31 | Advanced Micro Devices, Inc. | Direct memory access apparatus and methods for transferring data between buses having different performance characteristics |
US5068819A (en) * | 1988-06-23 | 1991-11-26 | International Business Machines Corporation | Floating point apparatus with concurrent input/output operations |
US4949246A (en) * | 1988-06-23 | 1990-08-14 | Ncr Corporation | Adapter for transmission of data words of different lengths |
US4963867A (en) * | 1989-03-31 | 1990-10-16 | Ampex Corporation | Apparatus for packing parallel data words having a variable width into parallel data words having a fixed width |
US5237701A (en) * | 1989-03-31 | 1993-08-17 | Ampex Systems Corporation | Data unpacker using a pack ratio control signal for unpacked parallel fixed m-bit width into parallel variable n-bit width word |
US5113516A (en) * | 1989-07-31 | 1992-05-12 | North American Philips Corporation | Data repacker having controlled feedback shifters and registers for changing data format |
US5079548A (en) * | 1989-09-20 | 1992-01-07 | Fujitsu Limited | Data packing circuit in variable length coder |
US5040136A (en) * | 1989-10-23 | 1991-08-13 | Nec Corporation | Arithmetic circuit for calculating and accumulating absolute values of the difference between two numerical values |
US5162795A (en) * | 1990-03-28 | 1992-11-10 | Sony Corporation | Coding and decoding apparatus of variable length data |
US5146220A (en) * | 1990-04-05 | 1992-09-08 | Canon Kabushiki Kaisha | Data conversion method and apparatus for converting undefined length data to fixed length data |
US5251321A (en) * | 1990-06-20 | 1993-10-05 | Bull Hn Information Systems Inc. | Binary to binary coded decimal and binary coded decimal to binary conversion in a VLSI central processing unit |
US5140322A (en) * | 1990-08-28 | 1992-08-18 | Ricoh Company, Ltd. | Coder unit for variable word length code |
Non-Patent Citations (12)
Title |
---|
IBM TDB, "Partial Byte Computations," vol. 17, No. 17, Dec. 1974, pp. 1931-1932. |
IBM TDB, Partial Byte Computations, vol. 17, No. 17, Dec. 1974, pp. 1931 1932. * |
IBM Technical Disclosure Bulletin , vol. 23, No. 9, pp. 4357 4360, Feb. 1981 Double Speed, Single Precision Vector Register Organization Using Double Port Chips . * |
IBM Technical Disclosure Bulletin , vol. 27, No. 11, pp. 6777 6778, Apr. 1985, Technique for Halving Vector Load Times . * |
IBM Technical Disclosure Bulletin , vol. 32, No. 3A, pp. 325 329, Aug. 1989, Easy Biased Exponent Handling Via 2 s Complement Arithmetic . * |
IBM Technical Disclosure Bulletin, vol. 23, No. 9, pp. 4357-4360, Feb. 1981 "Double-Speed, Single-Precision Vector Register Organization Using Double-Port Chips". |
IBM Technical Disclosure Bulletin, vol. 27, No. 11, pp. 6777-6778, Apr. 1985, "Technique for Halving Vector Load Times". |
IBM Technical Disclosure Bulletin, vol. 31, No. 8, pp. 204 205, Jan. 1989, Fastener Decimal Division . * |
IBM Technical Disclosure Bulletin, vol. 31, No. 8, pp. 204-205, Jan. 1989, "Fastener Decimal Division". |
IBM Technical Disclosure Bulletin, vol. 32, No. 3A, pp. 325-329, Aug. 1989, "Easy Biased Exponent Handling Via 2's Complement Arithmetic". |
The 11th Annual International Symposium on Computer Architecture , Jul. 1984, Michigan, A High Performance Factoring Machine , Rudd et al, pp. 297 300. * |
The 11th Annual International Symposium on Computer Architecture, Jul. 1984, Michigan, "A High Performance Factoring Machine", Rudd et al, pp. 297-300. |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8190867B2 (en) | 1994-12-02 | 2012-05-29 | Intel Corporation | Packing two packed signed data in registers with saturation |
US9389858B2 (en) | 1994-12-02 | 2016-07-12 | Intel Corporation | Orderly storing of corresponding packed bytes from first and second source registers in result register |
US9361100B2 (en) | 1994-12-02 | 2016-06-07 | Intel Corporation | Packing saturated lower 8-bit elements from two source registers of packed 16-bit elements |
US6516406B1 (en) * | 1994-12-02 | 2003-02-04 | Intel Corporation | Processor executing unpack instruction to interleave data elements from two packed data |
US20030131219A1 (en) * | 1994-12-02 | 2003-07-10 | Alexander Peleg | Method and apparatus for unpacking packed data |
US9223572B2 (en) | 1994-12-02 | 2015-12-29 | Intel Corporation | Interleaving half of packed data elements of size specified in instruction and stored in two source registers |
US9182983B2 (en) | 1994-12-02 | 2015-11-10 | Intel Corporation | Executing unpack instruction and pack instruction with saturation on packed data elements from two source operand registers |
US9141387B2 (en) | 1994-12-02 | 2015-09-22 | Intel Corporation | Processor executing unpack and pack instructions specifying two source packed data operands and saturation |
US9116687B2 (en) | 1994-12-02 | 2015-08-25 | Intel Corporation | Packing in destination register half of each element with saturation from two source packed data registers |
US9015453B2 (en) | 1994-12-02 | 2015-04-21 | Intel Corporation | Packing odd bytes from two source registers of packed data |
US8838946B2 (en) | 1994-12-02 | 2014-09-16 | Intel Corporation | Packing lower half bits of signed data elements in two source registers in a destination register with saturation |
US8793475B2 (en) | 1994-12-02 | 2014-07-29 | Intel Corporation | Method and apparatus for unpacking and moving packed data |
US20060236076A1 (en) * | 1994-12-02 | 2006-10-19 | Alexander Peleg | Method and apparatus for packing data |
US8639914B2 (en) | 1994-12-02 | 2014-01-28 | Intel Corporation | Packing signed word elements from two source registers to saturated signed byte elements in destination register |
US8601246B2 (en) | 1994-12-02 | 2013-12-03 | Intel Corporation | Execution of instruction with element size control bit to interleavingly store half packed data elements of source registers in same size destination register |
US8521994B2 (en) | 1994-12-02 | 2013-08-27 | Intel Corporation | Interleaving corresponding data elements from part of two source registers to destination register in processor operable to perform saturation |
US8495346B2 (en) | 1994-12-02 | 2013-07-23 | Intel Corporation | Processor executing pack and unpack instructions |
US20110093682A1 (en) * | 1994-12-02 | 2011-04-21 | Alexander Peleg | Method and apparatus for packing data |
US7966482B2 (en) | 1994-12-02 | 2011-06-21 | Intel Corporation | Interleaving saturated lower half of data elements from two source registers of packed data |
US20110219214A1 (en) * | 1994-12-02 | 2011-09-08 | Alexander Peleg | Microprocessor having novel operations |
US8396915B2 (en) | 1995-08-31 | 2013-03-12 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US8725787B2 (en) | 1995-08-31 | 2014-05-13 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US20040117422A1 (en) * | 1995-08-31 | 2004-06-17 | Eric Debes | Method and apparatus for performing multiply-add operations on packed data |
US20090265409A1 (en) * | 1995-08-31 | 2009-10-22 | Peleg Alexander D | Processor for performing multiply-add operations on packed data |
US8495123B2 (en) | 1995-08-31 | 2013-07-23 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US8185571B2 (en) | 1995-08-31 | 2012-05-22 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US20020059355A1 (en) * | 1995-08-31 | 2002-05-16 | Intel Corporation | Method and apparatus for performing multiply-add operations on packed data |
US8793299B2 (en) | 1995-08-31 | 2014-07-29 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US8626814B2 (en) | 1995-08-31 | 2014-01-07 | Intel Corporation | Method and apparatus for performing multiply-add operations on packed data |
US8745119B2 (en) | 1995-08-31 | 2014-06-03 | Intel Corporation | Processor for performing multiply-add operations on packed data |
US7149882B2 (en) | 1995-12-19 | 2006-12-12 | Intel Corporation | Processor with instructions that operate on different data types stored in the same single logical register file |
US6751725B2 (en) | 1995-12-19 | 2004-06-15 | Intel Corporation | Methods and apparatuses to clear state for operation of a stack |
US7373490B2 (en) | 1995-12-19 | 2008-05-13 | Intel Corporation | Emptying packed data state during execution of packed data instructions |
US20050038977A1 (en) * | 1995-12-19 | 2005-02-17 | Glew Andrew F. | Processor with instructions that operate on different data types stored in the same single logical register file |
US20040181649A1 (en) * | 1995-12-19 | 2004-09-16 | David Bistry | Emptying packed data state during execution of packed data instructions |
US7516307B2 (en) | 1998-03-31 | 2009-04-07 | Intel Corporation | Processor for computing a packed sum of absolute differences and packed multiply-add |
US6243803B1 (en) * | 1998-03-31 | 2001-06-05 | Intel Corporation | Method and apparatus for computing a packed absolute differences with plurality of sign bits using SIMD add circuitry |
US6295561B1 (en) * | 1998-06-30 | 2001-09-25 | At&T Corp | System for translating native data structures and specific message structures by using template represented data structures on communication media and host machines |
US20040073589A1 (en) * | 2001-10-29 | 2004-04-15 | Eric Debes | Method and apparatus for performing multiply-add operations on packed byte data |
US20030200373A1 (en) * | 2001-12-07 | 2003-10-23 | David Kent | Computer system component |
US6954818B2 (en) * | 2001-12-07 | 2005-10-11 | Renesas Technology Corp. | Providing a burst mode data transfer proxy for bridging a bus |
CN104126170A (en) * | 2011-12-22 | 2014-10-29 | 英特尔公司 | Packed data operation mask register arithmetic combination processors, methods, systems and instructions |
US20130275728A1 (en) * | 2011-12-22 | 2013-10-17 | Intel Corporation | Packed data operation mask register arithmetic combination processors, methods, systems, and instructions |
US9760371B2 (en) * | 2011-12-22 | 2017-09-12 | Intel Corporation | Packed data operation mask register arithmetic combination processors, methods, systems, and instructions |
CN104126170B (en) * | 2011-12-22 | 2018-05-18 | 英特尔公司 | Packaged data operation mask register arithmetic combining processor, method, system and instruction |
WO2023020984A1 (en) * | 2021-08-19 | 2023-02-23 | International Business Machines Corporation | Masked shifted add operation |
US20230075534A1 (en) * | 2021-08-19 | 2023-03-09 | International Business Machines Corporation | Masked shifted add operation |
Also Published As
Publication number | Publication date |
---|---|
EP0486143A2 (en) | 1992-05-20 |
JPH05204605A (en) | 1993-08-13 |
EP0486143A3 (en) | 1993-03-24 |
JP2601960B2 (en) | 1997-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5909552A (en) | Method and apparatus for processing packed data | |
USRE44190E1 (en) | Long instruction word controlling plural independent processor operations | |
US6009451A (en) | Method for generating barrel shifter result flags directly from input data | |
US6173394B1 (en) | Instruction having bit field designating status bits protected from modification corresponding to arithmetic logic unit result | |
US5001662A (en) | Method and apparatus for multi-gauge computation | |
US6219688B1 (en) | Method, apparatus and system for sum of plural absolute differences | |
US7003542B2 (en) | Apparatus and method for inverting a 4×4 matrix | |
EP0656582B1 (en) | Parallel adding and averaging circuit and method | |
US20060149804A1 (en) | Multiply-sum dot product instruction with mask and splat | |
JP2022545414A (en) | Coprocessor for cryptographic operations | |
JP2006107463A (en) | Apparatus for performing multiply-add operations on packed data | |
JPH07210368A (en) | Efficient handling method by hardware of positive and negative overflows generated as result of arithmetic operation | |
JPH04172533A (en) | Electronic computer | |
US5717616A (en) | Computer hardware instruction and method for computing population counts | |
US5767867A (en) | Method for alpha blending images utilizing a visual instruction set | |
Lee et al. | AIR: Iterative refinement acceleration using arbitrary dynamic precision | |
US5386534A (en) | Data processing system for generating symmetrical range of addresses of instructing-address-value with the use of inverting sign value | |
GB2262637A (en) | Padding scheme for optimized multiplication. | |
KR20030034213A (en) | Apparatus, methods, and compilers enabling processing of multiple signed independent data elements per register | |
US7051062B2 (en) | Apparatus and method for adding multiple-bit binary-strings | |
JP2000039995A (en) | Flexible accumulate register file to be used in high performance microprocessor | |
US4947364A (en) | Method in a computing system for performing a multiplication | |
WO2002035471A1 (en) | Image processing apparatus using a cascade of poly-point operations | |
US7653674B2 (en) | Parallel operations on multiple signed elements in a register | |
Voorhies | Reduced-complexity graphics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20070601 |