US6272512B1  Data manipulation instruction for enhancing value and efficiency of complex arithmetic  Google Patents
Data manipulation instruction for enhancing value and efficiency of complex arithmetic Download PDFInfo
 Publication number
 US6272512B1 US6272512B1 US09/170,473 US17047398A US6272512B1 US 6272512 B1 US6272512 B1 US 6272512B1 US 17047398 A US17047398 A US 17047398A US 6272512 B1 US6272512 B1 US 6272512B1
 Authority
 US
 United States
 Prior art keywords
 result
 data
 complex
 processor
 operand
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Expired  Lifetime
Links
 230000002708 enhancing Effects 0.000 title description 3
 230000004044 response Effects 0.000 claims abstract description 9
 238000000034 method Methods 0.000 description 9
 238000001914 filtration Methods 0.000 description 5
 230000005236 sound signal Effects 0.000 description 5
 238000010586 diagram Methods 0.000 description 2
 230000003287 optical Effects 0.000 description 2
 238000010420 art technique Methods 0.000 description 1
 238000000802 evaporationinduced selfassembly Methods 0.000 description 1
 230000004048 modification Effects 0.000 description 1
 238000006011 modification reaction Methods 0.000 description 1
Images
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30025—Format conversion instructions, e.g. FloatingPoint to Integer, decimal conversion

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
 G06F9/30003—Arrangements for executing specific machine instructions
 G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
 G06F9/30036—Instructions to perform operations on packed data, e.g. vector operations

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
 G06F9/30098—Register arrangements
 G06F9/30105—Register structure
 G06F9/30112—Register structure for variable length data, e.g. single or double registers

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/76—Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
 G06F7/768—Data position reversal, e.g. bit reversal, byte swapping
Abstract
A method and apparatus for performing complex arithmetic is disclosed. In one embodiment, a method comprises decoding a single instruction, and in response to decoding the single instruction, moving a first operand occupying lower order bits of a first storage area to higher order bits of a result, moving a second operand occupying higher order bits of a second storage area to lower order bits of the result, and negating one of the first and second operands of the result.
Description
1. Field of the Invention
The present invention relates generally to the field of computer systems, and specifically, to a data manipulation instruction for enhancing value and efficiency of performing complex arithmetic instructions.
2. Background Information
To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, a Single Instruction, Multiple Data (SIMD) architecture has been implemented in computer systems to enable one instruction to operate on several operands simultaneously, rather than on a single operand. In particular, SIMD architectures take advantage of packing many data elements within one register or memory location. With parallel hardware execution, multiple operations can be performed on separate data elements with one instruction, resulting in a significant performance improvement. The SIMD architecture applies to both integer and floatingpoint operands.
The SIMD data format of packing data elements within a register or memory location is a natural format for representing complex data. That is, first and second data elements of an operand may comprise real and imaginary components of the complex number, respectively. Many applications require the multiplication of complex numbers such as, for example, signal processing applications. To increase the efficiency of these applications, it is therefore desirable to reduce the number of instructions required for performing a complex multiply.
The present invention comprises a method and apparatus for performing complex arithmetic. In one embodiment, a method comprises decoding a single instruction, and in response to decoding the single instruction, moving a first operand occupying lower order bits of a first storage area to higher order bits of a result, moving a second operand occupying higher order bits of a second storage area to lower order bits of the result, and negating one of the first and second operands of the result.
FIG. 1 is a block diagram illustrating an exemplary computer system according to one embodiment of the invention.
FIGS. 2A2C illustrate floatingpoint swap instructions for performing complex arithmetic according to one embodiment of the present invention.
FIG. 3A illustrates a technique for performing a complex multiply operation using little endian byte ordering according to one embodiment of the present invention.
FIG. 3B illustrates a technique for performing a complex multiply operation using big endian byte ordering according to one embodiment of the present invention.
FIG. 4 illustrates a technique for performing a complex multiply operation where one of the operands is reused according to one embodiment of the present invention.
FIG. 1 is a block diagram illustrating an exemplary computer system 100 according to one embodiment of the invention. The exemplary computer system 100 includes a processor 105, a storage device 110, and a bus 115. The processor 105 is coupled to the storage device 110 by the bus 115. In addition, a number of user input/output devices, such as a keyboard 120 and a display 125 are also coupled to the bus 115. The processor 105 represents a central processing unit of any type of architecture, such as a CISC, RISC, VLIW, or hybrid architecture. In addition, the processor 105 could be implemented on one or more chips. The storage device 110 represents one or more mechanisms for storing data. For example, the storage device 110 may include read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage mediums, optical storage mediums, flash memory devices, and/or other machinereadable mediums. The bus 115 represents one or more busses (e.g., PCI, ISA, XBus, EISA, VESA, etc.) and bridges (also termed as bus controllers). While this embodiment is described in relation to a single processor computer system, the invention could be implemented in a multiprocessor computer system. In addition, while this embodiment is described in relation to a 64bit computer system, the invention is not limited to a 64bit computer system.
In addition to other devices, one or more of a network 130, a TV broadcast signal receiver 132, a fax/modem 134, a digitizing unit 136, and a sound unit 138 may optionally be coupled to bus 115. The network 130 represents one or more network connections (e.g., an Ethernet connection), the TV broadcast signal receiver 132 represents a device for receiving TV broadcast signals, and the fax/modem 134 represents a fax and/or modem for receiving and/or transmitting analog signals representing data. The digitizing unit 136 represents one or more devices for digitizing images (e.g., a scanner, camera, etc.). The sound unit 138 represents one or more devices for inputting and/or outputting sound (e.g., microphones, speakers, magnetic storage devices, optical storage devices, etc.). An analogtodigital converter (not shown) may optionally be coupled to the bus 115 for converting complex values received externally into digital form. These complex values may be received as a result of, for example, a signal processing application (e.g., sonar, radar, seismology, speech communication, data communication, etc) running on the computer system 100.
FIG. 1 also illustrates that the storage device 110 has stored therein, among other data formats, complex data 140 and software 145. Software 145 represents the necessary code for performing any and/or all of the techniques described with reference to FIGS. 2 through 5. Of course, the storage device 110 preferably contains additional software (not shown), which is not necessary to understanding the invention.
FIG. 1 additionally illustrates that the processor 105 includes a decode unit 150, a set of registers 155, an execution unit 160, and an internal bus 165 for executing instructions. Of course, the processor 105 contains additional circuitry, which is not necessary to understanding the invention. The decode unit 150, registers 155, and execution unit 160 are coupled together by internal bus 165. The decode unit 150 is used for decoding instructions received by processor 105 into control signals and/or microcode entry points. In response to these control signals and/or microcode entry points, the execution unit 160 performs the appropriate operations. The decode unit 150 may be implemented using any number of different mechanisms (e.g., a lookup table, a hardware implementation, a PLA, etc.).
The decode unit 150 is shown including a data manipulation instruction set 170 for performing operations on packed data. In one embodiment, the data manipulation instruction set 170 includes floatingpoint swap instructions 175. The floatingpoint swap instructions include a floatingpoint swap (“FSWAP”), floatingpoint swap negateleft (“FSWAPNL”), and floatingpoint swap negateright (“FSWAPNR”) instructions, as will be further described herein. While the floatingpoint swap instructions 175 can be implemented to perform any number of different operations, in one embodiment they operate on packed data. Furthermore, in one embodiment, the processor 105 is a pipelined processor (e.g., the Pentium® II processor) capable of completing one or more of these data manipulation instructions per clock cycle (ignoring any data dependencies and pipeline freezes). In addition to the data manipulation instructions, processor 105 can include new instructions and/or instructions similar to or the same as those found in existing generalpurpose processors. For example, in one embodiment the processor 105 supports an instruction set which is compatible with the Intel® Architecture instruction set used by existing processors, such as the Pentium® II processor. Alternative embodiments of the invention may contain more or less, as well as different, data manipulation instructions and still utilize the teachings of the invention.
The registers 155 represent a storage area on processor 105 for storing information, including control/status information, packed integer data, and packed floating point data. It is understood that one aspect of the invention is the described floatingpoint data manipulation instructions for operating on packed data. According to this aspect of the invention, the storage area used for storing the packed data is not critical. The term data processing system is used herein to refer to any machine for processing data, including the computer system(s) described with reference to FIG. 1. The term operand as used herein refers to the data on which an instruction operates.
Moreover, the floatingpoint instructions operate on packed data located in floatingpoint registers and/or memory. When floatingpoint values are stored in memory, they can be stored as single precision format (32 bits), double precision format (64 bits), double extended precision format (80 bits), etc. In one embodiment, a floatingpoint register is eightytwo (82) bits wide to store an unpacked floatingpoint value in extended precision format. However, in the case of a packed floatingpoint value having first and second data elements, each data element is stored in the floatingpoint register as single precision format (32 bits) to occupy bits 063 of the floatingpoint register. In such a case, the highest order bits (bits 6481) of the floatingpoint register are ignored.
FIGS. 2A2C illustrate floatingpoint swap instructions for performing complex arithmetic according to one embodiment of the present invention. Referring to FIG. 2A, a first operand F1 occupies the lower order bits (bits 031) of a first storage area 210 and a second operand F2 occupies the higher order bits (bits 3263) of a second storage area 220. The FSWAP instruction causes the first operand F1 to be placed in the higher order bits (bits 3263) of a third storage area 230, and the second operand F2 to be placed in the lower order bits (bits 031) of the third storage area 230. In essence, the FSWAP instruction concatenates the first operand F1 with the second operand F2 (in the case where storage areas 210 and 220 are different), and then swaps the concatenated pair.
Referring now to FIG. 2B, a first operand F1 occupies the lower order bits (bits 031) of a first storage area 210 and a second operand F2 occupies the higher order bits (bits 3263) of a second storage area 220. The FSWAPNL instruction causes the first operand F1 to be placed in the higher order bits (bits 3263) of a third storage area 230 and the most significant bit of the first operand F1 is negated (bit 63). In addition, the second operand F2 is placed in the lower order bits (bits 031) of the third storage area 230. As can be seen, the FSWAPNL instruction concatenates the first operand F1 with the second operand F2 in a third storage area (in the case where storage areas 210 and 220 are different), swaps the concatenated pair, and negates the first operand F1.
Turning now to FIG. 2C, a first operand F1 occupies the lower order bits (bits 031) of a first storage area 210 and a second operand F2 occupies the higher order bits (bits 3263) of a second storage area 220. The FSWAPNR instruction causes the first operand F1 to be placed in the higher order bits (bits 3263) of a third storage area 230. In addition, the second operand F2 is placed in the lower order bits (bits 031) of the third storage area 230 and the most significant bit of the second operand is negated (bit 31). Thus, the FSWAPNR instruction concatenates the first operand F1 with the second operand F2 in a third storage area 230 (in the case where storage areas 210 and 220 are different), swaps the concatenated pair, and negates the second operand F2.
Continuing to refer to FIGS. 2A2C, the first, second, and third storage areas 210, 220, and 230 may comprise registers, memory locations, or a combination thereof. The first and second storage areas 210 and 220 may be the same storage area or may comprise different storage areas. The first and second operands F1 and F2 each represent a data element of a packed floatingpoint value. In the case where the storage areas 210 and 220 are the same storage area, a packed floatingpoint value comprises operands F1 (bits 031) and F2 (bits 3263). On the other hand, in the case where the storage areas 210 and 220 are different storage areas, the higher order bits (bits 3263) of the first storage area 210 and the lower order bits (bits 031) of the second storage area 220 are not shown because they are “don't care” values. The result F3 represents a packed floatingpoint value. If the storage area 230 is a floatingpoint register, then the highest order bits (bits 6481) are ignored. Additionally, the third storage area 230 may be the same storage area as one of the storage areas 210 and 220. The floatingpoint swap instructions are especially useful in complex arithmetic, as will be illustrated below.
Microprocessors either follow the little endian or big endian byte ordering protocol. The little endian protocol states that the lowest address byte contains the least significant byte of a larger data value, while the highest address byte contains the most significant byte of the larger data value. The big endian protocol is the exact opposite. For complex numbers, the little endian protocol states that the low address byte contains the real component of a complex number whereas the high address byte contains the imaginary component of the complex number. Again, the big endian protocol states the opposite. The SWAPNL and SWAPNR instructions are both provided so that the instruction can be used with both the little and big endian protocols.
FIG. 3A illustrates a technique for performing a complex multiply operation using little endian byte ordering according to one embodiment of the present invention. In this illustration, data is represented by ovals, while instructions are represented by rectangles.
At block 300, a complex number A and a complex number B are stored in a first packed data item 305 and a second packed data item 310, respectively. The first packed data item 305 stores data elements representing the complex number A in a first format (such that the data elements are Ai, Ar), while the second packed data item 310 stores data elements representing the complex number B in a second format (such that the data elements are Bi, Br). Of course, one or both of these numbers could be real numbers. In such situations, the real number(s) would be stored in these complex formats by storing zero as the imaginary components.
At block 315, a floatingpoint pack low instruction is performed on the first data element (Ar) of the first packed data item 305 to generate a first intermediate packed data item 320. Similarly, at block 325 a floatingpoint pack high instruction is performed on the second data element (Ai) of the first packed data item 305 to generate a second intermediate packed data item 330. As a result, the first intermediate packed data item 320 contains first and second data elements each storing Ar (the real component of the complex number A) whereas the second intermediate packed data item 330 contains first and second data elements each storing Ai (the imaginary component of the complex number A).
FIG. 3A also shows the advantage of using the FSWAPNR instruction 335. In particular, the FSWAPNR instruction is performed on the second packed data item 310 to generate a resulting packed data item 340. The FSWAPNR instruction places the first data element (Br) of the second packed data item 310, which occupies the lower data element, in the second data element of the resulting packed data item 340 (i.e., the higher data element). Additionally, the FSWAPNR instruction places the second data element (Bi) of the second packed data item 310, which occupies the higher data element, in the first data element of the resulting packed data item 340 (the higher data element), and negates the first data element. Thus, the resulting packed data item 340 contains first and second data elements storing Br and −Bi.
At block 340, a floatingpoint multiply instruction is performed on the resulting packed data item 340 and the second intermediate packed data item 330 to generate a second resulting packed data item 350. In particular, the floatingpoint multiply instruction multiplies the first data element of the resulting packed data item 340 (−Bi) with the first data element of the second intermediate packed data item 330 (Ai), and multiplies the second data element of the resulting packed data item 340 (Br) with the second data element of the second intermediate packed data item 330 (Ai). The second resulting packed data item 350 contains a first data element storing −AiBi and a second data element storing AiBr.
At block 355, a multiplyadd instruction is performed on the first intermediate packed data item 320 and the second packed data item 310, and the second resulting packed data item 350. In particular, the multiplyadd instruction multiplies the first data elements of the first intermediate packed data item 320 (Ar) with the second packed data item 310 (Br), adds the multiplied data elements to the first data element of the second resulting packed data item 350 (−AiBi), and places the result in a first data element of the final resulting packed data item 360. The multiplyadd instruction also multiplies the second data elements of the first intermediary packed data item 320 (Ar) with the second packed data item 310 (Bi), adds the multiplied data elements to the second data element of the second resulting packed data item 350 (AiBr), and places the result in a second data element of the final resulting packed data item 360. Thus, the final resulting packed data item 360 contains the first data element storing ArBr−AiBi (the real component of multiplying together complex numbers A and B), and the second data element storing ArBi+AiBr (the imaginary component of multiplying together complex numbers A and B).
Thus, by using the FSWAPNR instruction together with arranging data representing complex numbers in the appropriate formats, the multiplication of two complex numbers may be performed in five instructions, namely instructions at blocks 315, 325, 335, 345, and 355. This provides a significant performance advantage over prior art techniques of performing complex multiplication. Of course, the advantages of this invention are greater when many such complex multiplication operations are required.
The block 300 of storing represents a variety of ways of storing the first and second packed data items in the appropriate formats. For example, the complex data may already be stored on a CDROM (represented by the storage device 110) in the described formats. In which case, block 300 may be performed by copying the complex data from the CDROM into the main memory (also represented by the storage device 110), and then into registers 155 on the processor 105. As another example, the fax/modem 134 (see FIG. 1) connecting the computer system 100 to network 130 may receive complex data and store it in the main memory in one or more of the formats described herein—storing two representations of each of the components of the complex data such that it may be read in as packed data item in the described formats. This complex data may then be accessed as packed data and copied into registers on the processor 105. Since the data is stored in the disclosed formats, the processor 105 can easily and efficiently perform the complex multiplication (e.g., the processor 105 can access the first packed data item 310 in a single instruction). Although these formats for storing complex numbers require more storage space, the performance advantage for complex multiplication is worth the additional storage space in some situations.
The technique for performing a complex multiply operation as shown in FIG. 3A utilizes the little endian protocol. The same technique can also be used in a system using the big endian protocol, as shown in FIG. 3B. Note that at block 335 of FIG. 3B, the SWAPNL instruction is used.
FIG. 4 illustrates a technique for performing a complex multiply operation where one of the operands is reused according to one embodiment of the present invention. In this illustration, a complex scalar A is multiplied by a complex vector X[i] and added to a complex vector Y[i], given by the following expression:
This formula is used in many applications including, for example, but not limited or restricted to, signal processing applications (e.g., sonar, radar, seismology, speech communications, data communication, acoustics, etc.), image processing applications, and various other applications.
Referring to FIG. 4, a first packed data item 405 stores data elements representing a complex scalar number A. The first packed data item 405 has two data elements each containing, for example, 32bits, although other numbers of bits may be used. The data elements of the first packed data item 405 are Ar and Ai.
At block 410, a floatingpoint pack low instruction is performed on the first data element (Ar) of the first packed data item 405 to generate a first intermediate packed data item 415. Similarly, at block 420 a floatingpoint pack high instruction is performed on the second data element (Ai) of the first packed data item 405 to generate a second intermediate packed data item 425. As a result, the first intermediate packed data item 415 contains first and second data elements each storing Ar (the real component of the complex number A) whereas the second intermediate packed data item 425 contains first and second data elements each storing Ai (the imaginary component of the complex number A). The packed data items 415 and 425 are reused for performing multiple complex multiplications.
Also shown is a second packed data item 430 representing a first complex vector X[i] and a third packed data item 435 representing a second complex vector Y[i]. The data elements for the second packed data item 430 are Xi and Xr, respectively, and the data elements for the third packed data item 435 are Yi and Yr, respectively. At block 440, a multiplyadd instruction is performed on the first intermediate packed data item 415, the second packed data item 430, and the third packed data item 435. That is, the multiplyadd instruction multiplies the first data elements of the first intermediate packed data item 415 (Ar) with the second packed data item 430 (Xr), adds the multiplied value to the first data element of the third packed data item 430 (Yr), and places the result in a first data element of a first resulting packed data item 445. The multiplyadd instruction also multiplies the second data elements of the first intermediary packed data item 415 (Ar) with the second packed data item 430 (Xi), adds the multiplied value to the second data element of the third packed data item 435 (Yi), and places the result in a second data element of the first resulting packed data item 445. Thus, the first resulting packed data item 445 contains the first data element storing ArXr+Yr, and the second data element storing ArXi+Yi.
At block 450, a FSWAPNR instruction 450 is performed on the second packed data item 430 to generate a second resulting packed data item 455. Note that the FSWAPNR instruction may be performed before, in parallel, or after the multiplyadd instruction 440. In particular, the FSWAPNR instruction places the first data element (Xr) of the second packed data item 430, which occupies the lower data element, in the second data element of the second resulting packed data item 455 (i.e., the higher data element). Additionally, the FSWAPNR instruction places the second data element (Xi) of the second packed data item 430, which occupies the higher data element, in the first data element of the second resulting packed data item 455 (the higher data element), and negates the first data element. Thus, the second resulting packed data item 455 contains first and second data elements storing Xr and −Xi.
At block 460, a second multiplyadd instruction is performed on the second intermediate packed data item 425, the second resulting packed data item 455, and the first resulting packed data item 445. The multiplyadd instruction multiplies the first data elements of the second intermediate packed data item 425 (Ai) with the second resulting packed data item 455 (−Xi), adds the multiplied value to the first data element of the first resulting packed data item 445 (ArXr+Yr), and places the result in a first data element of a final resulting packed data item 465. The multiplyadd instruction also multiplies the second data elements of the second intermediary packed data item 425 (Ai) with the second resulting packed data item 455 (Xr), adds the multiplied value to the second data element of the first resulting packed data item 445 (ArXi+Yi), and places the result in a second data element of the final resulting packed data item 465. Thus, the final resulting packed data item 465 contains the first data element storing ArXr−AiXi+Yr (the real component of equation (1)), and the second data element storing AiXr+ArXi+Yi (the complex component of the equation (1)).
It must be noted that the final resulting packed data item 465 may be stored in the third packed data item 435 to reflect the updated Y[i] in the lefthand side of equation (1). This updated complex vector Y[i] is then used with the complex scalar A and the new X[i] to calculate a new Y[i], and so on. As can be seen from equation (1) and FIG. 4, it takes five instructions (blocks 410, 420, 440, 450, and 460) to calculate the vector Y[i] the first time. Thereafter, it only takes three instructions (blocks 440, 450, and 460) to calculate a next Y[i] because the data items 415 and 425 (the real and imaginary components of the scalar A) are reused after they are loaded the first time. As such, a further performance advantage is realized in looping operations.
In the embodiments illustrating the present invention, the processor 105, executing the SWAP, SWAPNL, and SWAPNR instructions, operated on packed data in “packed double word” format, i.e., two data elements per operand or register. However, it is to be appreciated that the processor 105 can operate on packed data in other different packed data formats. The processor can operate on packed data having more than two data elements per register and/or memory location. In one illustration, the processor can operate on packed data having four data elements in a 128bit register. Other packed formats and/or register sizes are possible and within the scope of the present invention.
One application of the present invention involves speech communication and/or recognition. In such an application, an audio signal is recorded by the microphone of the sound unit 138 (or is received by the fax/modem 134) and converted into a digital audio stream by the analogtodigital converter of the sound unit 138 for storage in the storage device 110. A filtering operation is then performed on the digital audio stream (which represents the audio signal) to smooth out the audio signal or for recognizing the speech. The filtering operation may be performed using a fast Fourier transform (e.g., a radix2 butterfly). The SWAPNL and SWAPNR instructions are used, as illustrated in FIGS. 3A, 3B, and 4, to perform complex multiplications during the filtering operation. The filtered digital audio stream is then transmitted to the sound unit 138 which converts the filtered audio stream into a filtered analog signal and outputs the audio signal to the speaker of the sound unit 138. In the case of speech recognition, the filtered audio stream is then compared with a glossary of predetermined terms stored in the storage device 110 to determine whether the audio signal is a recognized command.
In another embodiment involving video communications, a video signal is received by the digitizing unit 136 which converts the video signal into a digital video stream (represented by complex numbers) for storage. A filtering operation may also be performed on the digital video stream which involves the multiplication of complex number. The multiplication techniques of the present invention is used to enhance the efficiency of the filtering operation. Once the digital video stream is filtered, it is sent out to the display 125 for viewing. Based on the foregoing, the floatingpoint swap instructions may be used in a myriad of applications utilizing complex arithmetic for increasing efficiency of such applications.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention. Moreover, it is to be understood that this invention not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art.
Claims (20)
1. A method comprising the computer implemented acts of:
decoding a single instruction;
in response to decoding said single instruction,
moving a first operand occupying lower order bits of a first storage area to higher order bits of a result;
moving a second operand occupying higher order bits of a second storage area to lower order bits of the result; and
negating one of the first and second operands of the result.
2. The method of claim 1 wherein said negating includes:
negating the first operand of the result.
3. The method of claim 1 wherein said negating includes:
negating the second operand of the result.
4. The method of claim 1 wherein the first and second storage areas are the same storage area.
5. The method of claim 1 further comprising:
storing the result in one of the first and second storage areas also in response to decoding said instruction.
6. The method of claim 1 wherein the first and second storage areas are registers.
7. The method of claim 1 wherein the first and second storage areas are memory locations.
8. A processor, comprising:
a decoder to decode instructions; and
a circuit coupled to said decoder, said circuit in response to a single decoded instruction to,
move a first operand occupying lower order bits of a first storage area to higher order bits of a result,
move a second operand occupying higher order bits of a second storage area to lower order bits of the result, and
negate one of the first and second operands of the result.
9. The processor of claim 8 wherein the first operand of the result is negated.
10. The processor of claim 8 wherein the second operand of the result is negated.
11. The processor of claim 8 wherein the first and second storage areas are the same storage area.
12. The processor of claim 8 wherein said circuit in response to a single decoded instruction to also store the result in one of the first and second storage areas.
13. The processor of claim 8 wherein the first and second storage areas are registers.
14. A method of multiplying a first floatingpoint complex number with a second floatingpoint complex number where each floatingpoint complex number includes a real component and an imaginary component, the method comprising the computer implemented acts of:
packing the real component of the first complex number into first and second data elements of a first result;
packing the imaginary component of the first complex number into first and second data elements of a second result;
swapping the real and imaginary components of the second complex number to form a third result;
negating the imaginary component of the third result;
multiplying the first data element of the second result with the negated imaginary component of the third result to form a first data element of a fourth result, and the second data element of the second result with the real component of the third result to form a second data element of the fourth result; and
multiplying the first data element of the first result with the real component of the second complex number and adding it to the first data element of the fourth result to form a first data element of a final result, and the second data element of the first result with the imaginary component of the second complex number and adding it to the second data element of the fourth result to form a second data element of the final result.
15. A processor, comprising:
a decoder to decode instructions; and
a circuit coupled to said decoder, said circuit in response to one or more decoded instructions to,
pack a real component of a first complex number into first and second data elements of a first result,
pack an imaginary component of the first complex number into first and second data elements of a second result,
swap the real and imaginary components of the second complex number to form a third result;
negate the imaginary component of the third result;
multiply the first data element of the second result with the negated imaginary component of the third result to form a first data element of a fourth result, and the second data element of the second result with the real component of the third result to form a second data element of the fourth result; and
multiply the first data element of the first result with the real component of the second complex number and adding it to the first data element of the fourth result to form a first data element of a final result, and the second data element of the first result with the imaginary component of the second complex number and adding it to the second data element of the fourth result to form a second data element of the final result.
16. The processor of claim 15 wherein the first through fourth result and the final result are stored in registers.
17. A computer system, comprising:
a machine readable medium storing one or more instructions; and
a processor coupled to said machine readable medium, said processor in response to said one or more decoded instructions to,
pack a real component of a first complex number into first and second data elements of a first result,
pack an imaginary component of the first complex number into first and second data elements of a second result,
swap the real and imaginary components of the second complex number to form a third result;
negate the imaginary component of the third result;
multiply the first data element of the second result with the negated imaginary component of the third result to form a first data element of a fourth result, and the second data element of the second result with the real component of the third result to form a second data element of the fourth result; and
multiply the first data element of the first result with the real component of the second complex number and adding it to the first data element of the fourth result to form a first data element of a final result, and the second data element of the first result with the imaginary component of the second complex number and adding it to the second data element of the fourth result to form a second data element of the final result.
18. The computer system of claim 17 wherein the machinereadable medium comprises a volatile memory.
19. The computer system of claim 17 wherein the machinereadable medium comprises a disk.
20. The computer system of claim 17 further comprising an analogtodigital converter coupled to the processor by way of a bus to provide the first and second complex numbers.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

US09/170,473 US6272512B1 (en)  19981012  19981012  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
Applications Claiming Priority (2)
Application Number  Priority Date  Filing Date  Title 

US09/170,473 US6272512B1 (en)  19981012  19981012  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
US09/874,865 US6502117B2 (en)  19981012  20010604  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
Related Child Applications (1)
Application Number  Title  Priority Date  Filing Date 

US09/874,865 Continuation US6502117B2 (en)  19981012  20010604  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
Publications (1)
Publication Number  Publication Date 

US6272512B1 true US6272512B1 (en)  20010807 
Family
ID=22619993
Family Applications (2)
Application Number  Title  Priority Date  Filing Date 

US09/170,473 Expired  Lifetime US6272512B1 (en)  19981012  19981012  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
US09/874,865 Expired  Lifetime US6502117B2 (en)  19981012  20010604  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
Family Applications After (1)
Application Number  Title  Priority Date  Filing Date 

US09/874,865 Expired  Lifetime US6502117B2 (en)  19981012  20010604  Data manipulation instruction for enhancing value and efficiency of complex arithmetic 
Country Status (1)
Country  Link 

US (2)  US6272512B1 (en) 
Cited By (27)
Publication number  Priority date  Publication date  Assignee  Title 

US6411979B1 (en) *  19990614  20020625  Agere Systems Guardian Corp.  Complex number multiplier circuit 
WO2002086756A1 (en) *  20010419  20021031  Arc International (U.K.) Limited  Data processor with enhanced instruction execution and method 
US20030097391A1 (en) *  20011121  20030522  Ashley Saulsbury  Methods and apparatus for performing parallel integer multiply accumulate operations 
US20030212728A1 (en) *  20020510  20031113  Amit Dagan  Method and system to perform complex number multiplications and calculations 
US6826587B1 (en) *  19990520  20041130  FRANCE TéLéCOM  Complex number multiplier 
US20050160402A1 (en) *  20020513  20050721  Wang Albert R.  Method and apparatus for adding advanced instructions in an extensible processor architecture 
US7376812B1 (en) *  20020513  20080520  Tensilica, Inc.  Vector coprocessor for configurable and extensible processor architecture 
US20090055455A1 (en) *  20070822  20090226  Nec Electronics Corporation  Microprocessor 
US20090187245A1 (en) *  20061222  20090723  Musculoskeletal Transplant Foundation  Interbody fusion hybrid graft 
US7937559B1 (en)  20020513  20110503  Tensilica, Inc.  System and method for generating a configurable processor supporting a userdefined plurality of instruction sizes 
US7996671B2 (en)  20031117  20110809  Bluerisc Inc.  Security of program executables and microprocessors based on compilerarchitecture interaction 
US8607209B2 (en)  20040204  20131210  Bluerisc Inc.  Energyfocused compilerassisted branch prediction 
US20140229716A1 (en) *  20120530  20140814  Intel Corporation  Vector and scalar based modular exponentiation 
CN104317774A (en) *  20141014  20150128  中国航天科技集团公司第九研究院第七七一研究所  Device and method for carrying out complex multiplication and butterfly calculation by virtue of floating point unit of processor 
US9069938B2 (en)  20061103  20150630  Bluerisc, Inc.  Securing microprocessors against information leakage and physical tampering 
US9235393B2 (en)  20020709  20160112  Iii Holdings 2, Llc  Statically speculative compilation and execution 
US9355068B2 (en)  20120629  20160531  Intel Corporation  Vector multiplication with operand base system conversion and reconversion 
US9569186B2 (en)  20031029  20170214  Iii Holdings 2, Llc  Energyfocused recompilation of executables and hardware mechanisms based on compilerarchitecture interaction and compilerinserted control 
US10095516B2 (en)  20120629  20181009  Intel Corporation  Vector multiplication with accumulation in large register space 
US20190102194A1 (en) *  20170929  20190404  Intel Corporaton  Apparatus and method for multiplication and accumulation of complex and real packed data elements 
US10514924B2 (en)  20170929  20191224  Intel Corporation  Apparatus and method for performing dual signed and unsigned multiplication of packed data elements 
EP3547115A3 (en) *  20180330  20200115  INTEL Corporation  Efficient implementation of complex vector fused multiply add and complex vector multiply 
US10664277B2 (en)  20170929  20200526  Intel Corporation  Systems, apparatuses and methods for dual complex by complex conjugate multiply of signed words 
US10795677B2 (en)  20170929  20201006  Intel Corporation  Systems, apparatuses, and methods for multiplication, negation, and accumulation of vector packed signed values 
US10795676B2 (en)  20170929  20201006  Intel Corporation  Apparatus and method for multiplication and accumulation of complex and real packed data elements 
US10869108B1 (en)  20080929  20201215  Calltrol Corporation  Parallel signal processing system and method 
US10929504B2 (en)  20170929  20210223  Intel Corporation  Bit matrix multiplication 
Families Citing this family (19)
Publication number  Priority date  Publication date  Assignee  Title 

US5742840A (en)  19950816  19980421  Microunity Systems Engineering, Inc.  General purpose, multiple precision parallel operation, programmable media processor 
US6643765B1 (en)  19950816  20031104  Microunity Systems Engineering, Inc.  Programmable processor with group floating point operations 
US6385634B1 (en)  19950831  20020507  Intel Corporation  Method for performing multiplyadd operations on packed data 
US6230253B1 (en) *  19980331  20010508  Intel Corporation  Executing partialwidth packed data instructions 
US6230257B1 (en) *  19980331  20010508  Intel Corporation  Method and apparatus for staggering execution of a single packed data instruction using the same circuit 
US6670895B2 (en) *  20020501  20031230  Analog Devices, Inc.  Method and apparatus for swapping the contents of address registers 
WO2004015558A1 (en) *  20020807  20040219  Thomson Licensing S.A.  Apparatus and method for computing a reciprocal of a complex number 
US6986023B2 (en) *  20020809  20060110  Intel Corporation  Conditional execution of coprocessor instruction based on main processor arithmetic flags 
JP2005535966A (en) *  20020809  20051124  インテル・コーポレーション  Multimedia coprocessor control mechanism including alignment or broadcast instructions 
US7392368B2 (en) *  20020809  20080624  Marvell International Ltd.  Cross multiply and add instruction and multiply and subtract instruction SIMD execution on real and imaginary components of a plurality of complex data elements 
US7793072B2 (en) *  20031031  20100907  International Business Machines Corporation  Vector execution unit to process a vector instruction by executing a first operation on a first set of operands and a second operation on a second set of operands 
US7555514B2 (en) *  20060213  20090630  Atmel Corportation  Packed addsubtract operation in a microprocessor 
EP2156284B1 (en) *  20070514  20180801  Raytheon Company  Methods and apparatus for testing software with realtime source data from a projectile 
KR100974190B1 (en) *  20081219  20100805  주식회사 텔레칩스  Complex number multiplying method using floating point 
US8862058B2 (en) *  20111219  20141014  Leigh M. Rothschild  Systems and methods for reducing electromagnetic radiation emitted from a wireless headset 
US9760371B2 (en)  20111222  20170912  Intel Corporation  Packed data operation mask register arithmetic combination processors, methods, systems, and instructions 
US20140325574A1 (en) *  20130430  20141030  Koozoo, Inc.  Perceptors and methods pertaining thereto 
EP2851786A1 (en) *  20130923  20150325  Telefonaktiebolaget L M Ericsson (publ)  Instruction class for digital signal processors 
US20180095758A1 (en) *  20161001  20180405  Intel Corporation  Systems and methods for executing a fused multiplyadd instruction for complex numbers 
Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US5473557A (en) *  19940609  19951205  Motorola, Inc.  Complex arithmetic processor and method 
US5859997A (en) *  19950831  19990112  Intel Corporation  Method for performing multiplysubstrate operations on packed data 
US5936872A (en) *  19950905  19990810  Intel Corporation  Method and apparatus for storing complex numbers to allow for efficient complex multiplication operations and performing such complex multiplication operations 
US5953241A (en) *  19950816  19990914  Microunity Engeering Systems, Inc.  Multiplier array processing system with enhanced utilization at lower precision for group multiply and sum instruction 
Family Cites Families (3)
Publication number  Priority date  Publication date  Assignee  Title 

US4161784A (en)  19780105  19790717  Honeywell Information Systems, Inc.  Microprogrammable floating point arithmetic unit capable of performing arithmetic operations on long and short operands 
WO1996017293A1 (en)  19941201  19960606  Intel Corporation  A microprocessor having a multiply operation 
US5634118A (en)  19950410  19970527  Exponential Technology, Inc.  Splitting a floatingpoint stackexchange instruction for merging into surrounding instructions by operand translation 

1998
 19981012 US US09/170,473 patent/US6272512B1/en not_active Expired  Lifetime

2001
 20010604 US US09/874,865 patent/US6502117B2/en not_active Expired  Lifetime
Patent Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US5473557A (en) *  19940609  19951205  Motorola, Inc.  Complex arithmetic processor and method 
US5953241A (en) *  19950816  19990914  Microunity Engeering Systems, Inc.  Multiplier array processing system with enhanced utilization at lower precision for group multiply and sum instruction 
US5859997A (en) *  19950831  19990112  Intel Corporation  Method for performing multiplysubstrate operations on packed data 
US5936872A (en) *  19950905  19990810  Intel Corporation  Method and apparatus for storing complex numbers to allow for efficient complex multiplication operations and performing such complex multiplication operations 
Cited By (45)
Publication number  Priority date  Publication date  Assignee  Title 

US6826587B1 (en) *  19990520  20041130  FRANCE TéLéCOM  Complex number multiplier 
USRE40803E1 (en) *  19990520  20090623  Fahrenheit Thermoscope Llc  Complex number multiplier 
US6411979B1 (en) *  19990614  20020625  Agere Systems Guardian Corp.  Complex number multiplier circuit 
US7010558B2 (en)  20010419  20060307  Arc International  Data processor with enhanced instruction execution and method 
WO2002086756A1 (en) *  20010419  20021031  Arc International (U.K.) Limited  Data processor with enhanced instruction execution and method 
US20030097391A1 (en) *  20011121  20030522  Ashley Saulsbury  Methods and apparatus for performing parallel integer multiply accumulate operations 
US7013321B2 (en) *  20011121  20060314  Sun Microsystems, Inc.  Methods and apparatus for performing parallel integer multiply accumulate operations 
US20030212728A1 (en) *  20020510  20031113  Amit Dagan  Method and system to perform complex number multiplications and calculations 
US7937559B1 (en)  20020513  20110503  Tensilica, Inc.  System and method for generating a configurable processor supporting a userdefined plurality of instruction sizes 
US7346881B2 (en)  20020513  20080318  Tensilica, Inc.  Method and apparatus for adding advanced instructions in an extensible processor architecture 
US7376812B1 (en) *  20020513  20080520  Tensilica, Inc.  Vector coprocessor for configurable and extensible processor architecture 
US20050160402A1 (en) *  20020513  20050721  Wang Albert R.  Method and apparatus for adding advanced instructions in an extensible processor architecture 
US9235393B2 (en)  20020709  20160112  Iii Holdings 2, Llc  Statically speculative compilation and execution 
US10101978B2 (en)  20020709  20181016  Iii Holdings 2, Llc  Statically speculative compilation and execution 
US9569186B2 (en)  20031029  20170214  Iii Holdings 2, Llc  Energyfocused recompilation of executables and hardware mechanisms based on compilerarchitecture interaction and compilerinserted control 
US10248395B2 (en)  20031029  20190402  Iii Holdings 2, Llc  Energyfocused recompilation of executables and hardware mechanisms based on compilerarchitecture interaction and compilerinserted control 
US9582650B2 (en)  20031117  20170228  Bluerisc, Inc.  Security of program executables and microprocessors based on compilerarchitecture interaction 
US7996671B2 (en)  20031117  20110809  Bluerisc Inc.  Security of program executables and microprocessors based on compilerarchitecture interaction 
US9244689B2 (en)  20040204  20160126  Iii Holdings 2, Llc  Energyfocused compilerassisted branch prediction 
US10268480B2 (en)  20040204  20190423  Iii Holdings 2, Llc  Energyfocused compilerassisted branch prediction 
US8607209B2 (en)  20040204  20131210  Bluerisc Inc.  Energyfocused compilerassisted branch prediction 
US9697000B2 (en)  20040204  20170704  Iii Holdings 2, Llc  Energyfocused compilerassisted branch prediction 
US9069938B2 (en)  20061103  20150630  Bluerisc, Inc.  Securing microprocessors against information leakage and physical tampering 
US9940445B2 (en)  20061103  20180410  Bluerisc, Inc.  Securing microprocessors against information leakage and physical tampering 
US10430565B2 (en)  20061103  20191001  Bluerisc, Inc.  Securing microprocessors against information leakage and physical tampering 
US20090187245A1 (en) *  20061222  20090723  Musculoskeletal Transplant Foundation  Interbody fusion hybrid graft 
US20090055455A1 (en) *  20070822  20090226  Nec Electronics Corporation  Microprocessor 
US10869108B1 (en)  20080929  20201215  Calltrol Corporation  Parallel signal processing system and method 
US20140229716A1 (en) *  20120530  20140814  Intel Corporation  Vector and scalar based modular exponentiation 
US9268564B2 (en) *  20120530  20160223  Intel Corporation  Vector and scalar based modular exponentiation 
US10095516B2 (en)  20120629  20181009  Intel Corporation  Vector multiplication with accumulation in large register space 
US9965276B2 (en)  20120629  20180508  Intel Corporation  Vector operations with operand base system conversion and reconversion 
US10514912B2 (en)  20120629  20191224  Intel Corporation  Vector multiplication with accumulation in large register space 
US9355068B2 (en)  20120629  20160531  Intel Corporation  Vector multiplication with operand base system conversion and reconversion 
CN104317774B (en) *  20141014  20170704  中国航天科技集团公司第九研究院第七七一研究所  The apparatus and method that CM and butterfly computation are carried out using processor floating point unit 
CN104317774A (en) *  20141014  20150128  中国航天科技集团公司第九研究院第七七一研究所  Device and method for carrying out complex multiplication and butterfly calculation by virtue of floating point unit of processor 
US10977039B2 (en)  20170929  20210413  Intel Corporation  Apparatus and method for performing dual signed and unsigned multiplication of packed data elements 
US20190102194A1 (en) *  20170929  20190404  Intel Corporaton  Apparatus and method for multiplication and accumulation of complex and real packed data elements 
US10929504B2 (en)  20170929  20210223  Intel Corporation  Bit matrix multiplication 
US10552154B2 (en) *  20170929  20200204  Intel Corporation  Apparatus and method for multiplication and accumulation of complex and real packed data elements 
US10795677B2 (en)  20170929  20201006  Intel Corporation  Systems, apparatuses, and methods for multiplication, negation, and accumulation of vector packed signed values 
US10795676B2 (en)  20170929  20201006  Intel Corporation  Apparatus and method for multiplication and accumulation of complex and real packed data elements 
US10514924B2 (en)  20170929  20191224  Intel Corporation  Apparatus and method for performing dual signed and unsigned multiplication of packed data elements 
US10664277B2 (en)  20170929  20200526  Intel Corporation  Systems, apparatuses and methods for dual complex by complex conjugate multiply of signed words 
EP3547115A3 (en) *  20180330  20200115  INTEL Corporation  Efficient implementation of complex vector fused multiply add and complex vector multiply 
Also Published As
Publication number  Publication date 

US20020004809A1 (en)  20020110 
US6502117B2 (en)  20021231 
Similar Documents
Publication  Publication Date  Title 

US10514918B2 (en)  Inlane vector shuffle instructions  
US9015453B2 (en)  Packing odd bytes from two source registers of packed data  
US9015354B2 (en)  Efficient complex multiplication and fast fourier transform (FFT) implementation on the ManArray architecture  
US8190854B2 (en)  System and method of processing data using scalar/vector instructions  
US6970994B2 (en)  Executing partialwidth packed data instructions  
KR100445542B1 (en)  Method and apparatus for custom operations of processor  
US5961628A (en)  Load and store unit for a vector processor  
US5465224A (en)  Three input arithmetic logic unit forming the sum of a first Boolean combination of first, second and third inputs plus a second Boolean combination of first, second and third inputs  
KR100218255B1 (en)  Fast requantization apparatus and method for mpeg audio decoding  
US6901420B2 (en)  Method and apparatus for performing packed shift operations  
JP3658072B2 (en)  Data processing apparatus and data processing method  
US6173394B1 (en)  Instruction having bit field designating status bits protected from modification corresponding to arithmetic logic unit result  
US5805913A (en)  Arithmetic logic unit with conditional register source selection  
US5680339A (en)  Method for rounding using redundant coded multiply result  
USRE44190E1 (en)  Long instruction word controlling plural independent processor operations  
EP0657803B1 (en)  Three input arithmetic logic unit  
KR100366689B1 (en)  Video frame rendering engine  
US7424501B2 (en)  Nonlinear filtering and deblocking applications utilizing SIMD sign and absolute value operations  
TWI242742B (en)  Multiplyaccumulate (MAC) unit for singleinstruction/multipledata (SIMD) instructions  
JP4477279B2 (en)  Digital signal processor with combined multiplyaccumulate unit  
RU2275677C2 (en)  Method, device and command for performing sign multiplication operation  
US5761726A (en)  Base address generation in a multiprocessing system having plural memories with a unified address space corresponding to each processor  
EP0680013B1 (en)  Central processing unit with integrated graphics functions and method of executing graphics instructions by said central processing unit  
US6230180B1 (en)  Digital signal processor configuration including multiplying units coupled to plural accumlators for enhanced parallel mac processing  
US7376812B1 (en)  Vector coprocessor for configurable and extensible processor architecture 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GOLLIVER, ROGER A.;DULONG, CAROLE;REEL/FRAME:009635/0413;SIGNING DATES FROM 19981207 TO 19981208 

STCF  Information on status: patent grant 
Free format text: PATENTED CASE 

FPAY  Fee payment 
Year of fee payment: 4 

FPAY  Fee payment 
Year of fee payment: 8 

FPAY  Fee payment 
Year of fee payment: 12 

SULP  Surcharge for late payment 
Year of fee payment: 11 