CN102707922B

CN102707922B - Control the device of the bit correction of shift grouped data

Info

Publication number: CN102707922B
Application number: CN201210059426.5A
Authority: CN
Inventors: A·D·佩勒格; Y·雅里; M·米塔尔; L·M·门内梅尔; B·艾坦; A·F·格卢; C·杜龙; E·科瓦施; W·维特
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 1995-08-31
Filing date: 1996-07-17
Publication date: 2015-10-07
Anticipated expiration: 2016-07-17
Also published as: CN103092563B; CN101930352A; EP0847551A1; CN101794212A; CN102073475B; CN103383639A; CN1892589B; CN103064650A; TW310406B; CN1264085C; CN103455304B; WO1997008608A1; CN101794213A; CN103455304A; JPH11511575A; CN103092562B; BR9612911B1; CN103064652B; CN1515994A; CN1200822A

Abstract

Title of the present invention is " device controlling the bit correction of shift grouped data ".The present invention relates to the device of the bit correction controlling shift grouped data, a kind of device of instruction set of the operation added within a processor in the integrated data of support required by typical multimedia application is provided.In one embodiment, the present invention includes the processor with memory block (150), demoder (165) and multiple circuit (130).The plurality of circuit provides the execution of some instructions to operate integrated data.In this embodiment, these instructions comprise assembling, decomposition, grouping multiplication, grouping addition, grouping subtraction, grouped comparison and grouping displacement.

Description

Control the device of the bit correction of shift grouped data

The divisional application that the application is the applying date is on July 17th, 1996, application number is 201010623140.6, denomination of invention is the patented claim of " device controlling the bit correction of shift grouped data ".

Technical field

The present invention is specifically related to field of computer.More specifically, the present invention relates to packet data operation field.

Background technology

In typical computer system, processor is embodied as and utilizes the instruction producing a kind of result to operate in the value represented by a large amount of positions (as 64).Such as, perform add instruction and first 64 place value and second 64 place value phase adduction are stored this result as the 3rd 64 place values.But multimedia application (application such as the purpose of the cooperation of computer supported (CSC-teleconference and mixed-media data processing integrated), 2D/3D figure, image processing, video compress/decompress(ion), recognizer and audio frequency process) requires to process can by the mass data of a small amount of bit representation.Such as, graph data needs 8 or 16 usually, and voice data needs 8 or 16 usually.Each of these multimedia application needs one or more algorithms, the some operations of each needs.Such as, algorithm may need addition, compare and shifting function.

In order to improve multimedia application (and having other application of same characteristic features), prior art processor provides packet data format.Be commonly used in packet data format represent that the position of single value is divided into the data element of some regular lengths, the value that each element representation is independent.Such as, 64 bit registers can be divided into two 32 bit elements, 32 place values that each element representation one is independent.In addition, these prior art processors provide the parallel instruction separately processing each element in these type of packet data.Such as, the add instruction of grouping is added from the first integrated data with the corresponding data element of the second integrated data.Thus, if multimedia algorithms needs the circulation comprising five kinds of operations that must perform on mass data element, always wish these data of assembling and utilize these operations of integrated data executing instructions.In this way, these processors just more efficiently can process multimedia application.

But, if comprise the operation (namely processor lacks suitable instruction) that processor can not perform in integrated data in this operation cycle, then must decompose these data to perform this operation.Such as, if multimedia algorithms requires additive operation and can not obtain above-mentioned grouping add instruction, then programmer must decompose the first integrated data and the second integrated data (namely separating the element comprising the first integrated data and the second integrated data), the independent element each separated is added, and then result is assembled into the result of grouping for further packet transaction.Processing time needed for performing this assembling and decomposing counteracts the feature performance benefit providing packet data format usually.Therefore, it is desirable to comprise on aageneral-purposeaprocessor the integrated data instruction set of all operations provided needed for exemplary multimedia algorithm.But due to the limited chip area on current microprocessor, the number of instructions that can increase is limited.

A kind of general processor comprising integrated data instruction is the i860XP that the Intel Company of California Santa Clara manufactures ^tMprocessor.I860XP processor comprises some type of packet data with different element size.In addition, i860XP processor comprises grouping addition and grouped comparison instruction.But grouping add instruction does not disconnect carry chain, therefore programmer must ensure that the computing that software is performing can not cause overflowing, and namely computing can not cause spilling in the next element of this integrated data from the position of an element in integrated data.Such as, if value 1 is added on 8 integrated data elements of storage " 11111111 ", just occur overflowing and result is " 100000000 ".In addition, the scaling position in the type of packet data that i860XP supports is fixing (i.e. i860XP processor support number 8.8,6.10 and 8.24 wherein counts i.j and comprises the j position after i most significant digit and radix point).Thus limit the value that programmer can represent.Because this two instructions only supported by i860XP processor, it can not perform the many computings required by multimedia algorithms adopting integrated data.

The general processor of another kind of support integrated data is the MC88110 that Motorala company manufactures ^tMprocessor.The support of MC88110 processor has the several different packet data format of different length element.In addition, the grouping instruction set that MC88110 processor is supported comprises assembling, decomposition, grouping addition, grouping subtraction, grouping multiplication, grouped comparison and grouping and rotates.

MC88110 processor packet command is undertaken operating by (t*r)/64 (wherein t is the figure place in the element of this integrated data) the individual highest significant position of each element in connection first register pair and is generated the field that width is r.This field replaces the highest significant position of the integrated data be stored in the second register pair.Then this integrated data is stored in the 3rd register pair and left-handed r position.T and the r value supported shown in table 1 and 2 below, and the computing example of this instruction.

The undefined operation of X=

Table 1

Table 2

This realization of grouping instruction has two shortcomings.First is need additional logic to perform rotation when order fulfillment.Second is generate the number of instructions needed for integrated data result.Such as, if wish that use 4 32 place values generate the result in the 3rd register (above), two instructions with t=32 and r=32 are just needed, as shown in Table 3 below.

Table 3

MC88110 processor decomposes order by 4,8 or 16 bit data elements from integrated data being put into the low level half of the data element of two double-lengths (8,16 or 32), and fill with zero, the high bit being about to the data element drawn is set as that zero operates.An example of the operation of this decomposition order has been shown in table 4 below.

Table 4

Each element of 64 integrated datas is multiplied by 32 place values by MC88110 processor grouping multiplying order, as this integrated data represents single value, as shown in Table 5 below.

Table 5

This multiplying order has two shortcomings.First this multiplying order does not disconnect carry chain, thus programmer must ensure that the computing performed in integrated data does not cause overflowing.As a result, programmer must add additional instruction sometimes to prevent this spilling.The second, each element in integrated data is multiplied by single value (i.e. this 32 place value) by this multiplying order.As a result, user does not select which element in integrated data to be multiplied by the dirigibility of this 32 place value.Therefore, programmer must prepare data and makes each element in integrated data needs identical multiplication or waste processing time decomposition data whenever being less than during needs are to these data when whole element carries out multiplication.Therefore programmer can not walk abreast and utilize multiple multiplier to perform multiple multiplication.Such as, 8 different data slice be multiplied, each data slice word length, needs four independent multiplyings.Two words are taken advantage of in each computing at every turn, concretely wastes the data line for the position of position more than 16 and circuit.

The instruction of MC88110 processor grouped comparison is compared from corresponding 32 bit data elements of the first integrated data with the second integrated data.Two relatively in each may return and is less than (<) or is more than or equal to one of (>=), draw four kinds of possible combinations.This instruction returns 8 resultant strings; Four bit representations meet in four kinds of possible conditions any, these complement code of four bit representations.Conditional transfer according to the result of this instruction can realize in two ways: 1) with a sequence conditional transfer; Or 2) use jump list.The problem of this instruction is that it needs to perform according to the conditional transfer of data the fact of function, such as: if Y > A then X=X+B else X=X.The pseudo-code compiled indicative of this function will be:

New microprocessor is attempted where to accelerate to perform by inferring to transfer to.If prediction is correct, does not just lose performance and also exist and put forward high performance potentiality.If but prediction error, just lose performance.Therefore, the encouragement recorded in advance is huge.But be rendered as uncertain mode according to the transfer of data (such as), this destroys prediction algorithm and draw more error prediction.As a result, this comparison order is used to set up the expensive will paid according to the conditional transfer of data in performance.

MC88110 processor rotate instruction rotates 64 place values to (example see table 6 below) on arbitrary mould 4 border between 0 and 60.

Table 6

Because rotate instruction makes the height shifting out register be shifted into the low level of register, each element in individually shift grouped data do not supported by MC88110 processor.As a result, require the programmed algorithm needs of each element in independent shift grouped data type: 1) decomposition data, 2) on each element, perform displacement individually, and 3) result is assembled into result packet data for further packet data processes.

Summary of the invention

The invention describes the method and apparatus of the integrated data instruction set adding the operation supported required by typical multimedia application within a processor.In one embodiment, the present invention includes a processor and a memory block.Comprise some instructions in memory block to perform for processor to operate integrated data.In this embodiment, these instructions comprise assembling, decomposition, grouping addition, grouping subtraction, grouping multiplication, grouping displacement and grouped comparison.

Processor responds to reception assembling instruction, assembles a part of position from the data element at least two integrated datas to form the 3rd integrated data.As a comparison, processor responds to this disassembly instruction of reception, generate the 4th integrated data comprising at least one data element from the first packet data operation number and at least one the corresponding data element from the second packet data operation number.

Processor response receives this grouping add instruction and is added together by the corresponding data elements in parallel from least two integrated datas individually.As a comparison, the corresponding data elements in parallel from least two integrated datas subtracts each other by processor response reception this grouping subtraction instruction individually.

Processor response receives grouping multiplying order and is multiplied by the corresponding data elements in parallel from least two integrated datas individually.

Processor response receives grouping shift order individually by the count value indicated by each data element parallel shift in packet data operation number.

Processor response receives grouped comparison instruction relation as indicated and is compared by the corresponding data elements in parallel from least two integrated datas individually, and is stored in the first register by a grouping mask as a result.Grouping mask at least comprises the first mask element and the second mask element.Each bit representation in first mask element compares the result of one group of corresponding data element, and the comparative result of each bit representation second group of data element in the second mask element.

Accompanying drawing explanation

The present invention is the non-limited way explanation by example in the accompanying drawings.The identical element identical with reference to instruction.

Fig. 1 illustrates the exemplary computer system according to one embodiment of the present of invention.

Fig. 2 illustrates the register file according to the processor of one embodiment of the present of invention.

Fig. 3 is explanation is used for processing the general step of data process flow diagram according to the processor of one embodiment of the present of invention.

Fig. 4 illustrates the type of packet data according to one embodiment of the present of invention.

Fig. 5 a represents integrated data in the register according to one embodiment of the present of invention.

Fig. 5 b represents integrated data in the register according to one embodiment of the present of invention.

Fig. 5 c represents integrated data in the register according to one embodiment of the present of invention.

Fig. 6 a represents the control signal form of the use according to one embodiment of the present of invention instruction integrated data.

Fig. 6 b illustrates the second control signal form according to the use of the instruction integrated data of one embodiment of the present of invention.

Grouping addition/subtraction

Fig. 7 a illustrates the method performing grouping addition according to one embodiment of the present of invention.

Fig. 7 b illustrates the method performing grouping subtraction according to one embodiment of the present of invention.

Fig. 8 illustrate according to one embodiment of the present of invention integrated data each on perform grouping addition with grouping subtraction circuit.

Fig. 9 illustrates the circuit performing grouping addition and grouping subtraction according to one embodiment of the present of invention in blocked byte data.

Figure 10 is the logical view of the circuit performing grouping addition and grouping subtraction according to one embodiment of the present of invention in grouping digital data.

Figure 11 is the logical view of the circuit performing grouping addition and grouping subtraction according to one embodiment of the present of invention on grouping double-word data.

Grouping multiplication

Figure 12 is the process flow diagram that the method performing grouping multiplying according to one embodiment of the present of invention in integrated data is described.

Figure 13 illustrates the circuit performing grouping multiplication according to one embodiment of the present of invention.

Take advantage of-plus/minus

Figure 14 illustrates the process flow diagram performing the method taken advantage of-Jia and take advantage of-subtract computing according to one embodiment of the present of invention in integrated data.

Figure 15 illustrates and in integrated data, performs according to one embodiment of the present of invention the circuit taken advantage of-Jia and/or take advantage of-subtract computing.

Grouping displacement

Figure 16 is the process flow diagram that the method performing packet shifting operation according to one embodiment of the present of invention in integrated data is described.

Figure 17 illustrates the circuit performing grouping displacement according to one embodiment of the present of invention in each byte of integrated data.

Assembling

Figure 18 is the process flow diagram that the method performing assembly operation according to one embodiment of the present of invention in integrated data is described.

Figure 19 a illustrates the circuit performing assembly operation according to one embodiment of the present of invention in blocked byte data.

Figure 19 b illustrates the circuit performing assembly operation according to one embodiment of the present of invention in grouping digital data.

Decompose

Figure 20 is the process flow diagram that the method performing operation splitting according to one embodiment of the present of invention in integrated data is described.

Figure 21 illustrates the circuit performing operation splitting according to one embodiment of the present of invention in integrated data.

Individual counting number

Figure 22 is the process flow diagram that the method performing number counting operation according to one embodiment of the present of invention in integrated data is described.

Figure 23 illustrates the process flow diagram performing number counting operation and the method for the single result data element of result packet data genaration according to one embodiment of the present of invention on a data element of integrated data.

Figure 24 illustrates the circuit performing number counting operation according to one embodiment of the present of invention in the integrated data with four digital data elements.

Figure 25 illustrates the detailed circuit performing number counting operation according to one embodiment of the present of invention on a digital data element of integrated data.

Grouping logical operation.

Figure 26 is the process flow diagram that the method performing some logical operations according to one embodiment of the present of invention in integrated data is described.

Figure 27 illustrates the circuit according to one embodiment of the present of invention actuating logic computing in integrated data.

Grouped comparison

Figure 28 is the process flow diagram that the method performing grouped comparison operation according to one embodiment of the present of invention in integrated data is described.

Figure 29 illustrates the circuit single byte according to the integrated data of one embodiment of the present of invention performing grouped comparison operation.

Embodiment

The application describes the method and apparatus of the instruction set of the operation comprised within a processor in the integrated data of support required by typical multimedia application.In the following description, set forth many specific detail to provide complete understanding of the present invention.But should understand the present invention can realize without these specific detail.In other example, in order to avoid making the present invention unnecessarily water down, be not shown specifically well-known circuit, configuration and techniques.

Definition

In order to provide the basis of the description understanding embodiments of the invention, propose to give a definition.

Position X is to position Y;

The son field of definition binary number.Such as, byte 00111010 ₂6 to the position, position 0 of (representing with base 2) represents son field 111010 ₂.Binary number ' 2 ' expression base 2 below.Therefore, 1000 ₂equal 8 ₁₀, and F ₁₆equal 15 ₁₀.

R _x: be register.Register is any device that can store with provide data.The further function of register is described below.Register is not the necessary parts of processor module.

SRC1, SRC2 and DEST:

Identification memory (such as storage address, register etc.)

Source1-i and Result1-i: represent data

computer system

Fig. 1 illustrates the illustrative computer system 100 according to one embodiment of the present of invention.Computer system 100 comprises for the bus 101 of transmission of information or other communication hardware and software, and for the treatment of the processor 109 be coupled with bus 101 of information.Processor 109 represents the CPU (central processing unit) comprising any type of architecture of CISC (sophisticated vocabulary calculating) or RISC (reduction instruction set calculating) type of architecture.Computer system 100 also comprises and being coupling in bus 101 for storing information and the random access memory (RAM) of instruction that will be performed by processor 109 or other dynamic memory (being called primary memory 104).Perform between order period at processor 109, primary memory 104 also can be used to store temporary variable or other intermediate information.Computer system 100 also comprises and being coupling in bus 101 for storing ROM (read-only memory) (ROM) 106 and/or other static storage device of the instruction of static information and processor 109.Data storage device 107 is coupling in bus 101 for storing information and instruction.

Fig. 1 also illustrates that processor 109 comprises performance element 130, register file 150, cache memory 160, demoder 165 and internal bus 170.Certainly, processor 109 also comprises other circuit, in order to do not water down the present invention and not shown they.

The instruction that performance element 130 receives for performing processor 109.Except identifying that performance element 130 is also identified in the instruction in the grouping instruction set 140 of executable operations on packet data format usually in the instruction that general processor realizes.In one embodiment, instruction set 140 mode that comprises after this to describe of dividing into groups is supported assembly operation, operation splitting, grouping additive operation, grouping subtraction, grouping multiplying, packet shifting operation, grouped comparison operation, multiply-add operations, is taken advantage of-subtract the instruction of computing, number calculating operation and a component group logical operation (comprising grouping "AND", grouping NAND, grouping "or" and distance of dividing into groups).Although describe an embodiment of the grouping instruction set 140 comprising these instructions, other embodiment can comprise subset or the superset of these instructions.

By comprising these instructions, operation required for many algorithms of using in multimedia application can be performed by integrated data.Thus, these algorithms can be write assemble necessary data and in integrated data, perform necessary operation, and these integrated datas need not be decomposed come once on a data element, to perform one or more operation.As mentioned above, this prior art general processor than the packet data operation do not supported required by some multimedia algorithms (namely, if multimedia algorithms requires the operation that can not perform in integrated data, then program must decompose these data, executable operations on the element separated individually, is then assembled into group result for further packet transaction by result) there is advantage in performance.In addition, the disclosed mode performing some these instructions wherein improves the performance of many multimedia application.

Performance element 130 is coupling in register file 150 by internal bus 170.Register file 150 represents on processor 109 for storing the memory block of the information comprising data.Should understand one aspect of the present invention operate in integrated data described by instruction set.According to this one side of the present invention, the memory block being used for storing integrated data is not crucial.But, an embodiment of register file 150 is described with reference to Fig. 2 after a while.Performance element 130 is coupling on cache memory 160 and demoder 165.Cache memory 160 is used for speed buffering and is used for the instruction decoding that processor 109 receives to become control signal and/or microcode entry points from the data of such as primary memory 104 and/or control signal, demoder 165.Respond these control signals and/or microcode entry points, performance element 130 performs suitable operation.Such as, if receive an add instruction, demoder 165 just makes the addition of performance element 130 execution requirements; If receive a subtraction instruction, demoder 165 just makes the subtraction of performance element 130 execution requirements; Deng.Demoder 165 can realize (such as, look-up table, hardware implementing, PLA etc.) by the different institutions of any number.Thus represented by a series of if/then statement although demoder and performance element perform various instruction, the execution should understanding instruction does not need a series of process of these if/then statements.But any mechanism performing this if/then process for logic is thought within the scope of the present invention.

Fig. 1 also show data storage device 107, such as disk or CD, and the disk drive of correspondence.Computer system 100 is also coupling in information displaying on the display device 121 of computer user by bus 101.Display device 121 can comprise frame buffer, special graphics-rendering apparatus (graphics rendering device), cathode-ray tube (CRT) (CRT) and/or flat-panel monitor.The Alphanumeric Entry Device 122 comprising alphanumeric and other key is coupling in bus 101 usually, for processor 109 transmission of information and command selection.Another kind of user input device is cursor control device 123, such as direction of transfer information and command selection to processor 109 and for controlling Genius mouse, trace ball, pen, touch-screen or cursor direction key that cursor moves on display device 121.This input equipment has two degree of freedom on two axles usually, the first axle (as X) and the second axle (as Y), and it allows the position in equipment given plane.But, the present invention be not limited in only have two degree of freedom input equipment on.

Another equipment that can be coupled in bus 101 is the hard copy device 124 that can be used to print command on the medium of the such as medium such as paper, film or similar type, data or out of Memory.In addition, computer system 100 can be coupling in the equipment 125 for sound recording and/or broadcasting, such as be coupling in the digitized audio frequency device on the microphone of recorded information.In addition, this equipment can comprise the loudspeaker be coupling on digital-to-analogue (D/A) converter, for playing digitized voice.

Computer system 100 also can be the terminal in computer network (such as LAN).At this moment computer system 100 is a computer subsystem of computer network.Computer system 100 comprises video digitizer equipment 126 alternatively.Video digitizer equipment 126 can be used for catching the video image that can be transferred to other computing machine on the computer network.

In one embodiment, processor 109 is additional supports that (Intel Company of such as California Santa Clara manufactures with x86 instruction set the instruction set that the existing microprocessor such as processor uses) compatible instruction set.Thus in one embodiment, the IA that the Intel Company of California Santa Clara defines supported by processor 109 ^tMthe all operations (see " microprocessor ", Intel data collection volume 1 and volume 2,1992 and 1993, can buy from the Intel of California Santa Clara) that-Intel architecture is supported.As a result, except operation of the present invention, processor 109 can also support that existing X86 operates.Although the present invention is described as being included in based in x86 instruction set, the present invention can be included in other instruction set by alternate embodiment.Such as, the present invention can be included in 64 bit processors adopting new instructions.

Fig. 2 illustrates the register file according to the processor of one embodiment of the present of invention.Register file 150 is used for storage information, comprises control/status information, integer data, floating data and integrated data.In the embodiment shown in Figure 2, register file 150 comprises integer registers 201, register 209, status register 208 and instruction pointer register 211.The state of status register 208 instruction processorunit 109.Instruction pointer register 211 stores the address of next instruction that will perform.Integer registers 201, register 209, status register 208 and instruction pointer register 211 are all coupling on internal bus 170.Any additional register is also coupling on internal bus 170.

In one embodiment, register 209 is not only for integrated data but also for floating data.In this embodiment, the flating point register that all register 209 must be located as stack at any given time of processor 109 or treat as the integrated data register that non-stack is located.In the present embodiment, switch between the register 209 including the integrated data register that a kind of mechanism allows processor 109 to locate at the flating point register of locating as stack and non-stack operates.In another embodiment, processor 109 can operate on the register 209 of the floating-point of locating as non-stack and integrated data register simultaneously.As another example, in another embodiment, these identical registers can be used to store integer data.

Certainly, the embodiment that can realize substituting comprises Parasites Fauna more or less.Such as, alternate embodiment can comprise independently flating point register group for storing floating data.As another example, alternate embodiment can comprise first group of register, respectively for storing control/status information, and second group of register, respectively can store integer, floating-point and integrated data.For the sake of clarity, the meaning of the register of an embodiment should be limited on the circuit of particular type.But the register of an embodiment only needs to store and provides data, and performs function as described herein.

Various Parasites Fauna (such as integer registers 201, register 209) can be embodied as the register comprising different number destination register and/or different size.Such as, in one embodiment, integer registers 201 is embodied as storage 32, and register 209 is embodied as storage 80 (whole 80 are used for storing floating data and only storing integrated data with 64).In addition, register 209 comprises 8 registers, R ₀212a to R ₇212h.R ₁212a, R ₂212b and R ₃212c is the example of each register in register 209.32 of in register 209 one register can be displaced in an integer registers in integer registers 201.Similarly, the value in integer registers can be moved in 32, a register in register 209.In another embodiment, integer registers 201 respectively comprises 64, and 64 bit data can transmit between integer registers 201 and register 209.

Fig. 3 is explanation is used for processing the general step of data process flow diagram according to the processor of one embodiment of the present of invention.Such as, in these operations, comprise load operation, will from the Data import of cache memory 160, primary memory 104, ROM (read-only memory) (ROM) 104 or data storage device 107 to the register in register file 150.

In step 301, demoder 202 receives the control signal 207 from cache memory 160 or bus 101.Demoder 202 is decoded the operation that this control signal is determined to perform.

In step 302, the unit in demoder 202 access function resister file 150 or storer.Depend on the register in the register address access function resister file 150 of specifying in control signal 207 or the memory cell in storer.Such as, for the operation in integrated data, control signal 207 can comprise SRC1, SRC2 and DEST register address.SRC1 is the address of the first source-register.SRC2 is the address of the second source-register.Because not all operation all needs two source addresses, SRC2 address is optional in some cases.If a kind of operation does not need SRC2 address, just only use SRC1 address.DEST is the address of the destination register of event memory data.In one embodiment, SRC1 or SRC2 is also used as DEST.More fully SRC1, SRC2 and DEST are described relative to Fig. 6 a and Fig. 6 b.The data be stored in corresponding register are called source 1 (Source1), source 2 (Source2) and result (Result).The length of each this data is 64.

In another embodiment of the invention, in SRC1, SRC2 and DEST any one or all can a memory cell in the addressable memory space of definition processor 109.Such as, SRC1 can identify the memory cell in primary memory 104, and SRC2 identifies the first register in integer registers 201 and the second register in DEST marker register 209.Here in order to simplified characterization, the present invention will describe relative to access function resister file 150.But these accesses can be carried out storer.

In step 303, start performance element 130 executable operations in the data of access.In step 304, according to the requirement of control signal 207, result is stored back register file 150.

data and storage format

Fig. 4 illustrates the type of packet data according to one embodiment of the present of invention.Show three kinds of packet data formats: blocked byte 401, grouping word 402 and grouping double word 403.In one embodiment of the invention, blocked byte is that to comprise 64 of 8 data elements long.Each data element is a byte long.Usually, data element is the independent data slot of of being stored in together with the data element of other equal length in single register (or memory cell).In one embodiment of the invention, the number of the data element stored in a register is 64 bit lengths divided by a data element.

Grouping word 402 be 64 long and comprise 4 word 402 data elements.Each word 402 data element comprises 16 information.

Grouping double word 403 be 64 long and comprise two double word 403 data elements.Each double word 403 data element comprises 32 information.

Fig. 5 a to 5c illustrates that in the register according to one embodiment of the present of invention, integrated data stores expression.Register without symbol packets byte represents that 510 illustrate at register R ₀212a to R ₇without the storage of symbol packets byte 401 in one of 212h.The information of each byte data element is stored in 63 to the position, position 56 of 7 to the position, position 0 of byte 0,15 to the position, position 8 of byte 1,23 to the position, position 16 of byte 2,31 to the position, position 24 of byte 3,39 to the position, position 32 of byte 4,47 to the position, position 40 of byte 5,55 to the position, position 48 of byte 6 and byte 7.Thus, employ all available positions in register.This storage arrangement improves the storage efficiency of processor.Meanwhile, by accessing 8 data elements, just can perform a kind of operation on 8 data elements simultaneously.Tape symbol blocked byte register represents 511 storages that tape symbol blocked byte 401 is shown.Note only needing the 8th of each byte data element the to indicate for symbol.

Represent 512 illustrate how to be stored in a register of register 209 by word 3 to word 0 without symbol packets word register.15 to position 0, position comprises the data element information of word 0, and 31 to position, position 16 comprises the data element information of word 1, and 47 to position, position 32 comprises the information that the information of data element word 2 and 63 to position, position 48 comprise data element word 3.Tape symbol grouping word register represents that 513 are similar to and represent 512 without symbol packets word register.Note only needing the 16th of each digital data element to be used as symbol instruction.

Represent that 514 illustrate how register 209 stores two double-word data elements without symbol packets double-word register.Double word 0 is stored in 31 to the position, position 0 of register.Double word 1 is stored in 63 to the position, position 32 of register.Tape symbol grouping double-word register represents that 515 are similar to and represent 514 without symbol packets double-word register.Notice that necessary sign bit is the 32nd of double-word data element.

As mentioned above, register 209 not only can be used for integrated data but also can be used for floating data.In this embodiment of the invention, may require that such as R followed the tracks of by this single programming processor 109 ₀the register of institute's addressing of 212a stores integrated data or floating data.In an alternative embodiment, processor 109 can follow the tracks of the type of the data be stored in each register of register 209.If when then such as attempting to carry out grouping additive operation in floating data, this alternate embodiment just can produce makes mistakes.

control signal form

The embodiment of control signal form that processor 109 is used for operating integrated data is described below.In one embodiment of the invention, control signal is expressed as 32.Demoder 202 can from bus 101 reception control signal 207.In another embodiment, demoder 202 also can receive this control signal from cache memory 160.

Fig. 6 a illustrates the control signal form using integrated data according to one embodiment of the present of invention instruction.Operation field OP601 (31 to position, position 26) provides the information about the operation performed by processor 109, the addition that such as divides into groups, grouping subtraction etc.SRC1 602 (25 to position, position 20) provides the source-register address of the register in register 209.This source-register is included in during control signal performs the first integrated data Source1 that will use.Similarly, SRC2603 (19 to position, position 14) comprises the address of the register in register 209.The integrated data Source2 that this second source-register will use the term of execution of being included in this operation.DEST605 (5 to position, position 0) comprises the register address in register 209.This purpose ground register will store the result packet data Result of packet data operation.

Control bit SZ610 (position 12 and position 13) indicates the length of the data element in first and second integrated data source-register.If SZ610 equals 01 ₂, then packet data format is turned to blocked byte 401.If SZ610 equals 10 ₂, then packet data format is turned to grouping word 402.SZ610 equals 00 ₂or 11 ₂reservation need not, but in another embodiment, one of these values can be used to instruction grouping double word 403.

Control bit T611 (position 11) indicates whether to carry out this operation in a saturated mode.If T611 equals 1, then perform operated in saturation.If T611 equals 0, then perform unsaturation operation.Operated in saturation will be described after a while.

Control bit S612 (position 10) instruction uses tape symbol operation.If S612 equals 1, then perform tape symbol operation.If S612 equals 0, then perform without symbol manipulation.

Fig. 6 b illustrates the second control signal form adopting integrated data according to one embodiment of the present of invention instruction.This form corresponds to can from Intel Company, the general purpose integer operational code form described in " Pentium processor family user manual " that document sales section (P.O.Box7641, Mt.Prospect, IL, 60056-7641) buys.OP601, SZ610, T611 and S612 is noted all to be merged into a big field.For some control signal, position 3 to 5 is SRC1602.In one embodiment, when existence SRC1 602 address, then position 3 to 5 also corresponds to DEST605.In an alternative embodiment, when there is SRC2 603 address, then position 0 to 2 also corresponds to DEST605.For other control signal, as grouping displacement immediate operation, position 3 to 5 represents the expansion to opcode field.In one embodiment, this expansion allows programmer to comprise the immediate value that has control signal, such as shift count.In one embodiment, immediate value is after control signal.This, at " Pentium processor family user manual " annex F, has more detailed description in page F-1 to F-3.Position 0 to 2 represents SRC2603.This general format allow register to register, storer to register, register be stored device, register by register, register by immediate, register to the addressing of storer.Meanwhile, in one embodiment, this general format can support that integer registers arrives register and register to integer registers addressing.

saturated/undersaturated explanation

As mentioned above, whether T611 instruction operation is selectively saturated.When allowing saturated, when the result overflow operated or lower spilling data area, result will by clamp.If this result is just set in maximum or minimum value by the off-limits maximum or minimum value of result that is meant to of clamp.In the situation of underflow, on saturated minimum result is clamped in scope, and when overflow, then on mxm..The allowed band of each data layout shown in table 7.

Table 7

As mentioned above, T611 indicates whether to perform operated in saturation.Therefore, adopt without symbol-byte data layout, if operating result=258 and allow saturated, then before destination register result being stored into this operation, this result is clamped to 255.Similarly, if operating result=-32999 and processor 109 adopt allow saturated tape symbol digital data form, then before destination register result being stored into operation ,-32768 are clamped to.

Grouping addition

grouping additive operation

One embodiment of the present of invention can perform grouping additive operation in performance element 130.That is, the present invention makes each data element of the first integrated data can individually be added on each data element of the second integrated data.

Fig. 7 a illustrates the method performing grouping addition according to one embodiment of the present of invention.In step 701, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: the operational code of grouping addition; SRC1 602 in register 209, SRC2603 and DEST605 address; Saturated/unsaturated, tape symbol/without the length of the data element in symbol and integrated data.In step 702, demoder 202 is by providing the register 209 of SRC1 602 and SRC2 603 address in internal bus 170 access function resister file 150.Register 209 provides integrated data Source1 and the Source2 in the register be stored in respectively on these addresses to performance element 130.That is, integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 703, demoder 202 starts performance element 130 and goes to perform grouping additive operation.Demoder 202 also by internal bus 170 transmit integrated data element length, whether adopt saturated and whether adopt signed arithmetic operation.In step 704, which step is the length of data element perform below determining.If the data element length in integrated data is 8 (byte datas), then performance element 130 performs step 705a.But if the data element length in integrated data is 16 (digital data), then performance element 130 performs step 705b.In one embodiment of the invention, only support that 8 to be divided into groups addition with 16 bit data elements length.But other embodiment can support different and/or other length.Such as, support 32 bit data elements length grouping addition can be added in an alternative embodiment.

Tentation data length of element is 8, then perform step 705a.7 to the position, position 0 of Source1 is added on 7 to the position, position 0 of SRC2 by performance element 130, generates 7 to the position, position 0 of Result integrated data.Walk abreast with this addition, 15 to the position, position 8 of Source1 is added on 15 to the position, position 8 of Source2 by performance element 130, produces 15 to the position, position 8 of Result integrated data.Walk abreast with these additions, 23 to the position, position 16 of Source1 is added on 23 to the position, position 16 of Source2 by performance element 130, produces 23 to the position, position 16 of Result integrated data.Walk abreast with these additions, 31 to the position, position 24 of Source1 is added on 31 to the position, position 24 of Source2 by performance element 130, produces 31 to the position, position 24 of Result integrated data.Walk abreast with these additions, 39 to the position, position 32 of Source1 is added on 39 to the position, position 32 of Source2 by performance element, produces 39 to the position, position 32 of Result integrated data.Walk abreast with these additions, the position 47 to 40 of Source1 is added on 47 to the position, position 40 of Source2 by performance element 130, produces 47 to the position, position 40 of Result integrated data.Walk abreast with these additions, 55 to the position, position 48 of Source1 is added on 55 to the position, position 48 of Source2 by performance element 130, produces 55 to the position, position 48 of Result integrated data.Walk abreast with these additions, 63 to the position, position 56 of Source1 is added on 63 to the position, position 56 of Source2 by performance element 130, produces 63 to the position, position 56 of Result integrated data.

Tentation data length of element is 16, then perform step 705b.15 to the position, position 0 of Source1 is added on 15 to the position, position 0 of SRC2 by performance element 130, produces 15 to the position, position 0 of Result integrated data.Walk abreast with this addition, 31 to the position, position 16 of Source1 is added on 31 to the position, position 16 of Source2 by performance element 130, produces 31 to the position, position 16 of Result integrated data.Walk abreast with these additions, 47 to the position, position 32 of Source1 is added on 47 to the position, position 32 of Source2 by performance element 130, produces 47 to the position, position 32 of Result integrated data.Walk abreast with these additions, 63 to the position, position 48 of Source1 is added on 63 to the position, position 48 of Source2 by performance element 130, produces 63 to the position, position 48 of Result integrated data.

In step 706, the register of demoder 202 in the DEST605 address start register 209 of destination register.Thus, Result is stored in the register of DEST605 addressing.

Table 8a illustrates that the register of grouping additive operation represents.The position of the first row is that the integrated data of Source1 integrated data represents.The position of the second row is that the integrated data of Source2 integrated data represents.The position of the third line is that the integrated data of Result integrated data represents.Number below each data element position is data element number.Such as, Source1 data element 0 is 10001000 ₂.Therefore, if this data element be 8 bit lengths (byte data) and perform be without the undersaturated addition of symbol, then performance element 130 produces shown Result integrated data.

Note in one embodiment of the invention, when result overflow or underflow and computing adopts unsaturated time, cut position result simply.Namely carry digit is ignored.Such as, in table 8a, the register of result data element 1 represents and will be: 10001000 ₂+ 10001000 ₂=00001000 ₂.Similarly, for its result of underflow also cut position.This cut position form makes programmer easily can perform modular arithmetic.Such as, the formula of result data element 1 can be expressed as: (Source1 data element 1+Source2 data element 1) mod256=result data element 1.In addition, person skilled in the art person can understand overflow and underflow by setting out dislocation to detect status register from this description.

Table 8a

Table 8b illustrates that the register of grouping digital data additive operation represents.Therefore, if this data element be 16 bit lengths (digital data) and performed be without the unsaturated addition of symbol, performance element 130 produces shown Result integrated data.Note in digital data element 2, the carry propagation carrying out self-alignment 7 (position 1 see emphasizing) below puts in place in 8, causes data element 2 overflow (" overflow " see emphasizing) below.

Table 8b

Table 8c illustrates that the register of grouping double-word data additive operation represents.An alternate embodiment of the present invention supports this computing.Therefore, if this data element be 32 bit lengths (i.e. double-word data) and perform be without the unsaturated addition of symbol, performance element 130 produces shown Result integrated data.Note propagating respectively from the position 7 of double-word data element 1 and the carry of position 15 putting 8 in place with position 16.

Table 8c

In order to the difference between grouping addition and common addition is described better, in table 9, replicate the data from upper example.But, in this case, data perform common addition (64).Note coming self-alignment 7, position 15, position 23, position 31, position 39 and position 47 carry be with respectively put 8 in place, position 16, position 24, position 32, in position 40 and position 48.

Table 9

Tape symbol/unsaturated grouping addition

Table 10 illustrates the example of tape symbol grouping addition, and the data element length of integrated data is wherein 8.Do not use saturated.Therefore, result energy overflow and underflow.Table 10 utilizes the data different from table 8a-8c and table 9.

Table 10

the grouping addition of tape symbol/saturated

Table 11 illustrates the example of tape symbol grouping addition, and the data element length of integrated data is wherein 8.Have employed saturated, therefore overflow is clamped to maximal value and underflow is clamped to minimum value.Table 11 uses the data identical with table 10.Here data element 0 and data element 2 are clamped to minimum value, and data element 4 and data element 6 are clamped to maximal value.

Table 11

Grouping subtraction

grouping subtraction

One embodiment of the present of invention make to perform grouping subtraction in performance element 130.That is, the present invention makes each data element of the second integrated data can deduct respectively from each data element of the first integrated data.

Fig. 7 b illustrates the method performing grouping subtraction according to one embodiment of the present of invention.Note, step 710-713 is similar to step 701-704.

In the present embodiment of the invention, only support that 8 to be divided into groups subtraction with 16 bit data elements length.But alternate embodiment can support different and/or other length.Such as, an alternate embodiment can add support 32 bit data elements length grouping subtraction.

Tentation data length of element is 8, just performs step 714a and 715a.Performance element 130 asks the complement code of 2 of 7 to the position, position 0 of Source2.Walk abreast with asking the complement code of 2, performance element 130 asks the complement code of 2 of 15 to the position, position 8 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 23 to the position, position 16 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 31 to the position, position 24 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 39 to the position, position 32 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 47 to the position, position 40 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 55 to the position, position 48 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 63 to the position, position 56 of Source2.In step 715a, performance element 130 performs the addition of the complementing bits of 2 of Source2 and the position of Source1, as the description total to step 705a.

Tentation data length of element is 16, then perform step 714b and 715b.Performance element 130 asks the complement code of 2 of 15 to the position, position 0 of Source2.Ask the complement code of 2 to walk abreast with this, performance element 130 asks the complement code of 2 of 31 to the position, position 16 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 47 to the position, position 32 of Source2.Ask the complement code of 2 to walk abreast with these, performance element 130 asks the complement code of 2 of 63 to the position, position 48 of Source2.In step 715b, performance element 130 performs the addition of the complementing bits of 2 of Source2 and the position of Source1, as the description total to step 705b.

Notice that step 714 and 715 is the method with deducting the first number in one embodiment of the invention from the second number.But the subtraction of other form is known in this technique, the invention should not be deemed to be limited to the supplement arithmetical operation of employing 2.

In step 716, demoder 202 destination-address of destination register starts register 209.Thus, result packet data are stored in the DEST register of register 209.

Table 12 illustrates that the register of grouping subtraction represents.Tentation data element be 8 bit lengths (byte data) and performed be without the unsaturated subtraction of symbol, then performance element 130 produces shown result packet data.

Table 12

integrated data addition/subtraction circuit

Fig. 8 illustrate according to one embodiment of the present of invention integrated data each on perform grouping addition with grouping subtraction circuit.Fig. 8 illustrates the bit slice adder/subtracter 800 revised.Adder/subtracter 801a-b can add or deduct from Source2 at Source1 two.Computing and carry control 803 to operation circuit 809a transmission of control signals and start addition or subtraction.Thus adder/subtracter 801a is at Source1 _ithe position i that 804a receives adds or deducts Source2 _ithe position i of the upper reception of 805a, produces at Result _ithe result bits of the upper transmission of 806a.Cin 807a-b and Cout 808a-b represents the carry control circuit often seen in adder/subtracter.

Control 803 from computing and carry and control 802 control Cin by enable 811 start bits of integrated data _i+1807b and Cout _i.Such as, in table 13a, perform without symbol packets byte addition.If adder/subtracter 801a is added Source1 position 7 and Source2 position 7, then computing and carry control 803 and start bit are controlled 802, stop carry propagating from position 7 to put 8 in place.

Table 13a

But if execution is without symbol packets word addition, and be added on the position 7 of Source2 with adder/subtracter 801a by the position 7 of Source1 similarly, then position controls 802 these carries of propagation and puts 8 in place.Table 13b illustrates this result.This propagation is all allowed for grouping double word addition and non-grouping addition.

Table 13b

Adder/subtracter 801a is by first anti-phase Source2i 805a and add 1 formation Source2 _i805a 2 complement code, from Source1 _i804a deducts a Source2 _i805a.Then, this result is added in Source1 by adder/subtracter 801a _ion 804a.Bit slice 2 complement arithmetic technology be well-known in this technology, person skilled in the art person can understand how to design this bit slice 2 complement arithmetic circuit.Notice that the propagation of carry controls 803 by position control 802 and computing and carry and controls.

Fig. 9 illustrates the circuit performing grouping addition and grouping subtraction according to one embodiment of the invention in blocked byte data.Source1 bus 901 and Source2 bus 902 are respectively by Source1 _in906a-h and Source2 _ininformation signal takes in adder/subtracter 908a-h by 905a-h.Thus adder/subtracter 908a adds/deducts 7 to position, Source2 position 0 on 7 to position, Source1 position 0; Adder/subtracter 908b adds/deducts 15 to position, Source2 position 8 on 15 to position, Source1 position 8, etc.CTRL 904a-h is received by grouping control 911 and forbids the saturated and Enable/Disable tape symbol of carry propagation, Enable/Disable/without the control signal of symbol arithmetical operation from operation control 903.Operation control 903 forbids carry propagation by receiving carry information from CTRL904a-h and it not propagated to next most significant bit adder/subtracter 908a-h.Thus operation control 903 performs the computing that position that computing and carry control 803 and 64 integrated datas controls 802.Give the example in Fig. 1-9 and foregoing description, person skilled in the art person can set up this circuit.

Adder/subtracter 908a-h exports 907a-h by result and the object information of various grouping addition is passed to result register 910a-h.Each result register 910a-h stores and is transferred in Result bus 909 by object information subsequently.Then this object information is stored in the integer registers that DEST605 register address specifies.

Figure 10 is the logical view of the circuit performing grouping addition and grouping subtraction according to one embodiment of the present of invention in grouping digital data.Here, grouping word arithmetic is being performed.Operation control 903 start bit 8 and position 7, position 24 and position 23, position 40 and position 39 and the carry propagation between position 56 and position 55.Thus, the first character (15 to position, position 0) that adder/subtracter 908a and the 908b being depicted as virtual adder/subtracter 1008a is operated in grouping digital data Source1 together adds/deducts the first character (15 to position, position 0) of grouping digital data Source2; Adder/subtracter 908c and the 908d being depicted as virtual adder/subtracter 1008b is operated in second word (31 to position, position 16) of second word (31 to position, position 16) upper plus/minus grouping digital data Source2 of grouping digital data Source1 together, etc.

Virtual adder/subtracter 1008a-d exports 1007a-d (result of combination exports 907a-b, 907c-d, 907e-f and 907g-h) by result and object information is passed to virtual result register 1010a-d.Each virtual result register 1010a-d (result register 910a-b, 910c-d, 910e-f and 910g-h of combination) stores 16 result data elements that will be delivered in Result bus 909.

Figure 11 is the logical diagram of the circuit performing grouping addition and grouping subtraction according to one embodiment of the present of invention on grouping double-word data.Operation control 903 start bit 8 and position 7, position 16 and position 15, position 24 and position 23, position 40 and position 39, position 48 and position 47 and the carry propagation between position 56 and position 55.Thus, first double word (31 to position, position 0) that the adder/subtracter 908a-d being depicted as virtual adder/subtracter 1108a is operated in combined characters data Source1 together adds/deducts first double word (31 to position, position 0) of combination double-word data Source2; Second double word (63 to position, position 32) that the adder/subtracter 908e-h being depicted as virtual adder/subtracter 1108b is operated in combination double-word data Source1 together adds/deducts second double word (63 to position, position 32) of combination double-word data Source2.

Virtual adder/subtracter 1108a-b exports 1107a-b (result of combination exports 907a-d and 907e-h) by result and object information is passed to virtual result register 1110a-b.Each virtual result register 1110a-b (result register 910a-d and the 910e-h of combination) stores 32 result data elements that will be delivered in Result bus 909.

Grouping multiplication

grouping multiplying

In one embodiment of the invention, multiplicand data (Source1) are comprised in SRC 1 register, comprise multiplier data (Source2) in SRC2 register, in DEST register, then comprise a part for product (result).Namely each data element of Source1 is multiplied by the respective data element of Source2 independently.Depend on the type of multiplication, in Result, will long-pending high-order position or low-order bit be comprised.

In one embodiment of the invention, following multiplying is supported: multiplication is high without symbol packets, multiplication high-band symbol packets and the low grouping of multiplication.High/low expression will comprise which position from product in Result.This is necessary, because two N figure places are multiplied draw to have the long-pending of 2N position.Because each result data element is identical with multiplicand and multiplier data element sizes, result can only represent long-pending half.Height causes higher-order bits to be exported as a result.The low low-order bit that causes is exported as a result.Such as, Source1 [7:0] × Source2 [7:0] without symbol height grouping multiplication, store long-pending high-order position in the Result [7:0].

In one embodiment of the invention, the use of high/low computing modifier eliminates the possibility from a data element overflow to next higher data element.That is, this modifier allow programmer select long-pending in which position to consider overflow in the result and not.Programmer can amass with the 2N position that the combination producing of grouping multiplying is complete.Such as, without symbol packets computing and then with identical Source1 and Source2 multiplication low grouping computing, programmer can show that complete (2N) amasss with Source1 with Source2 multiplication is high.Provide the computing of multiplication height because usually long-pending high-order position is long-pending only pith.First programmer can perform the high-order position that any cut position just obtains amassing, and this cut position normally needs for non-packet data computing.

In one embodiment of the invention, each data element in Source2 can have a different value.This provides to programmer can have the dirigibility of different values as multiplier for each multiplicand in Source1.

In step 1201, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: the operation code of suitably multiplying; SRC1 602 in register 209, SRC2 603 and DEST604 address; Tape symbol/without the length of the data element in symbol, high/low and integrated data.

In step 1202, demoder 202 is by the register 209 of SRC1 602 given in internal bus 170 access function resister file 150 with SRC2 603 address.Register 209 provides the integrated data (Source1) be stored in SRC1 603 register and the integrated data (Source2) be stored in SRC2 603 register to performance element 130.That is, integrated data is passed to performance element 130 by internal bus 170 by register 209.

Start performance element 130 in step 1130 demoder 202 and perform suitable grouping multiplying.Demoder 202 is also used for the length of data element of multiplying and high/low by internal bus 170 transmission.

In step 1210, which step is the length of data element perform below determining.If the length of data element is 8 (byte datas), then performance element 130 performs step 1212.But if the data element length in integrated data is 16 (digital data), then performance element 130 performs step 1214.In one embodiment, the grouping multiplication of 16 bit data elements length is only supported.In another embodiment, support that 8 to be divided into groups multiplication with 16 bit data elements length.But, in another embodiment, also support 32 bit data elements length grouping multiplication.

The length of tentation data element is 8, just performs step 1212.In step 1212, perform following each operation, 7 to position, Source2 position 0 is multiplied by 7 to position, Source1 position 0 and generates 7 to position, Result position 0.Source1 position 15 to 8 is multiplied by Source2 position 15 to 8 and generates Result position 15 to 8.Source1 position 23 to 16 is multiplied by Source2 position 23 to 16 and generates Result position 23 to 16.Source1 position 31 to 24 is multiplied by Source2 position 31 to 24 and generates Result position 31 to 24.Source1 position 39 to 32 is multiplied by Source2 position 39 to 32 and generates Result position 39 to 32.Source1 position 47 to 40 is multiplied by Source2 position 47 to 40 and generates REsult position 47 to 40.Source1 position 55 to 48 is multiplied by Source2 position 55 to 48 and generates Result position 55 to 48.Source1 position 63 to 56 is multiplied by Source2 position 63 to 56 and generates Result position 63 to 56.

The length of tentation data element is 16, then perform step 1214.In step 1214, perform following operation.Source1 position 15 to 0 is multiplied by Source2 position 15 to 0 and generates Result position 15 to 0.Source1 position 31 to 16 is multiplied by Source2 position 31 to 16 and generates Result position 31 to 16.Source1 position 47 to 32 is multiplied by Source2 position 47 to 32 and generates Result position 47 to 32.Source1 position 63 to 48 is multiplied by Source2 position 63 to 48 and generates Result position 63 to 48.

In one embodiment, perform the multiplication of step 1212 simultaneously.But in another embodiment, these multiplication are that serial performs.In another embodiment, in these multiplication some be perform simultaneously and some be serial perform.This discussion is applicable to the multiplication of step 1214 too.

In step 1220, Result is stored in DEST register.

Table 14 illustrates that the grouping multiplication in grouping digital data represents without the register of symbol height computing.The position of the first row is that the integrated data of Source1 represents.The position of the second row is the data representation of Source2.The position of the third line is that the integrated data of Result represents.Number below each data element position is data element number.Such as, Source1 data element 2 is 1111111100000000 ₂.

Table 14

The register of the multiplication high-band symbol packets computing that table 15 illustrates in grouping digital data represents.

Table 15

The register of the low computing of grouping multiplication that table 16 illustrates in grouping digital data represents.

Table 16

integrated data mlultiplying circuit

In one embodiment, multiplying can there is with the clock period of the identical number of single multiplying in the data of decomposing on multiple data element.Performing in the clock period of identical number to reach, have employed concurrency.Namely indicator register performs multiplication operation on data element simultaneously.Discuss this point in more detail below.

Figure 13 illustrates the circuit being used for performing grouping multiplication according to one embodiment of the present of invention.Operation control 1300 controls the circuit performing multiplication.Operation control 1300 processes the control signal of multiplying and has following output: high/low enable 1380; Byte/word enable 1381 and symbol enable 1382, will comprise long-pending high-order position or low-order bit in high/low enable 1380 mark results.What enable 1381 marks of byte/word will perform is byte packet data or the multiplying of word integrated data.Symbol enable 1382 indicates whether to adopt signed multiplication.

Four digital data elements taken advantage of by grouping word multiplier 1301 simultaneously.8 byte data elements taken advantage of by blocked byte multiplier 1302.Grouping word multiplier 1301 and blocked byte multiplier all have following input: Source1 [63:0] 1331, Source [63:0] 1333, symbol enable 1382 and high/low enable 1380.

Grouping word multiplier 1301 comprises 4 16 × 16 multiplier circuits: 16 × 16 multiplier A1310,16 × 16 multiplier B1311,16 × 16 multiplier C 1312 and 16 × 16 multiplier D1313.16 × 16 multiplier A1310 have input Source1 [15:0] and Source2 [15:0], 16 × 16 multiplier B1311 have input Source1 [31:16] and Source2 [31:16], 16 × 16 multiplier C1312 have input Source1 [47:32] and have input Source1 [63:48] and Source2 [63:48] with Source2 [47:32], 16 × 16 multiplier D1313.Each 16 × 16 multipliers are coupling on symbol enable 1382.Each 16 × 16 multipliers produce 32 and amass.For each multiplier, multiplexer (being respectively Mx0 1350, Mx1 1351, Mx2 1352 and Mx31353) receives 32 results.Depend on the high/low value of enable 1380, each multiplexer exports 16 long-pending high-order positions or 16 low-order bits.The output of four multiplexers is combined into 64 results.This result is stored in result register 11371 alternatively.

Blocked byte multiplier 1302 comprises 88 × 8 multiplier circuits: 8 × 8 multiplier A1320 to 8 × 8 multiplier H1337.Each 8 × 8 multipliers have 8 inputs from each Source1 [63:0] 1331 and Source2 [63:0] 1333.Such as 8 × 8 multiplier A1320 have input Source1 [7:0] and Source2 [7:0], and 8 × 8 multiplier H1327 have input Source1 [63:56] and Source2 [63:56].Each 8 × 8 multipliers are coupling on symbol enable 1382.Each 8 × 8 multipliers produce one 16 and amass.For each multiplier, multiplexer (such as Mx41 360 and Mx11 1367) receives 16 results.Depend on the high/low value of enable 1380.Each multiplexer exports 8 long-pending high-order positions or 8 low-order bits.The output of 8 multiplexers is combined into 64 results.Alternatively this result is stored in result register 2 1372.Depend on the length of the data element of this computing requirement, byte/word enable 1381 starts specific result register.

In one embodiment, by manufacturing the circuit taking advantage of both two 8 × 8 numbers or one 16 × 16 number, the area for realizing multiplication is reduced.One 8 × 8 is become and 16 × 16 multipliers by two 8 × 8 multipliers and 16 × 16 multiplier combination.Operation control 1300 will allow the suitable length of multiplication.In this embodiment, the physical area that multiplier uses can be reduced, but it will be difficult to perform blocked byte multiplication and grouping word multiplication.Support that in the embodiment of grouping double word multiplication, a multiplier can perform four 8 × 8 multiplication, two 16 × 16 multiplication or 32 × 32 multiplication at another.

In one embodiment, the multiplying of grouping word is only provided.In this embodiment, blocked byte multiplier 1302 and result register 21372 is not comprised.

the advantage of above-mentioned grouping multiplying is comprised in instruction set

Thus above-mentioned grouping multiplying order provides the independent multiplication that each data element in Source1 is multiplied by its corresponding data element in Source2.Certainly, require each element of Source1 be multiplied by same figure method by this identical number is stored in Source2 each element in perform.In addition, this multiplying order prevents overflow by disconnecting carry chain; This responsibility of removing program person whereby, no longer needs to prepare data to prevent the instruction of overflow, and draws more healthy and stronger code.

Otherwise, do not support that the prior art general processor of this instruction needs by decomposition data element, performs multiplication and assemble result subsequently for further packet transaction, perform this computing.Like this, the different pieces of information element of integrated data is multiplied by different multipliers with just utilizing a parallel instructions by processor 109.

Typical multimedia algorithms performs a large amount of multiplyings.Thus, by reducing the instruction number performed needed for these multiplyings, the performance of these multimedia algorithms just can be improved.Thus provide this multiplying order by the instruction set supported at processor 109, processor 109 just can perform the algorithm needing this function in higher performance level.

Take advantage of-plus/minus

take advantage of-plus/minus computing

In one embodiment, show 17a below employing and perform two multiply-add operations with the single multiply-add instruction of table shown in 17b.Table 17a illustrates the reduced representation of disclosed multiply-add instruction, and shows the position level example that 17b illustrates disclosed multiply-add instruction.

Table 17a

Table 17b

Except " adding " with " subtracting " replacement, take advantage of-subtract computing identical with multiply-add operations.Two computings taking advantage of-subtract the example of computing to take advantage of-subtract instruction are performed shown in table 12.

Table 12

In one embodiment of the invention, SRC1 register comprises integrated data (Source1), SRC2 register comprises integrated data (Source2), and DEST register is by being included in, and Source1 and Source2 is upper performs the result (Result) taken advantage of-Jia or take advantage of-subtract instruction.Taking advantage of-Jia or taking advantage of-subtract in the first step of instruction, the corresponding data element that each data element of Source1 is multiplied by Source2 is independently to generate one group of corresponding intermediate result.When performing multiply-add instruction, these intermediate results being added in pairs, generating two data elements, their data elements as Result are stored.On the contrary, when instruction is taken advantage of-subtracted in execution, subtract each other these intermediate results in pairs and generate two data elements, their data elements as Result are stored.

Alternate embodiment can change the figure place in the data element in the data element of intermediate result and/or Result.In addition, alternate embodiment can change the data element prime number in Source1, Source2 and Result.Such as, if Source1 and Source2 respectively has 8 data elements, can by taking advantage of-plus/minus instruction is embodied as the Result produced with 4 data elements (each data element in Result represents the addition of two intermediate results), two data elements (each data element in result represents the addition of four intermediate results) etc.

Figure 14 illustrates in integrated data, to perform according to one embodiment of the present of invention the process flow diagram of method taken advantage of-Jia and take advantage of-subtract.

In step 1401, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: take advantage of-Jia or take advantage of-subtract the operational code of instruction.

In step 1402, demoder 202 provides SRC1 602 and the register 209 in the register file 150 of SRC2 603 address by internal bus 170 access.Register 209 provides the integrated data (Source1) be stored in SRC1 602 register and the integrated data (Source2) be stored in SRC2 603 register to performance element 130.Namely integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 1403, demoder 202 starts performance element 130 and goes to perform instruction.If this instruction is multiply-add instruction, flow process enters step 1414.But if this instruction is for taking advantage of-subtracting instruction, flow process just enters step 1415.

In step 1414, perform following computing.Source1 position 15 to 0 is multiplied by Source2 position 15 to 0 and generates the one 32 intermediate result (intermediate result 1).Source1 position 31 to 16 is multiplied by Source2 position 31 to 16 and generates the 2 32 intermediate result (intermediate result 2).Source1 position 47 to 32 is multiplied by Source2 position 47 to 32 and generates the 3 32 intermediate result (intermediate result 3).Source1 position 63 to 48 is multiplied by Source2 position 63 to 48 and generates the 4 32 intermediate result (intermediate result 4).Intermediate result 1 is added to the position 31 to 0 intermediate result 2 generating Result, and intermediate result 3 is added to the position 63 to 32 intermediate result 4 generating Result.

Step 1415 is identical with step 1414, and except intermediate result 1 and intermediate result 2 are subtracted each other the position 31 to 0 generating Result, intermediate result 3 and intermediate result 4 subtract each other the position 63 to 32 generating Result.

Different embodiment can perform certain multiply and add/subtract combined of serial, parallel or serial and concurrent operation.

In step 1420, Result is stored in DEST register.

integrated data takes advantage of-plus/minus circuit

In one embodiment, occur respectively taking advantage of-Jia and taking advantage of-subtract instruction on multiple data element in the clock period of identical number that can be the same with the single multiplication in the data of decomposing.Performing in the clock period of identical number to reach, have employed concurrency.Namely indicator register performs simultaneously takes advantage of-Jia or take advantage of-subtract computing on data element.In more detail below this point is discussed.

Figure 15 illustrates and in integrated data, performs according to one embodiment of the present of invention the circuit taken advantage of-Jia and/or take advantage of-subtract computing.-Jia and the control signal taking advantage of-subtract instruction are taken advantage of in operation control 1500 process.Operation control 1500 outputs signal to carry out control packet and take advantage of-adder/subtracter 1501 on enable 1580.

Grouping takes advantage of-and adder/subtracter 1501 has following input: Source1 [63:0] 1531, Source2 [63:0] 1533 and enable 1580.Grouping takes advantage of-and adder/subtracter 1501 comprises 4 16 × 16 multiplier circuits: 16 × 16 multiplier A1510,16 × 16 multiplier B1511,16 × 16 multiplier C1512 and 16 × 16 multiplier D1513.16 × 16 multiplier A1510 have input Source1 [15:0] and Source2 [15:0].16 × 16 multiplier B1511 have input Source1 [31:16] and Source2 [31:16].16 × 16 multiplier C1512 have input Source1 [47:32] and Source2 [47:32].16 × 16 multiplier D1513 have input Source1 [63:48] and Source2 [63:48], 32 intermediate results that 16 × 16 multiplier A1510 and 16 × 16 multiplier B1511 generate are received by virtual adder/subtracter 1550, and 32 intermediate results that 16 × 16 multiplier C1512 and 16 × 16 multiplier D 1513 generate then are received by virtual adder/subtracter 1551.

Take advantage of-Jia based on present instruction or take advantage of-subtract instruction, virtual adder/subtracter 1550 and 1551 or their separately 32 of adding deduct input.The output (i.e. the position 31 to 0 of Result) of virtual adder/subtracter 1550 and the output (i.e. the position 63 to 32 of Result) of virtual adder/subtracter 1551 are combined into 64 Result and are delivered to result register 1571.

In one embodiment, virtual adder/subtracter 1551 and 1550 is (namely each virtual adder/subtracter 1551 and 1550 is made up of 48 totalizers with suitable propagation delay) of realizing in the mode of similar virtual adder/subtracter 1108b and 1108a.But alternate embodiment can realize virtual adder/subtracter 1551 and 1550 in various manners.

Perform these equivalent instructions taken advantage of-Jia or take advantage of-subtract instruction in order on the prior art processor that operates in decomposition data, the loading of needs four times independently 64 multiplyings and twice 64 plus or minus computings and necessity is operated with storing.This waste is used for the data line higher than position 32 higher than position 16 and Result and the circuit of Source1 and Source2.Further, whole 64 results that this prior art processor generates may be useless for programmer.Therefore, programmer must block each result.

the advantage of above-mentioned multiply-add operations is comprised in instruction set

Above-mentionedly to take advantage of-plus/minus instruction can be used for some objects.Such as, multiply-add instruction can be used on being multiplied of complex multiplication and value and cumulative.The some algorithms utilizing multiply-add instruction will be described after a while.

Thus by adding the above-mentioned-Jia of taking advantage of and/or take advantage of-subtract instruction in the instruction set of processor 109 support, just many functions can be performed with the instruction more less than the prior art general processor lacking these instructions.

Grouping displacement

packet shifting operation

In one embodiment of the invention, the data (Source1) that will be shifted are comprised in SCR1 register, comprise the data (Source2) representing shift count in SRC2 register, and will the result (Result) of displacement be comprised in DEST register.Namely each data element in Source1 is shifted this shift count independently.In one embodiment, Source2 is interpreted as signless 64 scalars.In another embodiment, Source2 is integrated data and comprises the shift count of each corresponding data element in Source1.

In one embodiment of the invention, arithmetic shift and logical shift is supported.The position of each data element is shifted downwards the number of specifying by arithmetic shift, and fills the high-order position of each data element with the initial value of sign bit.For blocked byte data be greater than 7 shift count, for grouping digital data, be greater than the shift count of 15 or the shift count that grouping double word is greater than 31 caused filling each Result data element with the initial value of sign bit.Logical shift can operate with displacement up or down.In logic right shift, fill the high-order position of each data element with 0.In logic in shifting left, fill the lowest order of each data element with zero.

In one embodiment of the invention, blocked byte and grouping word are supported that arithmetic shift right, logic right shift and logic are to shifting left.In another embodiment of the invention, grouping double word is also supported that these operate.

In step 1601, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: for the operational code of suitable shifting function; SRC1 602 in register 209, SRC2 603 and DEST605 address; Saturated/unsaturated (shifting function is not necessarily needed), tape symbol/without the length of the data element in symbol (also not necessarily needing) and integrated data.

In step 1602, demoder 202 is by providing the register 209 of SRC1 602 and SRC2 603 address in internal bus 170 access function resister file 150.Register 209 provides the integrated data (Source1) stored in SRC1 602 register and scalar shift counting (Source2) be stored in SRC2 603 register to performance element 130.Namely integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 1603, demoder 202 starts performance element 130 and goes to perform suitable packet shifting operation.Demoder 202 also transmits data element length, shifting function type and direction of displacement (for logical shift) by internal bus 170.

In step 1610, which step is the length of data element will perform below determining.If the length of data element is 8 (byte datas), then performance element 130 performs step 1612.But if the length of the data element in integrated data is 16 (digital data), then performance element 130 performs step 1614.In one embodiment, only support 8 to be shifted with the grouping of 16 bit data elements length.But, in another embodiment, also support that the grouping of 32 bit data elements length is shifted.

The length of tentation data element is 8, just performs step 1612.In step 1612, perform following operation.Source1 position 7 to 0 displacement shift count (Source2 position 63 to 0) is generated Result position 7 to 0.Source1 position 15 to 8 displacement shift count generates Result position 15 to 8.Source1 position 23 to 16 displacement shift count is generated Result position 23 to 16.Source1 position 31 to 24 displacement shift count is generated Result position 31 to 24.Source1 position 39 to 32 displacement shift count is generated Result position 39 to 32.Source1 position 47 to 40 displacement shift count is generated Result position 47 to 40.Source1 position 55 to 48 displacement shift count is generated Result position 55 to 48.Source1 position 63 to 56 displacement shift count is generated Result position 63 to 56.

The length of tentation data element is 16, just performs step 1614.Perform following operation in step 1614.Source1 position 15 to 0 displacement shift count is generated Result position 15 to 0.Source1 position 31 to 16 displacement shift count is generated Result position 31 to 16.Source1 position 47 to 32 displacement shift count is generated Result position 47 to 32.Source1 position 63 to 48 displacement shift count is generated Result position 63 to 48.

In one embodiment, the displacement of step 1612 performs simultaneously.But in another embodiment, these displacements are that serial performs.In another embodiment these displacement in some be perform simultaneously and some be serial perform.This discussion is equally applicable to the displacement of step 1614.

In step 1620, Result is stored in DEST register.

Table 19 illustrates that the register that byte packet arithmetic shift right operates represents.The position of the first row is that the integrated data of Source1 represents.The position of the second row is the data representation of Source2.The position of the third line is that the integrated data of Result represents.Numeral below each data element position is data element number.Such as, Source1 data element 3 is 10000000 ₂.

Table 19

The register of the grouping logic right shift operation that table 20 illustrates in blocked byte data represents

Table 20

The register of the grouping logic shifting function left that table 21 illustrates in blocked byte data represents.

Table 21

integrated data shift circuit

In one embodiment, shifting function can there is on multiple data element with the clock period of the identical number of single shifting function in the data of decomposing.Performing in the clock period of identical number to reach, have employed concurrency.Namely indicator register performs shifting function simultaneously on data element, below for a more detailed discussion.

Figure 17 illustrates the circuit performing grouping displacement according to one embodiment of the present of invention in each byte of integrated data.Figure 17 illustrates the use of modified chunk shift circuit, byte chip level i1799.Each chunk (except most significant digit data element chunk) comprises a shift unit and position controls.Most significant digit data element chunk only needs a shift unit.

Shift unit i1711 and shift unit i+1 1771 respectively allows this shift count of 8 bit shifts from Source1.In one embodiment, each shift unit resembles 8 known bit shift circuit and operates.Each shift unit has a Source1 input, Source2 input, a control inputs, a next stage signal, a upper level signal and a result export.Therefore, shift unit i1711 has Source1 _i1731 inputs, Source2 [63:0] 1733 input, control i1701 input, next stage i1713 signal, upper level i1712 input and the result be stored in result register i1751.Therefore, shift unit i+1 1771 has Source1 _i+11732 inputs, Source2 [63:0] 1733 inputs, control i+1 1702 inputs, next stage i+1 1773 signal, upper level i+1 1772 input and be stored in result in result register i+11752.

Source1 inputs 8 bit positions of normally Source1.The minimum type of 8 bit representation data elements, a blocked byte data element.Source2 input represents shift count.In one embodiment, each shift unit receives identical shift count from Source2 [63:0] 1733.Operation control 1700 transmission of control signals starts the displacement of each shift unit execution requirements.Control signal is determined from shift-type (arithmetic/logic) and direction of displacement.Next stage signal controls to receive from the position of this shift unit.Depend on the direction (left/right) of displacement, at next stage Signal Move bit location, most significant digit is shifted out/enters.Similarly, depend on the direction (right/left) of displacement, lowest order shifts out/enters by each shift unit on upper level signal.Upper level signal receives from the position control module of previous stage.Result exports the result of the shifting function represented in the part of the Source1 that shift unit operates thereon.

Control i1720 in position starts i1706 from operation control 1700 by integrated data and starts.Control i1720 in position controls next stage i1713 and upper level i+11772.Such as, assuming that shift unit i1711 is responsible for 8 lowest orders of Source1, and shift unit i+1 1771 is responsible for the next one 8 of Source1.If perform the displacement in blocked byte, the lowest order do not allowed from shift unit i+1 1771 is communicated with the most significant digit of shift unit i1711 by position control i1720.But when performing the displacement on grouping word, then the lowest order allowed from shift unit i+1 1771 is communicated with the most significant digit of shift unit i1711 by position control i1720.

Such as, in table 22, perform blocked byte arithmetic shift right.Assuming that shift unit i+1 1771 operates on data element 1, and shift unit i1711 operates on data element 0.Its lowest order shifts out by shift unit i+1 1771.But operation control 1700 stops propagating into next stage i1713 from this position that upper level i+1 1721 receives by causing position control i1720.Otherwise shift unit i1711 fills most high-order position Source1 [7] with sign bit.

Table 22

But, if perform grouping word arithmetic shift, then the lowest order of shift unit i+1 1771 is passed to the most significant digit of shift unit i1711.Table 23 illustrates this result.This transmission allows for the displacement of grouping double word is same.

Table 23

Each shift unit is coupling on result register alternatively.Result register stores the result of shifting function until whole result Result [63:0] 1760 can be transferred to DEST register temporarily.

For complete 64 grouping shift circuits, use 8 shift units and 7 position control modules.This circuit also can be used for the displacement in execution 64 non-packet data, thus uses same circuit to perform non-grouping shifting function and packet shifting operation.

the advantage of above-mentioned shifting function is comprised in instruction set

The shift count that above-mentioned grouping shift order causes each element shift of Source1 to be specified.By adding this instruction in instruction set, each element of a single instruction shift integrated data just can be used.Otherwise, do not support that the prior art general processor of this operation must perform many instructions to decompose Source1, the data element of each decomposition that is individually shifted, then result be assembled into packet data format for further packet transaction.

Transfer operation

Transfer operation to or transmit data from register 209.In one embodiment, SRC2603 is the address DEST605 comprising source data is then the address that data will be sent to.In this embodiment, SRC1 602 is not used.In another embodiment, SRC1 602 equals DEST605.

In order to the object of transfer operation is described, register and storage unit are distinguished.Register storer in register file 150 can be then such as in cache memory 160, primary memory 104, ROM106, data storage device 107.

Data can be sent to register 209 from storer by transfer operation, from register 209 to storer, and from the register of register 209 to the second register in register 209.In one embodiment, integrated data is stored in the register different from storing integer data.In this embodiment, data can be sent to register 209 from integer registers 201 by transfer operation.Such as, in processor 109, if integrated data is stored in register 209, integer data is stored in integer registers 201, then move instruction can be used for data to be sent to register 209 from integer registers 201, and vice versa.

In one embodiment, when specifying a storage address for transmission, be loaded into a register in register 209 in 8 byte datas (this storage unit comprises lowest byte) of this storage unit or be stored into specified memory cells from this register.When specifying a register in register 209, just the content of this register being sent to or being loaded the second register in register 209 or loading to appointment register from the second register.If the length of integer registers 201 is 64, and specify an integer registers, then the byte data of 8 in this integer registers be loaded into the register in register 209 or be stored into the integer registers of specifying from this register.

In one embodiment, integer is expressed as 32.When performing from register 209 to the transfer operation of register 201, be then only sent to the integer registers of specifying by low 32 of integrated data.In one embodiment, high-order 32 is become 0.Similarly, when performing from integer registers 201 to the transmission of register 209, a register in a bit load registers 209 low 32.In one embodiment, 32 transfer operations between register in register 209 and storer supported by processor 109.In another embodiment, on the high-order 32 of integrated data, only perform the transmission only having 32.

Assembly operation

In one embodiment of the invention, SRC1 602 register comprises data (Source1), and SRC2 603 register comprises data (Source2), and DEST605 register will comprise the result data (Result) of operation.This is that the part of Source1 is generated Result together with the sections fit of Source2.

In one embodiment, assembly operation is assembled into by the low order byte (or word) of the word (or double word) that divided into groups in source in the byte (or word) of Result that the word (or double word) that will divide into groups converts blocked byte (or word) to.In one embodiment, assembly operation converts 4 grouping words to grouping double word.This operation selectively with signed number according to execution.In addition, this operation can selectively with saturated execution.In an alternative embodiment, the additional assembly operation operated in the high order part of each data element is added.

In step 1801, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: the operational code of suitable assembly operation; SRC1 602 in register 209, SRC2 603 and DEST605 address; Saturated/unsaturated, tape symbol/without the data element length in symbol and integrated data.As mentioned above, SRC1 602 (or SRC2603) can be used as DEST605.

In step 1802, demoder 202 is by the register 209 of SRC1 602 given in internal bus 170 access function resister file 150 with SRC2 603 address.Register 209 provides the integrated data (Source1) be stored in SRC1 602 register and the integrated data (Source2) be stored in SRC2 603 register to performance element 130.Namely integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 1803, demoder 202 starts performance element 130 and goes to perform suitable assembly operation.Demoder 202 also transmits the length of the data element in saturated and Source1 and Source2 by internal bus 170.The saturated value of the data in result data element that makes can be selected to become maximum.If the scope of the value that the data element that the value of the data element in Source1 or Source2 is greater than or less than Result can represent, then the result data element of correspondence is set on its highest or minimum.Such as, if the signed value in the digital data element of Source1 and Source2 is less than 0x80 (or for double word 0x8000), then result byte (or word) data element is clamped on 0x80 (or for double word 0x8000).If the signed value in the digital data element of Source1 and Source2 is greater than 0x7F (or for double word 0x7FFF), then result byte (or word) data element is clamped on 0x7F (or 0x7FFF).

In step 1810, which step is the length of data element will perform below determining.If the length of data element is 16 (grouping word 402 data), then performance element 130 performs step 1812.But if the length of the data element in integrated data is 32 (grouping double word 403 data), then performance element 130 performs step 1814.

Assuming that the length of source data element is 16, just perform step 1812.In step 1812, perform following operation.Source1 position 7 to 0 is Result position 7 to 0.Source1 position 23 to 16 is Result position 15 to 8.Source1 position 39 to 32 is Result position 23 to 16.Source1 position 63 to 56 is Result position 31 to 24.Source2 position 7 to 0 is Result position 39 to 32.Source2 position 23 to 16 is Result position 47 to 40.Source2 position 39 to 32 is Result position 55 to 48.Source2 position 63 to 56 is Result position 31 to 24.If set saturated, then the high-order position of testing each word determines whether to answer clamp Result data element.

Assuming that the length of source data element is 32, just perform step 1814.In step 1814, perform following operation.Source1 position 15 to 0 is Result position 15 to 0.Source1 position 47 to 32 is Result position 31 to 16.Source2 position 15 to 0 is Result position 47 to 32.Source2 position 47 to 32 is Result position 63 to 48.If set saturated, then the high-order position of testing each double word determines whether should by Result data element clamp.

In one embodiment, perform the assembling of step 1812 simultaneously.But in another embodiment, serial performs this assembling.In another embodiment, some assembling be perform simultaneously and some be serial perform.This discussion is also applicable to the assembling of step 1814.

In step 1820, Result is stored in DEST605 register.

Table 24 illustrates that the register of assembling word operation represents.Hs and the Ls of indexing represents height and the low-order bit of each 16 bit data elements in Source1 and Source2 respectively.Such as A _lrepresent the low order 8 of the data element A in Source1.

Table 24

Table 25 illustrates that the register of assembling double-word operation represents, wherein Hs and the Ls of indexing represents the height component level of each 32 bit data elements in Source1 and Source2 respectively.

Table 25

assembling circuit

In one embodiment of the invention, in order to reach the efficient execution of assembly operation, have employed concurrency.Figure 19 a and 19b illustrates the circuit performing assembly operation according to one embodiment of the present of invention in integrated data.This circuit can perform the saturated assembly operation of band selectively.

The circuit of Figure 19 a and 19b comprises operation control 1900, result register 1952, result register 1953,8 16 to 8 bit test saturated circuits and 4 32 to 16 bit test saturated circuits.

Operation control 1900 receives information from demoder 202 to start assembly operation.Operation control 1900 uses saturation value to be that each test saturated circuit starts saturation testing.If the length of source integrated data is word integrated data 503, then operates control 1900 and set output enable 1931.This just starts the output of result register 1952.If the length of source integrated data is double word integrated data 504, then operates control 1900 and set output enable 1932.This just starts the output of output register 1953.

Each test saturated circuit can be tested saturated selectively.If forbid saturation testing, then respectively test saturated circuit and only low-order bit is delivered on position corresponding in result register.If allow test saturated, then respectively test saturated circuit test high-order position and determine whether to answer clamp result.

Test saturated 1910 to test saturated 1917 have 16 input with 8 export.8 least-significant bytes exported as input, or be the value (0x80,0x7F or 0xFF) of a clamp alternatively.Test saturated 1910 and receive Source1 positions 15 to 0 to result register 1952 carry-out bit 7 to 0.Test saturated 1911 and receive Source1 positions 31 to 16 to result register 1952 carry-out bit 15 to 8.Test saturated 1912 and receive Source1 positions 47 to 32 to result register 1952 carry-out bit 23 to 16.Test saturated 1913 and receive Source1 positions 63 to 48 to result register 1952 carry-out bit 31 to 24.Test saturated 1914 and receive Source2 positions 15 to 0 to result register 1952 carry-out bit 39 to 32.Test result 1915 receives Source2 position 31 to 16 and to result register 1952 carry-out bit 47 to 40.Test saturated 1916 and receive Source2 positions 47 to 32 to result register 1952 carry-out bit 55 to 48.Test saturated 1917 and receive Source2 positions 63 to 48 to result register 1952 carry-out bit 63 to 56.

Test saturated 1920 to test saturated 1923 have 32 input with 16 export.16 export low 16 for what input, or are the value (0x8000,0x7FFF or 0xFFFF) of a clamp alternatively.Testing saturated 1920 reception Source1 positions 31 to 0 is also result register 1953 carry-out bit 15 to 0.Testing saturated 1921 reception Source1 positions 63 to 32 is also result register 1953 carry-out bit 31 to 16.Testing saturated 1922 reception Source2 positions 31 to 0 is also result register 1953 carry-out bit 47 to 32.Testing saturated 1923 reception Source2 positions 63 to 32 is also result register 1953 carry-out bit 63 to 48.

Such as, in table 26, perform and do not assemble without symbol word with saturated.Operation control 1900 will start result register 1952 Output rusults [63:0] 1960.

Table 26

But do not assemble without symbol double word with saturated if performed, operation control 1900 will start result register 1953 Output rusults [63:0] 1960.Table 27 illustrates this result.

Table 27

the advantage of above-mentioned assembly operation is comprised in instruction set

Above-mentioned assembling instruction assembling generates Result from the position of the predetermined number of data element each in Source1 and Source2.In this way, processor 109 can to assemble data in the instruction of few instruction half required in prior art general processor.Such as, generate from four 32 bit data elements the result comprising 4 16 bit data elements and only need an instruction (contrasting with two instructions), as shown below:

Table 28

Typical multimedia application assembling mass data.Thus, by the number of instructions needed for these data of assembling is reduced to half, just improve the performance of these multimedia application.

Operation splitting

operation splitting

In one embodiment, operation splitting interlocks the low level blocked byte of two source integrated datas, word or double word to generate result packet byte, word or double word.Here this operation is called and decomposes low operation.In another embodiment, operation splitting also may interlock higher order element (be called decompose high operation).

First step 2001 and 2002 is performed.In step 2003, demoder 202 starts performance element 130 and goes to perform operation splitting.Demoder 202 transmits the length of the data element in Source1 and Source2 by internal bus 170.

In step 2010, which step is the length of data element will perform below determining.If the length of data element is 8 (blocked byte 401 data), then performance element 130 performs step 2012.But if the length of the data element in integrated data is 16 (grouping word 402 data), performance element 130 performs step 2014.But if the length of the data element in integrated data is 32 (grouping double word 503 data), then performance element 130 performs step 2016.

Assuming that source data element length is 8, just perform step 2012.In step 2012, perform following operation.Source1 position 7 to 0 is Result position 7 to 0.Source2 position 7 to 0 is Result position 15 to 8.Source1 position 15 to 8 is Result position 23 to 16.Source2 position 15 to 8 is Result position 31 to 24.Source1 position 23 to 16 is Result position 39 to 32.Source2 position 23 to 16 is Result position 47 to 40.Source1 position 31 to 24 is Result position 55 to 48.Source2 position 31 to 24 is Result position 63 to 56.

Assuming that the length of source data element is 16, then perform step 2014.In step 2014, perform following operation.Source1 position 15 to 0 is Result position 15 to 0.Source2 position 15 to 0 is Result position 31 to 16.Source1 position 31 to 16 is Result position 47 to 32.Source2 position 31 to 16 is Result position 63 to 48.

Assuming that source data element length is 32, then perform step 2016.In step 2016, perform following operation.Source1 position 31 to 0 is Result position 31 to 0.Source2 position 31 to 0 is Result position 63 to 32.

In one embodiment, perform the decomposition of step 2012 simultaneously.But in another embodiment, serial performs this decomposition.In another embodiment, some decomposition are some simultaneously execution is then that serial performs.This discussion is also applicable to the decomposition of step 2014 and step 2016.

In step 2020, Result is stored in DEST 605 register.

Table 29 illustrates and decomposes double-word operation (each data element A _0-1and B _0-1comprise 32) register represent.

Table 29

Table 30 illustrates and decomposes word operation (each data element A _0-3and B _0-3comprise 16) register represent.

Table 30

Table 31 illustrates and decomposes byte manipulation (each data element A _0-7and B _0-7comprise 8) register represent.

Table 31

decomposition circuit

Figure 21 illustrates the circuit performing operation splitting according to one embodiment of the present of invention in integrated data.The circuit of Figure 21 comprises operation control circuit 2100, result register 2152, result register 2153 and result register 2154.

Operation control 2100 receives information from demoder 202 to start operation splitting.If the length of source integrated data is byte packet data 502, then operates control 2100 and set output enable 2132.This just starts the output of result register 2152.If the length of source integrated data is word integrated data 503, then operates control 2100 and output enable 2133 is set.This just starts the output of output register 2153.If the length of source integrated data is double word integrated data 504, then operates control 2100 and output enable 2134 is set.This just starts the output of Output rusults register 2154.

Result register 2152 has following input.Source1 position 7 to 0 is the position 7 to 0 of result register 2152.Source2 position 7 to 0 is the position 15 to 8 of result register 2152.Source1 position 15 to 8 is the position 23 to 16 of result register 2152.Source2 position 15 to 8 is the position 31 to 24 of result register 2152.Source1 position 23 to 16 is the position 39 to 32 of result register 2152.Source2 position 23 to 16 is the position 47 to 40 of result register 2152.Source1 position 31 to 24 is the position 55 to 48 of result register 2152.Source2 position 31 to 24 is the position 63 to 56 of result register 2152.

Result register 2153 has following input.Source1 position 15 to 0 is the position 15 to 0 of result register 2153.Source2 position 15 to 0 is the position 31 to 16 of result register 2153.Source1 position 31 to 16 is the position 47 to 32 of result register 2153.Source2 position 31 to 16 is the position 63 to 48 of result register 2153.

Result register 2154 has following input.Source1 position 31 to 0 is the position 31 to 0 of result register 2154.Source2 position 31 to 0 is the position 63 to 32 of result register 2154.

Such as, in table 32, perform decomposition word operation.Operation control 2100 will start result register 2153 Output rusults [63:0] 2160.

Table 32

But decompose double word if performed, startup result register 2154 is exported Result [63:0] 2160 by operation control 2100.Table 33 illustrates this result.

Table 33

the advantage of above-mentioned disassembly instruction is comprised in instruction set

By above-mentioned disassembly instruction is added in instruction set, can interlock or decompose integrated data.This disassembly instruction is 0 by making the data element in Source2 entirely, just can be used for decomposing integrated data.The example decomposing byte illustrates in table 34a.

Table 34a

Same disassembly instruction can be used to intercrossed data, as shown in table 34b.Staggered in multiple multimedia algorithms is useful.Such as, staggeredly can be used for transposed matrix and interpolation pixel.

Table 34b

Thus add this disassembly instruction by the instruction set supported at processor 109, processor 109 is more general and can perform the algorithm needing this function in higher performance level.

Number calculates

number calculates

One embodiment of the present of invention allow the number counting operation that will perform in integrated data.That is, the present invention is that each data element of the first integrated data generates a result data element.The figure place of each result data element representation set in each corresponding data element of the first integrated data.In one embodiment, the total bit that set becomes 1 is counted.

The register of the number counting operation that table 35a illustrates in integrated data represents.The position of the first row is that the integrated data of Source1 integrated data represents.The position of the second row is that the integrated data of Result integrated data represents.Data word below each data element position is data element number.Such as, Source1 data element 0 is 1000111110001000 ₂.Therefore, if data element length is 16 (digital data), and perform number counting operation, performance element 130 generates shown Result integrated data.

Table 35a

In another embodiment, individual counting number performs in 8 bit data elements.The register of the individual counting number that table 35b illustrates in the integrated data with 88 integrated data elements represents.

Table 35b

In another embodiment, individual counting number performs in 32 bit data elements.The register of the individual counting number that table 35c illustrates in the integrated data with two 32 integrated data elements represents.

Table 35c

Individual counting number also can perform on 64 integer datas.That is, obtain set in 64 bit data and become the total bit of 1.The register of the individual counting number that table 35d illustrates on 64 integer datas represents.

Table 35d

perform the method for a counting number

Figure 22 is the process flow diagram that the method performing number counting operation according to one embodiment of the present of invention in integrated data is described.In step 2201, demoder 202 responds the reception of a control signal 207, this control signal 207 of decoding.In one embodiment, control signal 207 is supplied by bus 101.In another embodiment, control signal 207 is that cache memory 160 supplies.Thus demoder 202 decodes: the SRC1 602 in the operational code of individual counting number and register 209 and DEST605 address.Note not using SRC2 603 in current embodiments of the invention.Saturated/unsaturated, tape symbol/without the data element length in symbol and integrated data is not used in this embodiment yet.In the present embodiment of the invention, the grouping addition of 16 bit data elements length is only supported.But person skilled in the art person can understand and can perform a counting number in the integrated data with 8 blocked byte data elements or two grouping double-word data elements.

In step 2202, demoder 202 is by providing the register 209 of SRC1 602 address in internal bus 170 access function resister file 150.Register 209 provides the integrated data Source1 in the register be stored on this address to performance element 130, namely integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 2130, demoder 202 starts performance element 130 and goes to perform number counting operation.In an alternative embodiment, demoder 202 also transmits integrated data length of element by internal bus 170.

In step 2205, tentation data length of element is 16, then performance element 130 obtains the sum of the position of set in Source1 position 15 to 0, produces the position 15 to 0 of Result integrated data.Ask sum parallel with this, performance element 130 asks the sum of Source1 position 31 to 16, produces 31 to the position, position 16 of Result integrated data.Walk abreast with the generation of these sums, performance element 130 amounts to 47 to the position, position 32 of Source1, produces 47 to the position, position 32 of Result integrated data.The generation amounted to these walks abreast, and performance element 130 amounts to 63 to the position, position 48 of Source1, produces 63 to the position, position 48 of Result integrated data.

In step 2206, demoder 202 starts the register with the DEST605 address of destination register in register 209.Thus, Result integrated data is stored in by the register of DEST605 addressing.

a data element performs the method for a counting number

Figure 23 is the method that the single result data element performing number counting operation and generation result packet data according to one embodiment of the present of invention on a data element of an integrated data is described.In step 2310a, generate one from Source1 position 15,14,13 and 12 and arrange and CSum1a and row carry CCarry1a.In step 2310b, generate from Source1 position 11,10,9 and 8 and arrange and CSum1b and row carry CCarry1b.To generate from Source1 position 7,6,5 and 4 in step 2310C and arrange and CSum1c and row carry CCarry1c.In step 2310d, generate from Source1 position 3,2,1 and 0 and arrange and CSum1d and row carry CCarry1d.In one embodiment of the invention, step 2310a-d is executed in parallel.In step 2320a, generate row and CSum2a and row carry CCarry2b from CSum1a, CCarry1a, CSum1b and CCarry1b.In step 2320b, generate row and CSum2b and row carry CCarry2b from CSum1c, CCarry1, CSum1d and CCarry1d.In one embodiment of the invention, step 2320a-b is executed in parallel.In step 2330, generate row and CSum3 and row carry CCarry3 from CSumm2a, CCarry2a, CSum2b and CCarry2b.In step 2340, generate Result result from CSum3 and CCarry3.In one embodiment, Result is with 16 bit representations.In this embodiment, owing to only needing 4 to position, position 0 to represent the maximum number of the position of set in Source1, position 15 to 5 is set as 0.The maximum number of digits of Source1 is 16.This appears at Source1 and equals 1111111111111111 ₂time.Result will be 16 and with 0000000000010000 ₂represent.

Thus, in order to calculate 4 result data elements for the number counting operation in 64 integrated datas, the step of Figure 23 to be performed for each data element in integrated data.In one embodiment, 4 16 result data elements are parallel computations.

perform the circuit of a counting number

Figure 24 illustrates the circuit performing number counting operation according to one embodiment of the present of invention in the integrated data with 4 digital data elements.Figure 25 illustrates the detailed circuit performing number counting operation according to one embodiment of the present of invention on a digital data element of integrated data.

Figure 24 illustrates a circuit, and wherein Source1 bus 2401 passes through Source1 _iNinformation signal is taken to popcnt circuit 2408a-d by 2406a-d.Thus popcnt circuit 2408a obtains the sum of the position of set in 16 to the position, position 0 of Source1, generate 15 to the position, position 0 of Result.Popcnt circuit 2408b obtains the sum of the position of set in 31 to the position, position 16 of Source1, generates 31 to the position, position 16 of Result.Popcnt circuit 2408c obtains the sum of the position of set in 47 to the position, position 32 of Source1, generates 47 to the position, position 32 of Result.Popcnt circuit 2408d obtains the sum of the position of the set in 63 to the position, position 48 of Source1, generates 63 to the position, position 48 of Result.Start 2404a-d and receive from operation control 2410 control signal that startup popcnt circuit 2408a-d performs number counting operation by control 2403, and Result is placed in Result bus 2409.Above giving describe and Fig. 1-6b and 22-25 in description and explanation, person skilled in the art person can set up this circuit.

Popcnt circuit 2408a-d exports 2407a-d by result and is delivered in Result bus 2409 by the object information of grouping number counting operation.Then this object information is stored in the integer registers specified by DEST 605 register address.

a data element performs number

the circuit of counting

Figure 25 illustrates the detailed circuit performing number counting operation on a digital data element of integrated data.Particularly, Figure 25 illustrates a part of popcnt circuit 2408a.In order to reach the maximum performance of the application adopting number counting operation, this operation should be completed within a clock period.Therefore, assuming that access function resister and event memory need the certain percentage of clock period, the circuit of Figure 24 completes its operation within the time of a clock period about 80%.This circuit has the advantage allowing processor 109 to perform number counting operation in a clock period in four 16 bit data elements.

Popcnt circuit 2408a adopts 4-> 2 carry save adder (unless otherwise specified, CSA will refer to 4-> 2 carry save adder), 4-> 2 carry save adder that may adopt in popcnt circuit 2408a-d is well-known in this technology.4-> 2 carry save adder be 4 operands are added draw two and totalizer.Because the number counting operation in popcnt circuit 2408a comprises 16, the first order comprises 4 4-> 2 carry save adders.These four 4-> 2 carry save adders 16 positional operands are transformed into 82 and.The second level by 82 and be transformed into 43 and, and the third level by 43 and be transformed into two 4 and.Then two four and addition are generated net result by 4 full adders.

Although have employed 4-> 2 carry save adder, 3-> 2 carry save adder in alternate embodiment, can be adopted.In addition, also several full adders can be used; But this configuration can not resemble the embodiment shown in Figure 25 and provide result so fast.

Source1 _iN15-02406a carries 15 to position, the position O of Source1.First four are coupling in the input of 4-> 2 carry save adder (CSA251Oa).Below four are coupling in the input of CSA251Ob.Four are again coupling in the input of CSA251Oc.Last four are coupling in the input of CSA2510d.Each CSA2510a-d generates two 2 outputs.Two of CSA251Oa 2 outputs are coupled in two inputs of CSA2520a.Two of CSA251Ob 2 outputs are coupled in other two inputs of CSA2520a.Two of CSA2510c 2 outputs are coupled in two inputs of CSA2520b.Two of CSA2510d 2 outputs are coupled in CSA2520b all the other two input.Each CSA2520a-b generates two 3 outputs.Two of 2520a 3 outputs are coupled in two inputs of CSA2530.Two of 2520b 3 outputs are coupled in all the other two inputs of CSA2530.CSA2530 generates two 4 outputs.

These two 4 outputs are coupled in two inputs of full adder (FA2550).Two 4 input phase adductions are transmitted the summation of 3 to position, position O as these two 4 input additions that Result exports 2407a by FA2550.FA2550 exports (CO2552) by carry and generates the position 4 that Result exports 2407a.In an alternative embodiment, 5 full adders are adopted to generate 4 to the position, position 0 of Result output 2407a.In arbitrary situation, 15 to the position, position 5 all Result being exported 2407a is fixed on O.Equally, be fixed on to any carry input of full adder on O.

Although not shown in fig. 25, person skilled in the art person can understand can export the multiplexed or buffer-stored of 2407a in Result bus 2409 by Result.Multiplexer is subject to the control of enable 2404a.Data are write in Result bus 2409 by this by other performance element circuit of permission.

the advantage of above-mentioned number counting operation is added in instruction set

The number of the position of set in each data element of the integrated datas such as above-mentioned counting number instruction count such as Source1.Thus, by adding this instruction in instruction set, just in a single instruction, number counting operation can be performed in integrated data.On the contrary, prior art general processor must perform many instructions to decompose Source1, and the data element of each decomposition individually performs this function, then assembles result for further packet transaction.

Thus, add this counting number instruction by the instruction set supported at processor 109, just improve the performance of the algorithm needing this function.

Logical operation

logical operation

In one embodiment of the invention, SRC1 register comprises integrated data (Source1), SRC2 register comprises integrated data (Source2), and DEST register will be included in the result (Result) of the logical operation of Source1 and Source2 above selected by execution.Such as, if having selected logic "and" operation, then by Source1 and Source2 logical "and".

In one embodiment of the invention, following logical operation is supported: logical "and", logic " non-with " (ANDN), logical "or" and logical exclusive-OR (XOR).Logical "and", "or" and nonequivalence operation are well-known in this technology.Logic " non-with " (ANDN) computing makes the logic of Source2 and Source1 " non-" carry out AND operation.Although the invention relates to these logical operations describe, other embodiment can realize other logical operation.

Figure 26 is the process flow diagram that the method performing several logical operation according to one embodiment of the present of invention in integrated data is described.

In step 2601, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: the operational code of suitable logical operation (i.e. "AND", " non-with ", "or" or distance); SRC1 602 in register 209, SRC2 603 and DEST604 address.

In step 2602, demoder 202 is by providing the register 209 of SRC1 602 and SRC2 603 address in internal bus 170 access function resister file 150.Register 209 provides the integrated data (Source1) be stored in SRC1 602 register and the integrated data (Source2) be stored in SRC2 603 register to performance element 130.That is, integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 2603, demoder 202 starts performance element 130 and goes to perform the one of grouping selected in logical operation.

In step 2610, which step is the one of grouping selected in logical operation perform below determining.If have selected logic "and" operation, performance element 130 performs step 2612; If have selected logic NOT AND operation, performance element 130 performs step 2613; If have selected logical "or" computing, performance element 130 performs step 2614; And if have selected logical exclusive-OR computing, performance element 130 performs step 2615.

Assuming that have selected logic "and" operation, just perform step 2612.In step 2612, Source1 position 63 to 0 and Source2 position 63 to 0 are carried out AND operation and are generated Result position 63 to 0.

Assuming that have selected logic NOT AND operation, just perform step 2613.In step 2613, Source1 position 63 to 0 and Source2 position 63 to 0 are carried out NOT AND operation and are generated Result position 63 to 0.

Assuming that have selected logical "or" computing, just perform step 2614.In step 2614, Source1 position 63 to 0 and Source2 position 63 to 0 are carried out inclusive-OR operation and are generated Result position 63 to 0.

Assuming that have selected logical exclusive-OR computing, just perform step 2615.In step 2615, Source1 position 63 to 0 and Source2 position 63 to 0 are carried out nonequivalence operation and are generated Result position 63 to 0.

In step 2620, Result is stored in DEST register.

The register of the logic NOT AND operation that table 36 illustrates in integrated data represents.The position of the first row is that the integrated data of Source1 represents.The position of the second row is that the integrated data of Source2 represents.The position of the third line is that the integrated data of Result represents.Numeral below each data element position is data element number.Such as, Source1 data element 2 is 1111111100000000 ₂.

Table 36

Although the present invention describes relative to the corresponding data element in Source1 and Source2 performing same logical operation, alternate embodiment can support the instruction allowing to select the logical operation that will perform on the data element of correspondence on the basis of element one by one.

integrated data logical circuit

In one embodiment, in the clock period with the identical number of unity logic computing in non-packet data, above-mentioned logical operation can be there is on multiple data element.In order to reach the execution in the clock period of identical number, have employed concurrency.

Figure 27 illustrates the circuit according to one embodiment of the present of invention actuating logic computing in integrated data.Operation control 2700 controls the circuit of actuating logic computing.Operation control 2700 processing control signals also exports selection signal on control line 2780.These select signals to transmit the one selected in "AND", " non-with ", "or" and nonequivalence operation to logical operation circuit 2701.

Logical operation circuit 2701 receives Source1 [63:0] and Source2 [63:0] and holds logical operation that row selection signal specifies to generate Result.Result [63:0] is passed to result register 2731 by logical operation circuit 2701.

the advantage of above-mentioned logical operation is added in instruction set

Above-mentioned logical order actuating logic "AND", logic " non-with ", logical "or" and logical exclusive-OR.These instructions are useful in any application needing the logical operation of data.Add these instructions by the instruction set supported at processor 109, just in an instruction, these logical operations can be performed in integrated data.

Grouped comparison

grouped comparison operates

In one embodiment of the invention, the data (Source1) that will compare are comprised in SRC1 602 register, comprise the data (Source2) that will compare relative to it in SRC2 603 register, and will the result (Result) compared be comprised in DEST605 register.That is, compare with data element each in Source1 independently by the relation of specifying with each data element of Source2.

In one embodiment of the invention, following comparison is supported: equal; Signedly to be greater than; Signedly to be more than or equal to; Be greater than without symbol; Or be more than or equal to without symbol.This relation is tested in often pair of corresponding data element.Such as, Source1 [7:0] is greater than Source2 [7:0], and result is Result [7:0].If comparative result meets this relation, then in one embodiment the corresponding data element in Result is arranged and help 1.If the result compared does not meet this relation, then the corresponding data element in Result is arranged to full 0.

In step 2801, the control signal 207 that demoder 202 decoding processor 109 receives.Thus demoder 202 decodes: the operational code of suitably compare operation; SRC1 602 in register 209, SRC2 603 and DEST605 address; Saturated/unsaturated (compare operation be there is no need), tape symbol/without the length of the data element in symbol and integrated data.As mentioned above, SRC1 602 (or SRC2 603) can be used as DEST605.

In step 2802, demoder 202 is by the register 209 of SRC1 602 given in internal bus 170 access function resister file 150 with SRC2 603 address.Register 209 provides the integrated data (Source1) be stored in SRC1 602 register and the integrated data (Source2) be stored in SRC2 603 register to performance element 130.That is, integrated data is passed to performance element 130 by internal bus 170 by register 209.

In step 2803, demoder 202 starts performance element 130 and goes to perform suitable grouped comparison operation.Demoder 202 also transmits the length of data element and the relation of compare operation by internal bus 170.

In step 2810, the step that the length of data element will perform below determining.If the length of data element is 8 (blocked byte 401 data), then performance element 130 performs step 2812.But if the length of the data element in integrated data is 16 (grouping word 402 data), then performance element 130 performs step 2814.In one embodiment, only support 8 with the grouped comparison of 16 bit data elements length.But, in another embodiment, also support the grouped comparison (grouping double word 403) of 32 bit data elements length.

The length of tentation data element is 8, then perform step 2812.In step 2812, perform following operation.Source2 position 7 to 0,7 to 0 pairs, Source1 position is compared and generates Result position 7 to 0.Source2 position 15 to 8,15 to 8 pairs, Source1 position is compared and generates Result position 15 to 8.Source2 position 23 to 16,23 to 16 pairs, Source1 position is compared and generates Result position 23 to 16.Source2 position 31 to 24,31 to 24 pairs, Source1 position is compared and generates Result position 31 to 24.Source2 position 39 to 32,39 to 32 pairs, Source1 position is compared and generates Result position 39 to 32.Source2 position 47 to 40,47 to 40 pairs, Source1 position is compared and generates Result position 47 to 40.Source2 position 55 to 48,55 to 48 pairs, Source1 position is compared and generates Result position 55 to 48.Source2 position 63 to 56,63 to 56 pairs, Source1 position is compared and generates Result position 63 to 56.

The length of tentation data element is 16, just performs step 2814.In step 2814, perform following operation.Source2 position 15 to 0,15 to 0 pairs, Source1 position is compared and generates Result position 15 to 0.Source2 position 31 to 16,31 to 16 pairs, Source1 position is compared and generates Result position 31 to 16.Source2 position 47 to 32,47 to 32 pairs, Source1 position is compared and generates Result position 47 to 32.Source2 position 63 to 48,63 to 48 pairs, Source1 position is compared and generates Result position 63 to 48.

In one embodiment, the comparison of step 2812 performs simultaneously.But in another embodiment, these compare is that serial performs.In another embodiment, some compare, and to be some simultaneously execution be then that serial performs.This discussion is also applicable to the comparison in step 2814.

In step 2820, Result is stored in DEST605 register.

Table 37 illustrates that the register that grouped comparison is greater than operation without symbol represents.The position of the first row is that the integrated data of Source1 represents.The position of the second row is the data representation of Source2.The position of the third line is that the integrated data of Result represents.Numeral below each data element position is data element number.Such as, Source1 data element 3 is 10000000 ₂.

Table 37

The register that the grouped comparison tape symbol that table 38 illustrates in blocked byte data is more than or equal to operation represents.

Table 38

integrated data comparator circuit

In one embodiment, compare operation can be produced on multiple data element in the clock period with the identical number of single compare operation in non-packet data.In order to reach identical number clock period in execution, have employed concurrency.That is, indicator register performs compare operation on data element simultaneously.In more detail below this point is discussed.

Figure 29 illustrates the circuit performing grouped comparison operation according to one embodiment of the present of invention in each byte of integrated data.Figure 29 illustrate through amendment chunk comparator circuit, byte chip level i2999 use.Each chunk except most significant digit data element chunk all comprises a comparing unit and position controls.Most significant digit data element chunk only needs a comparing unit.

Comparing unit i2911 and comparing unit i+12971 respectively allows to compare with from corresponding 8 of Source2 from 8 of Source1.In one embodiment, each comparing unit operates as 8 known bit comparison unit.These 8 known bit comparison circuit comprise the chunk circuit allowing to deduct Source2 from Source1.The result of process subtraction determines the result of compare operation.In one embodiment, subtraction result comprises a flooding information.Test this flooding information to judge that whether the result of compare operation is as true.

Each comparing unit has a Source1 input, Source2 input, a control inputs, a next stage signal, a upper level signal and a result export.Therefore, comparing unit i2911 has Source1 _i2931 inputs, Source2 _i2933 inputs, control i2901 input, next stage i2913 signal, upper level i2912 input and the result be stored in result register i2951.Therefore, comparing unit i+1 2971 has Source1 _i+12932 inputs, Source2 _i+12934 inputs, control i+1 2902 inputs, next stage i+1 2973 signal, upper level i+1 2972 input and be stored in result in result register i+1 2952.

Source1n inputs 8 bit positions of normally Source1.The data element of the minimum type of 8 bit representation, the data element of a blocked byte 401.Source2 input is correspondence 8 bit position of Source2.Operation control 2900 transmission of control signals starts the comparison required by the execution of each comparing unit.Control signal determines from the length (such as byte or word) of the relation compared (such as tape symbol is greater than) and data element.Next stage signal controls to receive from the position of this comparing unit.When use is greater than the data element of byte length, position control module combines comparing unit effectively.Such as, when comparand integrated data, the position control module between the first comparing unit and the second comparing unit will make these two comparing units as 16 bit comparison cell operation.Similarly, the control module between the 3rd and the 4th comparing unit will make these two comparing units as a comparing unit job.This can proceed to four grouping digital data elements.

Depend on required relation and the value of Source1 and Source2, comparing unit propagates into lower-order comparing unit downwards by allowing the result of higher-order comparing unit or compares to perform conversely.This is, each comparing unit by utilize control i2920 in position to transmit information to provide comparative result.If use double word integrated data, then four comparing units work to be formed for one 32 of each data element long comparing units together.The result of each comparing unit exports the result of the compare operation represented in the part of Source1 and the Source2 that this comparing unit operates thereon.

Control i2920 in position is started from operation control 2900 by the enable i2906 of integrated data.Control i2920 in position controls next stage i2913 and upper level i+12972.Such as, assuming that comparing unit i2911 is responsible for 8 lowest orders of Source1 and Source2, and comparing unit i+12971 is responsible for next 8 of Source1 and Source2.Compare if performed in blocked byte data, control i2920 in position is delivered to comparing unit i2911 by not allowing from the object information of comparing unit i+12971, and vice versa.But, compare if performed on grouping word, then result (in one embodiment for the overflowing) information allowed from comparing unit i2911 is delivered to comparing unit i+1 by position control i2920, and passes to comparing unit i2911 from result (in one embodiment for the overflowing) information of comparing unit i+1 2971.

Such as, in table 39, execution blocked byte tape symbol is greater than and compares.Assuming that comparing unit i+12971 operates on data element 1, and comparing unit i2911 operates on data element 0.Comparing unit i+1 2971 compares the most most-significant byte of a word and transmits this object information by upper level i+1 2972.Comparing unit i2911 compares the most least-significant byte of this word and transmits this object information by next stage i2913.But operation control 2900 stops at making position control i2920 the object information propagated between comparing unit and receive from upper level i+1 2972 and next stage i2913.

Table 39

But if execution grouping word tape symbol is greater than compare, then the result of comparing unit i+1 2971 will be delivered to comparing unit i2911, and vice versa.Table 40 illustrates this result.This allows grouping double word more equally in the transmission of type.

Table 40

Each comparing unit is coupling on result register alternatively.Result register stores the result of compare operation until complete result Result [63:0] 2960 can be transferred to DEST605 register temporarily.

For complete 64 grouped comparison circuit, adopt 8 comparing units and 7 position control modules.This circuit also can be used for performing in 64 non-packet data comparing, and utilizes same circuit to perform non-grouping compare operation and grouped comparison operates whereby.

above-mentioned grouping is added in instruction set

the advantage of compare operation

The comparative result of Source1 and Source2 stores as grouping mask by above-mentioned grouped comparison instruction.As mentioned above, the conditional transfer in data is uncertain, and because they destroy branch prediction algorithm, therefore wastes processor performance.But by generating grouping mask, this comparison order decreases the number of the conditional transfer based on data of needs.Such as, function (if Y > A then X=X+B can be performed in integrated data; Else X=X), as shown in table 41 below (value as shown in table 41 be with shown in 16 hex notation).

Table 41

As seen from the above example, no longer conditional transfer is needed.Owing to not needing transfer instruction, when using this comparison order to perform this and other similar operations, speculatively predict that the processor of transfer can not have performance to reduce.Thus provide this comparison order by the instruction set supported at processor 109, processor 109 just can perform the algorithm needing this function in higher performance level.

Multimedia algorithms example

For the versatility of disclosed instruction set is described, some multimedia algorithms examples are described below.In some cases, some step in these algorithms can be performed with similar integrated data instruction.In example below, the some steps needing to use general processor instruction to come management data transmission, circulation and conditional transfer are eliminated.

1) complex multiplication

Disclosed multiply-add instruction can be used for by two complex multiplication in single instruction, as shown in table 42a.The multiplication of two plural numbers (such as, r1i1 and r2i2) performs according to following equalities:

Real part=r1r2-i1i2

Imaginary part=r1i2+r2i1

Complete in a clock period if this instruction be embodied as, the present invention just can by two complex multiplication in a clock period.

Table 42a

As another example, table 42b illustrates the instruction for being taken advantage of together by three plural numbers.

Table 42b

2) multiply accumulating computing

Disclosed instruction also can be used for taking advantage of and accumulated value.Such as, can by two groups of 4 data element (A _1-4with B _1-4) be multiplied with cumulative, as shown in table 43 below.In one embodiment, each instruction shown in table 43 is embodied as and completes in each clock period.

Table 43

If the number of the data element in each group more than 8 and be 4 multiple, if performed as shown in table 44 below, multiplication of these groups need less instruction with cumulative.

Table 44

As another example, table 45 illustrates the separately multiplication of group A and B and group C and D and adds up, wherein in these groups each group comprise two data elements.

Table 45

As another example, table 46 illustrates the separately multiplication of group A and B and group C and D and adds up, wherein in these groups each group comprise 4 data elements.

Table 46

3) point integration method

Dot product (also known as inner product) is used in signal transacting and matrix operation.Such as, at long-pending, digital filtering operation (such as FIR and IIR filtering) and use dot product when calculating correlated series of compute matrix.Because many voice compression algorithms (as GSM, G.728, CELP and VSELP) and high-fidelity compression algorithm (such as MPEG and subband coding) utilize digital filtering and correlation computations widely, the performance improving dot product equals the performance improving these algorithms.

The sequence A of two length N and the dot product of B are defined as:

Result = Σ_{i = 0}^{N - 1} Ai \cdot Bi

Perform dot product calculating and extensively utilize multiply accumulating computing, wherein the corresponding element of each sequence is multiplied, and these results cumulative are to form dot product result.

By comprising transmission, grouping addition, taking advantage of-Jia and packet shifting operation, the present invention allows to use integrated data to perform dot product and calculates.Such as, comprise the type of packet data of 4 16 bit elements if used, just can with following operate in respectively comprise 4 values two sequences on perform dot product calculating:

1) move instruction is used to get 4 16 place values to generate Source1 from A sequence;

2) move instruction is used to get 4 16 place values to generate Source2 from B sequence; And

3) use and take advantage of-Jia, grouping addition and shift order to be multiplied as mentioned above and cumulative.

For the vector with slightly multielement, net result is also in the end added together by the method shown in use table 46.Other supports that instruction comprises grouping OR for initialization accumulator registers and XOR instruction, for shifting out the grouping shift order of unwanted value in the most rear class of calculating.The instruction existed in the instruction set of processor 109 is utilized to complete cycle control operation.

4) two dimension loop filter

Two dimension loop filter is used in some multimedia algorithms.Such as, the filter coefficient below shown in table 47 can be used in video conference algorithm to perform low-pass filtering on pixel data.

[\begin{matrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{matrix}]

Table 47

In order to calculate the new value of the pixel on position (x, y), use following equation:

Result pixel=(x-1, y-1)+2 (x, y-1)+(x+1, y-1)+2 (x-1, y)+4 (x, y)+2 (x+1, y)+(x-1, y+1)+2 (x, y-1)+(x+1, y+1)

By comprising assembling, grouping, transmission, grouping displacement and grouping addition, the present invention allows to use integrated data to perform two-dimentional loop filter.Realize according to the one of above-mentioned loop filter, this loop filter is as two simple one-dimensional filtering devices application-namely, and above-mentioned two dimensional filter can be used as two 121 wave filters.In the horizontal direction, the second wave filter then in vertical direction for first wave filter.

Table 48 illustrates the expression of 8 × 8 pieces of pixel data.

Table 48

Perform following step to the wave filter on these 8 × 8 pieces that realize pixel data horizontally through:

1) move instruction is used to access 88 pixel values as integrated data;

2) these 88 pixels are resolved into 16 integrated datas (Source1) comprising 48 pixels with keep cumulative in precision;

3) copy Source1 and generate Source2 and Source3 twice;

4) on Source1, non-grouping right shift 16 is performed;

5) perform on Source3 non-grouping to shifting left 16;

6) (Source1+2*Source2+Source3) is generated by performing following grouping addition;

a)Source1＝Source1+Source2；

b)Source1＝Source1+Source2；

c)Source1＝Source1+Source3；

7) as the grouping digital data that the part storage of 8 × 8 intermediate result arrays draws; And

8) these steps are repeated until whole 8 × 8 intermediate result arrays generated as shown in table 49 are below (as IA ₀represent the A from table 49 ₀intermediate result).

Table 49

Perform following step to realize the wave filter in 8 × 8 intermediate result arrays perpendicular through:

1) use move instruction access from this intermediate result array 4 × 4 data block as integrated data to generate Source1, Source2 and Source3 (as seen table 50 exemplarily);

Table 50

2) (Source1+2*Source2+Source3) is generated by performing following grouping addition:

a)Source1＝Source1+Source2；

b)Source1＝Source1+Source2；

c)Source1＝Source1+Source3；

3) on the Source1 drawn, perform grouping right shift 4 generate weighted value sum-in fact divided by 16;

4) assembling is with saturated result Source1,16 place values is converted back 8 pixel values;

5) part of the blocked byte data drawn as 8 × 8 result arrays is stored (for the example shown in table 50, the new pixel value of these four byte representation B0, B1, B2 and B3); And

6) these steps are repeated until generate whole 8 × 8 result arrays.

It is worthy of note, top and the bottom row of 8 × 8 result arrays determine with different algorithms, do not describe this algorithm here to not water down the present invention.

Thus by providing assembling, decomposition, transmission, grouping displacement and grouping add instruction on processor 109, performance of the present invention is apparently higher than prior art general processor, and the necessary data element ground of the latter performs the operation required by these wave filters.

5) estimation (Motion Estimation)

Estimation is used in (such as, video conference and MPEG (high quality television broadcasting)) in several multimedia application.For video conference, estimation is used for reducing the data volume that must transmit between the terminals.Estimation is undertaken by video block frame of video being divided into fixed size.For each piece in frame 1, determine in frame 2, whether have the block comprising similar image.If comprise such block in frame 2, just with quoting the motion vector in frame 1, this block can be described.Like this, not all data that transmission represents this block, only need a motion vector to be transferred to receiving terminal.Such as, if in frame 1 one piece on identical screen coordinate, only needs for this block sends a motion vector 0 similar in appearance in frame 2 piece.But, if in frame 1 one piece on different screen coordinateses, only needs a motion vector of the reposition sending this block of instruction similar in appearance in frame 2 piece.Realize according to one, in order to determine that whether block A in frame 1 is similar in appearance to the block B in frame 2, determines the absolute difference sum between pixel value.Lower, block A more similar to block B (if namely and be 0, block A equals block B).

By comprising transmission, decompose, the addition that divides into groups, with saturated grouping subtraction and logical operation, permission integrated data of the present invention performs estimation.Such as, if the video block of two 16 × 16 is the array representations being used as two 8 pixel values that integrated data stores, available following step calculates the absolute difference of the pixel value in these two pieces:

1) utilize move instruction from block A, get 88 place values and generate Source1;

2) utilize move instruction from block B, get 88 place values and generate Source2;

3) perform the saturated grouping subtraction of band and from Source2, deduct Source1 generation Source3-by being with saturated subtraction, in Source3, will only comprise the positive result (namely negative test becomes 0) of this subtraction;

4) perform the saturated grouping subtraction of band and from Source1, deduct Source2 generation Source4-by being with saturated subtraction, in Source4, will only comprise the positive result (namely negative test becomes 0) of this subtraction;

5) on Source3 and Source4, execution grouping or computing (OR) produce Source5-by performing this or computing, comprise the absolute value of Source1 and Source2 in Source5;

6) these steps are repeated until process 16 × 16 pieces.

Draw 8 absolute values are resolved into 16 bit data elements to allow 16 precision, then uses the summation of grouping addition.

Thus, by providing transmission on processor 109, decompose, the addition that divides into groups, with saturated grouping subtraction and logical operation, the present invention has had obvious performance to improve than prior art general processor, and the latter must perform addition and the absolute difference of estimation calculating a data element.

6) discrete cosine transform

Discrete cosine transform (DCT) is the famous function be used in many signal processing algorithms.Especially video and image compression algorithms utilize this conversion widely.

In image with video compression algorithm, use DCT that one piece of pixel is transformed to frequency representation from space representation.In frequency representation, image information is divided into frequency component, some component is more important than other component.Compression algorithm quantizes selectively or abandons reconstruct image content and the frequency component had no adverse effect.Reach compression in this way.

DCT has many realizations, and wherein most popular is certain fast transform approach based on the modeling of Fast Fourier Transform (FFT) (FFT) calculation process.In this Fast transforms, the conversion of N rank is resolved into the group merger and reorganization result of N/2 rank conversion.This decomposition can be performed until till arriving minimum second order conversion.Usually this elementary second order transformation kernel is called butterfly computation.Butterfly computation is expressed as follows:

X＝a*x+b*y

Y＝c*x-d*y

Wherein a, b, c and d are called coefficient, and x and y is input data, and X and Y is then for conversion exports.

By comprising transmission, taking advantage of-Jia and grouping shift order, the present invention allows to use integrated data to perform DCT in the following manner and calculates:

1) utilize and to transmit and disassembly instruction gets two the 16 place values generation Source1 (see table 51 below) representing x and y;

2) as shown in table 51 below, generate Source2-and notice that Source2 is reusable on several times butterfly computation; And

3) utilize Source1 and Source2 to perform multiply-add instruction and generate Result (see table 51 below).

Table 51

In some cases, the coefficient of butterfly computation is 1.For these situations, butterfly computation is degenerated to only has addition and subtraction, and this can utilize grouping addition to perform with grouping subtraction instruction.

IEEE file defines inverse DCT that video conference must perform precision used.(see IEEE Circuits and Systems association, " the ieee standard specification of the realization of 8 × 8 inverse discrete cosine transforms ", IEEE Std.1180-1990, IEEE Inc.345East 47th st., NY, NY on March 18th, 10017, USA, 1991).Disclosed multiply-add instruction meets this permissible accuracy because it uses 16 inputs to generate 32 outputs.

Thus by providing transmission on processor 109, taking advantage of-Jia and packet shifting operation, the present invention compared to the prior art general processor of the addition and multiplication that must perform DCT calculating at every turn a data element, and the present invention has had obvious performance to improve.

Alternate embodiment

Although the present invention has been described as wherein each different computing to have independently circuit, the embodiment that also can realize substituting has made nonidentity operation some circuit common.Such as, lower column circuits is used in one embodiment: 1) single ALU (ALU) performs grouping addition, grouping subtraction, grouped comparison and grouping logical operation; 2) circuit unit performs assembling, decomposition and packet shifting operation; 3) circuit unit performs grouping multiplication and multiply-add operations; And 4) circuit unit perform number counting operation.

Be used herein corresponding and corresponding noun and censure predetermined relationship between the data element that is stored in two or more integrated datas.In one embodiment, this relation is the position position based on the data element in integrated data.Such as, the data element 0 (such as storing in position 0-7 in place with blocked byte form) of the first integrated data is corresponding to the data element 0 (such as storing in position 0-7 in place with blocked byte form) of the second integrated data.But this relation can be different in various embodiments.Such as, the corresponding data element in first and second integrated data may have different length.As another example, be not the lowest order data element (and by that analogy) that the lowest order data element of the first integrated data corresponds to the second integrated data, and the data element in first and second integrated data can with certain other order in correspondence with each other.As another example, data element in first and second integrated data is not have one_to_one corresponding, and data element can on different ratios corresponding (such as, one or more data elements of the first integrated data may correspond to two or more the different pieces of information elements in the second integrated data).

Though describe the present invention by some embodiments, be familiar with this operator and understanding be the invention is not restricted to described embodiment.Can revise in the spirit of appended claims and scope or change to realize method and apparatus of the present invention.Therefore, this instructions should think exemplary instead of limitation of the present invention.

Claims

1. computing method, comprising:

Receive the first instruction of grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction, and described first instruction comprises: opcode field; First field, is used to indicate first operand, and first operand has more than first data element comprising first operand first data element and first operand second data element; Second field, is used to indicate second operand, and second operand has more than second data element comprising second operand first data element and second operand second data element; Each of first operand first data element, first operand second data element, second operand first data element, second operand second data element has the length of N/2 position;

In response to described first instruction, store the first result data element with N bit length, described first result data element comprises first operand first data element and second operand first data element.

2. the process of claim 1 wherein, described first instruction is disassembly instruction and wherein said first result data element is the first decomposition data element.

3. the method for claim 2, comprises further: utilize the first decomposition data element to rewrite more than first data element of first operand.

4. the method for claim 2, the wherein low order data element of first operand first data element to be the low order data element of first operand and second operand first data element be second operand, and the opcode field of disassembly instruction comprises one of opcode set to specify the operation splitting of the staggered low order byte elements from more than first and second data elements, Character table or double word element.

5. the method for claim 4, wherein stores the first decomposition data element and comprises:

Store described first operand first data element and described second operand first data element is stored in adjacent storage unit together with described first operand first data element.

6. the method for claim 2, wherein said first operand first data element is the high level data element of first operand, second operand first data element is the high level data element of second operand, and the opcode field of disassembly instruction comprises one of opcode set to specify the operation splitting of the staggered high-order byte element from more than first and second data elements, Character table or double word element.

7. the method for claim 2, comprises further:

Have the second decomposition data element of N bit length in response to described disassembly instruction storage, described second decomposition data element comprises first operand second data element, and the second operation book second data element.

8. the method for claim 7, the data element wherein storing the first decomposition data element and storage second decomposition comprises:

Described first operand first data element is placed in the first storage unit;

Be placed in the second storage unit by described second operand first data element, described second storage unit is adjacent with described first storage unit;

Be placed in the 3rd storage unit by described first operand second data element, described 3rd storage unit is adjacent with described second storage unit;

Be placed on by second operand second data element in the 4th storage unit, described 4th storage unit is adjacent with described 3rd storage unit.

9. the method for claim 8, wherein said first storage unit, described second storage unit, described 3rd storage unit, and described 4th storage unit is the part in the destination operand of described disassembly instruction instruction.

10. the method for claim 9, wherein said destination operand is represented by the first field of disassembly instruction.

The method of 11. claims 10, also comprises:

Described destination operand is utilized to be rewritten to small part first operand.

The method of 12. claims 11, wherein the first field is made up of the 3-5 position of disassembly instruction.

The method of 13. claims 9, wherein the second field comprises the 0-2 position of the 3rd byte of disassembly instruction.

The method of 14. claims 13, wherein destination operand is indicated by the second field of disassembly instruction.

The method of 15. claims 13, also comprises:

Destination operand is utilized to be rewritten to small part second operand.

16. 1 kinds of calculation elements, comprising:

Demoder, for the first instruction of decoded packet instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction, described first instruction instruction: first operand, first operand has more than first data element comprising first operand first data element and first operand second data element; Second operand, second operand has more than second data element comprising second operand first data element and second operand second data element; Each data element has the length of N/2 position;

Functional unit, in response to the first instruction described in described decoders decode, store the first result data element with N bit length, described first result data element comprises first operand first data element and second operand first data element.

The device of 17. claims 16, wherein said first instruction is disassembly instruction and wherein said first result data element is the first decomposition data element.

The device of 18. claims 17, wherein said disassembly instruction has the integer operation code form comprising three or more byte, and the 3rd byte in three bytes allows the source-register address of the or three and the source-destination register address of the two or three.

The device of 19. claims 18, wherein first operand corresponds to the source-register address of the or three.

The device of 20. claims 18, wherein first operand corresponds to addressable storage unit in reservoir.

The device of 21. claims 18, wherein second operand corresponds to the source-destination register address of the two or three, and wherein functional unit is used for storing the first decomposition data element in the destination of the source-destination register address corresponding to the two or three, and rewrite second operand.

The device of 22. claims 21, wherein said functional unit is used for described first operand first data element to be stored in the Part I of destination, and described second operand first data element is stored in Part II adjacent with described Part I in destination, and in addition, wherein said functional unit is also in response to described first instruction, described first operand second data element is stored in Part III adjacent with described Part II in described destination, and described second operand second data element is stored in Part IV adjacent with described Part III in described destination.

The device of 23. claims 17, also comprises:

Storer, for preserving disassembly instruction, described disassembly instruction has the integer operation code form comprising three or more byte, byte in described three bytes allow the source-register address of the or three and the two or three source-destination register address; And

Memory device, for storing software, described software is configured to disassembly instruction to be supplied to storer for execution.

The device of 24. claims 23, wherein, described demoder is used for receiving disassembly instruction from storer and decoding to disassembly instruction, and first operand corresponds to the source-register address of the or three, and second operand corresponds to the source-destination register address of the two or three.

The device of 25. claims 24, wherein functional unit is used for storing the first decomposition data element in the destination of the source-destination register address corresponding to the two or three.

26. 1 kinds of microprocessors, comprising:

First source-register, for preserving the first integrated data, first integrated data has more than first the integrated data element comprising the first integrated data element and the 3rd integrated data element, and each of the first integrated data element and the 3rd integrated data element has the length of N/2 position;

Second source-register, for preserving the second integrated data, second integrated data has more than second the integrated data element comprising the second integrated data element and the 4th integrated data element, and each of the second integrated data element and the 4th integrated data element has the length of N/2 position;

Circuit, coupling is to receive the first integrated data and to receive the second integrated data from the second source-register from the first source-register, and in response to disassembly instruction, by the first integrated data element and the second integrated data element being copied to the first decomposition data element of destination register and decomposing the first integrated data and the second integrated data by the second decomposition data element of the 3rd integrated data element and the 4th integrated data element being copied to destination register, each of first and second decomposition data elements has the length of N position

Wherein, described disassembly instruction is included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction.

The microprocessor of 27. claims 26, the wherein low order data element of the first integrated data element to be the low order data element of the first integrated data and the second integrated data element be the second integrated data, and disassembly instruction comprises opcode field, for comprise at least the first opcode set one of them to specify the staggered operation splitting from the low order data element of more than first and second data elements, described first opcode set specifies the data element in the group of free byte elements, Character table and double word element composition.

The microprocessor of 28. claims 26, wherein, the high level data element of the first integrated data element to be the high level data element of the first integrated data and the second data element be the second integrated data, and disassembly instruction comprises opcode field, described opcode field selectively comprises one of second opcode set to specify the staggered operation splitting from the high level data tuple of more than first and second data elements, the data element in the group that described second opcode set specifies free byte elements, Character table and double word element to form.

The microprocessor of 29. claims 26, wherein, destination register is the first source-register and wherein utilizes the first and second decomposition data elements to rewrite the first source-register at least partially.

The microprocessor of 30. claims 26, wherein, disassembly instruction has the integer operation code form comprising three or more byte, and the 3rd byte in three or more byte allows the register address of the or three of identification first source-register.

The microprocessor of 31. claims 30, wherein, the register address also identifying purpose ground register of the or three.

32. 1 kinds of calculation elements, comprising:

For receiving the parts of the first instruction of grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction, and described first instruction comprises: opcode field; First field, is used to indicate first operand, and first operand has more than first data element comprising first operand first data element and first operand second data element; And second field, be used to indicate second operand, second operand has more than second data element comprising second operand first data element and second operand second data element; Each of first operand first data element, first operand second data element, second operand first data element, second operand second data element has the length of N/2 position;

Have the parts of the first result data element of N bit length for responding described first instruction storage, described first result data element comprises first operand first data element and second operand first data element.

33. 1 kinds of processors, comprising:

Register file, for storing the first and second integrated datas, first integrated data comprises the first and second data elements and the second integrated data comprises the third and fourth data element, wherein, each data element in more than first data element corresponds to the different data element in more than second data element in respective position;

Demoder, for the integrated data instruction of decoded packet instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction; And

Performance element, be coupled to register file and demoder, wherein, described performance element in response to integrated data instruction using the first data element from the first integrated data be attached in register file as the 3rd integrated data from the data element of the correspondence of the second integrated data.

34. 1 kinds of processors, comprising:

First register, stores the first source data element and middle storage the second source data element in the position [31:16] of the first register in the position [15:0] of the first register;

Second register, stores the 3rd source data element and middle storage the 4th source data element in the position [31:16] of the second register in the position [15:0] of the second register;

Performance element, be coupled to register and demoder, described performance element is configured to produce integrated data result in response to integrated data instruction, and described integrated data result is stored in the first source data element in the position [15:0] of result register and the 3rd source data element and is stored in the position [31:16] of result register.

35. 1 kinds of disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Bus, is disposed for transmission information; With

Processor, be coupled to described bus with process information, described processor comprises:

Performance element, be coupled to register file and demoder, wherein, described performance element in response to integrated data instruction using the first data element from the first integrated data be combined in register file as the 3rd integrated data from the data element of the correspondence of the second integrated data;

Wherein said system is configured to be coupled to the display device for showing information to user and the user input device for receiving information from user.

36. 1 kinds of mixed-media disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Communication bus, for transmitting information; With

Processor, be coupled to described communication bus with process information, described processor comprises:

Performance element, be coupled to the first register, second register and demoder, described performance element is configured to produce integrated data result in response to integrated data instruction, described integrated data result is stored in the first source data element in the position [15:0] of the first register and the 3rd source data element is stored in the position [31:16] of the first register, wherein, described integrated data instruction is included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction,

Wherein, described system is configured to be coupled to the display device for showing information to user and the user input device for receiving information from user.

37. 1 kinds of processors, comprising:

Register file, be configured to storage first integrated data and the second integrated data, described first integrated data and the second integrated data comprise more than first data element and more than second data element respectively, and each data element wherein in more than first data element corresponds to the data element in more than second data element;

Demoder, be configured to decode and take advantage of the instruction of operation, the length of this instruction specific data element, described take advantage of the instruction of operation be included in grouping instruction set in, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping shift order, grouped comparison instruction and grouping logical operation instruction; With

Performance element, be coupled to described register file and demoder, described performance element is configured to each data element from more than first data element to be multiplied with the respective data element from more than second data element, take advantage of the instruction of operation described in described decoders decode and the multiple result data elements generating the 3rd integrated data to respond, wherein the length of each result data element equals the length of the data element that described instruction is specified.

38. 1 kinds of processors, comprising:

Register file, be configured to the integrated data of storage the one 64 and the integrated data of the 2 64, the integrated data of described one 64 and the integrated data of the 2 64 comprise more than first data element and more than second data element respectively, each data element wherein in more than first data element corresponds to the data element in more than second data element, and the length of each data element is 8,16 or 32;

Demoder, instruction is taken advantage of in the grouping being configured to decoding 32, the length of described instruction specific data element, the grouping of described 32 takes advantage of instruction to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping shift order, grouped comparison instruction and grouping logical operation instruction; With

Performance element, be coupled to described register file and demoder, and be configured to each data element from more than first data element and the corresponding data element multiplication from more than second data element, take advantage of instruction described in described decoders decode and the multiple result data elements generating the 3rd integrated data to respond, wherein the length of each result data element equals the length of the data element that described instruction is specified.

39. 1 kinds of disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Bus, is disposed for transmission information; With

Demoder, be configured to decode and take advantage of the instruction of operation, the length of this instruction specific data element, wherein, described take advantage of the instruction of operation be included in grouping instruction set in, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping shift order, grouped comparison instruction and grouping logical operation instruction; With

Performance element, be coupled to described register file and demoder, described performance element is configured to each data element from more than first data element to be multiplied with the respective data element from more than second data element, take advantage of the instruction of operation described in described decoders decode and the multiple result data elements generating the 3rd integrated data to respond, wherein the length of each result data element equals the length of the data element that described instruction is specified;

40. 1 kinds of mixed-media disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Communication bus, for transmitting information; With

Demoder, instruction is taken advantage of in the grouping being configured to decoding 32, the length of described instruction specific data element, wherein, the grouping of described 32 takes advantage of instruction to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping shift order, grouped comparison instruction and grouping logical operation instruction; With

Performance element, be coupled to described register file and demoder, and be configured to each data element from more than first data element and the corresponding data element multiplication from more than second data element, take advantage of instruction described in described decoders decode and the multiple result data elements generating the 3rd integrated data to respond, wherein the length of each result data element equals the length of the data element that described instruction is specified;

41. 1 kinds of processors, comprising:

Command cache, is disposed for storing instruction;

Instruction pointer register, is configured to the address storing the instruction that will perform;

Register file, comprises multiple register, and the instruction that can operate according to performing is stored in floating data or integrated data in single register, and described integrated data comprises multiple data element;

Demoder, be coupled to instruction pointer register and command cache, this decoder configurations becomes to be used for decode packet data instruction, the shifting function that will perform each data element stored in described register is specified in this integrated data instruction, the size of the data element that will store in a register is specified with described integrated data instruction, described demoder is also configured to load instructions of decoding, described load instructions specifies the load operation that data will be loaded into from data cache register file, wherein, the described integrated data instruction of described shifting function is specified to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouped comparison instruction and grouping logical operation instruction, with

Performance element, is coupled to described register file and demoder, and described performance element is configured to each data element that is shifted independently in response to integrated data instruction, and described performance element is configured to respond described load instructions execution load operation.

42. 1 kinds of processors, comprising:

Command cache, is disposed for storing instruction;

Register file, comprise multiple register, the instruction that can operate according to performing is stored in floating data or integrated data in single register, described integrated data comprises the data element of 64, the data element of two 32, the data element of the data element of four 16 or eight 8;

Demoder, be coupled to instruction pointer register and command cache, this decoder configurations becomes for decoding the integrated data instruction of 32, the shifting function that will perform each data element stored in described register is specified in this integrated data instruction of 32, the size of the data element that will store in a register is specified in integrated data instruction with described 32, described demoder is also configured to load instructions of decoding, described load instructions specifies the load operation that data will be loaded into from data cache register file, wherein, the integrated data instruction of described 32 of described shifting function is specified to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouped comparison instruction and grouping logical operation instruction, with

Performance element, be coupled to described register file and demoder, described performance element is configured in response to integrated data instruction and uses each data element in the clock period shifted data element of some independently, the number of described clock period equals described processor and performs number of clock cycles required for single shifting function to non-packet data, and described performance element is also configured to respond described load instructions performs load operation.

43. 1 kinds of disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Bus, is disposed for transmission information; With

Command cache, is disposed for storing instruction;

Demoder, be coupled to command cache, this decoder configurations becomes to be used for decode packet data instruction, the shifting function that will perform each data element stored in described register is specified in this integrated data instruction, the size of the data element that will store in a register is specified with described integrated data instruction, wherein, the described integrated data instruction of described shifting function is specified to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouped comparison instruction and grouping logical operation instruction, with

Performance element, is coupled to described register file and demoder, and described performance element is configured to each data element that is shifted independently in response to integrated data instruction,

44. 1 kinds of mixed-media disposal systems, be configured to support 2D/3D figure, image procossing, video compression/decompression and audio operation, described system comprises:

Communication bus, for transmitting information; With

Command cache, is disposed for storing instruction;

Register file, comprise multiple register, can operate and according to the instruction that will perform, floating data or integrated data are stored in single register, described integrated data comprises the data element of two 32, the data element of the data element of four 16 or eight 8;

Demoder, be coupled to command cache, this decoder configurations becomes for decoding the integrated data instruction of 32, the shifting function that will perform each data element stored in described register is specified in this integrated data instruction of 32, the size of the data element stored in register is specified in integrated data instruction with described 32, described demoder is also configured to load instructions of decoding, described load instructions specifies the load operation that data will be loaded into from data cache register file, wherein, the integrated data instruction of described 32 of described shifting function is specified to be included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouped comparison instruction and grouping logical operation instruction, with

Performance element, be coupled to described register file and demoder, described performance element is configured in response to integrated data instruction and uses each data element in the clock period shifted data element of some independently, the number of described clock period equals described processor and performs number of clock cycles required for single shifting function to non-packet data, and described performance element is also configured to respond described load instructions performs load operation;

45. 1 kinds of computer systems, comprising:

Processor, comprising:

Command cache, for storing instruction;

Instruction pointer register, for storing the address of the instruction that will be performed;

Comprise the register file of multiple register, described multiple register operable is for storing floating data and integrated data in single register, described integrated data comprises first integrated data with more than first data element and the second integrated data comprising more than second data element, and each of more than first and second data element comprises two 32 bit data elements, four 16 bit data elements and eight 8 bit data elements;

Be coupled to the demoder of instruction pointer register and command cache, described demoder is for decoding for multiple grouping instructions of the grouping instruction set to integrated data executable operations, and described grouping instruction set at least comprises:

Grouping shift order is used for performing:

Shifting function, this operation is in order to produce group result by each displacement of more than first data element with the shift count of the correspondence one of more than second data element;

Operated in saturation, this operation in case of overflow by group result saturation clamping to maximal value, and in case of underflow by group result saturation clamping to minimum value; And

Grouping shift order, the size of more than first data element is specified in this instruction further; And

Grouping load instructions, is loaded into grouping register by integrated data from data cache for performing load operation;

Grouping add instruction, in order to perform additive operation; And

Grouping subtraction instruction, in order to perform subtraction;

Grouping multiplying order;

Grouped comparison instruction; And

Grouping logical operation instruction; And

Be coupled to the performance element of register file and demoder, described performance element is used for:

Respond packet data command and each of data element is shifted independently;

Respond load instructions and perform load operation;

Respond add instruction and perform additive operation; And

Respond subtraction instruction and perform subtraction; And

Bus, is coupled to described processor and is suitable for the equipment that is coupled to for sound recording and/or broadcasting;

ROM, is coupled to described bus to store the instruction for described processor;

RAM, is coupled to described bus; And

Video digitizer equipment, is coupled to described bus;

Described computer system is coupled to user input device and display device, and will be used for supporting 2D/3D figure for the terminal in network further.

46. 1 kinds of computer systems, comprising:

Processor, comprising:

Command cache, for storing instruction;

Comprise the register file of multiple register, described multiple register operable is for storing floating data and integrated data in single register, described integrated data comprises multiple data element, and each of described multiple data element comprises two 32 bit data elements, four 16 bit data elements and eight 8 bit data elements;

Grouping shift order is used for performing:

Shifting function, this operation is in order to produce group result by each displacement of described multiple data element to be directly worth independently;

Grouping shift order, the size of multiple data element is specified in this instruction further; And

Grouping add instruction, in order to perform additive operation; And

Grouping subtraction instruction, in order to perform subtraction;

Grouping multiplying order;

Grouped comparison instruction; And

Grouping logical operation instruction; And

Respond packet data command and each of data element is shifted independently;

Respond load instructions and perform load operation;

Respond add instruction and perform additive operation;

Respond subtraction instruction and perform subtraction; And

RAM, is coupled to described bus; And

Video digitizer equipment, is coupled to described bus;

Described computer system will be coupled to user input device and display device, and will be used for supporting 2D/3D figure for the terminal in network further.

The computer system of 47. claims 45 or 46, wherein:

More than first data element and more than second data element comprise the non-signed integer element that length is 8; And

Described performance element in case of underflow by group result saturation clamping to minimum value of zero.

The computer system of 48. claims 45 or 46, wherein:

More than first data element and more than second data element comprise the signed integer element that length is 8; And

Described performance element in case of underflow by group result saturation clamping to minimum value-128.

The computer system of 49. claims 45 or 46, wherein:

More than first data element and more than second data element comprise the non-signed integer element that length is 16; And

The computer system of 50. claims 45 or 46, wherein:

More than first data element and more than second data element comprise the signed integer element that length is 16; And

Described performance element in case of underflow by group result saturation clamping to minimum value-32768.

The computer system of 51. claims 45 or 46, wherein:

More than first data element and more than second data element comprise the non-signed integer element that length is 32; And

The computer system of 52. claims 45 or 46, wherein said demoder is for 32 shift orders of dividing into groups of decoding.

The computer system of 53. claims 45 or 46, wherein shifting function is used for described result to be stored in corresponding destination register.

The computer system of 54. claims 45 or 46, wherein said grouping register is in described register file.

The computer system of 55. claims 45 or 46, wherein said performance element is in response to integrated data instruction with each data element in the clock period shifted data element of some, and the number of described clock period equals described processor and performs number of clock cycles required for single shifting function to non-packet data.

56. 1 kinds of processors, comprising:

Command cache, for storing instruction;

Be coupled to the demoder of command cache, described demoder will have the first instruction that the first integrated data sequence of the first component group data element performs for decoding, the each of first component group data element has identical n bit length, the source of the first integrated data sequence and the destination unit of result packet data sequence are specified in first instruction, described result packet data sequence has the one group result packet data element corresponding with described first component group data element, and each of described one group of result packet data element has identical n bit length;

Performance element, produce described result packet data sequence in response to the instruction of described decoders decode first, each of described one group of result packet data element represents to the counting of the figure place of a setting in the corresponding integrated data element of described first integrated data sequence respectively; And

Register file, being coupled with described performance element stores with the destination unit specified by the first instruction the result packet data sequence produced by performance element,

Wherein, first instruction is included in grouping instruction set, described grouping instruction set has for the multiple grouping instructions to integrated data executable operations, wherein, described grouping instruction set also at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction.

The processor of 57. claims 56, wherein each of each and described one group of result packet data element of the first component group data element has 8 identical bit lengths.

The processor of 58. claims 56, wherein each of each and described one group of result packet data element of the first component group data element has 16 identical bit lengths.

The processor of 59. claims 56, wherein each of each and described one group of result packet data element of the first component group data element has 32 identical bit lengths.

The processor of 60. claims 56, wherein said first component group data sequence and described result packet data sequence all have 64 bit lengths.

The processor of 61. claims 56, wherein said first instruction has the control signal form of 24 or more bit lengths.

The processor of 62. claims 56, wherein said first instruction has the control signal form of 32 bit lengths.

The processor of 63. claims 62, all instructions be wherein stored in described command cache all have the control signal form of 32 bit lengths.

64. 1 kinds, by computer implemented method, comprising:

The instruction of decoded packet instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction, described instruction instruction has the storage unit of the first integrated data sequence of the first component group data element, described each of first component group data element has identical n bit length, and the operation will carried out described integrated data element is specified in described instruction; And

The integrated data sequence that bears results in response to the described instruction of execution, described result packet data sequence has the one group result packet data element corresponding with the described first component group data element of described first integrated data sequence, and each of described one group of result packet data element represents respectively to the counting of the figure place of a setting in the corresponding integrated data element of described first integrated data sequence.

65. claims 64 by computer implemented method, each of each and described one group of result packet data element of wherein said first component group data element has 8 identical bit lengths.

66. claims 65 by computer implemented method, the storage unit of wherein said first integrated data sequence has 64 bit lengths.

67. 1 kinds of computer systems, comprising:

Memory device, for storing instruction and data; And

Processor, comprising:

Command cache, for storing the copy of the instruction from memory device;

Be coupled to the demoder of command cache, described demoder is for first instruction that will perform the first integrated data sequence with the first component group data element of decoding, the each of first component group data element has identical n bit length, the source of the first integrated data sequence and the destination unit of result packet data sequence are specified in first instruction, described result packet data sequence has the one group result packet data element corresponding with described first component group data element, and each of described one group of result packet data element has identical n bit length;

Wherein, first instruction is included in grouping instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, wherein, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction.

The computer system of 68. claims 67, each of each and described one group of result packet data element of wherein said first component group data element has 8 identical bit lengths.

The computer system of 69. claims 67, the storage unit of wherein said first component group data sequence has 64 bit lengths.

70. additionally support the processor with the instruction set of x86 instruction set compatibility, and described processor comprises:

Multiple register, it comprises integer registers, status register and not only for floating data but also for the register of integrated data, described status register is used to indicate the state of described processor;

Mechanism, for switching between the integrated data register of its flating point register as stack location or non-stack location not only making for floating data but also for the described register manipulation of integrated data;

Demoder, decode to indicate the use of integrated data to 32 control signal forms and the first source-register, the second source-register be provided and be used for the destination register of event memory integrated data, described 32 control signal forms are for representing the instruction of grouping instruction set, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction;

Performance element, both performed division operation for floating data for described first source-register of the described register of integrated data and the second source-register in response to 32 control signals described in decoders decode to the integrated data register of locating as non-stack, and described result was stored in described destination register.

The processor of 71. claims 70, the instruction set with x86 instruction set compatibility is supported in response to the instruction from the second control signal form, and the flating point register process that described performance element also will both be used for described register that floating data is used for integrated data and locates as stack.

The processor of 72. claims 71, wherein said performance element also operates the floating-point of not only having located as non-stack for floating data but also for the described register of integrated data and integrated data register simultaneously.

The processor of 73. claims 72, wherein additionally supports to be included in 64 bit processors with the described processor of the instruction set of x86 instruction set compatibility.

74. 1 kinds of computer systems, comprising:

Memory device, is used for storing instruction and data;

Additionally support the processor with the instruction set of x86 instruction set compatibility, described processor comprises:

Command cache, for storing the copy of the instruction from memory device;

Multiple register, it comprises integer registers, status register and not only for floating data but also for the register of integrated data;

Demoder, be used for decoding 32 control signal forms to indicate the use of integrated data and to provide the first source-register, the second source-register and be used for the destination register of event memory integrated data, described 32 control signal forms are for representing the instruction of grouping instruction set, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction;

Performance element, both for described first source-register of the described register of integrated data and the second source-register, division operation was performed for floating data to the integrated data register of locating as non-stack in response to 32 control signals described in decoders decode, and result was stored in described destination register.

The computer system of 75. claims 74, described processor supports the instruction set with x86 instruction set compatibility in response to the instruction from the second control signal form, and wherein said performance element is also using flating point register process that the described register being not only used for floating data but also being used for integrated data is located as stack.

The computer system of 76. claims 75, wherein said status register is used to refer to the state of processor.

The computer system of 77. claims 74, wherein said performance element also operates the floating-point of not only having located as non-stack for floating data but also for the described register of integrated data and integrated data register stack simultaneously.

The computer system of 78. claims 74, wherein additionally supports to be included in 64 bit processors with the described processor of the instruction set of x86 instruction set compatibility.

79. 1 kinds of processors, comprising:

Command cache, for storing instruction;

Register file, comprise the multiple registers be operable as storing floating data and integrated data, described integrated data comprises first integrated data with more than first data element and second integrated data with more than second data element, and each of more than first and second data element has identical n position size;

Be coupled to the demoder of command cache, described demoder is used for the first instruction of decoded packet instruction set, described grouping instruction set comprises for the multiple grouping instructions to integrated data executable operations, described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction, the first source of the first integrated data is specified in first instruction, second source of the second integrated data, and there is the destination unit of result packet data of the 3rd many result packet data elements, the each of described 3rd many result packet data elements has the formed objects so grown with the twice of each n position size of more than first and second data elements, each size of more than first and second data element is specified in first instruction further,

Performance element, be coupled to register file and in response to the decoding of demoder to the first instruction, by each corresponding element being multiplied by more than second data element of more than first data element, thus produce respective amassing, again by the long-pending result packet data that addition is corresponding together of the corresponding data element of the first and second integrated datas, thus produce the 3rd many result packet data elements; And

Described file register is coupled to performance element, so that the destination unit specified by the first instruction stores the result packet data produced by performance element.

The processor of 80. claims 79, wherein each of each and described one group of result packet data element of the first component group data element has 8 identical sizes.

The processor of 81. claims 79, wherein each of each and described one group of result packet data element of the first component group data element has 16 identical sizes.

The processor of 82. claims 79, wherein each of each and described one group of result packet data element of the first component group data element has 32 identical sizes.

83. 1 kinds of computer systems, comprising:

Processor, comprising:

Command cache, for storing instruction;

Grouping multiply-add instruction is used for performing:

Multiplying, in order to each correspondence one being multiplied by more than second data element by more than first data element, thus produces respective amassing;

Additive operation, in order to the long-pending summation by the corresponding data element in the first and second integrated datas, thus produces group result; And

Grouping multiply-add instruction, the size of described multiple data element is specified in this instruction further; And

Grouping add instruction, in order to perform additive operation;

Grouping subtraction instruction, in order to perform subtraction;

Grouping multiplying order;

Grouped comparison instruction;

Grouping logical operation instruction; And

Grouping shift order, in order to perform shifting function; And

Be coupled to the performance element of register file and demoder, described performance element is used for: respond packet multiply-add instruction and perform multiply-add operations independently to data element;

Respond load instructions and perform load operation;

Respond add instruction and perform additive operation; And

Respond subtraction instruction and perform subtraction; And

Respond shift order and perform shifting function; And

RAM, is coupled to described bus; And

Video digitizer equipment, is coupled to described bus;

The computer system of 84. claims 83, wherein, described demoder be used for further decode operated in saturation so that in case of overflow by each saturation clamping of described group result to maximal value, in case of underflow then saturation clamping to minimum value.

The computer system of 85. claims 84, wherein:

More than first integrated data element and more than second integrated data element comprise the non-signed integer element that length is 8; And

Described performance element in case of underflow by each saturation clamping of group result to minimum value of zero.

The computer system of 86. claims 84, wherein:

More than first integrated data element and more than second integrated data element comprise the signed integer element that length is 8; And

Described performance element in case of underflow by each saturation clamping of group result to minimum value-128.

The computer system of 87. claims 84, wherein:

More than first integrated data element and more than second integrated data element comprise the non-signed integer element that length is 16; And

The computer system of 88. claims 84, wherein:

More than first integrated data element and more than second integrated data element comprise the signed integer element that length is 16; And

Described performance element in case of underflow by each saturation clamping of group result to minimum value-32768.

The computer system of 89. claims 84, wherein:

More than first integrated data element and more than second integrated data element comprise the non-signed integer element that length is 32; And

The computer system of 90. claims 83, wherein said demoder is for 32 shift orders of dividing into groups of decoding.

The computer system of 91. claims 83, wherein described result is stored in corresponding destination register by multiply-add operations.

The computer system of 92. claims 83, wherein said grouping register is in described register file.

93. 1 kinds of processors, comprising:

Cache memory, for storing one or more instruction;

Demoder, for decoding to the instruction of grouping instruction set, described instruction is used for integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction and grouping logical operation instruction;

Parasites Fauna, for the second integrated data of the first integrated data and 64 sizes that store 64 sizes, first integrated data comprises first, second, third, fourth, the 5th, the 6th, the 7th and Eight characters joint, the second integrated data comprises the 9th, the tenth, the 11, the 12, the 13, the 14, the 15 and the 16 byte; And

Performance element, is coupled with cache memory, demoder and Parasites Fauna, and

The instruction that can operate to perform the decoding of the grouping instruction set comprising the grouping additive operation corresponding with grouping add instruction comes by the first byte and the 9th byte are added generation first result, by the second byte and crossed joint are added generation second result, by the 3rd byte and the 11 byte are added generation the 3rd result, by nybble and the 12 byte are added generation the 4th result, by the 5th byte and the 13 byte are added generation the 5th result, by the 6th byte and the tenth nybble are added generation the 6th result, by the 7th byte and the 15 byte are added generation the 7th result, and by Eight characters joint and the 16 byte are added generation the 8th result,

Can operate when division operation causes the overflow in the first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result, first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result is clamped to maximal value, or when division operation causes the underflow in the first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result, first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result is clamped to minimum value, and

Can operate and store the first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result in destination register, wherein, maximal value equals 255, minimum value equals 0, and the first, second, third, fourth, the 5th, the 6th, the 7th and the 8th result represents without value of symbol.

94. 1 kinds of processors, comprising:

Cache memory, for storing one or more instruction;

Parasites Fauna; And

Performance element;

Wherein, Parasites Fauna is for the second integrated data of the first integrated data and 64 sizes that store 64 sizes, first integrated data comprises first, second, third, fourth, the 5th, the 6th, the 7th and Eight characters joint, the second integrated data comprises the 9th, the tenth, the 11, the 12, the 13, the 14, the 15 and the 16 byte;

Wherein, performance element is coupled with cache memory, demoder and Parasites Fauna, and

Can operate to perform the instruction of the decoding of the grouping instruction set comprising the grouped comparison computing corresponding with grouped comparison instruction by the first byte and the 9th byte are compared generation first result, by the second byte and crossed joint are compared generation second result, by the 3rd byte and the 11 byte are compared generation the 3rd result, by nybble and the 12 byte are compared generation the 4th result, by the 5th byte and the 13 byte are compared generation the 5th result, by the 6th byte and the tenth nybble are compared generation the 6th result, by the 7th byte and the 15 byte are compared generation the 7th result, and by Eight characters joint and the 16 byte are compared generation the 8th result, if the first byte is greater than the 9th byte, then the first result is configured to the first value, if the first byte is not more than the 9th byte, then the first result is configured to the second value, if the second byte is greater than crossed joint, then the second result is configured to the first value, if the second byte is not more than crossed joint, then the second result is configured to the second value, if the 3rd byte is greater than the 11 byte, then the 3rd result is configured to the first value, if the 3rd byte is not more than the 11 byte, then the 3rd result is configured to the second value, if nybble is greater than the 12 byte, then the 4th result is configured to the first value, if nybble is not more than the 12 byte, then the 4th result is configured to the second value, if the 5th byte is greater than the 13 byte, then the 5th result is configured to the first value, if the 5th byte is not more than the 13 byte, then the 5th result is configured to the second value, if the 6th byte is greater than the tenth nybble, then the 6th result is configured to the first value, if the 6th byte is not more than the tenth nybble, then the 6th result is configured to the second value, if the 7th byte is greater than the 15 byte, then the 7th result is configured to the first value, if the 7th byte is not more than the 15 byte, then the 7th result is configured to the second value, if Eight characters joint is greater than the 16 byte, then the 8th result is configured to the first value, if Eight characters joint is not more than the 16 byte, then the 8th result is configured to the second value,

Wherein, grouped comparison operation is without symbol packets compare operation.

95. 1 kinds of processors, comprising:

Cache memory, for storing one or more instruction;

Demoder, can operate and the instruction of grouping instruction set is decoded, described instruction is used for integrated data executable operations, and described grouping instruction set at least comprises grouping add instruction, grouping multiplying order, grouping shift order, grouped comparison instruction, grouping logical operation instruction, assembling instruction, disassembly instruction, grouping subtraction instruction and multiply-add instruction;

Register file, can operate to store the first integrated data comprising more than first data element and the second integrated data comprising more than second data element, wherein, first integrated data comprises the first plural number, second integrated data comprises the second plural number, first plural number comprises first real-valued (r1) and the first dummy values (i1), and the second plural number comprises second real-valued (r2) and the second dummy values (i2); And

Performance element, is coupled with cache memory, demoder and Parasites Fauna, and can operate the instruction of the decoding performing grouping instruction set, and the following operation of execution comprises:

The assembly operation corresponding with assembling instruction, assembles to form the 3rd integrated data for a part of bit from the data element at least two integrated datas is carried out;

The operation splitting corresponding with disassembly instruction, for generating the 4th integrated data containing at least one data element from the first integrated data and at least one the corresponding data element from the second integrated data;

The grouping additive operation corresponding with grouping add instruction, for being added the corresponding data elements in parallel from least two integrated datas;

The grouping subtraction corresponding with grouping subtraction instruction, for subtracting each other the corresponding data elements in parallel from least two integrated datas;

With the corresponding grouping multiplying of grouping multiplying order, for the corresponding data element multiplication from least two integrated datas;

The multiply-add operations corresponding with multiply-add instruction, for by first plural number and the second complex multiplication to generate the first result data element (R) and the second result data element (I), wherein, first result data element (R) equals the difference of the product of first real-valued (r1) and second real-valued (r2) and the product of the first dummy values (i1) and the second dummy values (i2), and the second result data element (I) equals the product of first real-valued (r1) and the second dummy values (i2) and the sum of products of second real-valued (r2) and the first dummy values (i1).