CN104137053B - For performing systems, devices and methods of the butterfly laterally with intersection addition or subtraction in response to single instruction - Google Patents
For performing systems, devices and methods of the butterfly laterally with intersection addition or subtraction in response to single instruction Download PDFInfo
- Publication number
- CN104137053B CN104137053B CN201180076420.2A CN201180076420A CN104137053B CN 104137053 B CN104137053 B CN 104137053B CN 201180076420 A CN201180076420 A CN 201180076420A CN 104137053 B CN104137053 B CN 104137053B
- Authority
- CN
- China
- Prior art keywords
- data element
- register
- source
- result
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000004044 response Effects 0.000 title claims abstract description 7
- 239000013598 vector Substances 0.000 claims abstract description 208
- 238000012856 packing Methods 0.000 claims abstract description 20
- 238000003860 storage Methods 0.000 claims description 48
- 238000012545 processing Methods 0.000 claims description 12
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 230000015654 memory Effects 0.000 description 114
- VOXZDWNPVJITMN-ZBRFXRBCSA-N 17β-estradiol Chemical compound OC1=CC=C2[C@H]3CC[C@](C)([C@H](CC4)O)[C@@H]4[C@@H]3CCC2=C1 VOXZDWNPVJITMN-ZBRFXRBCSA-N 0.000 description 74
- 238000006073 displacement reaction Methods 0.000 description 41
- 238000010586 diagram Methods 0.000 description 31
- 238000007792 addition Methods 0.000 description 28
- 238000005516 engineering process Methods 0.000 description 14
- 210000004027 cell Anatomy 0.000 description 9
- 238000006243 chemical reaction Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 8
- 230000006835 compression Effects 0.000 description 8
- 238000007906 compression Methods 0.000 description 8
- 230000004069 differentiation Effects 0.000 description 8
- 210000004940 nucleus Anatomy 0.000 description 7
- 239000003795 chemical substances by application Substances 0.000 description 6
- 230000000295 complement effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000007667 floating Methods 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 238000013501 data transformation Methods 0.000 description 3
- 238000002156 mixing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000000151 deposition Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 239000003607 modifier Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000000758 substrate Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 101100285899 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) SSE2 gene Proteins 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010009 beating Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013506 data mapping Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 229910052754 neon Inorganic materials 0.000 description 1
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000013442 quality metrics Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30018—Bit or string instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
- G06F9/3016—Decoding the operand specifier, e.g. specifier format
- G06F9/30167—Decoding the operand specifier, e.g. specifier format of immediate specifier, e.g. constants
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
- G06F9/30185—Instruction operation extension or modification according to one or more bits in the instruction, e.g. prefix, sub-opcode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Physics (AREA)
- Executing Machine-Instructions (AREA)
- Advance Control (AREA)
Abstract
It describes for performing the embodiment that the vector of packaged data element is packaged the systems, devices and methods of butterfly lateral cross addition or subtraction in the computer processor in response to single vector packing butterfly lateral cross addition or subtraction instruction, which includes destination vector registor operand, source vector register operand, immediate and command code.
Description
Technical field
The field of invention relates generally to computer processor architectures, relate more specifically to lead to particular result when being executed
Instruction.
Background technology
Instruction set or instruction set architecture (ISA) are parts related with programming in computer architecture, and may include primary
Data type, instruction, register architecture, addressing mode, memory architecture, interruption and abnormality processing and external input and defeated
Go out (I/O).Term instruction typicallys represent macro-instruction in this application, macro-instruction be provided to processor (or dictate converter,
The dictate converter (converted using static binary conversion, the binary including on-the-flier compiler) conversion, deformation, emulation or
Otherwise convert instructions into other the one or more instructions that will be handled by processor) for the instruction of execution --- make
For comparison, microcommand or microoperation (microoperation) are that the decoder of processor decodes the result of macro-instruction.
ISA is different from micro-architecture, and micro-architecture is the interior design for the processor for realizing the instruction set.With different micro-architectures
Processor can share common instruction set.For example,Pentium four (Pentium4) processor,Duo
(CoreTM) processor and the advanced micro devices Co., Ltd from California Sani's Weir (Sunnyvale)
All multiprocessors of (Advanced Micro Devices, Inc.) perform the x86 instruction set of almost the same version (newer
Some extensions are added in version), but with different interior designs.For example, it can be used in different micro-architectures well known
Technology realizes the identical register architecture of ISA in different ways, these technologies include special physical register, use register
Renaming mechanism is (such as, using register alias table (RAT), resequencing buffer (ROB) and resignation register group;It uses
Multiple mappings and register pond) one or more physical registers dynamically distributed, etc..It is posted in this application using phrase
Storage framework, register group and register represent the side of the visible register of software/programmer and the specified register of instruction
Formula, unless specified otherwise.In the occasion for needing particularity, attribute logic, framework or software will be used visible to indicate
Register/register group in register architecture, while the register (example that different attributes will be used to indicate in given micro-architecture
Such as physical register, resequencing buffer, resignation register, register pond).
Instruction set includes one or more instruction formats.Given instruction format defines multiple fields (quantity of position, the position of position
Put) operand that will be performed with the specified operation (command code) that will be performed and the operation etc..Referred to by definition
Template (or subformat) is enabled, some instruction formats are further divided.For example, the instruction template of given instruction format can be defined
Into the different subsets of the field with the instruction format, (included field is typically same sequence, but at least some due to packet
Include less field and there is different position positions) and/or be defined as different to the explanation for giving field.Therefore, using given
Instruction format (and if definition, template is given according in the instruction template of the instruction format) expresses ISA's
Each instruction, and each instruction of ISA includes the field for specifying its operation and operand.For example, illustrative ADD
(addition) instruction has specific command code and instruction format, which includes the op-code word for specifying the command code
Section and the operand field for selection operation number (1/ destination of source and source 2);And appearance of the ADD instruction in instruction stream
It will be with the specific content in operand field, specific content selection specific operation number.
Scientific application, financial application, automatic vectorization common application, RMS (identification, excavate and synthesis) applications and visual
With multimedia application (such as, 2D/3D figures, image procossing, video compression/decompression, speech recognition algorithm and audio frequency process)
It usually requires to perform mass data item identical operation (being known as " data parallel ").Single-instruction multiple-data (SIMD) refers to
So that processor performs multiple data item a type of instruction of one operation.SIMD technologies are particularly suitable for will be in register
Multiple positions be logically divided into the processors of multiple fixed-size data elements, wherein each data element represents individual
Value.For example, the position in 256 bit registers can be appointed as the source operand to operate, as 4 individual 64 packings
Data element (four words (Q) dimension data element), 8 individual 32 packaged data elements (double word (D) dimension data element),
16 individual 16 packaged data elements (word (W) dimension data element) or 32 individual 8 bit data elements (bytes (B)
Dimension data element).The data type is referred to alternatively as packaged data type or vector data types, and the behaviour of the data type
It counts and is referred to as packaged data operand or vector operand.In other words, packaged data item or vector refer to packaged data
The sequence of element, and packaged data operand or vector operand be SIMD instruction (or for packaged data instruction or vector refer to
Enable) source operand or vector element size.
Two source vector operands will be performed in a longitudinal fashion as an example, a type of SIMD instruction specifies
Single vector operation, for generate with identical size, with the data element of identical quantity and according to identical number
According to the destination vector operand (also referred to as result vector operand) of elements order.Data element in source vector operands
It is referred to as source data element, and the data element in the vector operand of destination is referred to as destination or result data element.This
A little source vector operands have identical size and the data element comprising same widths, therefore they include the number of identical quantity
According to element.Source data element in identical bits position in two source vector operands forms data element to (also referred to as corresponding
Data element;That is, the data element in the data element position 0 of each source operand is corresponding, in each source operand
Data element in data element position 1 is corresponding, and so on).These source data element centerings each is held respectively
The operation that the row SIMD instruction is specified, to generate the result data element of number of matches, and therefore per a pair of of source data element
With corresponding result data element.Since the operation is longitudinal, and since result vector operand is identical size, has
The data element and result data element of identical quantity are stored according to the data element sequence identical with source vector operands,
So result data element is in result vector operand and their corresponding source data elements pair in source vector operands
In identical position position.Other than the SIMD instruction of this exemplary types, there are various other kinds of SIMD to refer to
Enable (such as only tool is operated there are one source vector operands or with more than two source vector operands, with landscape mode, is generated not
Result vector operand with size, with various sizes of data element and/or with different data element order
SIMD instruction).It is specified it should be appreciated that term destination vector operand (or vector element size) is defined as performing by instructing
Operation direct result, (can be the deposit specified by the instruction including the vector element size is stored at a position
At device or storage address) so that it can be used as source operand to be accessed by another instruction and (specify same position by another instruction
It puts).
Such as with including x86, MMXTM, Streaming SIMD Extension (SSE), SSE2, SSE3, SSE4.1 and SSE4.2 instruction
Instruction setCoreTMThe SIMD technologies of SIMD technologies etc have been realized in application performance used by processor
Significant improvement.It has released and/or has issued and be referred to as high-level vector extension (AVX) (AVX1 and AVX2) and expanded using vector
The additional SIMD extension collection of (VEX) encoding scheme is opened up (see, for example, in October, 201164 and IA-32 Framework Softwares
Developer's handbook;And referring in June, 2011High-level vector extension programming reference).
Brief description
The present invention is illustrated by way of example, and is not just limited to the drawings, in the accompanying drawings, similar reference
Label represents similar element, wherein:
Fig. 1 shows the exemplary diagram of the operation of exemplary PHXADDSUB instructions.
Fig. 2 shows the embodiments for using PHXADDSUB instructions in the processor.
Fig. 3 shows the embodiment for handling the method that PHXADDSUB is instructed.
Fig. 4 shows the exemplary pseudo-code for four packaged data element destination transverse direction additions or subtraction.
Fig. 5 shows that multiple 1 effective bit vectors according to an embodiment of the invention write mask element and vector dimension sum number
According to the association between element size.
Fig. 6 A instantiate exemplary AVX instruction formats;
Fig. 6 B show which field from Fig. 6 A forms complete operation code field and fundamental operation field;
Fig. 6 C show which field from Fig. 6 A forms register index field;
Fig. 7 A-7B are the frames for showing general vector close friend instruction format according to an embodiment of the invention and its instruction template
Figure;
Fig. 8 A-D are the block diagrams for showing exemplary special vector friendly instruction format according to an embodiment of the invention;
Fig. 9 is the block diagram of register architecture according to an embodiment of the invention;
Figure 10 A are to show sample in-order pipeline and exemplary register renaming according to an embodiment of the invention
Both unordered publication/execution pipelines block diagram;
Figure 10 B be show each embodiment according to the present invention the ordered architecture core to be included in the processor it is exemplary
The block diagram of unordered publication/execution framework core of embodiment and illustrative register renaming;
Figure 11 A-B show the block diagram of more specific exemplary ordered nucleus framework, which will be several logics in chip
One of block (including same type and/or other different types of cores);
Figure 12 is the core according to an embodiment of the invention with more than one, can be controlled with integrated memory
Device and can have integrated graphics processor block diagram;
Figure 13 is the block diagram of exemplary system according to an embodiment of the invention;
Figure 14 is the block diagram of the according to an embodiment of the invention first more specific exemplary system;
Figure 15 is the block diagram of the according to an embodiment of the invention second more specific exemplary system;
Figure 16 is the block diagram of SoC according to an embodiment of the invention;
Figure 17 is that comparison according to an embodiment of the invention uses software instruction converter by the binary system in source instruction set
The block diagram for the binary instruction that instruction map is concentrated into target instruction target word.
Detailed description
In the following description, multiple details be set forth.It it is to be understood, however, that can not be specific thin by these
It saves to implement the embodiment of the present invention.In other examples, it is not shown specifically well known circuit, structure and technology, Yi Mianmo
Paste understanding of the description.
Described reality is shown to the reference of " one embodiment ", " embodiment ", " example embodiment " etc. in specification
The scheme of applying can include specific feature, structure or characteristic, but each embodiment differs and establishes a capital including specific feature, the knot
Structure or characteristic.In addition, these phrases not necessarily represent the same embodiment.In addition, when contact embodiment describes specific feature, knot
When structure or characteristic, it is believed that those of ordinary skill in the art understand that realized with reference to other embodiments this feature, structure or
Characteristic, regardless of whether being expressly recited.
General view
In the following description, it before the operation of the specific instruction in describing the instruction set architecture, needs to explain
Project.A kind of such project is known as " writing mask register ", by condition is controlled based on element one by one commonly used in asserting
Calculate the operand of operation (hereinafter, it is also possible to using term mask register, represent all " k " registers as discussed below
Etc write mask register).It is used in following article, writes mask register and store multiple positions (16,32,64 etc.), wherein writing
Operation/more of the packaged data element of each significance bit dominant vector register in mask register during SIMD processing
Newly.It typically, there are and mask register is write more than one for processor core use.
The instruction set architecture includes at least some SIMD instructions, and at least some SIMD instructions specify vector operations and with use
In selecting the field of source register and/or destination register from these vector registors, (illustrative SIMD instruction can refer to
Surely the vector operations that performed to the content of one or more of vector registor vector registor, and by the vector operations
Result be stored in one of vector registor).Different embodiments of the invention can have various sizes of vector registor, and
Support more/less/various sizes of data elements.
The size (such as byte, word, double word, four words) for the long numeric data element specified by SIMD instruction determines vector register
The position position of " data element position " in device, and the size of vector operand determines the quantity of data element.Packaged data
Element refers to being stored in the data in specific position.In other words, the ruler depending on the data element in vector element size
The size (total quantity of the position in vector element size) of very little and vector element size is (or, in other words, depending on destination
The quantity of data element in the size of operand and the vector element size), the multidigit in vector operand as a result
Data element position position position change (if for example, the destination of vector operand as a result is vector registor,
The position position change of long numeric data element position in the destination vector registor).For example, it is carried out to 32 bit data elements
(data element position 0 occupies a position 31 to the vector operations of operation:0, data element position 1 occupies a position 63:32, with this
Analogize) (data element position 0 occupies a position 63 with the vector operations that are operated to 64 bit data elements:0, data element
Position 1 occupies a position 127:64, and so on) between, the position position of long numeric data element is different.
In addition, according to one embodiment of present invention, mask element and vector dimension sum number are write in multiple 1 effective bit vectors
According between element size, there are associations as shown in Figure 5.Show the vector dimension of 128,256 and 512, Bu Guoqi
His width is also possible.Consider octet (B), 16 words (W), 32 double words (D) or single-precision floating point and 64 four
The data element size of word (Q) or double-precision floating point, but other width are also possible.As shown in the figure, when vector dimension is
At 128,16 can be operated for mask when the data element size of vector is 8, when the data element size of vector
8 for mask can be operated when being 16,4 can be operated for mask when the data element size of vector is 32,
And 2 can be operated for mask when the data element size of vector is 64.When vector dimension is 256, when beating
Bag data element width can operate 32 for mask when being 8, can be by 16 when the data element size of vector is 16
Position can operate 8 and when vectorial number when the data element size of vector is 32 for mask operation for mask
4 can be operated for mask when according to element size being 64.When vector dimension is 512, when the data element ruler of vector
It is very little 64 to be operated for mask when being 8,32 can be grasped for mask when the data element size of vector is 16
Make, 16 for mask can be operated and worked as the data element size of vector when the data element size of vector is 32
8 can be operated for mask when being 64.
Depending on the combination of vector dimension and data element size, the subset of all 64 or only 64 can be used as writing
Mask.In general, when the mask control bit for using single element one by one, the vector for mask operation writes mask register
In multiple positions (significance bit) equal to the vector dimension represented with position divided by with position represent vector data element size.
The vector instruction of the flexibility with several subtractions and addition pair for performing generation vector element is described below.These
It is the essential structure block that the butterfly (butterfly) of the transformation of similar HAAR and HADMARD is realized, this is in media encoding and decoding and divides
It is useful in analysis field.It is to be commonly referred to as packaged butterfly laterally to instruct with intersection addition or subtraction (" PHXADDSUB ") below
Instruction embodiment and available for perform by benefits in several different aspects the system of this instruction, framework, instruct lattice
The embodiment of formula etc..The execution of PHXADDSUB instructions leads to the transverse direction between the packaged data element of source vector register and friendship
Fork addition or subtraction, the result of wherein addition and/or subtraction are stored in the vector registor of destination, and addition or subtraction are sentenced
Position in the disconnected immediate value depending on instruction.More specifically, source vector register includes one or more data channel, and laterally
Occurred based on every data channel with intersection addition or subtraction.
Following (the attention of data being stored in destination registerThe abbreviation of conditional operator, if it is it is true it will return
First is returned as a result, and if if false, returning to the second result):
For least significant data element position, the data stored be source register least significant data element with
The result or subtract from the second least significant data element of source register that second least significant data element of source register is added
The least significant data element of source register is removed as a result, wherein least significant bit of the judgement of addition or subtraction based on immediate
(that is, DEST [0]=SRC [0] * (imm8 [0]-1:1)+SRC[1]);
For the second least significant data element position, the data stored are the minimum significant figures of third of source register
The result being added according to element with the 4th least significant data element of source register or the 4th minimum significant figure from source register
The third least significant data element of source register is subtracted according to element as a result, wherein the judgement of addition or subtraction is based on immediate
Third least significant bit (DEST [1]=SRC [2] * (imm8 [2]-1:1)+SRC[3]);
For third least significant data element position, the data stored are the second minimum significant figures of source register
The result that is added according to element with the least significant data element of source register subtracts from the least significant data element of source register
Remove the second least significant data element of source register as a result, wherein the judgement of addition or subtraction based on immediate second most
Low order (DEST [2]=SRC [1] * (imm8 [1]-1:1)+SRC[0]);And
For the 4th most significant data element position, the data stored are the 4th minimum significant figures of source register
The result being added according to element with the third least significant data element of source register or the minimum significant figure of third from source register
The the 4th least significant data element of source register is subtracted according to element as a result, wherein the judgement of addition or subtraction is based on immediate
The 4th least significant bit (DEST [3]=SRC [3] * (imm8 [3]-1:1)+SRC[2]).
Fig. 1 shows the exemplary diagram of the operation of exemplary PHXADDSUB instructions.More specifically, this illustrate for by
The source data channel 101 that four packaged data elements are formed to the instruction of destination data channel 105 operation.In the data
The calculating performed in channel by all data channel of source register carry out with complete instruction execution.
The exemplary tool of source data channel 101 is there are four packaged data element, and minimum effectively (SRC [0]) is left side
One.The size of packaged data element can be many different sizes, including 8,16,32 and 64 etc..It assesses
Data channel number depend on the size of data element and source register size (that is, 128,256,512 etc.).Example
It such as, will be with two 64 data channel assessed as shown in the figure using 128 source registers and 16 bit data elements.
As discussed above, it is the value being stored in destination is as follows:
- DEST [0]=SRC [0] * (imm8 [0]-1:1)+SRC[1];
- DEST [1]=SRC [2] * (imm8 [2]-1:1)+SRC[3];
- DEST [2]=SRC [1] * (imm8 [1]-1:1)+SRC[0];And
- DEST [3]=SRC [3] * (imm8 [3]-1:1)+SRC[2].
In the particular example, only relatively low four of immediate 107 are used to determine using the addition/subtraction of NOT logic 103.
However, if there is more data elements, then the digit used in immediate will change.In addition, in some embodiments, make
With upper four of immediate.
As an example, in the figure, if source data element 0 has value 3, source data element 1 has value 2, and immediate
Least significant bit be 0, then the end value that will be stored in destination data element position 0 be 3+2 be equal to 5.
In addition, this illustrates adder logic 109, which can be performed for the addition of source value (negated or not negated)
Hardware (such as ALU) or software.
Example format
The example format of the instruction is " PHXADDSUB { B/W/D/Q } YMM1, YMM2, IMM8 ", wherein operand YMM1
It is destination vector registor, and YMM2 is source vector register (such as 128,256,512 bit registers etc.), IMM8
It is 8 immediates, and PHXADDSUB is the command code of the instruction.The size of data element can be defined within the " preceding of the instruction
Sew " in, such as defined by using the instruction of data granularity bit.In most embodiments, which will indicate each data element
Element is 32 or 64, but other modifications can also be used.However, in some embodiments, command code is in itself by determining data
Element size.For example, as it appears from the above, additional letter { B/W/D/Q } is available for indicating respectively byte, word, double word and the number of four words
According to element size.
It is illustrative to perform method
Fig. 2 shows the embodiments for using PHXADDSUB instructions in the processor.201, obtaining has source register operation
The PHXADDSUB instructions of number, destination register operand and immediate.
203, PHXADDSUB instructions are decoded by decode logic.Depending on the form of the instruction, can explain at this stage
A variety of data, such as whether will carry out data conversion, to be written and fetch which register, to access what storage address,
Etc..
205, retrieval/reading source operand value.For example, read source register.If source operand is storage operation
Number, then fetch data element associated with the operand.In some embodiments, the data element from memory is stored
Into the temporary register before executive level.The grade, which may also comprise, to be logically organized into multiple data by source register and leads to
Road, wherein the size of each data channel is the size of the data element of destination register.
In 207, PHXADDSUB instructions (or operation including this instruction, such as microoperation) by such as one or more
The execution resource of functional unit etc performs, to calculate the packaged data member of source vector register on the basis of by data channel
Transverse direction and intersection addition or subtraction between element, wherein the judgement of addition or subtraction is depending on the position in the immediate value of instruction.
In one embodiment, for the least significant data element position of each data channel, calculating is source register
The result that least significant data element is added with the second least significant data element of source register or second from source register
Least significant data element subtracts the least significant data element of source register as a result, the judgement of wherein addition or subtraction is based on
The least significant bit of immediate.For the second least significant data element position of each data channel, calculating is source register
The result that is added with the 4th least significant data element of source register of third least significant data element or from source register
The 4th least significant data element subtract the third least significant data element of source register as a result, wherein addition or subtraction
Third least significant bit of the judgement based on immediate.For the third least significant data element position of each data channel,
Calculating be the result that the second least significant data element of source register and the least significant data element of source register are added or
The the second least significant data element of source register is subtracted from the least significant data element of source register as a result, wherein addition
Or second least significant bit of the judgement of subtraction based on immediate.Finally, for the 4th highest significant figure of each data channel
According to element position, calculating is the 4th least significant data element of source register and the third least significant data member of source register
The result of element addition or the 4th least significant data that source register is subtracted from the third least significant data element of source register
Element as a result, wherein second least significant bit of the judgement of addition or subtraction based on immediate.These calculating can be serial each other
Or it is performed in parallel.For above full content, when position is 1 immediately, subtraction is performed, and when position is 0 immediately, perform and add
Method.Certainly, in some embodiments, using opposite agreement.
209, each result is stored in the corresponding packaged data element of destination register.Although it respectively illustrates
207 and 209, but in some embodiments, the part that they can be performed as instruction performs together.
Fig. 3 shows the embodiment for handling the method that PHXADDSUB is instructed.More specifically, the diagram is described in detail
The step of for handling data channel.It, can be with other data channel serially or concurrently for each additional data channel
Perform identical step.In this example it is assumed that some in operation 201-205 are previously had been carried out (if not complete
Portion), however those operations are not shown, in order to avoid fuzzy details presented below.For example, taking-up and decoding is not shown, also it is not shown
Shown operand retrieval.
301, source register is organized into data element size and destination register size based on data channel.Such as
Pointed by upper, the quantity of data element changes according to its size and its place or the size of register that will enter.So
And typical realize is that have four data elements in data channel, and the quantity of data channel is variable.
303, for each data element position of the data channel of source register, the determining and data element position phase
Whether associated position immediately indicates that it should be negated for some additions.
The intersection addition between neighboring data elements position is performed 305.The details of these additions described above.
307, store the result into the data element position of destination register as discussed above.
Fig. 4 shows the exemplary pseudo-code for being laterally added or subtracting each other for four packaged data element destinations.It certainly, can be right
The code makes change to adapt to various sizes of element, etc..Conditional operatorIt represents:If it is very by the first knot of return
Fruit (- 1), and if if false, returning to the second result (1).
Exemplary instruction format
The embodiment of instruction described herein can embody in a different format.For example, instruction described herein can body
It is now VEX, general vector is friendly or other forms.The details of following discussion VEX and general vector close friend's form.In addition, under
Detailed examples sexual system, framework and assembly line in text.The embodiment of instruction can be on these systems, framework and assembly line
It performs, but is not limited to the system, framework and the assembly line that are described in detail.
VEX instruction formats
VEX codings allow instruction to have more than two operands, and allow SIMD vector registors than 128 bit lengths.VEX
The use of prefix provides three operands (or more) syntax.For example, two previous operand instructions perform rewriting source behaviour
The operation (such as A=A+B) counted.The use of VEX prefixes makes operand perform non-destructive operation, such as A=B+C.
Fig. 6 A show exemplary AVX instruction formats, including VEX prefixes 602, real opcode field 630, MoD R/M bytes
640th, SIB bytes 650, displacement field 662 and IMM8672.Fig. 6 B show which field from Fig. 6 A forms complete operation
Code field 674 and fundamental operation field 642.Fig. 6 C show which field from Fig. 6 A forms register index field 644.
VEX prefixes (byte 0-2) 602 are encoded with three bytewises.First byte is (the VEX bytes of format fields 640
0, position [7:0]), which includes specific C4 byte values (for distinguishing the unique value of C4 instruction formats).Second-
Multiple bit fields of the third byte (VEX byte 1-2) including providing special ability.Specifically, REX fields 605 (VEX bytes 1,
Position [7-5]) by VEX.R bit fields (VEX bytes 1, position [7]-R), VEX.X bit fields (VEX bytes 1, position [6]-X) and
VEX.B bit fields (VEX bytes 1, position [5]-B) form.Other fields of these instructions to depositing as known in the art
Relatively low three positions (rrr, xxx and bbb) of device index are encoded, thus can be by increasing VEX.R, VEX.X and VEX.B
To form Rrrr, Xxxx and Bbbb.Command code map field 615 (VEX bytes 1, position [4:0]-mmmmm) including to implicit
The content that leading opcode byte is encoded.W fields 664 (VEX bytes 2, position [7]-W) are represented, and carry by mark VEX.W
For depend on the instruction and different functions.VEX.vvvv620 (VEX bytes 2, position [6:3]-vvvv) effect may include as
Under:1) VEX.vvvv encodes the first source register operand and effective to the instruction with two or more source operands,
First source register operand is designated by inverting in the form of (1 complement code);2) VEX.vvvv encodes destination register operand, mesh
Ground register operand be designated in the form of 1 complement code for specific vector displacement;Or 3) VEX.vvvv do not encode it is any
Operand retains the field, and should include 1111b.If VEX.L668 size fields (VEX bytes 2, position [2]-L)=
0, then it indicate 128 bit vectors;If VEX.L=1, it indicates 256 bit vectors.Prefix code field 625 (VEX bytes 2,
Position [1:0]-pp) provide extra order for fundamental operation field.
Real opcode field 630 (byte 3) is also known as opcode byte.A part for command code refers in the field
It is fixed.
MOD R/M fields 640 (byte 4) including MOD field 642 (position [7-6]), Reg fields 644 (position [5-3]) and
R/M fields 646 (position [2-0]).The effect of Reg fields 644 may include as follows:To destination register operand or source register
Operand (rrr in Rrrr) is encoded;Or it is considered as command code extension and is not used in carry out any instruction operands
Coding.The effect of R/M fields 646 may include as follows:The instruction operands for quoting storage address are encoded;Or to mesh
Ground register operand or source register operand encoded.
The content of ratio, index, plot (SIB)-ratio field 650 (byte 5) is included for storage address generation
SS652 (position [7-6]).Previously be directed to register index Xxxx and Bbbb with reference to SIB.xxx654 (position [5-3]) and
The content of SIB.bbb656 (position [2-0]).
Displacement field 662 and immediately digital section (IMM8) 672 includes address date.
General vector close friend's instruction format
Vector friendly instruction format is adapted for the finger of vector instruction (for example, in the presence of the specific fields for being exclusively used in vector operations)
Enable form.Notwithstanding wherein by the embodiment of both vector friendly instruction format supporting vector and scalar operation, still
The vector operation by vector friendly instruction format is used only in alternate embodiment.
Fig. 7 A-7B are the frames for showing general vector close friend instruction format according to an embodiment of the invention and its instruction template
Figure.Fig. 7 A are the block diagrams for showing general vector close friend instruction format according to an embodiment of the invention and its A class instruction templates;And
Fig. 7 B are the block diagrams for showing general vector close friend instruction format according to an embodiment of the invention and its B class instruction templates.Specifically
Ground defines A classes and B class instruction templates for general vector close friend instruction format 700, and the two includes no memory and accesses 705
Instruction template and the instruction template of memory access 720.Term " general " in the context of vector friendly instruction format refers to
It is not bound by the instruction format of any special instruction set.
Although description wherein vector friendly instruction format is supported into 64 byte vector operand lengths (or size) and 32
(4 byte) or 64 (8 byte) data element widths (or size) (and as a result, 64 byte vectors by 16 double word sizes member
The elements composition of element or alternatively 8 four word sizes), 64 byte vector operand lengths (or size) and 16 (2 bytes) or 8
Position (1 byte) data element width (or size), 32 byte vector operand lengths (or size) and 32 (4 bytes), 64
(8 byte), 16 (2 bytes) or 8 (1 byte) data element widths (or size) and 16 byte vector operand lengths
(or size) and 32 (4 bytes), 64 (8 bytes), 16 (2 bytes) or 8 (1 byte) data element widths (or ruler
It is very little) the embodiment of the present invention, but alternate embodiment can support bigger, smaller, and/or different vector operand sizes
(for example, 256 byte vector operands) are from bigger, smaller or different data element widths (for example, 128 (16 byte) number
According to element width).
A class instruction templates in Fig. 7 A include:1) in the instruction template for accessing 705 in no memory, no memory is shown
The finger of data changing type operation 715 that the instruction template and no memory of the accesses-complete rounding control type operation 710 of access access
Enable template;And 2) in the instruction template of memory access 720, the instruction template of the timeliness 725 of memory access is shown
With the instruction template of the Non-ageing 730 of memory access.B class instruction templates in Fig. 7 B include:1) it is accessed in no memory
In 705 instruction template, the instruction template for the part rounding control type operation 712 for writing mask control that no memory accesses is shown
And the instruction template for the vsize types operation 717 for writing mask control that no memory accesses;And 2) in memory access 720
Instruction template in, show memory access write mask control 727 instruction template.
General vector close friend instruction format 700 is including being listed below according to the as follows of the sequence shown in Fig. 7 A-7B
Field.
Particular value (instruction format identifier value) in the format fields 740- fields uniquely identifies vectorial friendly instruction
Form, and thus mark instruction occurs in instruction stream with vector friendly instruction format.The field is logical for only having as a result,
It is unwanted with the instruction set of vector friendly instruction format, the field is optional in this sense.
Its content of fundamental operation field 742- distinguishes different fundamental operations.
Its content of register index field 744- directs or through address generation source or vector element size to be specified to exist
Position in register or in memory.These fields include sufficient amount of position with from PxQ (for example, 32x512,
16x128,32x1024,64x1024) the N number of register of a register group selection.Although N may be up to three in one embodiment
Source and a destination register, but alternate embodiment can support more or fewer source and destination registers (for example, can
It supports to be up to two sources, a source wherein in these sources also serves as destination, up to three sources can be supported, wherein in these sources
A source also serve as destination, can support up to two sources and a destination).
Modifier its content of (modifier) field 746- goes out specified memory access with general vector instruction format
Existing instruction and the instruction occurred with general vector instruction format of not specified memory access distinguish;Visited in no memory
It asks and is distinguished between 705 instruction template and the instruction template of memory access 720.Memory access operation read and/or
It is written to memory hierarchy (in some cases, specifying source and/or destination-address using the value in register), Er Feicun
Reservoir access operation is not in this way (for example, source and/or destination are registers).Although in one embodiment, which also exists
Select to perform storage address calculating between three kinds of different modes, but alternate embodiment can support it is more, less or not
Same mode calculates to perform storage address.
Which in various different operations extended operation field 750- its content differentiations will also perform in addition to fundamental operation
A operation.The field is for context.In one embodiment of the invention, which is divided into class field 768, α words
752 and β of section fields 754.Extended operation field 750 allows to perform in single instruction rather than 2,3 or 4 instructions multigroup common
Same operation.
Its content of ratio field 760- is allowed for storage address generation (for example, for using 2Ratio* index+plot
Address generation) index field content bi-directional scaling.
Its content of displacement field 762A- is used as a part for storage address generation (for example, for using 2Ratio* index+
The address generation of plot+displacement).
Displacement factor field 762B is (note that juxtaposition instructions of the displacement field 762A directly on displacement factor field 762B
Use one or the other) part of-its content as address generation, it specifies and is pressed by the size (N) of memory access
The displacement factor of proportional zoom, wherein N are byte quantities in memory access (for example, for using 2Ratio* index+plot+
The address generation of the displacement of bi-directional scaling).Ignore the low-order bit of redundancy, and be therefore multiplied by the content of displacement factor field
The final mean annual increment movement that memory operand overall size (N) is used with generation in effective address is calculated.The value of N is existed by processor hardware
It is determined during operation based on complete operation code field 774 (being described herein later) and data manipulation field 754C.Displacement field
762A and displacement factor field 762B can be not used in the instruction template of no memory access 705 and/or different embodiments can
It realizes the only one in the two or does not realize any one in the two, in this sense displacement field 762A and displacement factor word
Section 762B is optional.
Its content of data element width field 764- is distinguished using which of multiple data element widths (at some
For all instructions in embodiment, it is served only for some instructions in other embodiments).If support only one data element width
And/or support data element width in a certain respect using command code, then the field is unwanted, in this sense should
Field is optional.
Its content of mask field 770- is write in control destination vector operand on the basis of each data element position
Data element position whether reflect the result of fundamental operation and extended operation.The support of A classes instruction template merges-writes mask behaviour
Make, and B classes instruction template supports that mask operation is write in merging and zero writes both mask operations.When combined, vectorial mask allows
Protect any element set in destination from updating (being specified by fundamental operation and extended operation) during any operation is performed;
In another embodiment, keep wherein corresponding to the old value of each element of the masked bits with 0 destination.On the contrary, when zero,
Vectorial mask allows to make during any operation is performed any element set in destination to be zeroed (by fundamental operation and extended operation
It is specified);In one embodiment, the element of destination is set as 0 when corresponding masked bits have 0 value.The subset of the function is
The ability (that is, from first to the span of the last element to be changed) of the vector length of the operation performed is controlled, however,
The element changed is not necessarily intended to be continuous.Writing mask field 770 as a result, allows part vector operations, this includes loading, deposits
Storage, arithmetic, logic etc..Notwithstanding the multiple packets write in mask register of the content selection for wherein writing mask field 770
Containing write mask one to be used write mask register (and thus write mask field 770 content indirection identify will
The mask operation of execution) the embodiment of the present invention, but alternate embodiment is opposite or mask is additionally allowed for write the interior of section 770
Hold and directly specify the mask to be performed operation.
Its content of digital section 772- allows to specify immediate immediately.The field does not support the logical of immediate in realization
With being not present in vectorial friendly form and being not present in the instruction without using immediate, the field is optional in this sense
's.
Its content of class field 768- distinguishes between inhomogeneous instruction.With reference to figure 7A-B, the content of the field exists
It is selected between A classes and the instruction of B classes.In Fig. 7 A-B, rounded square be used to indicate specific value be present in field (for example,
A class 768A and B the class 768B of class field 768 are respectively used in Fig. 7 A-B).
A class instruction templates
In the case where A classes non-memory accesses 705 instruction template, α fields 752 are interpreted that the differentiation of its content will be held
It is any (for example, operating 710 and no memory visit for the rounding-off type that no memory accesses in the different extended operation types of row
Ask data changing type operation 715 instruction template respectively specify that rounding-off 752A.1 and data transformation 752A.2) RS fields
752A, and β fields 754 distinguish to perform it is any in the operation of specified type.705 instruction templates are accessed in no memory
In, ratio field 760, displacement field 762A and displacement ratio field 762B are not present.
Instruction template-accesses-complete rounding control type operation that no memory accesses
In the instruction template of the accesses-complete rounding control type operation 710 accessed in no memory, β fields 754 are interpreted it
Content provides the rounding control field 754A of static rounding-off.Although the rounding control field 754A in the embodiment of the present invention
Including inhibiting all floating-point exception (SAE) fields 756 and rounding-off operation and control field 758, but alternate embodiment can support, can
By these concepts be both encoded into identical field or only have these concept/fields in one or the other (for example,
Operation and control field 758 can be only rounded).
Whether its content of SAE fields 756- is distinguished deactivates unusual occurrence report;When the content of SAE fields 756 indicates to enable
During inhibition, given instruction does not report any kind of floating-point exception mark and does not arouse any floating-point exception processing routine.
It is rounded operation and control field 758- its content differentiations and performs which of one group of rounding-off operation (for example, house upwards
Enter, be rounded to round down, to zero and be rounded nearby).Rounding-off operation and control field 758 allows the base in each instruction as a result,
Change rounding mode on plinth.Wherein processor include for specify rounding mode control register the present invention a reality
It applies in example, the content priority of rounding-off operation and control field 750 is in the register value.
The accesses-data changing type operation that no memory accesses
In the instruction template of data changing type operation 715 accessed in no memory, β fields 754 are interpreted that data become
Field 754B is changed, content differentiation will perform which of multiple data transformation (for example, no data transformation, mixing, broadcast).
In the case of the instruction template of A classes memory access 720, α fields 752 are interpreted expulsion prompting field
752B, content, which is distinguished, will use which of expulsion prompting (in fig. 7, for the finger of memory access timeliness 725
The instruction template of template and memory access Non-ageing 730 is enabled to respectively specify that the 752B.1 and Non-ageing of timeliness
752B.2), and β fields 754 are interpreted data manipulation field 754C, content differentiation will perform multiple data manipulation operations
Which of (also referred to as primitive (primitive)) is (for example, without manipulation, broadcast, the upward conversion in source and destination
Conversion downwards).The instruction template of memory access 720 includes ratio field 760 and optional displacement field 762A or displacement
Ratio field 762B.
Vector memory instruction is supported load to perform the vector from memory and store vector to depositing using conversion
Reservoir.Such as ordinary vector instruction, vector memory instructs in a manner of data element formula and memory transfer data,
Wherein the element of actual transmissions is by being selected as writing the content provided of the vectorial mask of mask.
Instruction template-timeliness of memory access
The data of timeliness are the data that possible reuse fast enough to be benefited from cache.However, this is to carry
Show, and different processors can realize it in different ways, including ignoring the prompting completely.
Instruction template-Non-ageing of memory access
The data of Non-ageing impossible are reused fast enough with from the cache in first order cache
Be benefited and should be given the data of expulsion priority.However, this is prompting, and different processors can be real in different ways
Show it, including ignoring the prompting completely.
B class instruction templates
In the case of B class instruction templates, α fields 752 are interpreted to write mask control (Z) field 752C, content regions
Point by writing of writing that mask field 770 controls, mask operates should be merging or zero.
In the case where B classes non-memory accesses 705 instruction template, a part for β fields 754 is interpreted RL fields
757A, content differentiation will perform any (for example, writing mask for what no memory accessed in different extended operation types
What the instruction template and no memory of control section rounding control type operations 712 accessed writes mask control VSIZE types operation 717
Instruction template respectively specify that rounding-off 757A.1 and vector length (VSIZE) 757A.2), and the rest part of β fields 754 distinguish
It performs any in the operation of specified type.In no memory accesses 705 instruction templates, ratio field 760, displacement word
Section 762A and displacement ratio field 762B is not present.
During the part rounding control type for writing mask control accessed in no memory operates 710 instruction template, β fields
754 rest part is interpreted to be rounded operation field 759A, and deactivated unusual occurrence report (do not report any by given instruction
The floating-point exception mark of type and do not arouse any floating-point exception processing routine).
Rounding-off operation and control field 759A- is only used as rounding-off operation and control field 758, and content, which is distinguished, performs one group of house
Enter which of operation (for example, be rounded up to, be rounded to round down, to zero and be rounded nearby).Rounding-off operation as a result,
Control field 759A permissions change rounding mode on the basis of each instruction.Processor is included for specified rounding-off mould wherein
In one embodiment of the present of invention of the control register of formula, the content priority of rounding-off operation and control field 750 is in the register
Value.
In the instruction template for writing mask control VSIZE types operation 717 accessed in no memory, its remaining part of β fields 754
Point be interpreted vector length field 759B, content differentiation to perform which of multiple data vector length (for example,
128 bytes, 256 bytes or 512 bytes).
In the case of the instruction template of B classes memory access 720, a part for β fields 754 is interpreted Broadcast field
757B, whether content differentiation will perform broadcast-type data manipulation operations, and the rest part of β fields 754 is interpreted vector
Length field 759B.The instruction template of memory access 720 include ratio field 760 and optional displacement field 762A or
Displacement ratio field 762B.
For general vector close friend instruction format 700, show that complete operation code field 774 includes format fields 740, basis
Operation field 742 and data element width field 764.Although be shown in which complete operation code field 774 include it is all this
One embodiment of a little fields, but in the embodiment for not supporting all these fields, complete operation code field 774 includes few
In all these fields.Complete operation code field 774 provides command code (opcode).
It extended operation field 750, data element width field 764 and writes mask field 770 and allows in each instruction
On the basis of these features are specified with general vector close friend's instruction format.
The combination for writing mask field and data element width field creates various types of instructions, because these instructions allow
The mask is applied based on different data element widths.
The various instruction templates occurred in A classes and B classes are beneficial different in the case of.In some realities of the present invention
Apply in example, the different IPs in different processor or processor can support only A classes, only B classes or can support two classes.Citing and
Speech, it is intended to can only support B classes for the high-performance universal disordered nuclear of general-purpose computations, it is intended to be mainly used for figure and/or science (gulps down
The amount of spitting) core that calculates can only support A classes, and the core both being intended for both can support (certainly, to there is the mould from two classes
Plate and instruction some mixing but be not from two classes all templates and instruction core within the scope of the invention).Together
Sample, single-processor may include multiple cores, and all cores support identical class or wherein different core to support different classes.Citing
For, in the processor with individual figure and general purpose core, figure and/or science meter are intended to be used mainly in graphics core
The core calculated can only support A classes, and one or more of general purpose core can be with the only branch for being intended for general-purpose computations
Hold the high performance universal core executed out with register renaming of B classes.Another processor without individual graphics core can
Including not only support A classes but also support B classes one or more it is general orderly or unordered cores.Certainly, in different embodiments of the invention
In, it can also be realized in other classes from a kind of feature.It can make to become (for example, compiling in time with the program that high-level language is write
Translate or statistics compiling) a variety of different executable forms, including:1) only there is the class that the target processor for execution is supported
Instruction form;Or 2) various combination with the instruction using all classes and the replacement routine write and with selecting this
A little routines are in the form of the control stream code performed based on the instruction supported by the processor for being currently executing code.
Exemplary special vector friendly instruction format
Fig. 8 is the block diagram for showing exemplary special vector friendly instruction format according to an embodiment of the invention.Fig. 8 is shown
Special vector friendly instruction format 800, some in designated position, size, the order of explanation and field and those fields
The value of field, vector friendly instruction format 800 is dedicated in this sense.Special vector friendly instruction format 800 can be used
In extension x86 instruction set, and thus some fields are similar to the use in existing x86 instruction set and its extension (for example, AVX)
Those fields or same.The form keeps making with the prefix code field of the existing x86 instruction set with extension, practical operation
Code byte field, MOD R/M fields, SIB field, displacement field and digital section is consistent immediately.Field from Fig. 7 is shown,
Field from Fig. 8 is mapped to the field from Fig. 7.
Although it should be appreciated that for purposes of illustration in the context of general vector close friend instruction format 700 with reference to special
The embodiment of the present invention is described with vector friendly instruction format 800, but the present invention is not limited to the friendly instruction lattice of special vector
Formula 800, unless otherwise stated.For example, general vector close friend instruction format 700 conceives the various possible sizes of various fields,
And special vector friendly instruction format 800 is shown to have the field of specific dimensions.As a specific example, although in special vector
Data element width field 764 is illustrated as a bit field in friendly instruction format 800, and but the invention is not restricted to this (that is, general
The other sizes of 700 conceived data element width field 764 of vector friendly instruction format).
General vector close friend instruction format 700 includes the following field according to the sequence shown in Fig. 8 A being listed below.
EVEX prefixes (byte 0-3) 802- is encoded in the form of nybble.
Format fields 740 (EVEX bytes 0, position [7:0]) the-the first byte (EVEX bytes 0) is format fields 740, and
It includes 0x62 (in one embodiment of the invention for the unique value of discernibly matrix close friend's instruction format).
Multiple bit fields of second-the nybble (EVEX byte 1-3) including providing special ability.
REX fields 805 (EVEX bytes 1, position [7-5])-by EVEX.R bit fields (EVEX bytes 1, position [7]-R),
EVEX.X bit fields (EVEX bytes 1, position [6]-X) and (757BEX bytes 1, position [5]-B) composition.EVEX.R, EVEX.X and
EVEX.B bit fields provide the function identical with corresponding VEX bit fields, and are encoded using the form of 1 complement code, i.e. ZMM0
1111B is encoded as, ZMM15 is encoded as 0000B.Other fields of these instructions are to register as known in the art
Index relatively low three positions (rrr, xxx and bbb) encoded, thus can by increase EVEX.R, EVEX.X and
EVEX.B forms Rrrr, Xxxx and Bbbb.
This is the first part of REX ' field 710 to REX ' field 710-, and is for 32 register sets to extension
Higher 16 or the EVEX.R ' bit fields (EVEX bytes 1, position [4]-R ') that are encoded of relatively low 16 registers closed.At this
In one embodiment of invention, other of this and following instruction are stored with the form of bit reversal with (known x86's together
Under 32 bit patterns) it is distinguished with BOUND instructions that real opcode byte is 62, but (hereinafter retouched in MOD R/M fields
State) in do not receive value 11 in MOD field;The present invention alternate embodiment not with the form of reversion store the instruction position and
The position of other instructions.Value 1 is used to encode relatively low 16 registers.In other words, by combine EVEX.R ', EVEX.R,
And other RRR from other fields form R ' Rrrr.
Command code map field 815 (EVEX bytes 1, position [3:0]-mmmm)-its content is to implicit leading op-code word
Section (0F, 0F38 or 0F3) is encoded.
Data element width field 764 (EVEX bytes 2, position [7]-W)-represented by mark EVEX.W.EVEX.W is for fixed
The granularity (size) of adopted data type (32 bit data elements or 64 bit data elements).
EVEX.vvvv820 (EVEX bytes 2, position [6:3]-vvvv)-EVEX.vvvv effect may include it is as follows:1)
EVEX.vvvv encodes the first source register operand and effective to the instruction with two or more source operands, and first
Source register operand is designated in the form of inverting (1 complement code);2) EVEX.vvvv encodes destination register operand, mesh
Ground register operand be designated in the form of 1 complement code for specific vector displacement;Or 3) EVEX.vvvv do not encode it is any
Operand retains the field, and should include 1111b.EVEX.vvvv fields 820 are to the shape of reversion (1 complement code) as a result,
4 low-order bits of the first source register indicator of formula storage are encoded.Depending on the instruction, additional different EVEX positions word
Section is used for indicator size expansion to 32 registers.
EVEX.U768 classes field (EVEX bytes 2, position [2]-U) if-EVEX.U=0, it indicate A classes or
EVEX.U0;If EVEX.U=1, it indicates B classes or EVEX.U1.
Prefix code field 825 (EVEX bytes 2, position [1:0]-pp)-provide for the additional of fundamental operation field
Position.Other than providing traditional SSE instructions with EVEX prefix formats and supporting, this also has the benefit of compression SIMD prefix
(EVEX prefixes only need 2 rather than need byte to express SIMD prefix).In one embodiment, in order to support to use
It is instructed with conventional form and with traditional SSE of the SIMD prefix (66H, F2H, F3H) of EVEX prefix formats, by these tradition SIMD
Prefix code is into SIMD prefix code field;And tradition is extended to before the PLA for being supplied to decoder at runtime
SIMD prefix (therefore PLA can perform these traditional instructions of tradition and EVEX forms, without modification).Although newer instruction
The content of EVEX prefix code fields can be extended directly as command code, but for consistency, specific embodiment is with similar
Mode extend, but allow different meanings is specified by these legacy SIMD prefixes.Alternate embodiment can redesign PLA to prop up
2 SIMD prefix codings are held, and thus without extension.
(EVEX bytes 3, position [7]-EH, also referred to as EVEX.EH, EVEX.rs, EVEX.RL, EVEX. write mask to α fields 752
Control and EVEX.N;Also shown with α)-as it was earlier mentioned, the field for context.
β fields 754 (EVEX bytes 3, position [6:4]-SSS, also referred to as EVEX.s2-0、EVEX.r2-0、EVEX.rr1、
EVEX.LL0、EVEX.LLB;Also shown with β β β)-as it was earlier mentioned, the field for context.
This is the rest part of REX ' field to REX ' field 710-, and is 32 register sets that can be used for extension
Higher 16 or the EVEX.V ' bit fields (EVEX bytes 3, position [3]-V ') that are encoded of relatively low 16 registers closed.The position
It is stored with the form of bit reversal.Value 1 is used to encode relatively low 16 registers.In other words, by combine EVEX.V ',
EVEX.vvvv forms V ' VVVV.
Write mask field 770 (EVEX bytes 3, position [2:0]-kkk)-its content specifies the deposit write in mask register
Device indexes, as discussed previously.In one embodiment of the invention, there is hint not write and cover by particular value EVEX.kkk=000
Code for specific instruction special behavior (this can be embodied in various ways, include the use of be hardwired to it is all write mask or
The hardware of bypass mask hardware is realized).
Real opcode field 830 (byte 4) is also known as opcode byte.A part for command code is referred in the field
It is fixed.
MOD R/M fields 840 (byte 5) include MOD field 842, Reg fields 844 and R/M fields 846.As previously
Described, the content of MOD field 842 distinguishes memory access and non-memory access operation.The effect of Reg fields 844
Two kinds of situations can be summed up as:Destination register operand or source register operand are encoded;Or it is considered as grasping
Make code extension and be not used in encode any instruction operands.The effect of R/M fields 846 may include as follows:Reference is deposited
The instruction operands of memory address are encoded;Or destination register operand or source register operand are compiled
Code.
Ratio, index, plot (SIB) byte (byte 6)-as discussed previously, the content of ratio field 750 is used to deposit
Memory address generates.SIB.xxx854 and SIB.bbb856- had previously been directed to register index Xxxx and Bbbb and has been referred to this
The content of a little fields.
Displacement field 762A (byte 7-10)-and when MOD field 842 includes 10, byte 7-10 is displacement field 762A,
And it equally works, and worked with byte granularity with traditional 32 Bit Shifts (disp32).
Displacement factor field 762B (byte 7)-and when MOD field 842 includes 01, byte 7 is displacement factor field
762B.The position of the field is identical with the position of 8 Bit Shift (disp8) of tradition x86 instruction set, it is worked with byte granularity.By
In disp8 be sign extended, therefore it is only capable of addressing between -128 and 127 byte offsets;In 64 byte caches
Capable aspect, disp8 is using can be set as 8 of only four actually useful values -128, -64,0 and 64;Due to usually needing
The range of bigger, so using disp32;However, disp32 needs 4 bytes.It is compared with disp8 and disp32, displacement factor
Field 762B is reinterpreting for disp8;When using displacement factor field 762B, by the way that the content of displacement factor field is multiplied
Actual displacement is determined with the size (N) that memory operand accesses.The displacement of the type is referred to as disp8*N.This reduce
Average instruction length (single byte is used for displacement, but with much bigger range).This compression displacement is based on effective displacement
The granularity of memory access it is multiple it is assumed that and thus the redundancy low-order bit of address offset amount does not need to be encoded.Change sentence
It talks about, displacement factor field 762B substitutes 8 Bit Shift of tradition x86 instruction set.Displacement factor field 762B with x86 to refer to as a result,
The mode (therefore not changing in ModRM/SIB coding rules) for enabling 8 Bit Shifts of collection identical is encoded, and unique difference exists
In overloading disp8 to disp8*N.In other words, do not change in coding rule or code length, and only by hard
To being changed in the explanation of shift value, (this needs by the size bi-directional scaling displacement of memory operand to obtain byte part
Formula address offset amount).
Digital section 772 operates as previously described immediately.
Complete operation code field
Fig. 8 B are to show that having for composition complete operation code field 774 according to an embodiment of the invention is special vectorial friendly
The block diagram of the field of instruction format 800.Specifically, complete operation code field 774 includes format fields 740, fundamental operation field
742 and data element width (W) field 764.Fundamental operation field 742 includes prefix code field 825, command code maps
Field 815 and real opcode field 830.
Register index field
Fig. 8 C be show it is according to an embodiment of the invention form register index field 744 have special vector
The block diagram of the field of friendly instruction format 800.Specifically, register index field 744 includes REX fields 805, REX ' field
810th, MODR/M.reg fields 844, MODR/M.r/m fields 846, VVVV fields 820, xxx fields 854 and bbb fields 856.
Extended operation field
Fig. 8 D are to show that having for composition extended operation field 750 according to an embodiment of the invention is special vectorial friendly
The block diagram of the field of good instruction format 800.When class (U) field 768 includes 0, it shows EVEX.U0 (A class 768A);When it is wrapped
During containing 1, it shows EVEX.U1 (B class 768B).When U=0 and MOD field 842 include 11 (showing no memory access operation)
When, α fields 752 (EVEX bytes 3, position [7]-EH) are interpreted rs fields 752A.When rs fields 752A includes 1 (rounding-off
When 752A.1), β fields 754 (EVEX bytes 3, position [6:4]-SSS) it is interpreted rounding control field 754A.Rounding control word
Section 754A includes a SAE field 756 and two rounding-off operation fields 758.When rs fields 752A includes 0, (data convert
When 752A.2), β fields 754 (EVEX bytes 3, position [6:4]-SSS) it is interpreted three data mapping field 754B.Work as U=0
And MOD field 842 include 00,01 or 10 (showing memory access operation) when, α fields 752 (EVEX bytes 3, position [7]-EH)
It is interpreted expulsion prompting (EH) field 752B and β fields 754 (EVEX bytes 3, position [6:4]-SSS) it is interpreted three data
Manipulate field 754C.
As U=1, α fields 752 (EVEX bytes 3, position [7]-EH) are interpreted to write mask control (Z) field 752C.When
When U=1 and MOD field 842 include 11 (showing no memory access operation), a part (EVEX bytes 3, the position of β fields 754
[4]–S0) it is interpreted RL fields 757A;When it includes 1 (rounding-off 757A.1), rest part (the EVEX bytes of β fields 754
3, position [6-5]-S2-1) be interpreted to be rounded operation field 759A, and when RL fields 757A includes 0 (VSIZE757.A2), β words
Rest part (EVEX bytes 3, position [6-5]-S of section 7542-1) it is interpreted vector length field 759B (EVEX bytes 3, position
[6-5]–L1-0).When U=1 and MOD field 842 include 00,01 or 10 (showing memory access operation), β fields 754
(EVEX bytes 3, position [6:4]-SSS) it is interpreted vector length field 759B (EVEX bytes 3, position [6-5]-L1-0) and broadcast
Field 757B (EVEX bytes 3, position [4]-B).
Exemplary register architecture
Fig. 9 is the block diagram of register architecture 900 according to an embodiment of the invention.In the embodiment illustrated,
There is the vector registor 910 of 32 512 bit wides;These registers are cited as zmm0 to zmm31.Relatively low 16zmm registers
256 positions of lower-order be covered on register ymm0-16.(ymm is deposited for 128 positions of lower-order of relatively low 16zmm registers
128 positions of lower-order of device) it is covered on register xmm0-15.Special vector friendly instruction format 800 posts these coverings
Storage group operates, as shown in the following table.
In other words, vector length field 759B is selected between maximum length and other one or more short lengths
It selects, wherein each this short length is the half of previous length, and the instruction template without vector length field 759B
It is operated in maximum vector length.In addition, in one embodiment, the B class instruction templates of special vector friendly instruction format 800
To packing or scalar mono-/bis-precision floating point data and packing or scalar integer data manipulation.Scalar operations are to zmm/ymm/
The operation that lowest-order data element position in xmm registers performs;Depending on the present embodiment, higher-order data element position is protected
It holds and identical before a command or zero.
Write mask register 915- in an illustrated embodiment, there are 8 to write mask register (k0 to k7), each to write
The size of mask register is 64.In alternative embodiments, the size for writing mask register 915 is 16.As discussed previously
, in one embodiment of the invention, vector mask register k0 is not used as writing mask;When the coding for normally indicating k0 is used
When writing mask, it select it is hard-wired write mask 0xFFFF, so as to effectively deactivate the instruction write mask operation.
General register 925 --- in the embodiment illustrated, there are 16 64 general registers, these registers
It is used together with existing x86 addressing modes and carrys out addressable memory operand.These registers by title RAX, RBX, RCX,
RDX, RBP, RSI, RDI, RSP and R8 to R15 are quoted.
Scalar floating-point stack register set (x87 storehouses) 945, has been overlapped the flat register of MMX packing integers in the above
Group 950 --- in the embodiment illustrated, x87 storehouses be for using x87 instruction set extensions come to 32/64/80 floating-point
Data perform eight element stacks of Scalar floating-point operation;And 64 packing integer data are performed with operation using MMX registers,
And some operations to be performed between MMX and XMM register preserve operand.
The alternate embodiment of the present invention can use wider or relatively narrow register.In addition, the replacement of the present invention is implemented
Example can use more, less or different register group and register.
Exemplary nuclear architecture, processor and computer architecture
Processor core can in different processors be realized with different modes for different purposes.It is for example, such
The realization of core can include:1) general ordered nucleuses of general-purpose computations are intended for;2) high-performance for being intended for general-purpose computations is led to
Use unordered core;3) it is intended to be used mainly for the specific core of figure and/or science (handling capacity) calculating.The realization of different processor can wrap
It includes:1) including one or more general ordered nucleuses for being intended for general-purpose computations and/or be intended for general-purpose computations one or
The CPU of multiple general unordered cores;And 2) including being intended to be used mainly for the one or more of figure and/or science (handling capacity) specially
With the coprocessor of core.Such different processor leads to different computer system architectures, may include:1) divide with CPU
The coprocessor on chip opened;2) coprocessor in the encapsulation identical with CPU but on the tube core that separates;3) exist with CPU
(in this case, such coprocessor is sometimes referred to as such as integrated graphics and/or science to coprocessor in same die
The special logics such as (handling capacity) logic are referred to as specific core);And 4) described CPU (can sometimes referred to as be applied
Core or application processor), the system on chip that is included on the same die of coprocessor described above and additional function.Then
Exemplary nuclear architecture is described, then describes example processor and computer architecture.
Exemplary nuclear architecture
Orderly and unordered core block diagram
Figure 10 A are to show that life is thought highly of in the sample in-order pipeline of each embodiment according to the present invention and illustrative deposit
The block diagram of unordered publication/execution pipeline of name.Figure 10 B be each embodiment according to the present invention is shown to be included in processor
In ordered architecture core exemplary embodiment and illustrative register renaming unordered publication/execution framework core frame
Figure.Solid box in Figure 10 A-B shows ordered assembly line and ordered nucleus, and optional increased dotted line frame shows that deposit is thought highly of
Name, unordered publication/execution pipeline and core.In the case that given orderly aspect is the subset of unordered aspect, will nothing be described
In terms of sequence.
In Figure 10 A, processor pipeline 1000 includes taking out level 1002, length decoder level 1004, decoder stage 1006, divides
(also referred to as assign or issue) grade 1012, register reading memory reading level with grade 1008, rename level 1010, scheduling
1014th, executive level 1016, write back/memory write level 1018, exception handling level 1022 and submission level 1024.
Figure 10 B show the processor core 1090 of the front end unit 1030 including being coupled to enforcement engine unit 1050, and
Both enforcement engine unit and front end unit are all coupled to memory cell 1070.Core 1090 can be reduced instruction set computing
(RISC) core, complex instruction set calculation (CISC) core, very long instruction word (VLIW) core or mixed or alternative nuclear type.As another
Option, core 1090 can be specific core, such as network or communication core, compression engine, coprocessor core, general-purpose computations figure
Processor unit (GPGPU) core or graphics core etc..
Front end unit 1030 includes being coupled to the inch prediction unit 1032 of Instruction Cache Unit 1034, and the instruction is high
Fast buffer unit 1034 is coupled to instruction translation lookaside buffer (TLB) 1036, which couples
To instruction retrieval unit 1038, instruction retrieval unit 1038 is coupled to decoding unit 1040.Decoding unit 1040 (or decoder)
Decodable code instruct, and generate decoded from presumptive instruction otherwise reflection presumptive instruction or led from presumptive instruction
One or more microoperations, microcode entry point, microcommand, other instructions or other control signals gone out are as output.Decoding
A variety of different mechanism can be used to realize for unit 1040.It is real that the example of suitable mechanism includes but not limited to look-up table, hardware
Existing, programmable logic array (PLA), microcode read only memory (ROM) etc..In one embodiment, core 1090 includes (example
In decoding unit 1040 or otherwise such as, in front end unit 1030) for storing micro- generation of the microcode of certain macro-instructions
Code ROM or other media.Decoding unit 1040 is coupled to renaming/allocation unit 1052 in enforcement engine unit 1050.
Enforcement engine unit 1050 includes renaming/dispenser unit 1052, which couples
To retirement unit 1054 and the set of one or more dispatcher units 1056.Dispatcher unit 1056 represent it is any number of not
Same scheduler, including reserved station, central command window etc..Dispatcher unit 1056 is coupled to physical register group unit 1058.Often
A physical register group unit 1058 represents one or more physical register groups, wherein different physical register group storages one
Kind or a variety of different data types, such as scalar integer, scalar floating-point, packing integer, packing floating-point, vectorial integer, vector
Floating-point, state (for example, instruction pointer of the address as the next instruction to be performed) etc..In one embodiment, physics is posted
Storage group unit 1058 includes vector registor unit, writes mask register unit and scalar register unit.These registers
Unit can provide framework vector registor, vector mask register and general register.Physical register group unit 1058 with
Retirement unit 1054 be overlapped by show can be used for realize register renaming and execute out it is various in a manner of (for example, use
Resequence buffer and resignation register group;Use the file in future, historic buffer and resignation register group;Use deposit
Device mapping and register pond etc.).Retirement unit 1054 and physical register group unit 1058, which are coupled to, performs cluster 1060.It holds
Row cluster 1060 includes the set of one or more execution units 1062 and the collection of one or more memory access units 1064
It closes.Execution unit 1062 can to various types of data (for example, scalar floating-point, packing integer, packing floating-point, vectorial integer,
Vector floating-point) perform various operations (for example, displacement, addition, subtraction, multiplication).Although some embodiments can include being exclusively used in
Multiple execution units of specific function or function set, but other embodiment may include that all performing the functional only one of institute holds
Row unit or multiple execution units.Dispatcher unit 1056, physical register group unit 1058 and execution cluster 1060 are illustrated as
May have it is multiple because some embodiments create separated assembly line (for example, scalar integer for certain form of data/operation
Assembly line, scalar floating-point/packing integer/packing floating-point/vector integer/vector floating-point assembly line and/or respectively have their own
Dispatcher unit, physical register group unit and/or the pipeline memory accesses for performing cluster --- and separated
In the case of pipeline memory accesses, realize that wherein only the execution cluster of the assembly line has memory access unit 1064
Some embodiments).It is also understood that in the case where using separated assembly line, one or more of these assembly lines can
Think unordered publication/execution, and remaining assembly line can be orderly publication/execution.
The set of memory access unit 1064 is coupled to memory cell 1070, which includes coupling
To the data TLB unit 1072 of data cache unit 1074, wherein data cache unit 1074 is coupled to two level
(L2) cache element 1076.In one exemplary embodiment, memory access unit 1064 can include loading unit,
Storage address unit and data storage unit, each unit in these units are coupled to the data in memory cell 1070
TLB unit 1072.Instruction Cache Unit 1034 is additionally coupled to two level (L2) cache list in memory cell 1070
Member 1076.L2 cache elements 1076 are coupled to the cache of other one or more grades, and are eventually coupled to primary storage
Device.
As an example, exemplary register renaming, unordered publication/execution core framework assembly line can be implemented as described below
1000:1) instruction takes out 1038 and performs taking-up and length decoder level 1002 and 1004;2) decoding unit 1040 performs decoder stage
1006;3) renaming/dispenser unit 1052 performs distribution stage 1008 and rename level 1010;4) dispatcher unit 1056 performs
Scheduling level 1012;5) physical register group unit 1058 and memory cell 1070 perform register reading memory reading level
1014;It performs cluster 1060 and performs executive level 1016;6) memory cell 1070 and the execution of physical register group unit 1058 are write
Return/memory write level 1018;7) each unit can involve exception handling level 1022;And 8) retirement unit 1054 and physics are posted
Storage group unit 1058 performs submission level 1024.
Core 1090 can support one or more instruction set (for example, x86 instruction set (has what is added together with more recent version
Some extensions);The MIPS instruction set of MIPS Technologies Inc. of California Sunnyvale city;California Sunnyvale
The ARM instruction set (there is the optional additional extensions such as NEON) that the ARM in city controls interest), including each finger described herein
It enables.In one embodiment, core 1090 includes supporting packing data instruction set extension (for example, AVX1, AVX2 and/or elder generation
The some form of general vector friendly instruction format (U=0 and/or U=1) of preceding description) logic, so as to allow many more matchmakers
Body using operation can be performed using packaged data.
It should be appreciated that core can support multithreading (set for performing two or more parallel operations or thread), and
And the multithreading can be variously completed, this various mode includes time division multithreading, synchronous multi-threaded (wherein
Single physical core provides Logic Core for each thread of physical core in synchronizing multi-threaded threads), or combination
(for example, the time-division take out and decoding and hereafter such as withHyperthread technology carrys out synchronous multi-threaded).
Although register renaming is described in context of out-of-order execution, it is to be understood that, it can be in ordered architecture
It is middle to use register renaming.Although the embodiment of shown processor further includes separated instruction and data cache list
Member 1034/1074 and shared L2 cache elements 1076, but alternate embodiment can have for both instruction and datas
It is single internally cached, such as level-one (L1) is internally cached or multiple ranks it is internally cached.One
In a little embodiments, which may include internally cached and External Cache outside the core and or processor combination.
Alternatively, all caches can be in the outside of core and or processor.
Specific exemplary ordered nucleus framework
Figure 11 A-B show the block diagram of more specific exemplary ordered nucleus framework, which will be several logics in chip
One of block (including same type and/or other different types of cores).The interconnection that high bandwidth is passed through according to application, these logical blocks
Network (for example, loop network) and some fixed function logics, memory I/O Interface and other necessary I/O logic communications.
Figure 11 A be each embodiment according to the present invention single processor core and it with tube core on interference networks 1102
The block diagram of the local subset of connection and its two level (L2) cache 1104.In one embodiment, instruction decoder 1100
Hold the x86 instruction set with packing data instruction set extension.L1 caches 1106 allow to entering in scalar sum vector location
Cache memory low latency access.Although in one embodiment (in order to simplify design), scalar units
1108 and vector location 1110 using separated set of registers (being respectively scalar register 1112 and vector registor 1114),
And the data shifted between these registers are written to memory and then read back from level-one (L1) cache 1106,
But the alternate embodiment of the present invention can use different method (such as using single set of registers or including allowing data
The communication path without being written into and reading back is transmitted between these two register groups).
The local subset 1104 of L2 caches is a part for global L2 caches, and overall situation L2 caches are drawn
It is divided into multiple separate local subset, i.e., each one local subset of processor core.Each processor core, which has, arrives their own
The direct access path of the local subset of L2 caches 1104.It is slow at a high speed that the data being read by processor core are stored in its L2
It deposits in subset 1104, and the local L2 cached subsets that their own can be accessed with other processor cores are concurrently quick
It accesses.It is stored in the L2 cached subsets 1104 of their own, and in necessary situation by the data that processor core is written
Under from other subsets remove.Loop network ensures the consistency of shared data.Loop network is two-way, to allow such as to handle
The agency of device core, L2 caches and other logical blocks etc is communicate with each other within the chip.Each circular data path is each
1012 bit wide of direction.
Figure 11 B are the expanded views of a part for the processor core in Figure 11 A of each embodiment according to the present invention.Figure 11 B
L1 data high-speeds including L1 caches 1104 cache 1106A parts and about vector location 1110 and vector registors
1114 more details.Specifically, vector location 1110 is 16 fat vector processing units (VPU) (see 16 width ALU1128), it should
Unit performs one or more of integer, single-precision floating point and double-precision floating point instruction.The VPU passes through mixed cell 1120
It supports the mixing inputted to register, numerical value conversion is supported by numerical conversion unit 1122A-B and pass through copied cells 1124
Support the duplication to memory input.Writing mask register 1126 allows to assert the vector write-in of gained.
Processor with integrated memory controller and graphics devices
Figure 12 be each embodiment according to the present invention may with more than one core, may be controlled with integrated memory
Device and may have integrated graphics device processor 1200 block diagram.Solid box in Figure 12 shows there is single core
1202A, System Agent 1210, one or more bus control unit unit 1216 set processor 1200, and dotted line frame
Optional add shows there is one or more of multiple core 1202A-N, system agent unit 1210 integrated memory controller
The set of unit 1214 and the alternative processor 1200 of special logic 1208.
Therefore, different realize of processor 1200 may include:1) CPU, wherein special logic 1208 be integrated graphics and/or
Science (handling capacity) logic (it may include one or more cores), and core 1202A-N be one or more general purpose cores (for example,
General ordered nucleus, general unordered core, combination of the two);2) coprocessor, center 1202A-N are intended to mainly use
In figure and/or multiple specific cores of science (handling capacity);And 3) coprocessor, center 1202A-N are that multiple general have
Sequence core.Therefore, processor 1200 can be general processor, coprocessor or application specific processor, such as network or communication
Processor, compression engine, graphics processor, GPGPU (universal graphics processing unit), high-throughput integrated many-core (MIC) association
Processor (including 30 or more cores) or embeded processor etc..The processor can be implemented on one or more chips
On.Processor 1200 can be one or more substrates a part and/or can use such as BiCMOS, CMOS or
Any one of multiple processing technologies of NMOS etc. technology realizes processor 1200 on one or more substrates.
Storage hierarchy includes the cache of one or more ranks in each core, one or more shared height
The set of fast buffer unit 1206 and coupled to integrated memory controller unit 1214 exterior of a set memory (not
It shows).The set of the shared cache element 1206 can include one or more intermediate-level caches, such as two level
(L2), three-level (L3), the cache of level Four (L4) or other ranks, last level cache (LLC), and/or a combination thereof.Although
In one embodiment, interconnecting unit 1212 based on ring is by integrated graphics logic 1208, shared cache element 1206
Set and 1210/ integrated memory controller unit 1214 of system agent unit interconnect, but alternate embodiment can be used it is any
The known technology of quantity is by these cell interconnections.In one embodiment, one or more cache elements can be safeguarded
Consistency (coherency) between 1206 and core 1202A-N.
In some embodiments, one or more of core 1202A-N nuclear energy is more than enough threading.System Agent 1210 includes
Coordinate and operate those components of core 1202A-N.System agent unit 1210 may include such as power control unit (PCU) and show
Show unit.PCU can be or the logic including being used to adjusting core 1202A-N and needed for the power rating of integrated graphics logic 1208
And component.Display unit is used to drive the display of one or more external connections.
Core 1202A-N can be isomorphic or heterogeneous in terms of architecture instruction set;That is, two in these cores 1202A-N
A or more core may be able to carry out identical instruction set, and other cores may be able to carry out the instruction set only subset or
Different instruction set.
Exemplary computer architecture
Figure 13-16 is the block diagram of exemplary computer architecture.It is known in the art to laptop devices, it is desktop computer, hand-held
PC, personal digital assistant, engineering work station, server, the network equipment, network hub, interchanger, embeded processor, number
Word signal processor (DSP), graphics device, video game device, set-top box, microcontroller, cellular phone, portable media are broadcast
The other systems design and configuration for putting device, handheld device and various other electronic equipments are also suitable.Usually, it can wrap
It is typically suitable containing processor disclosed herein and/or other multiple systems for performing logic and electronic equipment.
Referring now to Figure 13, it show the block diagram of system 1300 according to an embodiment of the invention.System 1300 can
To include one or more processors 1310,1315, these processors are coupled to controller center 1320.In one embodiment
In, controller center 1320 includes graphics memory controller hub (GMCH) 1390 and input/output hub (IOH) 1350
(it can be on separated chip);GMCH1390 includes memory and graphics controller, memory 1340 and coprocessor
1345 are coupled to the memory and graphics controller;Input/output (I/O) equipment 1360 is coupled to GMCH1390 by IOH1350.
Alternatively, the one or both in memory and graphics controller can be integrated in processor (as described in this article),
Memory 1340 and coprocessor 1345 are directly coupled to processor 1310 and controller center 1320, controller center 1320
It is in one single chip with IOH1350.
The optional property of Attached Processor 1315 is represented by dashed line in fig. 13.Each processor 1310,1315 may include
One or more of process cores described herein, and can be a certain version of processor 1200.
Memory 1340 can be such as dynamic random access memory (DRAM), phase transition storage (PCM) or the two
Combination.For at least one embodiment, controller center 1320 is total via the multiple-limb of such as Front Side Bus (FSB) etc
The point-to-point interface of line, such as fast channel interconnection (QPI) etc or similar connection 1395 and processor 1310,1315
It communicates.
In one embodiment, coprocessor 1345 is application specific processor, such as high-throughput MIC processor, net
Network or communication processor, compression engine, graphics processor, GPGPU or embeded processor etc..In one embodiment, it controls
Device maincenter 1320 processed can include integrated graphics accelerator.
There may be the system including framework, micro-architecture, heat and power consumption features etc. between physical resource 1310,1315
Each species diversity in terms of row quality metrics.
In one embodiment, processor 1310 performs the instruction for the data processing operation for controlling general type.Association is handled
Device instruction can be embedded in these instructions.These coprocessor instructions are identified as to be handled by attached association by processor 1310
The type that device 1345 performs.Therefore, processor 1310 on coprocessor buses or other interconnects refers to these coprocessors
(or representing the control signal of coprocessor instruction) is enabled to be published to coprocessor 1345.Coprocessor 1345 receives and performs institute
The coprocessor instruction of reception.
Referring now to Figure 14, show more specific first exemplary system 1400 of an embodiment according to the present invention
Block diagram.As shown in figure 14, multicomputer system 1400 is point-to-point interconnection system, and including being coupled via point-to-point interconnect 1450
First processor 1470 and second processor 1480.Each in processor 1470 and 1480 can be processor 1200
A certain version.In one embodiment of the invention, processor 1470 and 1480 is processor 1310 and 1315 respectively, and is assisted
Processor 1438 is coprocessor 1345.In another embodiment, processor 1470 and 1480 is processor 1310 and association respectively
Processor 1345.
Processor 1470 and 1480 is illustrated as respectively including integrated memory controller (IMC) unit 1472 and 1482.Place
Reason device 1470 further includes point-to-point (P-P) interface 1476 and 1478 of the part as its bus control unit unit;Similarly,
Second processor 1480 includes point-to-point interface 1486 and 1488.Processor 1470,1480 can use point-to-point (P-P) circuit
1478th, 1488 information is exchanged via P-P interfaces 1450.As shown in figure 14, each processor is coupled to phase by IMC1472 and 1482
The memory answered, i.e. memory 1432 and memory 1434, these memories can be locally attached to corresponding processor
The part of main memory.
Processor 1470,1480 can be respectively via using each of point-to-point interface circuit 1476,1494,1486,1498
P-P interfaces 1452,1454 exchange information with chipset 1490.Chipset 1490 can optionally via high-performance interface 1439 with
Coprocessor 1438 exchanges information.In one embodiment, coprocessor 1438 is application specific processor, such as high-throughput
MIC processors, network or communication processor, compression engine, graphics processor, GPGPU or embeded processor etc..
Shared cache (not shown) can be included in any processor or be included in outside two processors
Portion but still interconnect via P-P and connect with these processors, if thus when certain processor is placed in low-power mode, can will be any
The local cache information of processor or two processors is stored in this shared cache.
Chipset 1490 can be coupled to the first bus 1416 via interface 1496.In one embodiment, the first bus
1416 can be peripheral component interconnection (PCI) bus or such as PCI Express buses or other third generation I/O interconnection bus
Etc bus, but the scope of the present invention is not limited thereto.
As shown in figure 14, various I/O equipment 1414 can be coupled to the first bus 1416, bus bridge together with bus bridge 1418
First bus 1416 is coupled to the second bus 1420 by 1418.In one embodiment, such as coprocessor, high-throughput MIC
Processor, the processor of GPGPU, accelerator (such as graphics accelerator or digital signal processor (DSP) unit), scene
One or more Attached Processors 1415 of programmable gate array or any other processor are coupled to the first bus 1416.One
In a embodiment, the second bus 1420 can be low pin count (LPC) bus.Various equipment can be coupled to the second bus
1420, in one embodiment these equipment include such as keyboard/mouse 1422, communication equipment 1427 and such as may include referring to
The storage unit 1428 of the disk drive or other mass-memory units of order/code and data 1430.In addition, audio I/
O1424 can be coupled to the second bus 1420.Note that other frameworks are possible.For example, the point-to-point frame instead of Figure 14
Structure, system can realize multiple-limb bus or other this kind of frameworks.
Referring now to Figure 15, it show the frame of more specific second exemplary system 1500 according to an embodiment of the invention
Figure.Same parts in Figure 14 and Figure 15 represent with same reference numerals, and eliminate from Figure 15 it is in Figure 14 in some terms,
It thickens to avoid the other aspects of Figure 15 are made.
Figure 15 shows that processor 1470,1480 can respectively include integrated memory and I/O control logics (" CL ") 1472 Hes
1482.Therefore, CL1472,1482 include integrated memory controller unit and including I/O control logics.Figure 15 not only shows to deposit
Reservoir 1432,1434 coupled to CL1472,1482, and also illustrate I/O equipment 1514 be also coupled to control logic 1472,
1482.Traditional I/O equipment 1515 is coupled to chipset 1490.
Referring now to Figure 16, it show the block diagram of the SoC1600 of an embodiment according to the present invention.In fig. 12, it is similar
Component have same reference numeral.In addition, dotted line frame is the optional feature of more advanced SoC.In figure 16, interconnecting unit
1602 are coupled to:Application processor 1610, the application processor include the set of one or more core 202A-N and share
Cache element 1206;System agent unit 1210;Bus control unit unit 1216;Integrated memory controller unit
1214;A group or a or multiple coprocessors 1620, may include integrated graphics logic, image processor, audio processor
And video processor;Static RAM (SRAM) unit 1630;Direct memory access (DMA) (DMA) unit 1632;With
And the display unit 1640 for being coupled to one or more external displays.In one embodiment, coprocessor 1620 wraps
Include application specific processor, such as network or communication processor, compression engine, GPGPU, high-throughput MIC processor or embedded
Formula processor etc..
Each embodiment of mechanism disclosed herein can be implemented in the group of hardware, software, firmware or these implementation methods
In conjunction.The embodiment of the present invention can realize the computer program or program code to perform on programmable systems, this is programmable
System includes at least one processor, storage system (including volatile and non-volatile memory and or memory element), at least
One input equipment and at least one output equipment.
Program code (all codes 1430 as shown in Figure 14) can be instructed applied to input, it is described herein to perform
Each function simultaneously generates output information.Can output information be applied to one or more output equipments in a known manner.For this
The purpose of application, processing system include having such as digital signal processor (DSP), microcontroller, application-specific integrated circuit
(ASIC) or any system of the processor of microprocessor.
Program code can realize with the programming language of advanced programming language or object-oriented, so as to processing system
Communication.When needed, it is also possible to which assembler language or machine language realize program code.In fact, mechanism described herein
It is not limited to the range of any certain programmed language.In either case, which can be compiler language or interpretative code.
The one or more aspects of at least one embodiment can be by the representative instruciton that is stored on machine-readable media
It realizes, instruction represents the various logic in processor, and instruction is when read by machine so that the machine makes to perform sheet
The logic of technology described in text.These expressions for being referred to as " IP kernel " can be stored on a tangible machine-readable medium, and
Multiple clients or production facility are provided to be loaded into the manufacture machine for actually manufacturing the logic or processor.
Such machine readable storage medium can include but is not limited to the article by machine or device fabrication or formation
Non-transient tangible arrangement, including storage medium, such as:Hard disk;The disk of any other type, including floppy disk, CD, tight
Cause disk read-only memory (CD-ROM), compact-disc rewritable (CD-RW) and magneto-optic disk;Semiconductor devices, such as read-only storage
The arbitrary access of device (ROM), such as dynamic random access memory (DRAM) and static RAM (SRAM) etc
Memory (RAM), Erasable Programmable Read Only Memory EPROM (EPROM), flash memory, electrically erasable programmable read-only memory
(EEPROM);Phase transition storage (PCM);Magnetic or optical card;Or the medium suitable for storing any other type of e-command.
Therefore, various embodiments of the present invention further include non-transient tangible machine-readable medium, the medium include instruction or
Comprising design data, such as hardware description language (HDL), it define structure described herein, circuit, device, processor and/
Or system features.These embodiments are also referred to as program product.
Emulation is (including binary translation, code morphing etc.)
In some cases, dictate converter can be used to from source instruction set convert instruction to target instruction set.For example, refer to
Enable converter that can convert (such as using static binary conversion, dynamic binary translation including on-the-flier compiler), deformation, imitate
Convert instructions into very or in other ways the one or more of the other instruction that will be handled by core.Dictate converter can be with soft
Part, hardware, firmware, or combination are realized.Dictate converter on a processor, outside the processor or can handled partly
On device and part is outside the processor.
Figure 17 be each embodiment according to the present invention control using software instruction converter by two in source instruction set into
System instruction is converted into the block diagram of the binary instruction of target instruction target word concentration.In an illustrated embodiment, dictate converter is software
Dictate converter, but alternatively, the dictate converter can be realized with software, firmware, hardware or its various combination.Figure 17
It shows that the program using high-level language 1702 can be compiled using x86 compilers 1704, it can be by having at least one with generation
The x86 binary codes 1706 of the 1716 primary execution of processor of a x86 instruction set core.With at least one x86 instruction set core
Processor 1716 represent any processor, these processors can be by compatibly performing or otherwise handling the following contents
To perform with having the function of that the Intel processors of at least one x86 instruction set core are essentially identical:1) Intel x86 instruction set
The essential part of the instruction set of core or 2) target is run on the Intel processors at least one x86 instruction set core
Application or other programs object code version, so as to obtain with at least one x86 instruction set core Intel handle
The essentially identical result of device.X86 compilers 1704 represent to generate x86 binary codes 1706 (for example, object code)
Compiler, the binary code 1706 can by or not by it is additional link processing at least one x86 instruction set core
Processor 1716 on perform.Similarly, Figure 17 shows to compile using high using the instruction set compiler 1708 substituted
The program of grade language 1702, can be by not having the processor 1714 of at least one x86 instruction set core (such as with holding with generation
The MIPS instruction set of MIPS Technologies Inc. of row California Sunnyvale city, and/or execution California Sani
The processor of the core of the ARM instruction set of the ARM holding companies in Wei Er cities) primary execution alternative command collection binary code
1710.Dictate converter 1712 is used to x86 binary codes 1706 being converted into can be by not having x86 instruction set cores
Manage the code of 1714 primary execution of device.The transformed code is unlikely with alternative 1710 phase of instruction set binary code
Together, because the dictate converter that can be done so is difficult to manufacture;However, transformed code will complete general operation and by coming from
The instruction of alternative command collection is formed.Therefore, dictate converter 1712 is by emulating, simulating or any other process represents to allow
Processor or other electronic equipments without x86 instruction set processors or core perform the software of x86 binary codes 1706, consolidate
Part, hardware or combination.
Claims (27)
1. a kind of performed in the computer processor in response to single vector packing butterfly lateral cross addition or subtraction instruction is beaten
The method that the vector of bag data element is packaged butterfly lateral cross addition or subtraction, the single vector are packaged butterfly lateral cross
Addition or subtraction instruction include destination vector registor operand, source vector register operand, immediate and command code,
It the described method comprises the following steps:
It performs the single vector and is packaged butterfly lateral cross addition or subtraction instruction, with for the every of the source vector register
A data channel calculates the transverse direction between the packaged data element of source vector register and intersects addition or subtraction, wherein being to add
The judgement of method or subtraction is depending on the position in the immediate of described instruction;And
By each transverse direction between packaged data element and intersect in addition or subtraction result storage to destination register,
Wherein for the least significant data element position of destination register, storage the result is that source register it is minimum effectively
Result that data element is added with the second least significant data element of source register or from the second of source register it is minimum effectively
Data element subtracts the least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediate
Least significant bit;
Wherein for the second least significant data element position of destination register, storage the result is that the third of source register
The result that least significant data element is added with the 4th least significant data element of source register or the from source register the 4th
Least significant data element subtracts the third least significant data element as a result, the wherein judgement of addition or subtraction of source register
Third least significant bit based on immediate;
Wherein for the third least significant data element position of destination register, storage the result is that the second of source register
Result that least significant data element is added with the least significant data element of source register or from source register it is minimum effectively
Data element subtracts the second least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediately
The second several least significant bits;And
Wherein for the 4th most significant data element position of destination register, storage the result is that the 4th of source register
The result or the third from source register that least significant data element is added with the third least significant data element of source register
Least significant data element subtracts the 4th least significant data element as a result, the wherein judgement of addition or subtraction of source register
The 4th least significant bit based on immediate.
2. the method as described in claim 1, which is characterized in that there are multiple data channel.
3. the method as described in claim 1, which is characterized in that the quantity of data channel to be processed depends on the destination
The size of vector registor.
4. the method as described in claim 1, which is characterized in that the ruler of the source vector register and destination vector registor
Very little is 128,256 or 512.
5. the method as described in claim 1, which is characterized in that the packing number of the source register and the destination register
Size according to element is 8,16,32 or 64.
6. method as claimed in claim 5, which is characterized in that the packaged data member in the source is defined by the command code
The size of element.
7. the method as described in claim 1, which is characterized in that the immediate is 8 place values.
8. a kind of computer system, including:
Storage unit, for storing instruction, the form of wherein described instruction specify vector registor and immediate to be grasped as its source
It counts, and specified single destination vector registor, as its destination, wherein described instruction form includes command code;And
Processor is coupled with the storage unit, and the processor includes:
Decoding unit, for decoding described instruction;And
Execution unit, in response to decoded instruction, for each data channel of source vector register, calculating source vector
Transverse direction between the packaged data element of register and intersect addition or subtraction, wherein being that the judgement of addition or subtraction depends on
Position in the immediate of described instruction and by each transverse direction between packaged data element and intersect addition or subtraction result is deposited
It stores up in destination register,
Wherein for the least significant data element position of destination register, storage the result is that source register it is minimum effectively
Result that data element is added with the second least significant data element of source register or from the second of source register it is minimum effectively
Data element subtracts the least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediate
Least significant bit;
Wherein for the second least significant data element position of destination register, storage the result is that the third of source register
The result that least significant data element is added with the 4th least significant data element of source register or the from source register the 4th
Least significant data element subtracts the third least significant data element as a result, the wherein judgement of addition or subtraction of source register
Third least significant bit based on immediate;
Wherein for the third least significant data element position of destination register, storage the result is that the second of source register
Result that least significant data element is added with the least significant data element of source register or from source register it is minimum effectively
Data element subtracts the second least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediately
The second several least significant bits;And
Wherein for the 4th most significant data element position of destination register, storage the result is that the 4th of source register
The result or the third from source register that least significant data element is added with the third least significant data element of source register
Least significant data element subtracts the 4th least significant data element as a result, the wherein judgement of addition or subtraction of source register
The 4th least significant bit based on immediate.
9. system as claimed in claim 8, which is characterized in that there are multiple data channel.
10. system as claimed in claim 8, which is characterized in that the quantity of data channel to be processed depends on the purpose
The size of ground vector registor.
11. system as claimed in claim 8, which is characterized in that the source vector register and destination vector registor
Size is 128,256 or 512.
12. system as claimed in claim 8, which is characterized in that the packing of the source register and the destination register
The size of data element is 8,16,32 or 64.
13. system as claimed in claim 8, which is characterized in that the packaged data in the source are defined by the command code
The size of element.
14. system as claimed in claim 8, which is characterized in that the immediate is 8 place values.
15. a kind of instruction processing unit, including:
Hardware decoder, for decoding, single vector is packaged butterfly lateral cross addition or subtraction instruction, the single vector are beaten
Packet butterfly lateral cross addition or subtraction instruction include destination vector registor operand, source vector register operand, stand
That is number and command code;
Execution logic unit, in response to decoded instruction, for each data channel of source vector register, calculating source
Transverse direction between the packaged data element of vector registor and intersect addition or subtraction, wherein being that the judgement of addition or subtraction takes
Certainly the position in the immediate of described instruction and by each transverse direction between packaged data element and intersect addition or subtraction knot
In fruit storage to destination register,
Wherein for the least significant data element position of destination register, storage the result is that source register it is minimum effectively
Result that data element is added with the second least significant data element of source register or from the second of source register it is minimum effectively
Data element subtracts the least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediate
Least significant bit;
Wherein for the second least significant data element position of destination register, storage the result is that the third of source register
The result that least significant data element is added with the 4th least significant data element of source register or the from source register the 4th
Least significant data element subtracts the third least significant data element as a result, the wherein judgement of addition or subtraction of source register
Third least significant bit based on immediate;
Wherein for the third least significant data element position of destination register, storage the result is that the second of source register
Result that least significant data element is added with the least significant data element of source register or from source register it is minimum effectively
Data element subtracts the second least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediately
The second several least significant bits;And
Wherein for the 4th most significant data element position of destination register, storage the result is that the 4th of source register
The result or the third from source register that least significant data element is added with the third least significant data element of source register
Least significant data element subtracts the 4th least significant data element as a result, the wherein judgement of addition or subtraction of source register
The 4th least significant bit based on immediate.
16. device as claimed in claim 15, which is characterized in that the source vector register and destination vector registor
Size is 128,256 or 512.
17. device as claimed in claim 15, which is characterized in that the packing of the source register and the destination register
The size of data element is 8,16,32 or 64.
18. device as claimed in claim 15, which is characterized in that the packaged data in the source are defined by the command code
The size of element.
19. device as claimed in claim 15, which is characterized in that the immediate is 8 place values.
20. a kind of machine readable storage medium, the machine readable storage medium includes code, and the code makes when executed
Machine performs the method as described in any one of claim 1-7.
21. a kind of performed in the computer processor in response to single vector packing butterfly lateral cross addition or subtraction instruction is beaten
The equipment that the vector of bag data element is packaged butterfly lateral cross addition or subtraction, the single vector are packaged butterfly lateral cross
Addition or subtraction instruction include destination vector registor operand, source vector register operand, immediate and command code,
The equipment includes following device:
For performing the device that the single vector is packaged butterfly lateral cross addition or subtraction instruction, with for the source vector
Each data channel of register calculates the transverse direction between the packaged data element of source vector register and intersects addition or subtract
Method, wherein being the position judged to depend in the immediate of described instruction of addition or subtraction;And
For being stored each transverse direction between packaged data element into destination register with intersection addition or subtraction result
Device,
Wherein for the least significant data element position of destination register, storage the result is that source register it is minimum effectively
Result that data element is added with the second least significant data element of source register or from the second of source register it is minimum effectively
Data element subtracts the least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediate
Least significant bit;
Wherein for the second least significant data element position of destination register, storage the result is that the third of source register
The result that least significant data element is added with the 4th least significant data element of source register or the from source register the 4th
Least significant data element subtracts the third least significant data element as a result, the wherein judgement of addition or subtraction of source register
Third least significant bit based on immediate;
Wherein for the third least significant data element position of destination register, storage the result is that the second of source register
Result that least significant data element is added with the least significant data element of source register or from source register it is minimum effectively
Data element subtracts the second least significant data element of source register as a result, wherein the judgement of addition or subtraction is based on immediately
The second several least significant bits;And
Wherein for the 4th most significant data element position of destination register, storage the result is that the 4th of source register
The result or the third from source register that least significant data element is added with the third least significant data element of source register
Least significant data element subtracts the 4th least significant data element as a result, the wherein judgement of addition or subtraction of source register
The 4th least significant bit based on immediate.
22. equipment as claimed in claim 21, which is characterized in that there are multiple data channel.
23. equipment as claimed in claim 21, which is characterized in that the quantity of data channel to be processed depends on the purpose
The size of ground vector registor.
24. equipment as claimed in claim 21, which is characterized in that the source vector register and destination vector registor
Size is 128,256 or 512.
25. equipment as claimed in claim 21, which is characterized in that the packing of the source register and the destination register
The size of data element is 8,16,32 or 64.
26. equipment as claimed in claim 25, which is characterized in that the packaged data in the source are defined by the command code
The size of element.
27. equipment as claimed in claim 21, which is characterized in that the immediate is 8 place values.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2011/067183 WO2013095631A1 (en) | 2011-12-23 | 2011-12-23 | Systems, apparatuses, and methods for performing a butterfly horizontal and cross add or substract in response to a single instruction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104137053A CN104137053A (en) | 2014-11-05 |
CN104137053B true CN104137053B (en) | 2018-06-26 |
Family
ID=48669269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201180076420.2A Active CN104137053B (en) | 2011-12-23 | 2011-12-23 | For performing systems, devices and methods of the butterfly laterally with intersection addition or subtraction in response to single instruction |
Country Status (3)
Country | Link |
---|---|
US (1) | US9459865B2 (en) |
CN (1) | CN104137053B (en) |
WO (1) | WO2013095631A1 (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150052330A1 (en) * | 2013-08-14 | 2015-02-19 | Qualcomm Incorporated | Vector arithmetic reduction |
US9851970B2 (en) * | 2014-12-23 | 2017-12-26 | Intel Corporation | Method and apparatus for performing reduction operations on a set of vector elements |
US20160283242A1 (en) * | 2014-12-23 | 2016-09-29 | Intel Corporation | Apparatus and method for vector horizontal logical instruction |
US20160188341A1 (en) * | 2014-12-24 | 2016-06-30 | Elmoustapha Ould-Ahmed-Vall | Apparatus and method for fused add-add instructions |
US20160188327A1 (en) * | 2014-12-24 | 2016-06-30 | Elmoustapha Ould-Ahmed-Vall | Apparatus and method for fused multiply-multiply instructions |
US10296342B2 (en) * | 2016-07-02 | 2019-05-21 | Intel Corporation | Systems, apparatuses, and methods for cumulative summation |
US10120680B2 (en) | 2016-12-30 | 2018-11-06 | Intel Corporation | Systems, apparatuses, and methods for arithmetic recurrence |
GB2564853B (en) * | 2017-07-20 | 2021-09-08 | Advanced Risc Mach Ltd | Vector interleaving in a data processing apparatus |
US10963247B2 (en) | 2019-05-24 | 2021-03-30 | Texas Instruments Incorporated | Vector floating-point classification |
US11334356B2 (en) * | 2019-06-29 | 2022-05-17 | Intel Corporation | Apparatuses, methods, and systems for a user defined formatting instruction to configure multicast Benes network circuitry |
US20230297371A1 (en) * | 2022-03-15 | 2023-09-21 | Intel Corporation | Fused multiple multiplication and addition-subtraction instruction set |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7095808B1 (en) * | 2000-08-16 | 2006-08-22 | Broadcom Corporation | Code puncturing method and apparatus |
CN101251791A (en) * | 2006-09-22 | 2008-08-27 | 英特尔公司 | Instruction and logic for processing text strings |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0377604B1 (en) * | 1987-08-21 | 1995-12-20 | Commonwealth Scientific And Industrial Research Organisation | A transform processing circuit |
US5204962A (en) * | 1989-11-30 | 1993-04-20 | Mitsubishi Denki Kabushiki Kaisha | Processor with preceding operation circuit connected to output of data register |
US6088782A (en) | 1997-07-10 | 2000-07-11 | Motorola Inc. | Method and apparatus for moving data in a parallel processor using source and destination vector registers |
US20030105945A1 (en) * | 2001-11-01 | 2003-06-05 | Bops, Inc. | Methods and apparatus for a bit rake instruction |
US7392368B2 (en) * | 2002-08-09 | 2008-06-24 | Marvell International Ltd. | Cross multiply and add instruction and multiply and subtract instruction SIMD execution on real and imaginary components of a plurality of complex data elements |
US20060149938A1 (en) * | 2004-12-29 | 2006-07-06 | Hong Jiang | Determining a register file region based at least in part on a value in an index register |
-
2011
- 2011-12-23 US US13/992,236 patent/US9459865B2/en active Active
- 2011-12-23 CN CN201180076420.2A patent/CN104137053B/en active Active
- 2011-12-23 WO PCT/US2011/067183 patent/WO2013095631A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7095808B1 (en) * | 2000-08-16 | 2006-08-22 | Broadcom Corporation | Code puncturing method and apparatus |
CN101251791A (en) * | 2006-09-22 | 2008-08-27 | 英特尔公司 | Instruction and logic for processing text strings |
Also Published As
Publication number | Publication date |
---|---|
US9459865B2 (en) | 2016-10-04 |
WO2013095631A1 (en) | 2013-06-27 |
CN104137053A (en) | 2014-11-05 |
US20140201502A1 (en) | 2014-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104094218B (en) | Systems, devices and methods for performing the conversion for writing a series of index values of the mask register into vector registor | |
CN104040482B (en) | For performing the systems, devices and methods of increment decoding on packing data element | |
CN104137053B (en) | For performing systems, devices and methods of the butterfly laterally with intersection addition or subtraction in response to single instruction | |
CN104335166B (en) | For performing the apparatus and method shuffled and operated | |
CN104081337B (en) | Systems, devices and methods for performing lateral part summation in response to single instruction | |
CN104350492B (en) | Cumulative vector multiplication is utilized in big register space | |
CN104040488B (en) | Complex conjugate vector instruction for providing corresponding plural number | |
CN104011649B (en) | Device and method for propagating estimated value of having ready conditions in the execution of SIMD/ vectors | |
CN104040487B (en) | Instruction for merging mask pattern | |
CN104081341B (en) | The instruction calculated for the element offset amount in Multidimensional numerical | |
CN104094182B (en) | The apparatus and method of mask displacement instruction | |
CN104011673B (en) | Vector frequency compression instruction | |
CN104137059B (en) | Multiregister dispersion instruction | |
CN104137060B (en) | Cache assists processing unit | |
CN104169867B (en) | For performing the systems, devices and methods of conversion of the mask register to vector registor | |
CN104115114B (en) | The device and method of improved extraction instruction | |
CN104011665B (en) | Super multiply-add (super MADD) is instructed | |
CN104350461B (en) | Instructed with different readings and the multielement for writing mask | |
CN104011671B (en) | Apparatus and method for performing replacement operator | |
CN104094221B (en) | Based on zero efficient decompression | |
CN104025019B (en) | For performing the systems, devices and methods of double block absolute difference summation | |
CN104204989B (en) | For the apparatus and method for the element for selecting vector calculating | |
CN104321740B (en) | Utilize the conversion of operand basic system and the vector multiplication of reconvert | |
CN104011661B (en) | Apparatus And Method For Vector Instructions For Large Integer Arithmetic | |
CN104011648B (en) | System, device and the method for being packaged compression for executing vector and repeating |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |