CN103793201B - Instruction and the logic of vector compression and spinfunction are provided - Google Patents

Instruction and the logic of vector compression and spinfunction are provided Download PDF

Info

Publication number
CN103793201B
CN103793201B CN201310524909.2A CN201310524909A CN103793201B CN 103793201 B CN103793201 B CN 103793201B CN 201310524909 A CN201310524909 A CN 201310524909A CN 103793201 B CN103793201 B CN 103793201B
Authority
CN
China
Prior art keywords
vector
vectorial
value
units
size
Prior art date
Application number
CN201310524909.2A
Other languages
Chinese (zh)
Other versions
CN103793201A (en
Inventor
T·乌利尔
E·乌尔德-艾哈迈德-瓦勒
R·瓦伦丁
Original Assignee
英特尔公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US13/664,401 priority Critical patent/US9606961B2/en
Priority to US13/664,401 priority
Application filed by 英特尔公司 filed Critical 英特尔公司
Publication of CN103793201A publication Critical patent/CN103793201A/en
Application granted granted Critical
Publication of CN103793201B publication Critical patent/CN103793201B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8076Details on data register access
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions; instructions using a mask
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30112Register structure for variable length data, e.g. single or double registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/3016Decoding the operand specifier, e.g. specifier format
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30185Instruction operation extension or modification according to one or more bits in the instruction, e.g. prefix, sub-opcode
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction, e.g. SIMD
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/452Instruction code
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements

Abstract

Instruction and logic provide vector compression and spinfunction.The instruction for specifying vectorial source, mask, vectorial destination and destination skew is responded, some embodiments read the mask and the adjacent sequential unit by corresponding unmasked vector element since being copied in the vectorial destination at the vectorial destination offset units the vectorial source.In certain embodiments, the unmasked vector element from the vectorial source is copied to adjacent sequential element units using the total quantity of the element units in the vectorial destination as mould.In some alternative embodiments, when the vectorial destination is full, just stop copy, and by unmasked vector element from the vectorial source copy the vectorial destination in adjacent sequential element units when, the value of the corresponding field in the mask is changed and arrives masking value.Alternative embodiment makes the element zero without the vectorial destination of element of the copy from the vectorial source.

Description

Instruction and the logic of vector compression and spinfunction are provided

Technical field

This disclosure relates to the field of logic, microprocessor and associated instruction set architecture be handled, when by the processor Or during other processing logics execution, the associated instruction set architecture execution logic, arithmetic or other feature operations.Specifically Say, this disclosure relates to instruction and logic for providing vector compression and spinfunction.

Background technology

Modern processors often include the instruction for being used to provide operation, and these operations are computation-intensive, but are to provide energy Enough effective realizations by using the various data storage devices for example by taking single-instruction multiple-data (SIMD) vector registor as an example Come the high-level data concurrency developed.Then CPU (CPU) can provide Parallel Hardware to support processing vector.To Amount is to maintain the data structure of multiple continuous data elements.Vector registor with size M can include the N with size O Individual vector element, wherein N=M/O.For example, 64 byte vector registers can be divided into (a) 64 vector elements, each Element keeps occupying the data item of 1 byte, and (b) 32 vector elements are to keep each to occupy 2 bytes (or one " word ") data item, (c) 16 vector elements to keep the data item that each occupies 4 bytes (or one " double word "), or (d) 8 vector elements are to keep the data item that each occupies 8 bytes (or one " quadword ").

Allow to apply or software code vectorization include making to apply for example using wide or big width vector architecture as Compiling, installation and/or operation on the particular system or instruction set architecture of example.

The various programming benchmark of industry development are calculated with test structure and such as vectorization, simultaneous multi-threading, prediction etc. Computing technique efficiency.Benchmark as one group assesses company (SPEC) from standard performance.SPEC benchmark are widely used in " inspection The performance of survey " processor and platform architecture.The program for constituting SPEC benchmark is described and analyzed by industry professional, it is intended to It was found that new compiling and computing technique are to improve computing power.Being referred to as one of CPU2006 SPEC benchmark groups includes being chosen The integer and the benchmark of floating-point CPU intensive of the processor, memory sub-system and compiler selecting to emphasize system.CPU2006 bags The program for being referred to as 444.NAMD derived according to NAMD data layout and inner ring is included, it is big by Illinois that one kind is used for simulation Learn the theory in Urbana-Champaign branch school and calculate the parallel of the mcroorganism molecular system of the Jim Phillips exploitations of biological group Program.NAMD nearly all run time is all spent between the atom in calculating group function and interacted.This group with it is big Code dehind is measured to form the compact benchmark for CPU2006.Core is calculated to realize in the machine architecture of wide scope well Performance, but specifically optimize not comprising platform.

Program NAMD is the victor of Gordon Bell parallel scalability bonuses in 2002, but serial performance is same It is important.It is not vectorizable serial after whole progress vectorizations in most of parallel sections of the people for example to benchmark Part typically represents the even more significant part of the run time of benchmark.This case is for high parallel expansible The typical case of the normal conditions of the computation-intensive program of property.After most of parallel sections are accelerated using vectorization, also In the presence of performance restricted problem and bottleneck is removed with improve program it is other can not vectorization or serial section performance it is difficult Work.

So far, the potential solution for such performance restricted problem and bottleneck is not probed into fully also.

Brief description of the drawings

Illustrate the present invention by way of example, and not limitation in the accompanying drawings.

Figure 1A is the block diagram of the one embodiment for the system for performing the instruction for being used to provide vector compression and spinfunction.

Figure 1B is the block diagram of another embodiment for the system for performing the instruction for being used to provide vector compression and spinfunction.

Fig. 1 C are the block diagrams of another embodiment for the system for performing the instruction for being used to provide vector compression and spinfunction.

Fig. 2 is the square frame of the one embodiment for the processor for performing the instruction for being used to provide vector compression and spinfunction Figure.

Fig. 3 A illustrate the encapsulated data type according to one embodiment.

Fig. 3 B illustrate the encapsulated data type according to one embodiment.

Fig. 3 C illustrate the encapsulated data type according to one embodiment.

Fig. 3 D illustrate to be used to provide vector compression and the instruction encoding of spinfunction according to one embodiment.

Fig. 3 E illustrate to be used to provide vector compression and the instruction encoding of spinfunction according to another embodiment.

Fig. 3 F illustrate to be used to provide vector compression and the instruction encoding of spinfunction according to another embodiment.

Fig. 3 G illustrate to be used to provide vector compression and the instruction encoding of spinfunction according to another embodiment.

Fig. 3 H illustrate to be used to provide vector compression and the instruction encoding of spinfunction according to another embodiment.

Fig. 4 A illustrate a reality of the processor micro-architecture for performing the instruction for providing vector compression and spinfunction Apply the element of example.

Fig. 4 B illustrate another reality of the processor micro-architecture for performing the instruction for providing vector compression and spinfunction Apply the element of example.

Fig. 5 is the block diagram of the one embodiment for the processor for performing the instruction for providing vector compression and spinfunction.

Fig. 6 is the square frame of the one embodiment for the computer system for performing the instruction for providing vector compression and spinfunction Figure.

Fig. 7 is the square frame of another embodiment for the computer system for performing the instruction for providing vector compression and spinfunction Figure.

Fig. 8 is the square frame of another embodiment for the computer system for performing the instruction for providing vector compression and spinfunction Figure.

Fig. 9 is the block diagram of the one embodiment for the on-chip system for performing the instruction for providing vector compression and spinfunction.

Figure 10 is the block diagram of the embodiment for the processor for performing the instruction for providing vector compression and spinfunction.

Figure 11 is to provide the block diagram of one embodiment of vector compression and the IP kernel heart development system of spinfunction.

Figure 12 illustrates to provide one embodiment of the framework analogue system of vector compression and spinfunction.

Figure 13 illustrates that translation provides one embodiment of the system of the instruction of vector compression and spinfunction.

Figure 14 A illustrate the flow chart of one embodiment of the instruction for providing vector compression and spinfunction.

Figure 14 B illustrate the flow chart of another embodiment of the instruction for providing vector compression and spinfunction.

The flow chart of the embodiment for the processing that Figure 15 A illustrate using instruction to provide vector compression and spinfunction.

The flow of another embodiment for the processing that Figure 15 B illustrate using instruction to provide vector compression and spinfunction Figure.

Figure 16 A illustrate to provide the flow chart of one embodiment of the processing of vector compression and spinfunction.

Figure 16 B illustrate to provide the flow chart of the alternative embodiment of the processing of vector compression and spinfunction.

Figure 17 illustrates to provide the flow chart of another embodiment of the processing of vector compression and spinfunction.

Figure 18 illustrates to provide the flow chart of the embodiment of the processing of vector compression function in reference application.

Figure 19 A illustrate to provide the flow chart of the embodiment of the processing of vector compression and spinfunction in reference application.

Figure 19 B illustrate to provide the flow of the alternative embodiment of the processing of vector compression and spinfunction in reference application Figure.

Embodiment

Following description disclose in processor, computer system or other processing units or with the processor, Computer system or other processing units are provided in association with instruction and the processing logic of vector compression and spinfunction.

Disclosed herein is instruction and the logic for providing vector compression and spinfunction.To specify vectorial source, mask, to The instruction of amount destination and destination skew is responded, and some embodiments read the mask, and will be corresponding unmasked Adjacent sequential element list of the vector element since being copied in vectorial destination at vectorial destination offset units vectorial source Member.Alternative embodiment makes the element zero without the vectorial destination of element of the copy from vectorial source.In some implementations In example, the unmasked vector element from vectorial source is copied to using the total quantity of the element units in vectorial destination as mould Adjacent sequential element units.In some alternative embodiments, when vectorial destination is full, copy just stops.It will not cover Cover vector element from vectorial source copy vectorial destination in adjacent sequential element units when, can also will be relative in mask The value for the field answered, which changes, arrives masking value.Thus, mask value can be used for tracking progress and/or complete, and will be changed into full Destination storage to after memory, instruction can be re-executed.It is then possible to using modification mask and be zero vector Destination offsets to re-execute instruction, only to compress there is still a need for performing the element that vector compression and rotation are instructed, so as to permit Perhaps the instruction throughput improved.

It will be recognized that SIMD compressions and rotation instruction can be used for providing in the application for being otherwise not easy to be quantified to Compression function is measured, such as in the reference application exemplified by the inner ring in the 444.NAMD of SPEC benchmark groups, so as to be reduced to The quantity of the expensive sequential storage of external memory storage, increase performance and instruction throughput, and reduce power and use.

In the following description, such as processing logic, processor type, micro-architecture condition, event, enable mechanism are elaborated Etc. various details, the embodiment of the present invention is more fully appreciated by provide.However, the technology in this area Personnel are it will be recognized that the present invention can be put into practice in the case of not such detail.In addition, not being illustrated in detail in Some known structure, circuits etc., to avoid unnecessarily obscuring embodiments of the invention.

Although reference processor describes the following examples, other embodiments can apply to other types of collection Into circuit and logical device.The similar techniques of the embodiment of the present invention and teaching go for that upper pipe line can be had benefited from handling up The other types of circuit or semiconductor devices of amount and improved performance.The teaching of the embodiment of the present invention can apply to perform number According to any processor or machine of manipulation.However, the present invention be not limited to perform 512,256,128,64,32 or The processor or machine of 16 data manipulations, and can be applied to wherein perform any processor of manipulation or the management of data And machine.In addition, following description provides example, and accompanying drawing shows various examples for purpose of explanation.However, should not These examples are construed to them in limiting sense and are meant only to provide the example of the embodiment of the present invention without being to provide the present invention All full lists in the cards of embodiment.

Although following example describes instruction processing and is distributed in the context of execution unit and logic circuit, Other embodiments of the invention can be realized by way of the data being stored on machine readable tangible medium or instruction, The instruction makes the machine perform the function consistent with least one embodiment of the invention when being performed by machine.In one embodiment In, by the function embodiment associated with the embodiment of the present invention in machine-executable instruction.The instruction may be used to institute State the step of the universal or special computing device present invention of instruction programming.Embodiments of the invention can be provided as wrapping The computer program product or software of machine or computer-readable medium are included, the machine or computer-readable medium, which have, to be stored in Can be used for thereon computer (or other electronic equipments) is programmed with perform one according to embodiments of the present invention or The instruction of multiple operations.It is alternatively possible to the particular hardware component by including the fixing function logic for performing the step Or the machine element and any combinations of fixing function hardware component by programming to perform the embodiment of the present invention the step of.

The memory of the instruction storage of the embodiment of the present invention in systems can will be performed for being programmed to logic It is interior, such as DRAM, cache, flash memory or other storages.Furthermore, it is possible to via network or pass through other computer-readable mediums Mode carry out distributed instructions.Thus, machine readable media can include being used to deposit in the readable form of machine (for example, computer) Storage or any mechanism of transmission information, but be not limited to by transmitting signal of the internet via electricity, light, sound or other forms (for example, carrier wave, infrared signal, data signal etc.) transmits floppy disk, CD, the compact disk read-only storage used during information And magneto-optic disk, read-only storage (ROM), random access memory (RAM), Erasable Programmable Read Only Memory EPROM (CD-ROM) (EPROM), Electrically Erasable Read Only Memory (EEPROM), magnetic card or optical card, flash memory or tangible machine readable storage. Therefore, computer-readable medium includes being suitable for the readable form storage of machine (for example, computer) or transmission e-command Or any kind of tangible machine-readable media of information.

Design can be undergone from the various stages for being created to simulation to manufacture.The data for representing design can be according to a variety of sides Formula represents design.First, as useful in simulations, hardware description language or another functional description language can be used to carry out generation Table hardware.Furthermore, it is possible to produce the circuit level model with logic and/or transistor gate at some stages of design treatment. Moreover, most of design reaches the data level for the physical layout for representing the various equipment in hardware model in a certain stage. In the case of using conventional semiconductor manufacturing technology, the data for representing hardware model can be specified for manufacturing integrated circuit Mask different masks layer on various features present or absent data.In any representative of design, Ke Yi Data storage in any type of machine readable media.The magnetical or optical storage of memory or such as disk can be machine readable The information that medium is transmitted with storing via being modulated or otherwise being generated to transmit the light wave or electric wave of such information. When transmission is indicated or carries the electric carrier wave of code or design, copy, buffering or the degree transmitted again of electric signal are being performed On, newly copied.Thus, communication provider or network provider can be on tangible machine readable medias at least temporarily with Storage embodies the article of the technology of the embodiment of the present invention, for example, be encoded as the information of carrier wave.

In modern processors, various codes and instruction are handled and perform using many different execution units.It is not All instructions are equally created, because some instructions complete comparatively fast, and other instructions will take a number of clock periods to complete. The handling capacity of instruction is faster, and the overall performance of processor is better.Thus, it will have many instructions is performed as quickly as possible Profit.However, exist with significant complexity and upon execution between and processor resource in terms of require more some instructions.Example Such as, there is floating point instruction, load/store operations, data movement etc..

As more computer systems are used in internet, text and multimedia application, introduce over time Extra processor is supported.In one embodiment, can by instruction set with including data type, instruction, register architecture, seek Location pattern, memory architecture, interruption and abnormality processing and outside input and one or more computer architectures of output (I/O) It is associated.

In one embodiment, instruction set architecture (ISA) can be by including for realizing one or more instruction set The one or more micro-architectures for managing device logical sum circuit are realized.Therefore, the processor with different micro-architectures can share public At least a portion of instruction set.For example,Pentium 4 processor,CoreTMProcessor and from California The processor of state Sunnyvale senior micro equipment company realizes that the x86 instruction set of almost identical version (adds in more recent version Added with some extensions), but with different indoor designs.Similarly, by such as ARM Holding Co., Ltds, MIPS Other processor development companies or its license reciever or the processor of adopter's design can share common instruction set at least A part, but different processors can be included and designed.For example, ISA identical register architecture can use it is new or public The technology known, including special physical register, using register renaming mechanism (for example, register alias table (RAT's) makes With) the physical registers of one or more dynamically distributes, resequencing buffer (ROB) and register file is recalled, in difference Micro-architecture in realize in different ways.In one embodiment, register can include one or more registers, Register architecture, register file or can with or the other register groups that cannot be addressed by software programmer.

In one embodiment, instruction can include one or more instruction formats.In one embodiment, instruction format Various fields (quantity of position, unit of position etc.) can be indicated to specify on-unit and operation will performed thereon Operand.Some instruction formats can sporadically further be defined by instruction template (or subformat).For example, will can have The instruction template of given instruction format is defined as the different subsets of the field with instruction format and/or is defined as with different The given field that ground is explained.In one embodiment, using instruction format (and if definition, then in the finger of the instruction format Make in given one of template) instruction is expressed, and the instruction specifies or indicates operation and will perform what be operated thereon Operand.

Science, finance, the general RMS (identification, excavation and synthesis) of automatic vectorization and vision and multimedia application (example Such as, 2D/3D figures, image procossing, video compression/decompression, speech recognition algorithm and audio manipulation) it may require that to substantial amounts of number Identical operation is performed according to item.In one embodiment, single-instruction multiple-data (SIMD), which is referred to, makes processor to multiple data elements Element performs the instruction type of operation.Position in register can be logically divided into multiple fixed dimensions or changeable ruler Each representative in SIMD technologies, these data elements is used individually to be worth in the processor of very little data element.For example, In one embodiment, the hyte in 64 bit registers can be woven to the source operation for including four single 16 bit data elements Number, each of which data element represents single 16 place value.The data of this type " can referred to as be compressed " data type Or " vector " data type, and the operand of this data type can be referred to as compressed data operand or vector operations Number.In one embodiment, compressed data item or vector can be stored in the sequence of the compressed data element in single register Row, and compressed data operand or vector operand can be SIMD instruction (or " compressed data instructions " or " vector instruction ") Source or vector element size.In one embodiment, SIMD instruction, which is specified, wants the single of many two source vector operands execution Vector operations are to generate with identical or different size, the data element with identical or different quantity and with identical or different Data element order destination vector operand (also referred to as result vector operand).

For example by with including x86, MMXTM, fluidisation SIMD extension (SSE), SSE2, SSE3, SSE4.1 and SSE4.2 instruction Instruction setCoreTMProcessor, such as with the instruction set including vector floating-point (VFP) and/or NEON instructions ARMThe arm processor of series processors and such as by the Chinese Academy of Sciences Institute of Computing Technology (ICT) develop The SIMD technologies that use of MIPS processors of Loongson series processors realize and significantly improve in terms of application performance (CoreTMAnd MMXTMIt is the trade mark of registration mark or California Santa Clara Intel company).

In one embodiment, destination and source register/data are source and the mesh for representing corresponding data or operation Ground general terms.In certain embodiments, they can be by register, memory or with the title or function with description Different other titles or other storage regions of function are realized.For example, in one embodiment, " DEST1 " can be interim Storage register or other storage regions, and " SRC1 " and " SRC2 " can be the first and second source storage registers or other deposit Storage area domain, etc..In other embodiments, in SRC and DEST storage regions two or more can be with same memory region Different pieces of information storage element in domain (for example, simd register) is corresponding.In one embodiment, one of source register also may be used For use as destination register, such as by the way that the result of the operation performed to the first and second source datas is write back to as purpose One of two source registers of ground register.

Figure 1A is to be formed with to include the processor of the execution unit for performing instruction according to an embodiment of the invention Exemplary computer system block diagram.System 100 is included according to the present invention, such as embodiment described herein in, Part, such as processor 102 are used to perform the algorithm for processing data using the execution unit of logic is included.System 100 Represent with from obtained by California Santa Clara Intel companyIII、 4、XeonTMXscaleTMAnd/or StrongARMTMProcessing system based on microprocessor, although can also Use other systems (including PC with other microprocessors, engineering work station, set top box etc.).In one embodiment, Sample system 100 can perform a kind of WINDOWS of version obtained by Microsoft from State of Washington RedmondTMOperation System, although other operating systems (such as UNIX and Linux), embedded software and/or graphical user interface can also be used. Thus, embodiments of the invention are not limited to any particular combination of hardware circuit and software.

Embodiment is not limited to computer system.The alternative embodiment of the present invention can be used in the other of such as handheld device In equipment and Embedded Application.Some examples of handheld device include mobile phone, the Internet protocol devices, digital camera, individual number Word assistant (PDA) and Hand held PC.Embedded Application can include microcontroller, digital signal processor (DSP), on-chip system, Network computer (NetPC), set top box, hub, wide area network (WAN) interchanger are able to carry out according at least one reality Apply any other system of one or more instructions of example.

Figure 1A is to be formed with to perform to perform algorithm according to a reality of the invention including one or more execution units 108 Apply the block diagram of the computer system 100 of the processor 102 of at least one instruction of example.Can be in single processor desktop type meter One embodiment described in the context of calculation machine or server system, but optional implement can be included in a multi-processor system Example.System 100 is the example of " hub " system architecture.Computer system 100 includes processor 102 with processing data signal. Processor 102 can be CISC (CISC) microprocessor, Jing Ke Cao Neng (RISC) microprocessor, surpass CLIW (VLIW) microprocessor, the processor of combination for realizing instruction set or appointing for example by taking data signal processor as an example What its processor device.Processor 102 is transmitted between being couple to other parts that can be in processor 102 and system 100 The processor bus 110 of data-signal.The element of system 100 performs conventional func known to those skilled in the art.

In one embodiment, processor 102 includes level 1 (L1) internal cache memory 104.Depending on framework, Processor 102 can have single internally cached or multiple-stage internal cache.Alternatively, in another embodiment, it is high Fast buffer memory may reside within the outside of processor 102.Realized and needs depending on specific, other embodiments can also Including inside and outside cache combination.Register file 106 can be by different types of data storage various In register, including integer registers, flating point register, status register and instruction pointer register.

Execution unit 108, including for performing the logic of integer and floating-point operation, also reside in processor 102.Processing Device 102 also includes microcode (ucode) ROM of storage for the microcode of some macro-instructions.For one embodiment, list is performed Member 108 includes the logic for being used to handle compression instruction set 109.Referred to by including compression in the instruction set of general processor 102 Order collection 109, together with the associated circuit for performing the instruction, can use the compressed data in general processor 102 To perform the operation used by many multimedia application.Thus, by using the processor for performing operation to compressed data The full duration of data/address bus, can more efficiently accelerate and perform many multimedia application.This can be eliminated across processor The need for data/address bus transmits the data compared with subsection once to perform one or more operations to data element.

The alternative embodiment of execution unit 108 can also be used in microcontroller, embeded processor, graphics device, DSP and In other types of logic circuit.System 100 includes memory 120.Memory 120 can be dynamic random access memory (DRAM) equipment, static RAM (SRAM) equipment, flash memory device or other memory devices.The energy of memory 120 Enough storages are by the instruction of data signal representative that can be performed by processor 102 and/or data.

System logic chip 116 is couple to processor bus 110 and memory 120.In the illustrated embodiment, system is patrolled It is Memory Controller hub (MCH) to collect chip 116.Processor 102 can lead to via processor bus 110 and MCH 116 Letter.MCH 116 provides high bandwidth memory path 118 to memory 120, is stored for instruction and data and for figure life Make, the storage of data and texture.MCH 116 is used between other parts in processor 102, memory 120 and system 100 Guide data-signal, and the bridge data signal between processor bus 110, memory 120 and system I/O 122.One In a little embodiments, system logic chip 116 can provide the graphics port for being couple to graphics controller 112.MCH 116 is passed through Cross memory interface 118 and be couple to memory 120.Graphics card 112 is couple to MCH by AGP (AGP) interconnection 114 116。

MCH 116 is couple to I/O controllers hub (ICH) by system 100 using proprietary hub interface bus 122 130.ICH 130 provides being directly connected to some I/O equipment via local I/O buses.Local I/O buses are used for periphery Equipment is connected to High Speed I/O buses of memory 120, chipset and processor 102.Some examples are Audio Controller, firmware Hub (flash BIOS) 128, transceiver 126, data storage 124, the traditional I/O for including user's input and keyboard interface The serial expansion port and network controller 134 of controller, such as USB (USB).Data storage device 124 can Including hard drive, disk drive, CD-ROM device, flash memory device or other mass memory units.

For another embodiment of system, the instruction according to one embodiment can be used together with on-chip system.On piece One embodiment of system is made up of processor and memory.It is flash memory for a kind of memory of such system.Flash memory can To be located at processor and other system units on identical nude film.In addition, such as Memory Controller or graphics controller Other logic blocks can also be located on on-chip system.

Figure 1B illustrates to realize the data handling system 140 of the principle of one embodiment of the invention.Technology in this area Personnel will readily appreciate that, in the case of the scope without departing from the embodiment of the present invention, can be together with optional processing system Using embodiment described herein.

Computer system 140 includes the processing core 159 for being able to carry out instructing according at least one of one embodiment.It is right In one embodiment, processing core 159 represents the processing unit with any kind of framework, and any kind of framework includes But it is not limited to CISC, RISC or VLIW type framework.Processing core 159 is also suitable for entering with one or more treatment technologies Row manufacture, and by being shown in detail in enough on a machine-readable medium, may adapt to promote the manufacture.

Processing core 159 includes execution unit 142, one group of register file 145 and decoder 144.Processing core 159 is also Including the additional circuit (not shown) unnecessary for understanding the embodiment of the present invention.Execution unit 142 is used to perform by process cores The instruction that the heart 159 is received.Except performing typical processor instruction, execution unit 142 can also carry out in compression instruction set 143 Instruction, for performing operation to compressed data form.Compression instruction set 143 includes the instruction for being used to perform the embodiment of the present invention With other compression instructions.Execution unit 142 is couple to register file 145 by internal bus.The representative office of register file 145 It is used for the storage region for storing the information for including data in reason core 159.As mentioned previously, it should be appreciated that for storing pressure The storage region of contracting data is not crucial.Execution unit 142 is couple to decoder 144.Decoder 144 is used for will be by process cores The instruction decoding that the heart 159 is received is control signal and/or microcode inlet point.These control signals and/or microcode are entered Point is responded, and execution unit 142 performs appropriate operation.In one embodiment, decoder is used for the operation of interpretative order Code, the command code of the instruction will indicate the corresponding data indicated in instruction are performed with what operation.

Processing core 159 is coupled with bus 141, for being communicated with various other system equipments, and other systems are set It is standby for example to include but is not limited to Synchronous Dynamic Random Access Memory (SDRAM) control 146, static random access memory Device (SRAM) control 147, burst flash interface 148, PCMCIA (PCMCIA)/compact flash (CF) Card control 149, liquid crystal display (LCD) control 150, direct memory access (DMA) (DMA) controller 151 and optional bus master connect Mouth 152.In one embodiment, data handling system 140 can also include being used for via I/O buses 153 and various I/O equipment The I/O bridges 154 communicated.Such I/O equipment can for example include but is not limited to universal asynchronous receiver/emitter (UART) 155, USB (USB) 156, bluetooth is wireless UART 157 and I/O expansion interfaces 158.

One embodiment regulation movement of data handling system 140, network and/or radio communication and it is able to carry out including The processing core 159 of the SIMD operation of text string comparison operation.Various audios, video, imaging and the communication of algorithms can be used Processing core 159 is programmed, these algorithms include such as Walsh-Hadamard conversion, FFT (FFT), Discrete cosine transform (DCT) and its discrete transform of respective inverse transformation;Such as colour space transformation, Video coding estimation Or the compression/de-compression technology of video decoding moving compensation;And such as modulating/demodulating of pulse code modulation (PCM) (MODEM) function.

Fig. 1 C illustrate to be able to carry out the another of the data handling system of the instruction for providing vector compression and spinfunction One alternative embodiment.According to an alternative embodiment, data handling system 160 can include primary processor 166, SIMD associations and handle Device 161, cache memory 167 and input/output 168.Input/output 168 can alternatively be couple to nothing Line interface 169.Simd coprocessor 161 is able to carry out including the operation of the instruction according to one embodiment.Processing core 170 can To be suitable for being manufactured with one or more treatment technologies, and by carrying out table on a machine-readable medium in detail enough Show, may adapt to promote to include all or part of manufacture of the data handling system 160 of processing core 170.

For one embodiment, simd coprocessor 161 includes execution unit 162 and one group of register file 164.Main place Managing one embodiment of device 166 includes decoder 165 to recognize the finger for the instruction set 163 for including the instruction according to one embodiment Order, for being performed by execution unit 162.For alternative embodiment, simd coprocessor 161 also includes decoder 165B at least A part, is decoded for the instruction to instruction set 163.Processing core 170 also includes for understanding the embodiment of the present invention not Necessary additional circuit (not shown).

In operation, primary processor 166 performs the stream of data processing instructions, and the data processing instructions control general type Data processing operation, including interacted with cache memory 167 and input/output 168.Simd coprocessor is instructed It is embedded in the stream of the data processing instructions.The instruction of these simd coprocessors is identified as by the decoder 165 of primary processor 166 With the type that should be performed by the simd coprocessor 161 being attached.Therefore, primary processor 166 is on coprocessor bus 171 These simd coprocessors instruction (or representing the control signal of simd coprocessor instruction) is sent, wherein by any attachment Simd coprocessor receives these coprocessor instructions from the coprocessor bus.In this case, simd coprocessor 161 It will receive and perform the simd coprocessor instruction for any reception for being intended for it.

Data can be received via wave point 169, for being handled by simd coprocessor instruction.Show for one Example, can receive voice communication in the form of data signal, and the voice communication can be instructed processing with weight by simd coprocessor The newly-generated digital audio samples for representing the voice communication.For another example, the compression of digital bit stream form can be received Audio and/or video, the audio and/or video of the compression can be handled to regenerate number by simd coprocessor instruction Word audio sample and/or port video frame.For one embodiment of processing core 170, at primary processor 166 and SIMD associations Reason device 161 is integrated into the single processing core 170 including execution unit 162, one group of register file 164 and decoder 165, To recognize the instruction for the instruction set 163 for including the instruction according to one embodiment.

Fig. 2 is to include being used to perform the micro- of the processor 200 of the logic circuit of instruction according to an embodiment of the invention The block diagram of framework.In some embodiments it is possible to realize according to the instruction of one embodiment with to byte, word, double word, The data element of the size of quadword etc. and such as single precision and the data type of double integer and floating type Operated.In one embodiment, orderly front end 201 is to take out pending instruction and them is prepared later in processor A part for the processor 200 used in pipeline.Front end 201 can include several units.In one embodiment, instruction prefetch Go out device 226 to take out instruction from memory and feed them into instruction decoder 228, the instruction decoder 228 is transferred to them Decoded or explained.For example, in one embodiment, the instruction decoding of reception is being claimed that machine is able to carry out by decoder For " microcommand " or " microoperation " (also referred to as micro- op or uop) one or more operations.In other embodiments, decoder The instruction is resolved to used by micro-architecture with perform according to the command code of the operation of one embodiment and corresponding data with And control field.In one embodiment, trace cache 230 takes the uop of decoding and they is assembled into program sequence Sequence or uop queues 234 in trace be used for perform.When trace cache 230 meets the instruction of complexity, microcode ROM 232, which is provided, completes the uop that the operation needs.

Some instructions are converted into single microoperation, and other instructions need several microoperations to complete complete operation. In one embodiment, instruction is completed if necessary to more than four microoperations, then the access microcode of decoder 228 ROM 232 To complete the instruction.Can be for lacking for being handled at instruction decoder 228 by instruction decoding for one embodiment Measure microoperation.In another embodiment, it is assumed that need multiple microoperations to complete operation, then instruction can be stored in microcode In ROM 232.Trace cache 230 refers to inlet point programmable logic array (PLA) to determine to be used for from microcode ROM 232 read micro-code sequences to complete the correct microcommand pointer of one or more instructions according to one embodiment.Micro- Code ROM 232 is completed after the order microoperation of instruction, and the front end 201 of machine restarts to take from trace cache 230 Go out microoperation.

It is the engine for preparing instruction wherein for execution to execute out engine 203.Order execution logic has multiple slow Device is rushed to make instruction stream smooth and reset so as to when they are descending and optimize performance when being scheduled for and performing along pipeline.Distribution Machine buffer that device assignment of logical each uop needs and resource are to be performed.Register renaming logic is by logic Register is renamed in the entry in register file.Distributor is distributed for two also before instruction scheduler The entry of each uop in one of uop queues, a uop is used for storage operation and one is used for non-memory and operates:Deposit Reservoir scheduler, fast scheduler 202, at a slow speed/general floating point scheduler 204 and simple floating point scheduler 206.Uop schedulers 202nd, 204,206 the execution that its operation needs is completed with the SBR and the uop of the input register operand source of its dependence Determine when uop is ready to carry out based on the availability of resource.The fast scheduler 202 of one embodiment can be in master clock It is scheduled in each half of cycle, and other schedulers can be dispatched only once per the primary processor clock cycle.Scheduler is secondary Sanction sends port to be used to perform to dispatch uop.

Register file 208,210 be located at scheduler 202,204,206 and perform square frame 211 in execution unit 212, 214th, between 216,218,220,222,224.For integer and floating-point operation, respectively with single register file 208, 210.Each register file 208,210 of one embodiment also includes bypass network, and the bypass network, which can be bypassed, not to be had also There is the result just completed being written in register file or the result is forwarded to new related uop.Integer registers File 208 and floating-point register 210 also can transmit data with another register file., will be whole for one embodiment Number register files 208 are divided into two single register files, and a register file is for 32 data of low order, and the Two register files are used for 32 data of high-order.The floating-point register 210 of one embodiment has 128 bit wide entries, because Typically there is the operand on width from 64 to 128 for floating point instruction.

Perform square frame 211 be included in actually carrying out the execution unit 212 of instruction, 214,216,218,220,222, 224.This part, which includes storage microcommand, needs the integer for execution and the register file of floating-point data operation value 208、210.The processor 200 of one embodiment is made up of multiple execution units:Scalar/vector (AGU) 212, AGU 214, Quick ALU 216, at a slow speed quick ALU 218, ALU 220, floating-point ALU 222, floating-point mobile unit 224.For an implementation Example, floating-point performs square frame 222,224 and performs floating-point, MMX, SIMD and SSE or other operation.The floating-point ALU of one embodiment 222 include performing division, square root and remainder micro-operation 64 remove 64 Floating-point dividers.For embodiments of the invention, Floating point hardware can be used and be related to the instruction of floating point values to handle.In one embodiment, ALU operation goes to high speed ALU execution Unit 216,218.The quick ALU 216,218 of one embodiment can perform the fast of effective time delay with a half clock cycle Speed operation.For one embodiment, most of complicated integer operation goes to ALU 220 at a slow speed, because ALU 220 includes at a slow speed Integer execution hardware for the long delay type operations of such as multiplier, displacement, mark logic and branch process.Memory adds Load/storage operation is performed by AGU 212,214.For one embodiment, integer operation is being performed to 64 data operands Integer ALU 216,218,220 is described in context.In an alternative embodiment, it is possible to achieve ALU 216,218,220 is with branch Hold including 16,32,128,256 etc. various data bit.Similarly, it is possible to achieve floating point unit 222,224 has with support The sequence of operations number of the position of various width.For one embodiment, floating point unit 222,224 can combine SIMD and multimedia Instruction operates come the compressed data operand to 128 bit wides.

In one embodiment, uop schedulers 202,204,206 load the behaviour for assigning correlation before completion is performed in father Make.Due to speculatively dispatching and performing uop in processor 200, therefore processor 200 also includes being used to handle memory omission Logic.If missing data loading in data high-speed caching, associative operation in-flight is there may be in pipeline, This is that scheduler leaves temporary transient incorrect data.Replay mechanism tracks and re-executes the instruction using incorrect data. Only need to reset related operation, and allow to complete independent operation.The scheduler of one embodiment of processor and playback Mechanism is also designed to the instruction that capture provides vector compression and spinfunction.

Term " register " may refer to be used as the airborne processor memory cell of a part for the instruction of identification operand. In other words, register can be the register that can be used from the outside (from the angle of programmer) of processor.However, implementing The register of example should not be confined to certain types of circuit in implication.But, the register of embodiment can be stored and carried For data, and perform functions described herein.Register described herein can be in processor circuit use any number The different technologies of amount realize, such as special physical register, using register renaming dynamically distributes physical register, Combination of physical register of special and dynamically distributes etc..In one embodiment, integer registers store 32 integer numbers According to.The register file of one embodiment also includes eight multimedia SIM D registers for compressed data.For following Discuss, register is interpreted as to the data register for being designed as keeping compressed data, such as using from California 64 bit wide MMX in the microprocessor that the MMX technology of Santa Clara Intel company is realizedTMRegister is (in some examples In also referred to as " mm " register).With these available MMX registers of both integer and relocatable can with SIMD and The compressed data element of SSE instructions is operated together.Similarly, with SSE2, SSE3, SSE4 or higher (commonly known as " SSEx ") 128 relevant bit wide XMM registers of technology can be used for keeping such compressed data operand.In one embodiment, When storing compressed data and integer data, register needs not distinguish between out both data types.In one embodiment, it is whole Number is with floating-point or included in identical register file or included in different register files.Moreover, at one In embodiment, floating-point and integer data can be stored in different registers or identical register.

In the example of accompanying drawing below, multiple data operands are described.Fig. 3 A are illustrated according to one implementation of the present invention Various encapsulated data types of the example in multimedia register are represented.Fig. 3 A illustrate the compression word for 128 bit wide operands The data type of section 310, compression word 320 and compression double word (dword) 330.The packed byte form 310 of this example is 128 Bit length, and include 16 packed byte data elements.Here byte is defined as 8 data.Will be for each byte number According to the information of element be stored in the position 7 of byte 0 in place 0, the position 15 of byte 1 in place 8, the position 23 of byte 2 in place 16 and last The position 120 of byte 15 is in place in 127.Thus, in a register using all available positions.This storage arrangement adds processing The storage efficiency of device.Also, in the case where accessing 16 data elements, it now is possible to which 16 data elements in parallel are performed One operation.

Generally, data element is to be collectively stored in single register or storage with other data elements with equal length Independent data segment in device unit.In the compressed data sequences relevant with SSEx technologies, the data in XMM register are stored in The quantity of element is the length of the position of 128 divided by individual data elements.Similarly, in the compression relevant with SSE technology with MMX In data sequence, the quantity for the data element being stored in MMX registers is the length of the position of 64 divided by individual data elements. Although the data type illustrated in Fig. 3 A is 128 bit lengths, embodiments of the invention can also using 64 bit wides, 256 bit wides, The operand of 512 bit wides or other sizes is operated.The compression word format 320 of this example is 128 bit lengths, and includes eight Individual compression digital data element.Each compression word includes 16 information.Fig. 3 A compression Double Word Format is 128 bit lengths, and is wrapped Containing four compression double-word data elements.Each compression double-word data element includes 32 information.It is 128 to compress quadword It is long, and include two compression quadword data elements.

Fig. 3 B illustrate data memory format in optional register.Each compressed data can include more than one only Vertical data element.Illustrate three compressed data forms:Compress half-word 341, compression individual character 342 and compression double word 343.Compression One embodiment of half-word 341, compression individual character 342 and compression double word 343 includes fixing point data element.Implement for optional Example, floating data element can be included by compressing one or more of half-word 341, compression individual character 342 and compression double word 343.Pressure One alternative embodiment of contracting half-word 341 is 128 bit lengths for including eight 16 bit data elements.Compress a reality of individual character 342 It is 128 bit lengths and comprising four 32 bit data elements to apply example.One embodiment of compression double word 343 is 128 bit lengths and wrapped Containing two 64 bit data elements.It will be recognized that it is long such compressed data form can be further expanded into other registers Degree, such as 96,160,192,224,256,512 or more.

Fig. 3 C illustrate various tape symbols in multimedia register according to embodiments of the present invention and without sign compression number Represented according to type.The storage without sign compression byte in simd register is illustrated without sign compression byte representation 344.Will be each The information of individual byte data element be stored in the position 7 of byte 0 in place 0, the position 15 of byte 1 in place 8, the position 23 of byte 2 in place 16 Etc. and last byte 15 position 120 in place in 127.Thus, in a register using all available positions.This storage Arrangement can increase the storage efficiency of processor.Also, now can be according to parallel in the case where accessing 16 data elements Mode 16 data elements are performed with an operation.Tape symbol packed byte represents that 345 illustrate tape symbol packed byte Storage.It is noted that the 8th of each byte data element is symbol designator.Represent that 346 illustrate without sign compression word How word 7 is stored in simd register to word 0.Tape symbol compression word represents 347 and table in unsigned packed word in-register Show that 346 is similar.It is noted that the 16th of each digital data element is symbol designator.Unsigned packed doubleword representation 348 is shown Go out and how to have stored double-word data element.Tape symbol compression double word represents 349 with representing 348 in unsigned packed doubleword in-register It is similar.It is noted that necessary sign bit is the 32nd of each double-word data element.

Fig. 3 D describe operation coding (command code) form 360 with 32 or more positions and with a type of operation One embodiment of the corresponding register/memory operand addressing mode of code form, the command code form exists From California Santa Clara on intel.com/products/processor/manuals/ WWW (www) Intel company obtained by "64and IA-32Intel Architecture Software Developer’s Manual Combined Volumes 2A and 2B:It is described in Instruction Set Reference A-Z ". In one embodiment, instruction can be encoded by one or more of field 361 and 362.It can recognize and often instruct up to Two operation counting units, including up to two source operand identifiers 364 and 365.For one embodiment, vector element size Identifier 366 is identical with source operand identifier 364, and in other embodiments, they are different.Implement for optional Example, destination operand identifier 366 is identical with source operand identifier 365, and in other embodiments, they are different 's.In one embodiment, one of the source operand recognized by source operand identifier 364 and 365 by the instruction result weight Write, and in other embodiments, identifier 364 is corresponding with source register element, and identifier 365 and destination register Element is corresponding.For one embodiment, operand identification symbol 364 and 365 can be used for recognizing 32 or 64 source and destinations Ground operand.

Fig. 3 E describe another optional operation coding (command code) form 370 with 40 or more positions.Command code lattice Formula 370 is corresponding with command code form 360, and including optional prefix byte 378.Can be with according to the instruction of one embodiment Encoded by one or more of field 378,371 and 372.Up to two operation counting units are often instructed to be operated by source Count identifier 374 and 375 and recognized by prefix byte 378.For one embodiment, prefix byte 378 can be used for identification 32 or 64 potential sources and vector element size.For one embodiment, destination operand identifier 376 is identified with source operand Accord with 374 identical, and in other embodiments, they are different.For alternative embodiment, destination operand identifier 376 It is identical with source operand identifier 375, and in other embodiments, they are different.In one embodiment, instruction to by Operand identification accords with one or more of operand of 374 and 375 identifications and operated, and accords with 374 by operand identification It is written over one or more operands of 375 identifications by the result of the instruction, and in other embodiments, will be by identifier The operand of 374 and 375 identifications writes another data element in another register.Command code form 360 and 370 allows part Specified by MOD field 363 and 373 and by the optional factor-index-basis (scale-index-base) and displacement byte on ground Register to register, memory to register, register multiply memory, register multiply register, register multiply immediate, Register is to memory addressing.

Fig. 3 F are turned next to, in some alternative embodiments, can be come by coprocessor data processing (CDP) instruction Execution 64 (or 128 or 256 or 512 or more) single-instruction multiple-data (SIMD) arithmetical operation.Operation coding (operation Code) form 380 depicts a kind of such CDP instruction with CDP opcode fields 382 and 389.For alternative embodiment, The type of CDP instruction and operation can be encoded by one or more of field 383,384,387 and 388.It can know Up to three operation counting units, including up to two source operand identifiers 385 and 390 and a destination behaviour are not instructed often Count identifier 386.One embodiment of coprocessor can be operated to 8,16,32 and 64 place values.For an implementation Example, to integer data element execute instruction.In certain embodiments, finger can be conditionally executed with use condition field 381 Order.For some embodiments, source data size can be encoded by field 383.In certain embodiments, can be to SIMD words Section completes zero (Z), negative (N), carry (C) and overflows (V) detection.For some instructions, the type of saturation can be entered by field 384 Row coding.

Fig. 3 G are turned next to, illustrate is used to provide vector compression and the another of spinfunction can according to another embodiment Choosing operation coding (command code) form 397, this operation coded format 397 with intel.com/products/ It can be obtained from California Santa Clara Intel company on processor/manuals/ WWW (www) "Command code form described in Advanced Vector Extensions Programming Reference " Type it is corresponding.

Original x86 instruction set regulation has 1 byte oriented operand of the various forms of address syllable and there is basis at it The immediate operand included in extra byte known to first " command code " byte.Additionally, there are be left command code Change some byte values of the factor (being referred to as prefix, because must place them in before instruction).When 256 op-code words When the original color palette of section (including these special prefix values) is depleted, single byte is exclusively used in one group of new 256 as escape Individual command code.When adding vector instruction (such as SIMD), the need for generation is for more command codes, and " two bytes " is grasped It is also insufficient to make code map, even if when by being extended using prefix.Therefore, adding optionally using 2 bytes Prefix in the extra map of identifier as adding new instruction.

In addition, in order to promote the extra register in 64 bit patterns, (and command code can be determined in prefix and command code Required any escape byte) between use extra prefix (being referred to as " REX ").In one embodiment, REX can have 4 " payload " positions are to indicate the use of the extra register in 64 bit patterns.In other embodiments, it can have few In or more than the position of 4.It is general by least one instruction set addressed below general format (generally with form 360 and/or Form 370 is corresponding):

[prefix] [rex] escape [escape 2] command code modrm (etc.)

Command code form 397 it is corresponding with command code form 370 and including optional VEX prefix bytes 391 (at one Started in embodiment with C4hex) change most of other conventional traditional instruction prefix bytes and escape code.For example, under Face illustrates the embodiment encoded using two fields to instruction, when there is the second escape code in presumptive instruction Or when needing to use extra bits (for example, XB and W fields) in REX fields, the instruction can be used.The reality illustrated below Apply in example, traditional escape is represented by new escape value, traditional prefix is sufficiently compressed as a part for " payload " byte, biography System prefix is recovered and can be used for the extension in future, and the second escape code is compressed in " map " field, following map Or feature space is available, and add new feature (for example, increased vector length and extra source register indicator).

[prefix] [rex] escape [escape 2] command code modrm [sib] [disp] [imm]

Vex RXBmmmmm WvvvLpp command codes modrm [sib] [disp] [imm]

New feature

It can be encoded according to the instruction of one embodiment by one or more of field 391 and 392.Often instruct many Up to four operation counting units can by field 391 combine source operand identifier 374 and 375 and combine the optional factor-index- Basic (SIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395 are identified.For one Embodiment, VEX prefix bytes 391 can be used for recognizing 32 or 64 potential sources and vector element size and/or 128 or 256 Simd register or memory operand.For one embodiment, the function of being provided by command code form 397 may be with command code Form 370 is repeated, and in other embodiments, they are different.Command code form 370 and 397 allows partly by MOD words Section 373 and specified by optional (SIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395 Register to register, memory to register, register multiply memory, register multiply register, register multiply immediate, Register is to memory addressing.

Fig. 3 H are turned next to, illustrate is used to provide vector compression and the another of spinfunction can according to another embodiment Operation coding (command code) form 398 of choosing.Command code form 398 is corresponding with command code form 370 and 397 and including can The EVEX prefix bytes 396 (being started in one embodiment with 62hex) of choosing are with the most of other conventional traditional instructions of replacement Prefix byte and escape code and provide extra function.Can be in field 396 and 392 according to the instruction of one embodiment One or more encoded.Up to four operation counting units and mask is often instructed to combine source operand by field 396 Identifier 374 and 375 simultaneously combines the optional factor-index-basis (SIB) identifier 393, the optional and of displacement identifier 394 Optional immediate byte 395 is identified.For one embodiment, EVEX prefix bytes 396 can be used for identification 32 or 64 potential sources and vector element size and/or 128,256 or 512 simd registers or memory operand.For one Embodiment, the function of being provided by command code form 398 can be repeated with command code form 370 or 397, and in other embodiments In, they are different.Command code form 398 allows partly by MOD field 373 and by optional (SIB) identifier 393rd, the register with mask that optional displacement identifier 394 and optional immediate byte 395 are specified to register, deposit Reservoir, which multiplies memory, register to register, register and multiplies register, register, multiplies immediate, register to memory addressing. Generally pass through the general format (generally corresponding with form 360 and/or form 370) of at least one instruction set addressed below:

Evex1RXBmmmmm WvvvLpp evex4 command codes modrm [sib] [disp] [imm]

For one embodiment, the instruction encoded according to EVEX forms 398 can have extra " payload " position, its It can be used for using the mask register that for example can configure with user, or extra operand or from 128,256 or 512 Selection that vector registor is carried out will carry out the extra new feature exemplified by more multiregister etc. of selection to provide from it Vector compression and spinfunction.

For example, being used for implicit mask to provide the situation of vector compression and spinfunction in VEX forms 397 Under, EVEX forms 398 are used for concise user and can configure mask to provide vector compression and spinfunction.In addition, In the case where VEX forms 397 can be used for providing vector compression and spinfunction to 128 or 256 bit vector registers, EVEX forms 398 can be used for 128,256,512 or the vector registor of bigger (or smaller) provides vector compression And spinfunction.

Illustrate the example instruction for providing vector compression and spinfunction by following example:

It will be recognized that as in the above example, SIMD compressions and rotation instruction can otherwise will be not easy it is vectorial Vector compression function is provided in the application of change, such as in the reference application such as in the 444.NAMD of SPEC benchmark groups inner ring In, so that the quantity of the expensive sequential storage of external memory storage is reduced to, increase performance and instruction throughput, and reduce power Use.

Fig. 4 A be illustrate orderly pipeline and register rename level according at least one of the invention embodiment, unordered point The block diagram of hair/execution pipeline.Fig. 4 B are to illustrate to include within a processor orderly according at least one of the invention embodiment Framework core and register renaming logic, the block diagram of unordered distribution/execution logic.The orderly pipe of solid box explanation in Fig. 4 A Line, and dotted line frame illustrates register renaming, unordered distribution/execution pipeline.Similarly, the solid box in Fig. 4 B illustrates orderly frame Structure logic, and dotted line frame illustrates the unordered distribution/execution logic of register renaming logical sum.

In Figure 4 A, processor pipeline 400 includes fetching level 402, length decoder level 404, decoder stage 406, distribution stage 408th, renaming level 410, scheduling (also referred to as dispatch or distribute) level 412, register reading/memory read level 414, performed Level 416, write back/memory writing level 418, abnormality processing level 422 and commission level 424.

In figure 4b, arrow refers to the coupling between two or more units, and the direction indicating data of arrow Flow the direction between those units.Fig. 4 B show the processing of the front end unit 430 including being couple to enforcement engine unit 450 Device core 490, both the front end unit 430 and enforcement engine unit 450 are each coupled to memory cell 470.

Core 490 can be that Jing Ke Cao Neng (RISC) core, sophisticated vocabulary calculate (CISC) core, overlength and referred to Make word (VLIW) core or mixing or substitute core type.As another selection, core 490 can be special core, for example By taking network or communication core, compression engine, graphic core etc. as an example.

Front end unit 430 includes being couple to the inch prediction unit 432 of Instruction Cache Unit 434, the instruction cache Buffer unit 434 is couple to instruction translation look-aside buffer (TLB) 436, the instruction translation look-aside buffer (TLB) 436 coupling To instruction fetching unit 438, the instruction fetching unit 438 is couple to decoding unit 440.Decoding unit or decoder can be to referring to Make and being decoded, and generate from presumptive instruction decode otherwise reflection presumptive instruction or according to presumptive instruction One or more microoperations, microcode inlet point, microcommand, other instructions or the other control signals derived, is used as output.Can To realize decoder using a variety of mechanism.It is real that the example of appropriate mechanism includes but is not limited to look-up table, hardware Existing, programmable logic array (PLA), microcode read-only storage (ROM) etc..The further coupling of Instruction Cache Unit 434 It is connected to level 2 (L2) cache element 476 in memory cell 470.Decoding unit 440 is couple to enforcement engine unit 450 In renaming/dispenser unit 452.

Enforcement engine unit 450 includes being couple to the weight for recalling unit 454 and one group of one or more dispatcher unit 456 Name/dispenser unit 452.Dispatcher unit 456 represents any amount of different schedulers, including subscribes station, central command Window etc..Dispatcher unit 456 is couple to physical register file unit 458.Each physical register file unit 458 One or more physical register files are represented, wherein different physical register files stores one or more different data Type, such as scalar integer, scalar floating-point, compression integer, compression floating-point, vectorial integer, vector floating-point etc., state (for example, It is used as the instruction pointer of the address for the next instruction to be performed) etc..Physical register file unit 458 is by recalling unit 454 cover to illustrate the various modes that can wherein realize register renaming and execute out (for example, usage record device is buffered Device and register file is recalled, using future file, historic buffer and recall register file;Using register map and posting Storage pond;Etc.).Generally, outside of the architectural registers from processor or the angle from programmer.Register is not limited to Any of certain types of circuit.Various types of register is appropriate, as long as they can store and provide Data as described herein.The example of appropriate register is included but is not limited to special physical register, thought highly of using deposit The physical register of the dynamically distributes of name, the combination of special and dynamically distributes physical registers etc..Recall the He of unit 454 Physical register file unit 458 is couple to execution cluster 460.Performing cluster 460 includes one group of one or more execution unit 462 and one group of one or more memory access unit 464.Execution unit 462 can be to various types of data (for example, mark Measure floating-point, compression integer, compression floating-point, vectorial integer, vector floating-point) perform it is various operation (for example, displacement, addition, subtraction, Multiplication).Although some embodiments can include the multiple execution units for being exclusively used in specific function or function group, other to implement Example can only include all performing institute's one execution unit of functional or multiple execution units.Because some embodiments are for certain Data/operation of a little types creates single pipeline (for example, the dispatcher unit with their own, physical register text respectively Scalar integer pipeline, the scalar floating-point/compression integer/compression floating-point/vectorial integer/vector of part unit and/or execution cluster are floated Point pipeline and/or memory access pipeline, and in the case of single memory access pipeline, realizes wherein only this pipe The execution cluster of line has some embodiments of memory access unit 464), therefore, it is possible to which dispatcher unit 456, physics are posted Storage unit 458 and execution cluster 460 are expressed as multiple.It should also be understood that in the case of using single pipeline, these pipes One or more of line can be unordered distribution/execution, and what remaining was ordered into.

This group of memory access unit 464 is couple to memory cell 470, and the memory cell 470 includes being couple to number According to the data TLB unit 472 of cache element 474, the data cache unit 474 is couple to level 2 (L2) cache Unit 476.In one exemplary embodiment, memory access unit 464 can include loading unit, storage address unit and Data storage unit, each of which unit is couple to the data TLB unit 472 in memory cell 470.L2 cache lists Member 476 is couple to one or more of the other grade of cache and is finally couple to main storage.

By way of example, exemplary register renaming, unordered distribution/execution core architecture can be according to following To realize pipeline 400:1) instruction fetching 438, which is performed, fetches and length decoder level 402 and 404;2) perform decoding of decoding unit 440 Level 406;3) renaming/dispenser unit 452 performs distribution stage 408 and renaming level 410;4) dispatcher unit 456 performs tune Spend device level 412;5) physical register file unit 458 and memory cell 470 perform register reading/memory and read level 414;Cluster 460 is performed to carry out performing level 416;6) memory cell 470 and physical register file unit 458 perform write back/ Memory writing level 418;7) various units can be related in abnormality processing level 422;And 8) recall unit 454 and physics deposit Device file unit 458 performs commission level 424.

Core 490 can support one or more instruction set (for example, (newer version expands x86 instruction set added with some Exhibition);The MIPS instruction set of California Sunnyval MIPS technologies;California Sunnyval ARM Holdings ARM instruction set (additional extension with optional such as NEON).

It should be understood that core can support multithreading (performing the parallel operation of two or more groups or thread), and can With according to including time slicing multithreading, simultaneous multi-threading, (wherein single physical core provides logic core, for each line Journey, the physical core is simultaneously multithreading) or its combination (for example, time slicing is fetched and decoded and for example exists afterwardsMultithreading while in Hyper-Threading) various modes come so do.

Although describing register renaming in the context executed out, it should be appreciated that, can be in orderly frame Register renaming is used in structure.Although the embodiment of the explanation of processor also includes single instruction and data cache list Member 434/474 and shared L2 cache elements 476, but alternative embodiment, which can have, is used for both instruction and datas It is single internally cached, for example by level 1 (L1) is internally cached or multiple-stage internal cache exemplified by.In some embodiments In, the system can include the combination of internally cached and outside core and/or processor External Cache.Can Selection of land, all caches can be located at the outside of core and/or processor.

Fig. 5 is the single-core processor and multi-core according to embodiments of the present invention with integrated memory controller and figure The block diagram of processor 500.Solid box in Fig. 5 illustrate with single core 502A, System Agent 510, one group one or The processor 500 of multiple bus control unit units 516, and the optional addition of dotted line frame illustrate with multiple core 502A-N, One group of one or more integrated memory controller unit 514 and integrated graphics logic 508 in system agent unit 510 Optional processor 500.

Storage levels include one or more levels cache, one group of one or more shared cache in core Unit 506 and the external memory storage (not shown) for being couple to the integrated Memory Controller unit 514 of the group.What the group was shared Cache element 506 can include one or more intermediate-level caches, for example level 2 (L2), level 3 (L3), level 4 (L4) or Other grades of caches, afterbody cache (LLC) and/or its combination.Although in one embodiment, based on the mutual of ring Even unit 512 interconnects the shared cache 506 of integrated graphics logic 508, the group and system agent unit 510, but can Embodiment is selected to use for making any amount of known technology of such cell interconnection.

In certain embodiments, one or more of core 502A-N has multithreading ability.System Agent 510 includes Coordinate and operate core 502A-N those parts.System agent unit 510 can for example including power control unit (PCU) and Display unit.The logical sum that PCU can be or the power rating including regulation core 502A-N and integrated graphics logic 508 needs Part.Display unit is used for the display for driving one or more external connections.

Core 502A-N can be isomorphism or isomery in terms of framework and/or instruction set.For example, in core 502A-N What some cores can be ordered into, and other cores are unordered.Two or more as another example, in core 502A-N It is individual to be able to carry out identical instruction set, and other cores may can only perform the subset of the instruction set or different instructions Collection.

Processor can be general processor, and for example the Intel company from California Santa Clara can obtain The Core obtainedTMI3, i5, i7,2Duo and Quad, XeonTM、ItaniumTM、XScaleTMOr StrongARMTMProcessor.It is optional Ground, processor can come from another company, such as ARM Holdings Co., Ltds, MIPS etc..Processor can be special Processor, be for example with network or communication processor, compression engine, graphics processor, coprocessor, embeded processor etc. Example.Processor can be realized on one or more chips.Processor 500 can be a part for one or more substrates And/or any one of a variety for the treatment of technologies for example by taking BiCMOS, CMOS or NMOS as an example can be used at one or many Realized on individual substrate.

Fig. 6-8 is suitable for including the example system of processor 500, and Fig. 9 is can to include one or more cores 502 exemplary system-on-chip (SoC).In laptop computer, desktop computer, Hand held PC, personal digital assistant, engineering Work station, server, the network equipment, hub, interchanger, embeded processor, digital signal processor (DSP), figure Shape equipment, video game device, set top box, microcontroller, mobile phone, portable electronic device, handheld device and various other Known other system designs and configuration are also appropriate in the field of electronic equipment.Generally, it can combine as disclosed herein The miscellaneous system or electronic equipment of processor and/or other execution logics are typically appropriate.

Referring now to Figure 6, showing the block diagram of system 600 according to an embodiment of the invention.System 600 can be wrapped Include the one or more processors 610,615 for being couple to Graphics Memory Controller hub (GMCH) 620.Extra processor 615 optional property dotted line figure 6 illustrates.

Each processor 610,615 can be the processor 500 of some version.It is to be noted, however, that integrated graphics are patrolled Collect and integrated memory control unit can not possibly be present in processor 610,615.Fig. 6 illustrates that memory can be couple to 640 GMCH 620, the memory 640 may, for example, be dynamic random access memory (DRAM).For at least one implementation Example, DRAM can be associated with non-volatile cache.

GMCH 620 can be a part for chipset or chipset.GMCH 620 can communicate with processor 610,615, And the interaction between control processor 610,615 and memory 640.GMCH 620 is also used as processor 610,615 and Acceleration EBI between other elements of system 600.For at least one embodiment, GMCH 620 is total via such as front side The multi-point bus of line (FSB) 695 is communicated with processor 610,615.

Moreover, GMCH 620 is couple to display 645 (such as flat-panel monitor).GMCH 620 can include integrated graphics Accelerator.GMCH 620 is further coupled to input/output (I/O) controller hub (ICH) 650, and the ICH 650 can be used In various ancillary equipment are couple into system 600.For example be illustrated that in the embodiment in fig 6 external graphics devices 660 together with Another ancillary equipment 670, the external graphics devices 660 can be coupled to ICH 650 discrete graphics device.

Alternatively, extra or different processor can also be present in system 600.For example, extra processor 615 can With including the processor extra with the identical of processor 610, with the isomery of processor 610 or asymmetric extra processor plus Fast device (such as by taking graphics accelerator or Digital Signal Processing (DSP) unit as an example), field programmable gate array or any other Processor., can be in physical resource in terms of the valuable measurement frequency spectrum including framework, micro-architecture, heat, power consumption characteristics etc. 610th, there is each species diversity between 615.These differences effectively can show as its own among processor 610,615 not Symmetry and isomerism.For at least one embodiment, various processors 610,615 may reside within the encapsulation of identical nude film In.

Referring now to Figure 7, showing the block diagram of second system 700 according to embodiments of the present invention.As shown in fig. 7, many Processor system 700 is point-to-point interconnection system, and the He of first processor 770 including being coupled via point-to-point interconnection 750 Second processor 780.Each in processor 770 and 780 can be had with processor 610, one or more of 615 The processor 500 of identical version.

Although illustrate only two processors 770,780, it should be appreciated that, the scope of the present invention is not limited to this. In other embodiments, there may be one or more extra processors in given processor.

Processor 770 and 780 is expressed as to include integrated memory controller unit 772 and 782 respectively.Processor 770 Also include the part of its bus control unit unit point-to-point (P-P) interface 776 and 778;Similarly, second processor 780 is wrapped Include P-P interfaces 786 and 788.Processor 770,780 can use P-P interface circuits 778,788 via point-to-point (P-P) interface 750 exchange information.As shown in fig. 7, it can be attached locally to respective processor that processor is couple to by IMC 772 and 782 Main storage part respective memory, i.e. memory 732 and memory 734.

Processor 770,780 can be connect via single P-P respectively with point of use to point interface circuit 776,794,786,798 Mouth 752,754 exchanges information with chipset 790.Chipset 790 can also be via high performance graphics interface 739 and high-performance figure Shape circuit 738 exchanges information.

Shared cache (not shown) can be included in any processor or in the outside of two processors, but Also to be connected via P-P interconnection with processor, if to place a processor into low-power mode, can by any or The local cache information of two processors is stored in shared cache.

Chipset 790 can be couple to the first bus 716 via interface 796.In one embodiment, the first bus 716 Can be that peripheral parts interconnected (PCI) bus or such as bus of PCI Express buses or another third generation I/O interconnection are total Line, although the scope of the present invention is not limited thereto.

As shown in fig. 7, together with bus bridge 718 various I/O equipment 714 can be couple into the first bus 716, the bus bridge First bus 716 is couple to the second bus 720 by 718.In one embodiment, the second bus 720 can be low pin count (LPC) bus.In one embodiment, various equipment can be couple to the second bus 720, such as including keyboard and/or mouse Mark 722, communication equipment 727 and can such as include the disk drive or other mass memory units of instructions/code and data 730 Memory cell 728.And then, audio I/O 724 can be couple to the second bus 720.It is noted that other frameworks are possible 's.For example, instead of Fig. 7 point-to-point framework, system can realize multi-point bus or other such frameworks.

Referring now to Figure 8, showing the block diagram of the 3rd system 800 according to embodiments of the present invention.In figures 7 and 8 Similar element use similar reference marker, and eliminate from Fig. 8 Fig. 7 it is some in terms of, to avoid confusion Fig. 8's Other side.

Fig. 8 illustrates that processor 870,880 can include integrated memory and I/O control logics (" CL ") 872 Hes respectively 882.For at least one embodiment, CL 872,882 can include the integrated memory for example described above for Fig. 5 and Fig. 7 Controller unit.In addition, CL 872,882 can also include I/O control logics.Fig. 8 illustrates not only memory 832,834 couplings CL 872,882 is connected to, and I/O equipment 814 is also coupled to control logic 872,882.Traditional I/O equipment 815 is couple to chip Collection 890.

Referring now to Figure 9, showing SoC 900 according to embodiments of the present invention block diagram.Similar component tool in Fig. 5 There is similar reference number.Also, dotted line frame is the optional feature on the SoC of higher level.In fig .9, the coupling of interconnecting unit 902 It is connected to:Application processor 910 including one group of one or more core 502A-N and shared cache element 506;System Agent unit 510;Bus control unit unit 516;Integrated memory controller unit 514;Integrated graphics logic can be included 508, for providing static and/or video camera function image processor 924, the audio for providing hardware audio acceleration Processor 926 and one group of one or more media handling for providing the video processor 928 that encoding and decoding of video accelerates Device 920;Static RAM (SRAM) unit 930;Direct memory access (DMA) (DMA) unit 932;And for coupling To the display unit 940 of one or more external displays.

Figure 10 illustrate according to one embodiment can perform at least one instruction comprising CPU (CPU) and The processor of graphics processing unit (GPU).In one embodiment, for performing the finger of the operation according at least one embodiment Order can be performed by CPU.In another embodiment, the instruction can be performed by GPU.In another embodiment, can pass through by The combination for the operation that GPU and CPU is performed carrys out execute instruction.For example, in one embodiment, can receive the decode according to one The instruction of embodiment is used to perform on GPU.However, one or more operations in the instruction of decoding can be performed by CPU, And result return to GPU be used for instruct finally recall.On the contrary, in certain embodiments, CPU may be used as primary processor simultaneously And GPU is used as coprocessor.

In certain embodiments, having benefited from the instruction of the handling capacity processor of highly-parallel can be performed by GPU, and be benefited It can be performed in the instruction of the performance of processor by CPU, the performance gains of the processor are equipped with the framework of pipeline in depth.Example Such as, figure, scientific application, financial application and other parallel workloads can have benefited from GPU performance and therefore be performed, And more order applications of such as operating system nucleus or application code can be better adapted to CPU.

In Fig. 10, processor 1000 includes CPU 1005, GPU 1010, image processor 1015, video processor 1020th, USB controller 1025, UART controller 1030, SPI/SDIO controllers 1035, display device 1040, high-definition multimedia Interface (HDMI) controller 1045, MIPI controller 1050, flash controller 1055, double data rate (DDR) (DDR) controller 1060, peace Full engine 1065 and I2S/I2C (between Integrated Inter-Chip Sound/integrated circuit) interface 1070.Can be by other logical sum circuit bags Include in Figure 10 processor, including more CPU or GPU and other Peripheral Interface Controllers.

The one or more aspects of at least one embodiment can be by storing representative data on a machine-readable medium Realize, the representative data represents the various logic in processor, and the representative data makes the machine system when being read by machine Make the logic for performing technique described herein.The such expression for being referred to as " the IP kernel heart " can be stored in tangible machine On device computer-readable recording medium (" tape ") and it is supplied to various consumers or manufacturing facility actually to manufacture the logic or place to be loaded into In the manufacture machine for managing device.For example, the Cortex that such as will can be developed by ARM Holdings Co., LtdsTMSeries processors IP kernel tacitly consent to the Loongson IP kernel hearts of computing technique research institute (ICT) exploitation by the Chinese Academy of Sciences can or be sold to The various consumers of such as Texas Instrument, high pass, apple or Samsung or licensee, and by these consumers or be licensed Realized in the processor of side's production.

Figure 11 shows the block diagram of the exploitation of the IP kernel heart according to one embodiment.Storage 1130 includes simulation softward 1120 and/or hardware or software model 1110.In one embodiment, can be via memory 1140 (for example, hard disk), wired Connection (for example, internet) 1150 or wireless connection 1160 provide the data for representing the design of the IP kernel heart to storage 1130.Then Can by by simulation tool and the IP kernel heart information transfer of model generation to manufacturing facility, wherein it can be manufactured by third party with Perform at least one instruction according at least one embodiment.

In certain embodiments, one or more instructions can be corresponding with the first kind or framework (for example, x86), and And the processor with different type or framework (for example, ARM) can be translated or emulated.According to the finger of one embodiment Order therefore can be in any processor or processor class including ARM, x86, MIPS, GPU or other processor type or framework Performed in type.

Figure 12 illustrates the instruction of the first kind how is emulated according to the different types of processor of one embodiment.In Figure 12 In, program 1205 includes and can perform some of the function identical or substantially the same with the instruction according to one embodiment and refer to Order.However, the instruction of program 1205 can have the type and/or form different or incompatible from processor 1215, this meaning The instruction with the type in program 1205 may not be locally executed by processor 1215.However, being patrolled in emulation Collect with the help of 1210, the instruction of program 1205 is translated into the instruction that can be locally executed by processor 1215.In a reality Apply in example, emulation logic embodies within hardware.In another embodiment, emulation logic be embodied in the tangible machine comprising software can Read medium in using by the instruction translation with the type in program 1205 as the type that can be locally executed by processor 1215. In other embodiments, emulation logic is fixing function or programmable hardware and the program being stored in tangible machine-readable media Combination.In one embodiment, processor includes emulation logic, and in other embodiments, emulation logic is present in processor Outside and provided by third party.In one embodiment, processor can by perform comprising within a processor or with place The associated microcode of device or firmware is managed to load the emulation logic being embodied in the tangible machine-readable media comprising software.

Figure 13 is to contrast according to embodiments of the present invention for the binary command in source instruction set to be converted into target instruction target word The block diagram used of the software instruction converter of the binary command of concentration.In the illustrated embodiment, dictate converter is Software instruction converter, although alternatively, the dictate converter can be realized in software, firmware, hardware or its various combination. Figure 13 show can use x86 compilers 1304 to compile the program write with high-level language 1302 with generate can by with The x86 binary codes 1306 that the processor of at least one x86 instruction set core 1316 is locally executed.With at least one x86 The processor of instruction set core 1316 can be represented can be by compatibly performing or otherwise handling (1) Intel x86 The sizable part or (2) of the instruction set of instruction set core are with the Intel with least one x86 instruction set core The object identification code version of the application for the purpose of running on device or other softwares is managed to perform with having at least one x86 instruction set core Any processor of substantially the same function of the Intel processors of the heart, so as to realize with least one x86 instruction set Substantially the same result of the Intel processors of core.X86 compilers 1304 represent operationally generation x86 binary codes The compiler of 1306 (for example, object identification codes), the x86 binary codes 1306 can be at or without extra link Performed in the case of reason on the processor with least one x86 instruction set core 1316.Similarly, show can be with by Figure 13 The program write with high-level language 1302 is compiled using optional instruction set compiler 1308 can be by without extremely to generate The processor 1314 of a few x86 instruction set core is (for example, with the MIPS technologies for performing California Sunnyvale MIPS instruction set and/or perform California Sunnyvale ARM Holdings ARM instruction set core place Reason device) the optional instruction set binary code 1310 that locally executes.Dictate converter 1312 is used for x86 binary codes 1306 are converted to the code that can be locally executed by the processor without x86 instruction set cores 1314.The code of this conversion Can not possibly be identical with optional instruction set binary code 1310, because being difficult to dictate converter of the manufacture with this ability; However, the code of conversion will be realized general operation and is made up of the instruction from optional instruction set.Thus, dictate converter 1312 represent software, firmware, hardware or its combination, and it allows to refer to without x86 by emulation, simulation or any other processing The processor or other electronic equipments of set processor or core is made to perform the x86 binary codes 1306.

Figure 14 A illustrate to provide the flow chart of one embodiment of the instruction 1401 of vector compression and spinfunction.Instruction 1401 embodiment can specify vectorial source operand 1420, mask register 1410, vectorial vector element size 1440 and to Measure destination skew 1430.Mask register 1410 can include many numbers in multiple data fields, mask register 1410 It is corresponding with the element units in the vector of such as vectorial source operand 1420 according to each in field.In some embodiments In, such as 406 decoder stage can be decoded to instruction 1401, and the instruction 1401 of decoding be responded, such as 450 One or more execution units read the values of multiple data fields in mask register 1410, and for mask register There is each in multiple data fields of unmasked value (for example, one), by corresponding vector element from vector in 1410 Source operand 1420 is copied in vectorial destination 1440 at vectorial destination skew Unit 1430 (for example, element units four) place The adjacent sequential element units of beginning.For some embodiments, by the corresponding element vector from vectorial source operand 1420 Element is copied to the adjacent sequential element units that the total quantity (for example, eight) of the element units in vectorial destination 1440 is mould, Eight 32 element units for example in 256 Ymm registers of x86 processors.

It will be recognized that vectorial destination 1440 can only have two 64 element units, or alternatively 16 32 Element units, or 32 16 element units, etc..

Figure 14 B illustrate to provide the flow chart of another embodiment of the instruction 1402 of vector compression and spinfunction.Instruction 1402 embodiment can also specify vectorial source operand 1420, mask 1410, vectorial vector element size 1440 and vectorial mesh Ground skew 1430.Mask 1410 can also include each in multiple data fields in multiple data fields, mask 1410 It is individual corresponding with element units in the vector of such as vectorial source operand 1420.In certain embodiments, such as 406 decoding Level can be decoded to instruction 1402, and the instruction 1402 of decoding be responded, such as 450 one or more execution Unit reads the value of multiple data fields in mask 1410, and for having unmasked value (for example, one) in mask 1410 Multiple data fields in each, one or more execution units are by corresponding vector element from vectorial source operand 1420, which copy unit (for example, element units four) place for offseting 1430 in vectorial destination in vectorial destination 1440 to, starts Adjacent sequential element units.For some embodiments, corresponding vector element is copied to from vectorial source operand 1420 The adjacent sequential element units started at the unit of vectorial destination skew 1430, are only up to filled with the effectively vectorial purpose of highest Untill the element units of ground 1440.

For some embodiments, each corresponding vector element is being copied to vector from vectorial source operand 1420 During adjacent sequential element units in destination 1440, by the value of the corresponding data field in mask register 1410 never Masking value, which changes, arrives masking value, for example, only keep the highest invariant position in mask register 1410 in the example present.It will recognize that Arrive, in such embodiments, can have the mask of modification and the instruction of zero offset to provide rotational work by performing again Energy.

Figure 15 A illustrate the stream of the embodiment for the processing 1501 using the instruction for providing vector compression and spinfunction Cheng Tu.By can include specialized hardware by general-purpose machinery or by special purpose machinery or by the combination can perform it is soft Part or the processing square frame of firmware operation code handle 1501 and other processing disclosed herein to perform.

In processing 1501, the top in each element of vector 1510 is worth (TopVal) v and such as vector register Vectorial B [3 in device 1515:0] each element is compared, to determine whether B element is less than top value v, and The mask of such as mask 0 1520 is generated, to store result.The counting of the quantity of the position of unmasked value will be arranged in mask Store count 1530.To vectorial A [3:0] unmasked setting of the element in mask 0 1520 is compressed, and By vectorial element storage to being for example initially the vector registor 1575 that starts at zero initial offset R0 1535.Will Count 1530 value is added to skew R0 1535 value to generate skew R1 1545.

Then similarly, by the top value v of the element of the vector T opVal in such as vector registor 1550 with for example to Measure the vectorial B [7 in register 1555:4] each element is compared, and whether is less than top with these elements for determining B Value v, and another mask of such as mask 1 1560 is generated, to store result.To vectorial A [7:4] element is according to mask 1 Unmasked setting in 1560 is compressed, and the vectorial element is stored to the vector started at skew R1 1545 Register 1585.

For some embodiments, by vector element A [7:4] it is compressed to from vectorial source 1565 and offsets 1545 in vectorial destination Adjacent sequential element units of the total quantity using the element units in vectorial destination 1585 started at unit as mould.count 1530 be used for by promising one the shifted left of mask 1540 to produce mask 1570, the mask 1570 can be used for for example using Compression result in vector registor 1575 and vector registor 1585 is combined to vector and posted by the movement under mask operation In storage 1590.

It will be recognized that and then vector registor 1590 can be stored into memory, and can be with the initial of R1 1545 Skew, which is added, to be arranged to the quantity of the position of unmasked value and subtracts element units in vectorial destination 1585 in mask 1560 Total quantity (the new initial offset for being i.e., in this example, one) starts another iteration (not shown).

Figure 15 B illustrate another reality for the processing 1502 using the instruction for being used to provide vector compression and spinfunction Apply the flow chart of example.In processing 1502, by each element of the vector T opVal in such as vector registor 1510 The top value v and vectorial B [3 in such as vector registor 1515:0] each element is compared, and is with the element for determining B It is no to be less than top value v, and the mask of such as mask 0 1520 is generated, to store result.It will be arranged to unmasked in mask Count 1530 is arrived in the counting storage of the quantity of the position of value.To vectorial A [3:0] element is unmasked in mask 0 1520 Setting is compressed, and by vectorial element storage to being for example initially what is started at zero initial offset R0 1535 Vector registor 1590.Count 1530 value is added to skew R0 1535 value to generate skew R1 1545.

Then similarly, by the element of the vector T opVal in such as vector registor 1550 or vector registor 1510 Top value v and such as vector registor 1555 in vectorial B [7:4] each element is compared, with determine B these Whether element is less than top value v, and generates another mask of such as mask 1 1560, to store result.To vectorial A [7:4] Unmasked setting of the element in mask 1 1560 be compressed, and vectorial element storage is arrived in skew R1 The vector registor 1590 started at 1545.

For some embodiments, by vector element A [7:4] it is compressed to from vectorial source 1565 and offsets 1545 in vectorial destination The adjacent sequential element units started at unit, untill being only up to filled with the effectively vectorial element units of destination 1590 of highest. For some embodiments, in the vector element that each is corresponding copies vectorial destination 1590 to from vectorial source 1565 During adjacent sequential element units, the value of the corresponding data field in mask register 1560 never masking value is changed to covering Value is covered, for example, only keeps the highest significant position in mask register 1560 constant.It will be recognized that in such embodiments, can By performing again there is the mask of modification and the instruction for another skew for being zero to provide spinfunction.

Figure 16 A illustrate the flow chart of one embodiment of the processing 1601 for providing vector compression and spinfunction. By can include specialized hardware or the software that can be performed by general-purpose machinery or by special purpose machinery or by the combination or The processing square frame of firmware operation code handles 1601 and other processing disclosed herein to perform.

In the processing square frame 1610 of processing 1601, compression rotation instruction is decoded., will in processing square frame 1615 Built-in variable i is set to zero (0), and built-in variable j is set into zero (0).In processing square frame 1630, mask deposit is read The value of more than first data field in device, and for each data field mask [i], it is determined whether by the data field Value be set to one (1).It will be recognized that any optional value can be used to indicate that the unmasked value in mask [i], including zero Or negative one (- 1) etc. (0).If it is determined that data field mask [i] is not set into one (1), then processing proceeds to processing side Frame 1655, wherein built-in variable i are incremented by.Otherwise, in processing square frame 1645, in the mask register with value one (1) Each data field, i-th corresponding of vector element from vectorial source is copied and vectorial destination is stored In Dest vectorial destination offset units rotate plus built-in variable j's and with vectorial destination Dest element units Length total quantity is the adjacent sequential element units started at the value of mould.Then, in processing square frame 1650, built-in variable J is incremented by, and in processing square frame 1655, built-in variable i is incremented by.In processing square frame 1660, it is determined that compression rotation instruction Whether execution completes.If it is not complete, then processing 1601 starts iteration again in processing square frame 1630.Otherwise, processing exists Handle in square frame 1665 and terminate.

Although it will be recognized that will processing 1601 and other processing specs disclosed herein be iterative processing, whenever In possible various embodiments, in a different order or the processing that their order illustrates can also be performed contemporaneously or in parallel Square frame.

In some alternative embodiments, when vectorial destination is full, just stop copy.When by unmasked vector element From vectorial source copy vectorial destination Dest in adjacent sequential element units when, can also be by the corresponding field in mask Value change arrive masking value.Thus, mask value can be used for tracking progress and/or complete, and can will be changed into full mesh Ground storage to re-executing instruction after memory.Then the vectorial destination skew that the mask of modification can be used and be zero To re-execute instruction, only to compress there is still a need for the element of vector compression and rotation instruction is performed, so as to allow what is improved Instruction throughput.

Figure 16 B illustrate the flow chart of another embodiment of the processing 1602 for providing vector compression and spinfunction. In the processing square frame 1610 of processing 1602, compression rotation instruction is decoded.In processing square frame 1615, by built-in variable I is set to zero (0), and built-in variable j is set into zero (0).In processing square frame 1630, the in mask register is read The value of more than one data field, and for each data field mask [i], it is determined whether the value of the data field is set For one (1).It will recognize again, any selectable value can be used to indicate that the unmasked value in mask [i], including zero (0) or negative One (- 1) etc..If it is determined that data field mask [i] is not set into one (1), then processing proceeds to processing square frame 1655, Wherein built-in variable i is incremented by.Otherwise, in processing square frame 1635, it is determined that whether skew rotate is small plus built-in variable j value In vectorial destination Dest element units length total quantity.If NO, then processing proceeds to processing square frame 1655, its Middle built-in variable i is incremented by.

Otherwise, in processing square frame 1640, data field mask [i] is set to zero (0).In processing square frame 1646, For having each data field of value one (1) in mask register, by i-th corresponding of element vector from vectorial source Element is copied and stored in vectorial destination Dest in vectorial destination offset units rotate plus beginning at built-in variable j Adjacent sequential element units, the highest effective element unit until being filled with vectorial destination Dest.In processing square frame 1650, Built-in variable j is incremented by, and in processing square frame 1655, built-in variable i is incremented by.In processing square frame 1660, it is determined that compression rotation Whether the execution for turning instruction completes.If NO, then processing 1602 starts iteration again in processing side 1630.Otherwise, handle Terminate in processing square frame 1665.

Figure 17 illustrates the flow chart of another embodiment of the processing 1701 for providing vector compression and spinfunction. In the processing square frame 1710 of processing 1701, compression rotation instruction is decoded.In processing square frame 1715, by built-in variable i Zero (0) is set to, and built-in variable j is set to zero (0).In processing square frame 1720, it is determined whether be using vectorial mesh Ground Dest zero.If YES, then by zero storage to all elements unit in vectorial destination Dest.It is optional at some In embodiment, neutral element is only stored to the vectorial destination unit arrived without element of the copy from vectorial source.Otherwise, such as Fruit does not apply vector destination Dest zero, then processing is directly to processing square frame 1730.

In processing square frame 1730, the value of more than first data field in mask register is read, and for each Individual data Fields Mask [i], it is determined whether the value of the data field is set to one (1).It will recognize again, it is any optional Value can be used to indicate that the unmasked value in mask [i], including zero (0) or negative one (- 1) etc..If it is determined that not by data Fields Mask [i] is set to one (1), then processing proceeds to processing square frame 1745, and wherein built-in variable i is incremented by.Otherwise, in processing In square frame 1735, for having each data field of value one (1) in mask register, by from the corresponding of vectorial source I-th of vector element is copied and stored in vectorial destination Dest in vectorial destination offset units rotate plus internal change Amount j's and using vectorial destination Dest element units length total quantity as mould at the adjacent sequential element units that start. Then in processing square frame 1740, built-in variable j is incremented by, and in processing square frame 1745, built-in variable i is incremented by.In processing In square frame 1750, it is determined whether complete the execution of compression rotation instruction.If NO, then 1701 are handled in processing square frame 1730 Start iteration again.Otherwise, processing terminates in processing square frame 1755.

It will be recognized that processing 1601 and 1701 can be used for providing vector pressure in the application for being otherwise not easy to be quantified Contracting function, such as in the application such as in the 444.NAMD of SPEC benchmark groups inner ring, so as to be reduced to external memory storage Expensive sequential storage quantity, increase performance, and reduce power and use.

Figure 18 illustrates the flow chart of the embodiment of the processing 1801 for providing vector compression function in reference application. In the processing square frame 1810 of processing 1801, variable i is set into zero (0), and last-i is set into last subtract vector The length of register.In processing square frame 1815, by the vector T opVal [length of such as vector registor 1510:0] Top value v and vector B [length+i in each element:I] each element be compared, be with the element for determining B It is no to be less than top value v, and mask for example is generated in mask register 1520, to store result.In processing square frame 1820, Count is arrived in the counting storage that the quantity of the position of unmasked value will be arranged in mask.In processing square frame 1830, it is determined that Whether count is more than zero.If NO, then processing proceeds to processing square frame 1870, wherein value length is added with i.

Otherwise, processing proceeds to processing square frame 1835, wherein by vectorial A [length+i:I] element be loaded into vector and post Storage DestA [length:0] in.Then processing proceeds to processing square frame 1845, wherein by DestA [length:0] compress simultaneously Store in memory in the memory list indicated by memory pointer operand according to the unmasked field set in mask The adjacent sequential element units started at member.In processing square frame 1860, memory pointer is set to increase count, i.e. if vector Element is eight byte longs, then makes eight times of the value increase count of memory pointer value.Next, processing proceeds to processing Square frame 1870, wherein value length is added with i.Then, in processing square frame 1875, determine whether i is more than last i.If It is yes, then only existing several remaining elements will consider, to complete the processing, this occurs in processing square frame 1880.Otherwise, Processing starts iteration again in processing square frame 1815.

Figure 19 A illustrate the embodiment of the processing 1901 for providing vector compression and spinfunction in reference application Flow chart.In the processing square frame 1910 of processing 1901, variable i is set to zero (0), offset is set to zero (0), and Last i are set to the length that last subtracts vector registor, and mask2 is set to complete one.In processing square frame 1915 In, by the vector T opVal [length of such as vector registor 1510:0] the top value v and vector B in each element [length+i:I] each element be compared, with determine B element whether be less than top value v, and generation such as cover The mask of code 1520, to store result.In processing square frame 1920, the quantity of the position of unmasked value will be arranged in mask Count storage and arrive count.In processing square frame 1925, to vectorial A [length+i:I] element it is unmasked in mask DestA [the length for filling with being compressed and being stored to and start at rotate are set:0].In processing square frame 1930 Place, using mask2 by DestA [length:0] element of prior compression is mixed with the compression element in DestR in.So Afterwards, in processing 1935, count value is added with rotate value.

In processing square frame 1940, determine whether rotate becomes greater than length length, i.e. kept in vector registor The quantity of DestA element.If NO, then processing proceeds to processing square frame 1970, wherein value length is added with i.It is no Then, processing proceeds to processing square frame 1945, wherein by DestA [length:0] memory pointer is arrived in storage.In processing square frame In 1950, length value is subtracted from rotate value.In processing square frame 1955, by DestR [length:0] copy to DestA[length:0].In processing square frame 1960, memory pointer is set to increase length, i.e. if vector element is four Byte long, then make four times of the value increase length of memory pointer value.In processing square frame 1965, complete one mask is led to Rotate is crossed to move to left and be stored in mask2.Next, processing proceeds to processing square frame 1970, wherein by value length and i It is added.Then, in processing square frame 1975, determine whether i is more than last i.If YES, then several remaining members are only existed Element will consider, to complete processing, and this occurs in processing square frame 1980.Otherwise, processing starts weight in processing square frame 1915 New iteration.

As discussed above, in some alternative embodiments, when vectorial destination is full, it is possible to stop copy. By unmasked vector element from vectorial source copy vectorial destination Dest in adjacent sequential element units when, can also be by The value of corresponding field in mask, which changes, arrives masking value.Thus, mask value can be used for tracking progress and/or complete, and And re-execute instruction after being stored full destination will be changed into memory.It is then possible to use the mask of modification Offset to re-execute instruction with the vectorial purpose for zero, only to compress there is still a need for the execution of vector compression and rotation instruction Element.

Figure 19 B illustrate the optional implementation of the processing 1902 for providing vector compression and spinfunction in reference application The flow chart of example.In the processing square frame 1911 of processing 1902, variable i is set to zero (0), offset is set to zero (0), And last i are set to the length that last subtracts vector registor., will such as vector register in processing square frame 1915 Vector T opVal [the length of device 1510:0] the top value v and vector B [length+i in each element:I] it is each Individual element is compared, and to determine whether B element is less than top value v, and the mask of such as mask 1520 is generated, to store As a result.In processing square frame 1920, count is arrived in the counting storage that will be arranged to the quantity of the position of unmasked value in mask. Handle in square frame 1926, to vectorial A [length+i:I] unmasked setting of the element in mask be compressed filling, and And it is stored to DestA [length:offset].Then, in processing 1931, by count value and offset value phase Plus.

In processing square frame 1941, determine whether offset becomes greater than length length, i.e. kept in vector registor The quantity of DestA element.If NO, then processing proceeds to processing square frame 1970, wherein value length is added with i.It is no Then, processing proceeds to processing square frame 1945, wherein by DestA [length:0] memory pointer is arrived in storage.In processing square frame In 1951, length value is subtracted from offset value.In processing square frame 1956, to vectorial A [length+i:I] element Unmasked setting in the mask of renewal is compressed filling, and is stored to DestA [length:0].In processing In square frame 1960, memory pointer is set to increase length, i.e. if vector element is four byte longs, to make memory pointer Four times of value of value increase length.Next, processing proceeds to processing square frame 1970, wherein value length is added with i. Then, in processing square frame 1975, determine whether i is more than last i.If YES, then only existing several remaining elements will examine Consider, to complete processing, this occurs in processing square frame 1980.Otherwise, processing starts iteration again in processing square frame 1915.

Embodiments of the invention, which are related to, can be used for providing vector compression work(in the application for being otherwise not easy to be quantified The SIMD vector compressions of energy and rotation are instructed, such as in the reference application such as in the 444.NAMD of SPEC benchmark groups inner ring In, so as to be reduced to the quantity of the expensive sequential storage of external memory storage, increase performance and reduce power and use.In some implementations In example, mask value can be used for tracking progress and/or complete, and can be after by for full destination storage to memory Re-execute instruction, the instruction re-executed using modification mask and be zero skew only compress there is still a need for by vector The element that compression and rotation instruction are compressed.Alternative embodiment makes the vectorial mesh without element of the copy from vectorial source Ground element zero.

The embodiment of mechanism disclosed herein can be realized to the group in hardware, software, firmware or such implementation In conjunction.The embodiment of the present invention can be embodied as including at least one processor, storage system (including volatibility and non-volatile Property memory and/or memory element), perform on the programmable system of at least one input equipment and at least one output equipment Computer program or program code.

Program code can be applied to input instruction to perform functions described herein and generate output information.Can be by Output information is applied to one or more output equipments in known manner.For the purpose of the application, processing system includes tool There is the processor for example by taking digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC) or microprocessor as an example Any system.

The programming language of level process or object-oriented can be used to realize program code with processing system to be led to Letter.If desired, compilation or machine language can also be used to realize program code.In fact, mechanisms described herein is in model Place and be not limited to any specific programming language.Under any circumstance, the language can be compiling or interpretative code.

Can be realized by storage representative instruction on a machine-readable medium one of at least one embodiment or Many aspects, the representative instruction represents the various logic in processor, and when being read by machine, the representative instruction makes the machine Device manufactures logic to perform technique described herein.The such expression for being referred to as " the IP kernel heart " can be stored in tangible machine On device computer-readable recording medium and it is supplied to various consumers or manufacturing facility, the logic or processor is actually manufactured to be loaded into In manufacture machine.

Such machinable medium can include but is not limited to by machine or device fabrication or the article formed Non-transient tangible arrangement, include the storage medium of such as hard disk, including floppy disk, CD, compact disk read-only storage (CD- ROM), the disk of any other type of writable compact disk (CD-RW) and magneto-optic disk, such as read-only storage (ROM), such as move It is the random access memory (RAM) of state random access memory (DRAM) and static RAM (SRAM), erasable Programmable read only memory (EPROM), flash memory, the semiconductor equipment of Electrically Erasable Read Only Memory (EEPROM), magnetic Card or light-card, or it is suitable for storing the medium of any other type of e-command.

Therefore, embodiments of the invention also include comprising instruction or included the design number of such as hardware description language (HDL) According to non-transient tangible machine-readable media, the design data define structure described herein, circuit, device, processor and/or System features.Such embodiment can also be referred to as program product.

In some cases, dictate converter can be used to instruct from source instruction set converting into target instruction set.For example, Dictate converter can be by instruction translation (for example, being turned over using static binary translation including the binary of on-the-flier compiler Translate), transformation, emulation or be transformed into another manner will by core processing one or more of the other instruction.Instruction can be turned Parallel operation is realized in software, hardware, firmware or its combination.Dictate converter can on a processor, processor is outer or part exists On processor and part is outside processor.

Thus, disclose the technology for performing one or more instructions according at least one embodiment.Although attached Some exemplary embodiments described and illustrated in figure, it should be appreciated that, such embodiment is only to wide in range invention It is illustrative and not limiting, and the present invention be not limited to the particular configuration and arrangement that show and describe because in this area Those of ordinary skill is contemplated that various other modifications when studying the disclosure.Such as grow up it is rapid and be not easy to predict into In such technical field of one step development, in the principle or the situation of scope of the following claims without departing from the disclosure Under, by the promotion of technological progress, it can easily change disclosed embodiment in arrangement and details.

Claims (36)

1. a kind of processor, including:
Mask register, including more than first data field, wherein, more than first data word in the mask register Each in section is corresponding with the element units in vector;
Decoder stage, for specifying vectorial source operand, the mask register, vectorial vector element size and vectorial destination First instruction of skew is decoded;And
One or more execution units, for performing following operation in response to the first instruction of decoding:
Read multiple values of more than first data field in the mask register;
For the first value in more than first data field in the mask register, by corresponding primary vector member Element from the vectorial source operand copy the vectorial vector element size in the first adjacent sequential element units, described the One vector element is at the vectorial destination offset units;
By corresponding primary vector element from the vectorial source operand copies the vectorial vector element size to After the first adjacent sequential element units, first value in the mask register is changed from the first unmasked value To the first masking value;
For the second value in more than first data field in the mask register, by corresponding secondary vector member Element from the vectorial source operand copy the vectorial vector element size in the second adjacent sequential element units;And
By corresponding secondary vector element from the vectorial source operand copies the vectorial vector element size to After the second adjacent sequential element units:
The second value in the mask register is changed from the second unmasked value and sheltered to the second masking value, described first Value and second masking value are used for the completion progress for the first instruction for tracking decoding;
Determine that the vectorial vector element size is full and stores the vectorial vector element size into memory;
The vectorial destination offset units are set to zero;And
Described is re-executed using first masking value, second masking value and the vectorial destination offset units One instructs to compress the 3rd vector element.
2. processor as claimed in claim 1, wherein, corresponding first and second from the vectorial source operand to Secondary element is copied into the adjacent sequential element list using the total quantity of the element units in the vectorial vector element size as mould Member.
3. processor as claimed in claim 2, wherein, first instruction is that vector compression and rotation are instructed.
4. processor as claimed in claim 1, wherein, corresponding first and second from the vectorial source operand to Secondary element is copied into the adjacent sequential element units started at the vectorial destination offset units, is only up to filled with most Untill high effectively vectorial destination element units.
5. processor as claimed in claim 4, wherein, first instruction is vector compression, filling and rotation instruction.
6. processor as claimed in claim 1, wherein, the first unmasked value is one.
7. processor as claimed in claim 5, wherein, second masking value is zero.
8. processor as claimed in claim 1, wherein, copy the primary vector in the vectorial vector element size to Element and the secondary vector element are 32 bit data elements.
9. processor as claimed in claim 1, wherein, copy the primary vector in the vectorial vector element size to Element and the secondary vector element are 64 bit data elements.
10. processor as claimed in claim 1, wherein, the vectorial vector element size is 128 bit vector registers.
11. processor as claimed in claim 1, wherein, the vectorial vector element size is 256 bit vector registers.
12. processor as claimed in claim 1, wherein, the vectorial vector element size is 512 bit vector registers.
13. a kind of method for performing the first instruction, including:
Read multiple values of more than first data field in mask register;
For the first value in more than first data field in the mask register, by corresponding primary vector member Element from vectorial source operand copy vectorial vector element size in the first adjacent sequential element units, primary vector member Element is at vectorial destination offset units;
By corresponding primary vector element from the vectorial source operand copies the vectorial vector element size to After the first adjacent sequential element units, first value in the mask register is changed from the first unmasked value To the first masking value;
For the second value in more than first data field in the mask register, by corresponding secondary vector member Element from the vectorial source operand copy the vectorial vector element size in the second adjacent sequential element units;And
By corresponding secondary vector element from the vectorial source operand copies the vectorial vector element size to After the second adjacent sequential element units:
The second value in the mask register is changed from the second unmasked value and sheltered to the second masking value, described first Value and second masking value are used for the completion progress for tracking first instruction;
Determine that the vectorial vector element size is full and stores the vectorial vector element size into memory;
The vectorial destination offset units are set to zero;And
Described is re-executed using first masking value, second masking value and the vectorial destination offset units One instructs to compress the 3rd vector element.
14. method as claimed in claim 13, wherein, corresponding first and second from the vectorial source operand to Secondary element is copied into the adjacent sequential element list using the total quantity of the element units in the vectorial vector element size as mould Member.
15. method as claimed in claim 13, wherein, corresponding first and second from the vectorial source operand to Secondary element is copied into the adjacent sequential element units started at the vectorial destination offset units, is only up to filled with most Untill high effectively vectorial destination element units.
16. method as claimed in claim 13, wherein, storage to the primary vector in the vectorial vector element size Element and the secondary vector element are 32 bit data elements.
17. method as claimed in claim 13, wherein, storage to the primary vector in the vectorial vector element size Element and the secondary vector element are 64 bit data elements.
18. method as claimed in claim 13, wherein, the vectorial vector element size is 128 bit vector registers.
19. method as claimed in claim 13, wherein, the vectorial vector element size is 256 bit vector registers.
20. method as claimed in claim 13, wherein, the vectorial vector element size is 512 bit vector registers.
21. a kind of processor, including:
Decoder stage, for specifying vectorial source operand, mask register, vectorial vector element size and vectorial destination to offset The first single-instruction multiple-data (SIMD) instruction decoded;And
One or more execution units, for performing following operation in response to the first SIMD instruction of decoding:
Read multiple values of more than first data field in the mask register;
For the first value in more than first data field in the mask register, by corresponding primary vector member Element is copied in the vectorial vector element size from the vectorial source operand and opened at the vectorial destination offset units First adjacent sequential element units of the total quantity using the element units in the vectorial vector element size begun as mould;
By corresponding primary vector element from the vectorial source operand copies the vectorial vector element size to After the first adjacent sequential element units, first value in the mask register is changed from the first unmasked value To the first masking value;
For the second value in more than first data field in the mask register, by corresponding secondary vector member Element from the vectorial source operand copy the vectorial vector element size in in the vectorial vector element size The total quantity of element units is the second adjacent sequential element units of mould;And
By corresponding secondary vector element from the vectorial source operand copies the vectorial vector element size to After the second adjacent sequential element units:
The second value in the mask register is changed from the second unmasked value and sheltered to the second masking value, described first Value and second masking value are used for the completion progress for tracking the first SIMD instruction of decoding;
Determine that the vectorial vector element size is full and stores the vectorial vector element size into memory;
The vectorial destination offset units are set to zero;And
Described is re-executed using first masking value, second masking value and the vectorial destination offset units One SIMD instruction is to compress the 3rd vector element.
22. processor as claimed in claim 21, wherein, the vectorial vector element size is 128 bit vector registers.
23. processor as claimed in claim 21, wherein, the vectorial vector element size is 256 bit vector registers.
24. processor as claimed in claim 21, wherein, the vectorial vector element size is 512 bit vector registers.
25. a kind of processor, including:
Decoder stage, for specifying vectorial source operand, mask register, vectorial vector element size and vectorial destination to offset The first single-instruction multiple-data (SIMD) instruction decoded;And
One or more execution units, for performing following operation in response to the first SIMD instruction of decoding:
Read multiple values of more than first data field in the mask register;
For the first value in more than first data field in the mask register, by corresponding primary vector member Element from the vectorial source operand copy the vectorial vector element size in the first adjacent sequential element units, described the One vector element is at the vectorial destination offset units;
By corresponding primary vector element from the vectorial source operand copies the vectorial vector element size to After the first adjacent sequential element units, first value in the mask register is changed from the first unmasked value To the first masking value;
For the second value in more than first data field in the mask register, by corresponding secondary vector member Element from the vectorial source operand copy the vectorial vector element size in the second adjacent sequential element units;And
By corresponding secondary vector element from the vectorial source operand copies the vectorial vector element size to After the second adjacent sequential element units:
The second value in the mask register is changed from the second unmasked value and sheltered to the second masking value, described first Value and second masking value are used for the completion progress for tracking the first SIMD instruction of decoding;
Determine that the vectorial vector element size is full and stores the vectorial vector element size into memory;
The vectorial destination offset units are set to zero;And
Described is re-executed using first masking value, second masking value and the vectorial destination offset units One SIMD instruction is to compress the 3rd vector element.
26. processor as claimed in claim 25, wherein, first masking value and second masking value are zero.
27. a kind of processing system, including:
Memory;And
Multiple processors, each processor includes:
Decoder stage, for specifying vectorial source operand, mask operand, vectorial vector element size and vectorial destination to offset The first SIMD instruction decoded;And
One or more execution units, for performing following operation in response to the first SIMD instruction of decoding:
Read multiple values of more than first data field in mask register;
For the first value in more than first data field in the mask register, by corresponding primary vector member Element from the vectorial source operand copy the vectorial vector element size in the first adjacent sequential element units, described the One vector element is at the vectorial destination offset units;
By corresponding primary vector element from the vectorial source operand copies the vectorial vector element size to After the first adjacent sequential element units, first value in the mask register is changed from the first unmasked value To the first masking value;
For the second value in more than first data field in the mask register, by corresponding secondary vector member Element from the vectorial source operand copy the vectorial vector element size in the second adjacent sequential element units;And
By corresponding secondary vector element from the vectorial source operand copies the vectorial vector element size to After the second adjacent sequential element units:
The second value in the mask register is changed from the second unmasked value and sheltered to the second masking value, described first Value and second masking value are used for the completion progress for tracking the first SIMD instruction of decoding;
Determine that the vectorial vector element size is full and stores the vectorial vector element size into memory;
The vectorial destination offset units are set to zero;And
Described is re-executed using first masking value, second masking value and the vectorial destination offset units One SIMD instruction is to compress the 3rd vector element.
28. processing system as claimed in claim 27, wherein, corresponding first from the vectorial source operand and Two vector elements are copied into the adjacent sequential member using the total quantity of the element units in the vectorial vector element size as mould Plain unit.
29. processing system as claimed in claim 27, wherein, one or more of execution units are further in response to described First SIMD instruction performs following operation:
For each not corresponding with the vector element from the vectorial source operand copy vectorial destination element, make institute State the value zero of vectorial destination element.
30. processing system as claimed in claim 27, wherein, corresponding first from the vectorial source operand and Two vector elements are copied into the adjacent sequential element units started at the vectorial destination offset units, only reach and Untill being filled with the effectively vectorial destination element units of highest.
31. processing system as claimed in claim 27, wherein, first masking value and second masking value are zero.
32. processing system as claimed in claim 27, wherein, storage to described first in the vectorial vector element size Vector element and the secondary vector element are 32 bit data elements.
33. processing system as claimed in claim 27, wherein, storage to described first in the vectorial vector element size Vector element and the secondary vector element are 64 bit data elements.
34. processing system as claimed in claim 27, wherein, the vectorial vector element size is 128 bit vector registers.
35. processing system as claimed in claim 27, wherein, the vectorial vector element size is 256 bit vector registers.
36. processing system as claimed in claim 27, wherein, the vectorial vector element size is 512 bit vector registers.
CN201310524909.2A 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided CN103793201B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/664,401 US9606961B2 (en) 2012-10-30 2012-10-30 Instruction and logic to provide vector compress and rotate functionality
US13/664,401 2012-10-30

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710568910.3A CN107729048A (en) 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201710568910.3A Division CN107729048A (en) 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided

Publications (2)

Publication Number Publication Date
CN103793201A CN103793201A (en) 2014-05-14
CN103793201B true CN103793201B (en) 2017-08-11

Family

ID=49680020

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201310524909.2A CN103793201B (en) 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided
CN201710568910.3A CN107729048A (en) 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201710568910.3A CN107729048A (en) 2012-10-30 2013-10-30 Instruction and the logic of vector compression and spinfunction are provided

Country Status (7)

Country Link
US (2) US9606961B2 (en)
JP (1) JP5739961B2 (en)
KR (1) KR101555412B1 (en)
CN (2) CN103793201B (en)
DE (1) DE102013018238A1 (en)
GB (1) GB2507655B (en)
TW (1) TWI610236B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378560B2 (en) * 2011-06-17 2016-06-28 Advanced Micro Devices, Inc. Real time on-chip texture decompression using shader processors
US9715385B2 (en) 2013-01-23 2017-07-25 International Business Machines Corporation Vector exception code
US9823924B2 (en) 2013-01-23 2017-11-21 International Business Machines Corporation Vector element rotate and insert under mask instruction
US9804840B2 (en) 2013-01-23 2017-10-31 International Business Machines Corporation Vector Galois Field Multiply Sum and Accumulate instruction
US9778932B2 (en) 2013-01-23 2017-10-03 International Business Machines Corporation Vector generate mask instruction
US9471308B2 (en) 2013-01-23 2016-10-18 International Business Machines Corporation Vector floating point test data class immediate instruction
US9513906B2 (en) 2013-01-23 2016-12-06 International Business Machines Corporation Vector checksum instruction
US10133570B2 (en) * 2014-09-19 2018-11-20 Intel Corporation Processors, methods, systems, and instructions to select and consolidate active data elements in a register under mask into a least significant portion of result, and to indicate a number of data elements consolidated
KR102106889B1 (en) * 2014-12-11 2020-05-07 한화디펜스 주식회사 Mini Integrated-control device
US20160188333A1 (en) * 2014-12-27 2016-06-30 Intel Coporation Method and apparatus for compressing a mask value
US20170177348A1 (en) * 2015-12-21 2017-06-22 Intel Corporation Instruction and Logic for Compression and Rotation
US10162752B2 (en) * 2016-09-22 2018-12-25 Qualcomm Incorporated Data storage at contiguous memory addresses
EP3336692B1 (en) * 2016-12-13 2020-04-29 Arm Ltd Replicate partition instruction
US20190369992A1 (en) * 2017-02-17 2019-12-05 Intel Corporation Vector instruction for accumulating and compressing values based on input mask
CN107748674A (en) * 2017-09-07 2018-03-02 中国科学院微电子研究所 The information processing system of Bit Oriented granularity
US10831497B2 (en) * 2019-01-31 2020-11-10 International Business Machines Corporation Compression/decompression instruction specifying a history buffer to be used in the compression/decompression of data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4490786A (en) * 1981-06-19 1984-12-25 Fujitsu Limited Vector processing unit
CN101482810A (en) * 2007-12-26 2009-07-15 英特尔公司 Methods, apparatus, and instructions for processing vector data

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0731669B2 (en) * 1986-04-04 1995-04-10 株式会社日立製作所 Vector processor
JPH01284972A (en) 1988-05-11 1989-11-16 Koufu Nippon Denki Kk Vector processor
JPH02190968A (en) 1989-01-19 1990-07-26 Nec Corp Vector processor
JP2665111B2 (en) * 1992-06-18 1997-10-22 日本電気株式会社 Vector processing equipment
US5832290A (en) * 1994-06-13 1998-11-03 Hewlett-Packard Co. Apparatus, systems and method for improving memory bandwidth utilization in vector processing systems
US6091768A (en) * 1996-02-21 2000-07-18 Bru; Bernard Device for decoding signals of the MPEG2 type
US5935198A (en) 1996-11-22 1999-08-10 S3 Incorporated Multiplier with selectable booth encoders for performing 3D graphics interpolations with two multiplies in a single pass through the multiplier
US6052769A (en) * 1998-03-31 2000-04-18 Intel Corporation Method and apparatus for moving select non-contiguous bytes of packed data in a single instruction
US6621428B1 (en) * 2000-05-04 2003-09-16 Hewlett-Packard Development Company, L.P. Entropy codec for fast data compression and decompression
US7054330B1 (en) * 2001-09-07 2006-05-30 Chou Norman C Mask-based round robin arbitration
JP2004302647A (en) * 2003-03-28 2004-10-28 Seiko Epson Corp Vector processor and address designation method for register
US8595394B1 (en) * 2003-06-26 2013-11-26 Nvidia Corporation Method and system for dynamic buffering of disk I/O command chains
US20050289329A1 (en) * 2004-06-29 2005-12-29 Dwyer Michael K Conditional instruction for a single instruction, multiple data execution engine
US7984273B2 (en) * 2007-12-31 2011-07-19 Intel Corporation System and method for using a mask register to track progress of gathering elements from memory
US9940138B2 (en) * 2009-04-08 2018-04-10 Intel Corporation Utilization of register checkpointing mechanism with pointer swapping to resolve multithreading mis-speculations
US20120216011A1 (en) * 2011-02-18 2012-08-23 Darryl Gove Apparatus and method of single-instruction, multiple-data vector operation masking
US20120254592A1 (en) 2011-04-01 2012-10-04 Jesus Corbal San Adrian Systems, apparatuses, and methods for expanding a memory source into a destination register and compressing a source register into a destination memory location
US20120254589A1 (en) * 2011-04-01 2012-10-04 Jesus Corbal San Adrian System, apparatus, and method for aligning registers
US20130151822A1 (en) * 2011-12-09 2013-06-13 International Business Machines Corporation Efficient Enqueuing of Values in SIMD Engines with Permute Unit
WO2013095598A1 (en) * 2011-12-22 2013-06-27 Intel Corporation Apparatus and method for mask register expand operation
CN104025020B (en) * 2011-12-23 2017-06-13 英特尔公司 System, device and method for performing masked bits compression
US9244687B2 (en) * 2011-12-29 2016-01-26 Intel Corporation Packed data operation mask comparison processors, methods, systems, and instructions
US8972697B2 (en) * 2012-06-02 2015-03-03 Intel Corporation Gather using index array and finite state machine
US8959275B2 (en) * 2012-10-08 2015-02-17 International Business Machines Corporation Byte selection and steering logic for combined byte shift and byte permute vector unit
US9411593B2 (en) * 2013-03-15 2016-08-09 Intel Corporation Processors, methods, systems, and instructions to consolidate unmasked elements of operation masks
US20150186136A1 (en) * 2013-12-27 2015-07-02 Tal Uliel Systems, apparatuses, and methods for expand and compress
US10133570B2 (en) * 2014-09-19 2018-11-20 Intel Corporation Processors, methods, systems, and instructions to select and consolidate active data elements in a register under mask into a least significant portion of result, and to indicate a number of data elements consolidated

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4490786A (en) * 1981-06-19 1984-12-25 Fujitsu Limited Vector processing unit
CN101482810A (en) * 2007-12-26 2009-07-15 英特尔公司 Methods, apparatus, and instructions for processing vector data

Also Published As

Publication number Publication date
US20170192785A1 (en) 2017-07-06
TW201435733A (en) 2014-09-16
GB2507655A (en) 2014-05-07
GB2507655B (en) 2015-06-24
GB201318167D0 (en) 2013-11-27
JP2014089699A (en) 2014-05-15
US9606961B2 (en) 2017-03-28
CN103793201A (en) 2014-05-14
KR20140056082A (en) 2014-05-09
CN107729048A (en) 2018-02-23
DE102013018238A1 (en) 2014-04-30
KR101555412B1 (en) 2015-10-06
US20140122831A1 (en) 2014-05-01
US10459877B2 (en) 2019-10-29
JP5739961B2 (en) 2015-06-24
TWI610236B (en) 2018-01-01

Similar Documents

Publication Publication Date Title
JP6567285B2 (en) Instruction and logic circuit for processing character strings
JP6227621B2 (en) Method and apparatus for fusing instructions to provide OR test and AND test functions for multiple test sources
US9696993B2 (en) Instructions and logic to vectorize conditional loops
US10162637B2 (en) Methods, apparatus, instructions and logic to provide permute controls with leading zero count functionality
US10152325B2 (en) Instruction and logic to provide pushing buffer copy and store functionality
JP6351682B2 (en) Apparatus and method
JP6344614B2 (en) Instructions and logic to provide advanced paging capabilities for secure enclave page caches
JP6126162B2 (en) Method and apparatus for performing shift-and-exclusive OR operation with a single instruction
CN103562856B (en) The pattern that strides for data element is assembled and the scattered system of the pattern that strides of data element, device and method
TWI476695B (en) Instruction and logic to provide vector horizontal compare functionality
CN104781803B (en) It is supported for the thread migration of framework different IPs
US10459877B2 (en) Instruction and logic to provide vector compress and rotate functionality
CN104204990B (en) Accelerate the apparatus and method of operation in the processor using shared virtual memory
CN105453071B (en) For providing method, equipment, instruction and the logic of vector group tally function
JP4697639B2 (en) Instructions and logic for performing dot product operations
JP6207575B2 (en) Processor, method and processing system
CN103827813B (en) For providing vector scatter operation and the instruction of aggregation operator function and logic
US9411592B2 (en) Vector address conflict resolution with vector population count functionality
TWI584192B (en) Instruction and logic to provide vector blend and permute functionality
KR101572770B1 (en) Instruction and logic to provide vector load-op/store-op with stride functionality
KR101767025B1 (en) Methods, apparatus, instructions and logic to provide vector address conflict detection functionality
CN104813277B (en) The vector mask of power efficiency for processor drives Clock gating
CN105359129B (en) For providing the method, apparatus, instruction and the logic that are used for group's tally function of gene order-checking and comparison
CN103959236B (en) For providing the vector laterally processor of majority voting function, equipment and processing system
CN104350492B (en) Cumulative vector multiplication is utilized in big register space

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
GR01 Patent grant