CN102103570A - Synchronizing SIMD vectors - Google Patents

Synchronizing SIMD vectors Download PDF

Info

Publication number
CN102103570A
CN102103570A CN 201010619577 CN201010619577A CN102103570A CN 102103570 A CN102103570 A CN 102103570A CN 201010619577 CN201010619577 CN 201010619577 CN 201010619577 A CN201010619577 A CN 201010619577A CN 102103570 A CN102103570 A CN 102103570A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
location
storage
elements
data
execution
Prior art date
Application number
CN 201010619577
Other languages
Chinese (zh)
Other versions
CN102103570B (en )
Inventor
A·T·福西思
R·拉瓦尔
Original Assignee
英特尔公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30087Synchronisation or serialisation instructions

Abstract

A vector compare-and-exchange operation is performed by: decoding by a decoder in a processing device, a single instruction specifying a vector compare-and-exchange operation for a plurality of data elements between a first storage location, a second storage location, and a third storage location; issuing the single instruction for execution by an execution unit in the processing device; and responsive to the execution of the single instruction, comparing data elements from the first storage location to corresponding data elements in the second storage location; and responsive to determining a match exists, replacing the data elements from the first storage location with corresponding data elements from the third storage location.

Description

SIMD向量的同步化 Synchronized SIMD Vector

技术领域 FIELD

[0001] 本公开涉及微处理器及其它处理装置,更具体来说,涉及SIMD向量的同步化。 [0001] The present disclosure relates to microprocessors and other processing devices, and more particularly, to synchronization of SIMD vector. 背景技术 Background technique

[0002] 在例如包括多线程化处理器、多个处理装置和/或多核处理器的系统中的多个线程和/或处理单元(下文称为代理)常常需要共享资源以及存储在该系统中的数据。 [0002] In the example, a plurality of threads comprises a multithreaded processor, a plurality of processing devices and / or multi-core processor systems and / or processing unit (hereinafter referred to as agent) often need to share resources in the system and storing The data. 注意要确保代理访问最近更新的数据,并确保代理不会访问和修改当前与另一个代理相关联的数据。 Note To ensure that the data proxy access recently updated, and to ensure that agents do not access and modify the current data associated with another agent. 使这种数据和资源共享进一步复杂化的是,大多数现代处理装置包括一个或多个专用高速缓存存储器。 This makes sharing data and resources are further complicated, most modern processing means comprises one or more dedicated cache memory. 在多处理器和多核系统内,这些芯片上高速缓存将通常并且实际上一般的确包含某个数据项的多个副本。 In the multi-processor and multi-core system, which will generally be on-chip caches and typically it does in fact contain multiple copies of a data item. 因此,当代理访问某个数据项的副本时,要确保读取经过更新的或有效的数据值。 Therefore, when a copy of the proxy access a data item, to ensure valid data value read or updated.

[0003] 因此,在这些系统中保持“高速缓存一致性”。 [0003] Therefore, to maintain "cache coherence" in these systems. 高速缓存一致性是指从高速缓存存储器写入或读取到高速缓存存储器的数据的同步化,以使得线程或处理器所访问的存储在高速缓存中的任何数据项是该数据项的最新副本。 Cache coherence refers to the synchronization of the write or read the data from the cache memory to the cache memory, so that the thread store access processor or any data item in the cache is the most recent copy of the data item . 此外,从高速缓存写回到主存储器的任何数据值应当是最当前的数据。 In addition, from the cache written back to main memory of any data values ​​it should be the most current data.

[0004] 一种保持高速缓存一致性并确保当代理需要数据项时访问的是该数据项的最新值的方法是实现信号量(semaphore)(例如,标志或锁)。 [0004] A method of maintaining cache coherency and ensure that when the agent needs to access the data item is a method for the latest value of the data item is to implement a semaphore (semaphore) (e.g., flag or lock). 例如,锁包括响应代理(例如,在加载操作中)对来自存储器的特定数据项的请求而执行以确保处理器和/或线程之间的同步的过程。 For example, the response agent comprising a lock (e.g., during a load operation) is performed on the request for a particular data item from the memory of the process to ensure synchronization between the processor and / or thread. 一般来说,锁与包括读/加载指令、修改数据项的指令和写/存储指令的指令集相关联。 In general, the lock includes a read / load instruction and a write instruction to modify / store instruction associated with a set of data items. 锁在本文中又称为“锁序列”或“锁操作”,它可包括例如:获取存储数据的存储器位置的所有权;对数据执行原子操作,同时防止其它进程对该数据进行操作;以及在执行原子操作之后释放该存储器位置的所有权。 Also referred to herein as lock "locking sequence" or "locked operation", which may include: acquiring ownership of the memory location storing data; atomic operations performed on the data, while preventing other processes operating on the data; and performing releasing ownership of the atomic operation after memory location. 原子操作是以非中断方式按顺序执行并且此外还确保其完成或根本不完成的操作(即,该操作不可分割)。 Non-atomic operations are executed sequentially interrupt and furthermore does not ensure its completion or completion of the operation (i.e., the integral operation).

发明内容 SUMMARY

[0005] 本发明涉及一种方法,包括: [0005] The present invention relates to a method, comprising:

[0006] 通过处理装置中的解码器解码单个指令,所述单个指令对第一存储位置、第二存储位置和第三存储位置之间的多个数据元素指定向量比较和交换操作; [0006] The single instruction decoded by the decoder processing apparatus, the single storage location of the first instruction, a second plurality of data elements between a storage position and a third storage location of the specified vector compare and swap operation;

[0007] 发出所述单个指令以供所述处理装置中的执行单元执行;以及响应所述单个指令的执行, [0007] The single issuing instructions for the processing unit performs the execution device; and in response to the single instruction execution,

[0008] 将来自所述第一存储位置的数据元素与所述第二存储位置中的对应数据元素进行比较;以及响应确定存在匹配, [0008] The data elements from the first storage location is compared with corresponding data elements in the second storage location; and in response to determining there is a match,

[0009] 用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素。 [0009] Alternatively the data elements from the first storage location of a corresponding data elements from the third storage location.

[0010] 本发明涉及一种处理器,包括: [0010] The present invention relates to a processor, comprising:

[0011] 存储位置,配置成存储多个第一数据元素、多个第二数据元素和多个第三数据元素,所述多个第二和第三数据元素中的每个对应于所述多个第一数据元素中的一个; [0011] The storage location configured to store a first plurality of data elements, a second plurality of data elements and a plurality of third data elements, the second and third plurality of data elements each corresponding to the plurality a first one of data elements;

[0012] 解码器,配置成解码单个指令,所述单个指令对所述多个第一、第二和第三数据元素指定向量比较和交换操作;以及 [0012] a decoder configured to decode a single instruction, the single instruction specifies a vector comparison of said plurality of first, second, and third elements and data exchange operation; and

[0013] 执行单元,耦合到所述解码器以接收经解码的指令,并耦合到所述存储位置以执行所述向量比较和交换操作; [0013] execution unit coupled to said instruction decoder to receive decoded and coupled to the storage location of the vector to perform a compare and swap operation;

[0014] 其中,响应所述向量比较和交换操作的执行,所述执行单元配置成: [0014] wherein, in response to the compare and swap performing vector operations, said execution unit is configured to:

[0015] 比较来自所述多个第一和第二数据元素的对应数据元素;以及响应确定存在匹配, [0015] Comparative corresponding data elements from the first and second plurality of data elements; and in response to determining there is a match,

[0016] 用来自所述多个第三数据元素的对应数据元素替换来自所述多个第一数据元素的数据元素。 [0016] Alternatively the data elements from said first plurality of data elements with a corresponding data element from said third plurality of data elements.

[0017] 本发明涉及一种系统,包括: [0017] The present invention relates to a system, comprising:

[0018] 存储器控制器,耦合到配置成存储多个第一数据元素的第一存储位置;以及 [0018] The memory controller coupled to the plurality of configured to store the first data element of the first storage location; and

[0019] 耦合到所述存储器控制器的处理器,所述处理器包括: [0019] a processor coupled to the memory controller, said processor comprising:

[0020] 寄存器文件,配置成存储多个第二数据元素和多个第三数据元素,所述多个第二和第三数据元素中的每个对应于所述多个第一数据元素中的一个; [0020] register file configured to store a plurality of data elements and a second plurality of third data elements, the second and third plurality of data elements each corresponding to said first plurality of data elements One;

[0021] 解码器,配置成解码单个指令,所述单个指令对所述多个第一、第二和第三数据元素指定向量比较和交换操作;以及 [0021] a decoder configured to decode a single instruction, the single instruction specifies a vector comparison of said plurality of first, second, and third elements and data exchange operation; and

[0022] 执行单元,耦合到所述解码器以接收经解码的指令,并耦合到所述第一存储位置和所述寄存器文件以执行所述向量比较和交换操作; [0022] execution unit coupled to said instruction decoder to receive decoded and coupled to the first storage location and said vector register file to perform the compare and swap operation;

[0023] 其中,响应所述向量比较和交换操作的执行,所述执行单元配置成: [0023] wherein, in response to the compare and swap performing vector operations, said execution unit is configured to:

[0024] 比较来自所述多个第一和第二数据元素的对应数据元素;以及响应确定存在匹配, [0024] Comparative corresponding data elements from the first and second plurality of data elements; and in response to determining there is a match,

[0025] 用来自所述多个第三数据元素的对应数据元素替换来自所述多个第一数据元素的数据元素;以及响应确定不存在匹配,用来自所述多个第一数据元素的对应数据元素替换来自所述多个第二数据元素的数据元素。 [0025] replaced by a plurality of corresponding data elements from the data elements of the third data elements from the first plurality of data elements; and in response to determining there is no match with the corresponding data from said first plurality of elements Alternatively the data elements data elements from the second plurality of data elements.

[0026] 本发明涉及一种其上存储有指令的计算机可读介质,所述指令可进行操作以使处 [0026] The present invention relates to a stored thereon a computer-readable medium having instructions, the instructions operable to cause at

理器装置: Processor means:

[0027] 解码单个指令,所述单个指令对多个数据元素指定向量比较和交换操作,每个数据元素具有对应的测试元素、替换元素和掩码元素; [0027] decoding a single instruction, the single instruction specifies a vector comparison of data elements and a plurality of switching operations, each data element having a corresponding test element, and element replacement mask element;

[0028] 如果相应掩码元素激活,则将数据元素与对应的测试元素进行比较;以及响应确定所有比较指示匹配, [0028] If the corresponding mask element is activated, the data elements corresponding to elements of comparing the test; and in response to determining that all the comparison indicates a match,

[0029] 设置标志,并用对应的替换元素替换所比较的数据元素;以及响应确定所有比较指示不匹配, [0029] flag is set, and the replacement data element being compared with the corresponding replaced elements; and in response to determining that all the comparison indicates a mismatch,

[0030] 将标志清零,并用对应的数据元素替换所比较的测试元素。 [0030] The flag is cleared, and the alternative test elements compared with the corresponding data element. 附图说明 BRIEF DESCRIPTION

[0031] 图1示出计算系统的框图。 [0031] Figure 1 shows a block diagram of a computing system.

[0032] 图2是如图1所示的处理装置的示意图。 [0032] FIG. 2 is a schematic view of the processing apparatus shown in FIG.

[0033] 图3示出单指令多数据(SIMD)向量比较和交换指令的编码方案。 [0033] FIG. 3 shows a single instruction multiple data (SIMD) vector encoding schemes compare and swap instruction.

6[0034] 图4是用于实现如图3所示的指令格式的第一示范性计算机系统的框图。 6 [0034] FIG. 4 is a block diagram of an exemplary computer system a first instruction format shown in FIG. 3 achieved.

[0035] 图5是用于实现如图3所示的指令格式的第二示范性计算机系统的框图。 [0035] FIG. 5 is a block diagram of an exemplary computer system in a second instruction format shown in FIG. 3 achieved.

[0036] 图6是用于实现如图3所示的指令格式的第三示范性计算机系统的框图。 [0036] FIG. 6 is a block diagram of a third exemplary computer system instruction format shown in FIG. 3 achieved.

[0037] 图7是用于实现如图3所示的指令格式的第四示范性计算机系统的框图。 [0037] FIG. 7 is a block diagram of the fourth exemplary computer system is shown in the instruction format shown in FIG. 3 achieved.

[0038] 根据描述和附图以及权利要求,其它特征和优点将显而易见。 [0038] The description and drawings, and claims, other features and advantages will be apparent.

具体实施方式 detailed description

[0039] 在以下描述中,阐述了众多具体细节,例如特定指令、指令格式、诸如寄存器和存储器的装置等,以便充分理解本文提供的实例。 [0039] In the following description, numerous specific details are set forth, such as specific instructions, instruction formats, such as registers and memory devices and the like, in order to fully understand the examples provided herein. 但是,本领域技术人员将明白,没有这些具体细节也可实现本发明。 However, those skilled in the art will appreciate that the present invention without these specific details may be practiced.

[0040] 一种用于确定信号量是否加锁(和/或使它加锁)的方法是通过使用读-修改-写序列(或操作)。 Method [0040] A method for determining whether a semaphore lock (and / or locking it) by using the reading - modification - writing sequence (or operations). 但是,读-修改-写实现的一个问题是信号量机制本身的获取和释放。 However, read - modify - write implementation is a problem semaphore acquisition and release mechanism itself. 即,当一个进程试图获得对共享存储器空间的控制权时,它首先读取锁值,检查和修改(如果允许的话)该值,并将修改值写回到该锁。 That is, when a process tries to gain control of the shared memory space, it first reads the lock value, inspect and modify (if allowed) the value, and modify the value written back to the lock. 一般希望将读-修改-写操作作为原子操作来执行(即,一旦开始执行便不中断地完成)以防止其它进程修改锁值。 It is generally desirable to read - modify - write operation is performed (ie, once started will complete without interruption) to prevent other processes modify the lock value as an atomic operation. 通过利用原子操作,进程可获取(读取)信号量,修改该值(如果允许的话),并通过启动写来释放信号量以在另一进程试图获取该锁之前完成该操作。 By using an atomic operation, the process can be obtained (reading) the amount of signal, modify the value (if permitted), and by starting to release the write operation is completed before semaphore In another process attempts to acquire the lock.

[0041] 现在参考图1,示出计算机系统10,计算机系统10具有通过总线13耦合到存储器12 (例如,寄存器、高速缓存、RAM等)的多个处理单元11 (例如,处理器、核、执行单元等)。 [0041] Referring now to Figure 1, there is shown a computer system 10, coupled to a computer system 10 having a memory 12 by a bus 13 (e.g., registers, cache, RAM) 11 of the plurality of processing units (e.g., processor, core, execution unit, etc.). 其中一个或多个处理单元11与一个或多个线程相关联。 Wherein the one or more processing units 11 to one or more associated threads. 因此,计算机系统10包括任何合适数量的处理单元11,每个处理单元11具有任何合适数量的线程。 Thus, computer system 10 includes any suitable number of processing units 11, each processing unit 11 has any suitable number of threads. 处理单元11均可形成独立的集成电路装置的一部分,或者所有处理单元11 (或其一部分)可形成在单个管芯上。 The processing unit 11 may form part of a separate integrated circuit device, or all of the processing unit 11 (or a portion thereof) may be formed on a single die. 在该特定计算机系统中,示出作为系统10的部分的四个处理单元11(指示为P1、P2、P3和P4)。 In this particular computer system, it is shown as part of the four process units 11 of the system 10 (indicated as P1, P2, P3, and P4). 所有这四个处理单元11都耦合到存储器12,具体来说是耦合到存储器12内的共享存储器空间15。 All four of the processing unit 11 is coupled to a memory 12, specifically, in the shared memory space 15 is coupled to a memory 12.

[0042] 将明白,存储器12能以各种方式进行配置。 [0042] It will be apparent, the memory 12 can be configured in various ways. 尽管示为单个存储器,但存储器12可包括多个内部和/或外部存储器。 Although it is shown as a single memory, but memory 12 may include a plurality of internal and / or external memory. 在特定实例中,所有四个处理单元11都访问存储器12, 并且指示为共享空间15的存储器12的一部分供多于一个处理单元11访问。 In a specific example, all four processing units 11 have access to memory 12 and memory 15 designated as the shared space 12 for a portion of more than one processing unit 11 accesses. 可能的是,存储器12内存在其它共享区域,其中两个或两个以上处理单元11具有访问这些共享区域的能力。 It is possible that other shared memory area in the memory 12, wherein two or more processing units 11 have the ability to access the shared area. 存储器12的非共享区域一般降级(relegate)为只供一个处理单元11访问。 Non-shared region of the memory 12 is generally degraded (relegate) of the processing unit 11 only for a visit.

[0043] 如图1所示的计算机系统10意在是示范性计算机系统,并且可包括许多额外组件,为清楚起见,省略了这些额外组件。 [0043] The computer system shown in FIG. 1 to 10 are intended to be exemplary computer system, and may include many additional components, for clarity, these additional components are omitted. 举例来说,计算机系统10可包括DMA(直接存储器访问)控制器、网络接口(如网络卡)、与其中一个或多个处理单元11相关联的芯片组、以及额外的信号线和总线。 For example, computer system 10 may include a DMA (direct memory access) controller, a network interface (such as a network card), and wherein the one or more processing units 11 associated with a chipset, as well as additional signal lines and buses. 并且,应了解,计算机系统10可以不包括如图1所示的所有组件。 And, it is understood, the computer system 10 may not include all of the components shown in FIG.

[0044] 在图1中,所采用的信号量是锁(或锁变量)16,指定它们以控制对一个或多个相应共享空间15的访问(如虚线14所示)。 [0044] In Figure 1, the semaphore employed is a lock (or lock variable) 16, to control access to specify them (dotted line 14) corresponding to one or more shared space 15. 锁16是指定包含与获得对共享空间15的访问权相关联的值的存储器中的特定位置。 Lock 16 is designated for access to a shared memory containing space 15 associated with the value obtained in a particular location. 因此,为了使处理单元11之一访问共享空间15,它首先访问对应锁16,并测试存储在锁位置16中的数据的状态(值)。 Thus, one of the processing unit 11 in order to make access to the shared space 15, which corresponds to first access the lock 16, the lock state in which the position data 16 and stores the test (values). 在最简单的格式中, 可为锁16指定两个值。 In the simplest form, the lock 16 may specify two values. 第一个值指示共享空间可供访问,第二个值指示共享空间当前正在使用并且因此不可供访问。 The first value indicates the shared space accessible, and the second value indicates a shared space currently in use and therefore not accessible. 再者,在最简单的实施例中,可对锁16的加锁和解锁状态使用位状态1和0。 Further, in the simplest embodiment, the status bit may be used and a state 0 pairs of locking and unlocking of the lock 16.

[0045] 将明白,锁16的实际锁值和锁状态只是一种设计选择,并且可以设想许多改变。 [0045] will be appreciated that the actual lock values ​​and lock state of the lock 16 is a matter of design choice and many variations can be envisaged. 并且,锁16的位置无需在存储器12本身内。 Further, the position of the lock 16 need not be within the memory 12 itself. 此外,参考图1将明白,存储器12可以是各种存储器装置之一。 Further, with reference to FIG. 1 will be understood, the memory 12 may be one of various memory devices. 还可能的是,其中一个或多个处理单元11可以用同样访问存储器的一个(或多个)存储器访问装置(诸如直接存储器访问控制器的装置)替换。 It is also possible, wherein the one or more processing units 11 can be replaced with a same access to the memory (or more) memory accessing devices (devices such as direct memory access controller). 在这些示例中,这些装置将与本文描述的处理单元11类似地起作用以获得对共享空间15的访问权。 In these examples, the device described herein processing unit 11 function similarly to obtain access to the shared space 15. 最后, 尽管只示出单个总线13,但在与总线13相同或不同的层级上可以有多个总线以用于耦合这些各种装置。 Finally, although only a single bus 13, but in the same or a different level bus 13 may have a plurality of buses for coupling the various devices.

[0046] 处理单元11访问存储器12以进行数据传输通常涉及加载和存储操作的使用。 [0046] The processing unit 11 accesses the memory 12 for data transfer typically involve the use of load and store operations. 加载操作传送来自被访问的存储器位置的存储器内容,而存储操作则将数据传送到被访问的存储器位置。 Transmitting from the memory loading operation of the contents of the memory location is accessed, the data is transmitted to a memory storing an operating position accessed. 因此,加载/存储操作用于访问存储器12和锁16以在处理单元11与存储器12之间进行数据传输。 Accordingly, load / store operations for accessing the memory 12 and the lock 16 for data transfer between the processing unit 11 and a memory 12. 加载和存储访问又分别称为读和写访问。 Load and store access also called the read and write access.

[0047] 参考图1和图2,计算机系统10包括经由系统总线22与处理单元11耦合的只读存储器(ROM) 31和主存储器18,主存储器18包括例如任何合适类型的随机存取存储器(RAM)。 [0047] Referring to Figures 1 and 2, system 10 includes a computer 18, a main memory 18 comprises, for example, any suitable type of random access memory 22 via a system bus coupled to the processing unit 11 is a read only memory (ROM) 31 and a main memory ( RAM). 处理单元11还具有通过系统总线22与其耦合的数据存储装置30。 The processing unit 11 further has a data storage means 30 through a system bus 22 coupled thereto. 数据存储装置30包括任何合适的非易失性存储器,例如硬盘驱动器。 The data storage device 30 includes any suitable non-volatile memory, such as hard drives. 计算机系统10还包括可移动存储介质32,例如软盘驱动器、⑶ROM驱动器和/或USB驱动器。 The computer system 10 further includes a removable storage medium 32, such as a floppy drive, ⑶ROM drive and / or USB drive.

[0048] 在图2中,处理单元11包括通过一个或多个总线互连的多个组件,并且这些总线在图2中用本地总线19象征性地示出。 [0048] In FIG. 2, the processing unit 11 includes a bus and these are shown in FIG. 2 symbolically by a local bus 19 through one or more components of a plurality of buses interconnected. 本地总线19以及因此的处理单元11的组件与总线接口单元23耦合。 Thus local bus 19 coupled to the processing unit 11 and bus interface unit 23 components. 总线接口单元23将处理单元11与系统总线22耦合,从而使得能够在处理单元11与主存储器18之间以及在处理单元11与外部高速缓存20之间进行通信。 Bus interface unit 23 is coupled to the processing unit 11 and system bus 22, thereby enabling communication between the cache 18 and the processing unit 11 and the external processing unit 11 and the main memory 20.

[0049] 处理单元11包括与本地总线19耦合的指令解码器21。 [0049] The processing unit 11 includes a local bus 19 coupled to the instruction decoder 21. 指令解码器21接收与在处理单元11上执行的程序或代码片段相关联的一个(或多个)指令,并且将该指令分解成一个或多个机器级指令/操作(UOP)。 The instruction decoder 21 receives (or more) of instructions in a program executed on the processing unit 11 associated with the segment or code, and the instruction is broken down into one or more machine-level instructions / operation (UOP). 应了解,处理单元11可接收与某个程序相关联的一个或多个指令,而计算机系统10的另一个处理单元11可接收与相同程序相关联的一个或多个指令。 It should be appreciated, the processing unit 11 may receive one or more instructions associated with a program, and the other processing unit 11 of the computer system 10 may receive a program associated with the same or more instructions. 因此,一个程序可以在多个处理单元11上执行。 Thus, a program may be executed on a plurality of processing units 11.

[0050] 处理单元11还包括多个执行单元,包括例如数据访问控制单元(DAC)M、存储器排序缓冲器(MOB) 26、寄存器文件单元四和功能单元27。 [0050] The processing unit 11 further includes a plurality of execution units, including for example, data access control unit (DAC) M, the memory order buffer (MOB) 26, four register file unit and functional units 27.

[0051] 寄存器文件单元四包括多个寄存器,每个寄存器具有16、32、64、1观、256或512 位的存储。 [0051] The four register file unit comprises a plurality of registers, each register having 16,32,64,1 concept, memory 256 or 512. 此外,寄存器文件四可包括一个或多个寄存器文件,每个寄存器文件具有一个或多个寄存器。 In addition, four register file may include one or more register files, each register file having one or more registers. 功能单元27包括一个或多个功能单元,例如算术、逻辑和/或浮点单元。 Function unit 27 includes one or more functional units, such as arithmetic, logic, and / or a floating point unit. MOB沈确保加载和存储指令的正确排序,并且还提供存储器层级(即,计算机系统10内的各种级别的存储器,包括LO高速缓存25、L1高速缓存观、外部高速缓存20、主存储器18和数据存储装置30)内的这些事务的正确定序。 MOB heavy load and store instructions to ensure correct ordering, and also provides a memory hierarchy (i.e., various levels within the memory of the computer system 10, comprising a LO cache 25, L1 cache concept, the external cache 20, and main memory 18 these positive transaction within the data storage means 30) determining the sequence. LO高速缓存25和Ll高速缓存观中的每一个可存储功能单元27最近访问或预期将访问的数据。 LO and Ll data cache 25 caches View Each functional unit 27 may be stored or accessed recently expected to be accessed. 如果功能单元27请求的数据项驻存在高速缓存存储器25J8之一中,则出现高速缓存“命中”;但是,如果所请求的数据没有存在于高速缓存中,则出现高速缓存“不命中”。 If the data item requested by a functional unit 27 reside in one of 25J8 cache memory, a cache "hit" occurs; however, if the requested data is not present in the cache, the cache is a "miss." 其中一个或多个高速缓存存储器(例如,LO 高速缓存25)可与DAC M耦合。 Wherein the one or more cache memories (e.g., LO cache 25) may be coupled to DAC M. DAC M控制导致高速缓存不命中的所有事务以及需要特殊处理的其它事务。 DAC M control leads to cache misses all transactions and other transactions that require special handling. 如上所述,锁是需要通过DAC M以及处理单元11的其它组件特殊处理的一种类型的事务。 As described above, DAC M lock is required by the processing unit and other components of a type of transaction process special 11. 如果uop对应于例如算术操作,则将该uop分派给功能单元27,然后功能单元27执行该算术操作。 If, for example, arithmetic operations corresponding to the uop, uop is dispatched to the functional unit 27, the function unit 27 then performs arithmetic operations. 如果uop对应于存储器引用指令(例如,加载或存储),则将该uop分派给MOB 26。 If a uop corresponds to a memory reference instructions (e.g., load or store), then the dispatched uop to MOB 26.

[0052] 应了解,如图2所示的处理单元11意在表示示范性处理装置,并且该处理单元还可包括这些图中没有示出的许多额外组件。 [0052] It should be appreciated, the processing unit 11 shown in FIG 2 is intended to represent an exemplary processing device, and the processing unit may further include many additional components not shown in these figures. 为便于理解,省略了这些组件。 For ease of understanding, these components are omitted. 例如,处理单元11可包括地址生成单元、预定站、重排序缓冲器、调度器、分段和地址翻译单元、翻译后备缓冲器、页不命中处理程序和/或内部时钟电路。 For example, the processing unit 11 may include an address generation unit, a predetermined station, reorder buffer, the scheduler, and a segment address translation unit, a translation lookaside buffer, a page miss handler and / or the internal clock circuit. 并且,尽管示为离散元件,但应了解,图2中示出的许多组件可进行组合和/或共享电路。 Also, although shown as discrete elements, it will be appreciated that many of the components shown in FIG. 2 may be combined and / or the shared circuits. 最重要的是,本文描述的实施例不限于任何特定的体系结构或布置,并且不限于用于描述这些体系结构或布置的任何特定术语,所公开的实施例可在任何类型的处理装置上实现,而与其体系结构或归于它的术语无关。 Most importantly, the embodiments described herein are not limited to any particular architecture or arrangement, and is not limited to any particular term used to describe the architecture or arrangement of the disclosed embodiments may be implemented on any type of processing device , while its architecture or attributed to its independent terms.

[0053] 经调度用于执行的任何一个或多个uop可包括加锁uop。 [0053] Any one or more uop scheduled for execution may include a locking uop. 如上所示,锁对应于按照确保处理器和/或线程之间的同步的方式执行的操作(例如,加载、修改和存储)的序列。 As described above, according to the lock corresponding to ensure operation (e.g., load, modify, and store) executing a synchronized way between the processor and / or thread of sequence.

[0054] 图3示出用于执行读-修改-写操作的指令。 [0054] Figure 3 illustrates for performing a read - write operation instruction - modified. 指令40是包括5个操作数41-45的单个原子指令。 40 includes a single atomic instruction is an instruction operand 5 41-45. 操作码操作数41标识这是VCMPXCHG指令。 41 Opcode Operand instruction identifier which is VCMPXCHG. 操作数42-44对应于与SRCl/ DEST、SRC2、SRC3相关联的源和目的地操作数,并且在一些实现中,还包括掩码存储位置(MSK)和/或偏移量(或“立即数(immediate)”)操作数45。 42-44 operand corresponds to SRCl / DEST, source and destination operand SRC2, SRC3 associated, and in some implementations, further comprising mask storage location (MSK) and / or offset (or "immediate number (immediate) ") operand 45. 该偏移量或立即数用于在寻址存储器12时提供相对于基地址(如SRC1)的偏移量。 The offset is used to provide immediate or relative to the base address (SRCl) offset when addressing the memory 12. 以下所示的指令也可具有这样的偏移量,但未示出。 Instructions shown below may also have such an offset, but not shown. 指定掩码存储位置45的实现引用存储对应于存储在由SRC1/DEST操作数引用的存储位置处的相应数据元素的掩码元素的寄存器或存储器位置。 45 implements the specified storage location reference mask register or memory location storing data corresponding to a respective data element stored at the storage location referenced by the SRC1 / DEST operands of the mask element.

[0055] 响应指令40,处理单元11读取第一源数据,将它与另一个源数据进行比较,并且如果比较满足预定条件(例如,真或匹配条件),则将某个经过修改的值写入到某个位置, 该位置可以是第一源数据的原始位置。 [0055] 40 in response to the instruction, the processing unit 11 reads a first source data, compares it to another source data, and if the comparison satisfies a predetermined condition (e.g., true or match condition), then a value modified is written to a location, the location may be the original position of the first source data. 如果不满足预定条件,则该位置中的原始数据不变。 If not the predetermined condition, the position of the original constant data. 该指令利用三个源操作数(例如,如下文所使用的SRC1、SRC2和SRC;?)和一个目的地操作数(例如,如下文所使用的DEST)来提供在执行指令中所用的各种信息的位置。 The number using the instruction (e.g., SRC1, SRC2 and SRC ;? used below) and one destination operand (e.g., as used DEST) to provide various three source operands in the execution instruction used in the location information. 操作特定寄存器可在执行指令时用于提供一个或多个源数据和/或用于存储目的地数据,从而无需在实际指令格式中明确指定操作数。 Operation of a particular register may be used to provide one or more sources of data and / or data storage destination for execution of the instruction, eliminating the need to explicitly specify the actual instruction operand format. 此外,在这个实例中,SRCl操作数和DEST操作数是指相同的存储位置(SRC1/DEST)。 Further, in this example, SrCl DEST operands and operands refer to the same memory location (SRC1 / DEST).

[0056] 在执行指令40之前,将SRCl、SRC2和SRC3加载到寄存器文件单元四的寄存器中。 [0056] 40 prior to instruction execution, the SRCl, SRC2 and SRC3 loaded into the four register file unit register. 例如,为了安全地更新存储在由SRC1/DEST操作数指定的位置中的值,首先将该值读入到由SRC2操作数指定的位置中,并将替换值读入到由SRC3操作数指定的位置中。 For example, in order to safely update the value of the number specified by the SRC1 DEST / operating position is stored, it is first read into the value specified by the operand SRC2 position, and replaces the value specified by the read operand SRC3 location. 然后,执行原子比较-交换操作以将与SRC1/DEST操作数相关联的当前值同与SRC2操作数相关联的值进行比较(即,由于被另一个代理修改,当前值可能与初始复制的值不同)。 Then, a comparison atom - to exchange operations with the current value of SRC2 operands associated with SRC1 / DEST operands associated compared (i.e., since modified by another agent, the current value may be copied to the initial different). 如果该值没有改变,则用与SRC3操作数相关联的值替换该值,并且设置零标志以指示成功更新。 If the value has not changed, and replaces the value with the value associated with the operation SRC3 number, and sets the zero flag indicating a successful update. 但是,如果另一个代理在初始复制与比较-交换操作之间修改了该值,则不替换当前值,并将零标志清零以指示更新失败。 However, if another agent in comparison with the initial copy - swap operation between the modified value, the current value is not replaced, and the zero flag is cleared to indicate that the update failure.

[0057] 图4中的框图示出执行指令40时的信息流。 In [0057] FIG. 4 illustrates the block 40 is executed in the instruction stream. 处理单元11包括执行单元46 (图2中的DAC对)、寄存器文件29、BIU 23和解码器21,它们全都通过本地总线19耦合在一起。 The processing unit 11 comprises an execution unit 46 (DAC of FIG. 2), a register file 29, BIU 23 and decoder 21, all of which are coupled together via local bus 19. 寄存器文件四包括供执行单元46访问以执行各种操作的多个寄存器。 Four register file 46 comprises an execution unit for accessing a plurality of registers to perform various operations. 如图4所示,VCMPXCHG指令40示为驻存在执行单元46中,并且从指令的操作数到与SRCl、SRC2、SRC3 和DEST相关联的对应寄存器用虚线示出。 4, VCMPXCHG instruction 40 shown residing execution unit 46, and the SRCl, SRC2, SRC3 and DEST register associated with the corresponding shown with dashed lines from the instruction operand. 寄存器驻存在寄存器文件四中。 Register Register file residing Fourth. 解码器21用于解码各种指令(包括VCMPXCHG指令40),以便执行单元46执行这些操作。 Decoder 21 for decoding various instructions (including instruction VCMPXCHG 40), execution unit 46 for performing these operations.

[0058] 如之前在图1和图2中所描述,存储器12示为通过总线19和/或总线22耦合到BIU 23。 [0058] As previously described in FIG. 2 and FIG. 1, memory 12 is shown coupled to BIU 23 via the bus 19 and / or 22 buses. 因此,处理单元11与存储器12之间的数据传输可通过BIU 5½或本地总线19进行。 Thus, data transmission between the processing unit 11 and the memory 12 may be performed by a local bus 19 or the BIU 5½. 将明白,利用VCMPXCHG指令40的程序例行程序可驻存在某个存储器中,该存储器也可以是存储器12或包括存储器12。 It will be apparent, the use of a program of instructions VCMPXCHG routine 40 may reside in a memory, the memory may include a memory 12 or memory 12.

[0059] 以下伪代码示出VCMPXCHG指令40如何进行操作的实例。 [0059] The following pseudo-code illustrates an example of how VCMPXCHG instructions 40 operate. 也可以使用其它伪代码、 Possible to use other pseudo-code,

语言、操作、操作顺序和/或数字。 Language, operations, orders of operations and / or numbers.

[0060] [0060]

Figure CN102103570AD00101

[0061] 在以上所示的特定VEX. 1¾和VEX. 256实例中,可将锁值分别存储在由SRC1/DEST 引用的512位存储位置(例如,64字节高速缓存行或寄存器)的位[127:0]和位[255:0] 中。 [0061] In a particular VEX shown above. 1¾ and VEX. 256 example, values ​​may be stored in the lock position referenced by the SRC1 / DEST bit storage location 512 (e.g., a 64 byte cache line or register) [127: 0] and the bit [255: 0]. 在一个实施例中,在由SRC1/DEST引用的锁值与它们所对应的共享存储位置15之间存在一一对应的关系。 In one embodiment, there is a one to one relationship between the value of the lock 15 is referenced by SRC1 / DEST and their corresponding shared storage location. 例如,SRC1/DEST可引用16个8位锁值(1¾位),每个值对应于高速缓存行或SIMD寄存器的16个存储位置中的相应一个存储位置。 For example, SRC1 / DEST can reference the value of 16 8-bit latch (1¾ bits), each corresponding to a respective storage location a storage location 16 or cache lines in a SIMD register. 或者,SRC1/DEST可引用32个8位锁值056位),每个值对应于高速缓存行或SIMD寄存器的32个存储位置中的相应一个存储位置。 Alternatively, SRC1 / DEST can reference the value of the lock 32 8 056), each corresponding to a respective storage location a storage location 32 or cache lines in a SIMD register.

[0062] 再次参考以上实例,SRC1/DEST与SRC2之间的比较的结果指示锁值是否被修改。 [0062] Referring again to the above example, the comparison between the SRC1 / DEST SRC2 with the results indicating whether the lock value is modified. 真条件指示锁未被修改并且锁处于解锁状态。 True condition indicates that the lock has not been modified and the lock is unlocked. 当满足该条件时,将由SRC3引用的值写入到SRC1/DEST,从而将锁值修改为加锁状态,以便防止其它代理访问这个(或这些)共享空间。 When this condition is satisfied, by reference to the value written SRC3 SRC1 / DEST, so as to modify the lock status of locked value, other agents in order to prevent access to this (or these) shared space. 此后,设置零标志(ZF)以指示成功操作。 Thereafter, set the zero flag (ZF) to indicate a successful operation.

[0063] 假条件指示一个或多个锁被修改(被加锁)并且另一代理已经取得共享空间的所有权。 [0063] The false lock condition indicates one or more modified (to be locked) and the other agent ownership of the shared space has been made. 当条件为假时,将由SRC1/DEST引用的值(当前锁值)存储到SRC2,并且将零标志清零以指示操作不成功。 When the condition is false, by the SRC1 / DEST reference value (the current value of lock) is stored and SRC2, and the zero flag is cleared to indicate the operation is unsuccessful. 然后,在从该操作返回之前,将SRC2的高位字节清零。 Then, before returning from this operation, the high byte is cleared SRC2.

[0064] 通常,如果一开始访问被拒绝,则询问代理将继续重新尝试访问直到获取访问权为止。 [0064] In general, if a start access is denied, ask the agent will continue to try to re-visit until gain access so far. 在一些实现中,外层循环将包括在重新执行VCMPXCHG指令40之前进行非原子加载和测试。 In some implementations, the outer loop will include a non-atomic loading and testing before re-execution instruction VCMPXCHG 40. 一旦处理器完成它对共享存储器空间15的访问,它通常将通过利用对锁16的写循环以将它解锁来释放它对共享存储器空间15的控制权,以使得其它代理现在可进入到共享存储器空间15。 Once the processor completes its access to the shared memory space 15, which is usually the write cycle lock 16 to unlock it to release its control over the shared memory space 15 by using such other agents can now enter into the shared memory space 15. 但是,将明白,处理器如何释放共享存储器空间只是一种设计选择,它可由系统体系结构来规定。 However, it is understood how to release the processor shared memory space is just a design choice, it is specified by the system architecture.

[0065] 在一些实现中,VCMPXCHG指令40包括具有多个掩码元素的掩码向量,每个掩码元素对应于由SRC1/DEST引用的多个数据元素之一。 [0065] In some implementations, VCMPXCHG instruction vector 40 comprises a mask having a plurality of mask elements, each element corresponding to one of the plurality of mask data elements referenced by the SRC1 / DEST of. 掩码向量存储位置可以是寄存器文件单元四中的寄存器,例如阴影寄存器、控制寄存器、标志寄存器、通用寄存器、SIMD寄存器或 Vector storage location may be a mask in the register file unit four registers, such as shadow registers, control registers, flag registers, general purpose registers, or the SIMD register

10其它合适的寄存器。 10 Other suitable register. 在一个实施例中,在由SRC1/DEST引用的数据元素与存储在掩码寄存器中的对应掩码元素之间存在一一对应的关系。 In one embodiment, one relationship exists between the mask elements corresponding to stored data element is referenced by the SRC1 / DEST in the mask register. 掩码元素或值可包括用于指示是否要比较和/或修改(例如,在对应的或所指的寄存器位置中的)对应数据元素的标志、标记、标签、 指示符和/或其它数字、位和/或代码。 Element may include a mask or value indicating whether to compare and / or modified (e.g., at positions corresponding to the register indicated or in) the corresponding flag data elements, marking, labeling, indicator and / or other digital, position and / or code. 例如,具有值“1”的掩码元素可指示将要修改对应的数据元素,否则可使用“0”。 For example, having a value of "1" mask elements may indicate the corresponding data element to be modified, otherwise use "0." 也可使用其它数字或标志。 It may also be used other numbers or signs.

[0066] 以下伪代码中分别针对16宽度、512字节向量(16wide, 512byte vector)和8宽度、512字节向量示出屏蔽的VCMPXCHGD和VCMPXCHGQ指令的实例。 [0066] The following pseudo-code 16 for each width, 512 byte vector (16wide, 512byte vector) and a width of 8, 512 bytes of vector is shown an example of the shield and VCMPXCHGQ VCMPXCHGD instructions. 在屏蔽的比较实现中, 只比较和更新激活元素。 In Comparative achieve shielded, comparisons and updates only the activation element.

[0067] [0067]

Figure CN102103570AD00111

[0068] 在如上所示的特定VCMP)(CHGD和VCMP)(CHGQ实例中,首先将变量ALL_CMPS_ SUCCEED预设为1(即,真条件)。一旦设置好,对于每个激活掩码元素(例如,其中存储有特定值的掩码元素,包括例如二进制1或十六进制值0x01、0xFF或0x80),将由SRC1/DEST 引用的对应存储位置与由SRC2中的对应位引用的值进行比较。如果未使用掩码,则将由SRC1/DEST引用的每个存储位置与由SRC2中的对应位引用的值进行比较。 [0068] In a particular VCMP shown above) (CHGD and VCMP) (CHGQ example, the first variable ALL_CMPS_ SUCCEED preset to 1 (i.e., a true condition). Once set, active mask for each element (e.g. wherein a particular value is stored mask elements, including, for example, a binary 1 or a hexadecimal value 0x80 or 0x01,0xFF), corresponding to the storage location by the SRC1 / DEST reference value is compared with a reference by a corresponding bit in the SRC2. If the mask is not used, then each memory location referenced by SRC1 / DEST is compared with the value referenced by a corresponding bit of SRC2.

[0069] 再次地,SRC1/DEST与SRC2的对应值之间的比较的结果指示该特定锁值是否已被修改。 [0069] Again, the results of the comparison between the SRC1 / DEST corresponds to the value of SRC2 value indicates whether the particular lock has been modified. 但是,在这些实例中,真条件(即,不匹配的条件)指示锁已被修改并且另一代理取得了共享存储位置的所有权。 However, in these examples, the true condition (i.e., mismatch condition) indicates that the lock has been modified and the other agents acquired ownership of the shared storage location. 当对于任何一个引用的存储位置都满足该条件时,将ALL_ CMPS_SUCCEED清零,以指示所有比较都不成功。 When any storage location for a reference to the conditions are met, will ALL_ CMPS_SUCCEED clear, indicating that all comparisons are not successful. 此后,将零标志清零,并且对于每个激活掩码元素,将存储在由SRC1/DEST引用的对应存储位置中的值加载到SRC2的对应位中。 Thereafter, the zero flag is cleared, and the mask for each active element, and loads the value stored in the corresponding memory location referenced by the SRC1 / DEST to a corresponding bit in the SRC2.

[0070] 当比较的结果为假(即,对于每个激活掩码元素,由SRC1/DEST引用的对应值与SRC2中的对应值匹配)时,ALL_CMPS_SUCCEED保持设置不变。 [0070] When the result of the comparison is false (i.e., active mask for each element, the corresponding value matches the corresponding value of SRC2 referenced by SRC1 / DEST in), ALL_CMPS_SUCCEED remains unchanged. 此后,设置零标志(ZF),并且对于每个激活掩码元素,将存储在SRC3的对应存储位置中的值加载到SRC1/DEST的对应位中,从而将锁值修改为加锁状态,以便防止其它代理访问共享空间。 Thereafter, the zero flag is set (the ZF), and the mask for each active element, the value stored in the corresponding memory location of SRC3 are loaded to a corresponding bit SRC1 / DEST, thereby modifying the locked state of the lock value, in order to other agents to prevent access to the shared space.

[0071] 图5中的框图示出执行指令40时的信息流的另一个实例。 [0071] FIG. 5 is a block diagram illustrating another example of the flow of information during execution of instructions 40. 如图5所示,VCMPXCHG 指令40示为驻存在执行单元46中,并且从指令的操作数到与SRC2、SRC3和MSK相关联的对应寄存器用虚线示出。 As shown in FIG. 5, VCMPXCHG instruction 40 shown residing execution unit 46, and shown with dashed lines from the instruction operand to a SRC2, SRC3 and MSK associated with the corresponding register. 在该实例中,掩码存储位置(MSK)是掩码寄存器,并且与SRC1/DEST 相关联的存储位置是Ll高速缓存。 In this example, the mask memory locations (MSK) a mask register, and the SRC1 / DEST storage location is associated Ll cache. 寄存器驻存在寄存器文件单元四中。 Registers reside in four register file unit.

[0072] 在执行指令40之前,将SRCl预取到Ll高速缓存中,并将SRC2、SRC3和MSK数据加载到寄存器文件单元四的寄存器中。 [0072] 40 prior to instruction execution, the SRCl Ll prefetched into the cache and loaded SRC2, SRC3 and MSK data register file unit to the four registers. 掩码寄存器存储对应于与SRC1/DEST操作数相关联的存储位置中的相应数据元素的多个掩码元素。 A mask register storing a plurality of mask elements corresponding to the storage location SRC1 / DEST operands associated with the respective data element. 另外,首先,将比较值读取到由SRC2操作数指定的位置中,并将替换值读取到由SRC3操作数指定的位置中。 Further, first, the comparison value is specified by the number of read operations SRC2 position, and replaces the value of the number specified by the read operation SRC3 position. 然后,执行指令40以使执行单元46比较与SRC1/DEST和SRC2操作数相关联的对应数据元素,并且如果存在匹配,则用来自SRC3的对应数据元素替换来自SRC1/DEST的数据元素。 Then, the instruction execution unit 40 so that the comparator 46 with the corresponding data element SRC1 / DEST SRC2 and associated operands and if there is a match, then the replacement data element from SRC1 / DEST with corresponding data elements from the SRC3. 如果不存在匹配,则指令40的执行使执行单元46用对应的SRC1/DEST数据元素替换SRC2数据元素。 If no match exists, the instruction execution unit 40 causes the execution SRC2 replacement data element 46 with SRC1 / DEST corresponding data element.

[0073] 在一些实现中,只有当对应掩码元素激活(active)时才执行各对SRC1/DEST和SRC2数据元素之间的比较。 [0073] In some implementations, each pair of comparison between the SRC1 / DEST data elements of SRC2 and executed only if the corresponding mask element is activated (active). 在某些实现中,执行单元46还配置成:如果其对应掩码元素激活的每对对应数据元素之间存在匹配,则设置标志;并且如果其对应掩码元素激活的任何一对之间不存在匹配,则将标志清零。 In some implementations, the execution unit 46 is further configured to: if a match exists between each pair of corresponding data elements corresponding mask elements activated flag is set; and between any pair of its corresponding mask elements if not activated there is a match, the flag is cleared. 此外,在一些实现中,只有当对应于相应SRC1/DEST 数据元素的掩码元素激活时才执行用对应的SRC3数据元素替换SRC1/DEST数据元素。 Further executed when, in some implementations, only when corresponding to the respective SRC1 / DEST data elements activated mask elements Alternatively SRC1 / DEST data elements with a corresponding data element SRC3. 另外,在一些实现中,只有当对应于SRC1/DEST数据元素的掩码元素激活时才执行用对应的SRC1/DEST数据元素替换SRC 2数据元素。 Further when the corresponding execution, in some implementations, only if the corresponding SRC1 / DEST data elements mask elements activated SRC1 / DEST replacement data element SRC 2 data elements.

[0074] 在一些实施例中,指示加锁状况的锁值与指示激活掩码元素的掩码值相同(例如,二进制1)。 [0074] In some embodiments, the mask value indicating the locking condition of the lock value indicates activation of mask elements is the same (e.g., binary 1). 在这种情况下,SRC 3可同时用作掩码向量和锁值替换向量。 In this case, SRC 3 can simultaneously be used as a mask vector and replacement vector values ​​lock.

[0075] 在一些实现中,比较-交换操作在不更新与SRC2操作数相关联的值的情况下完成执行。 [0075] In some implementations, the comparison - the exchange operation has finished without updating the value of SRC2 associated operands. 此后,测试标志(例如,零标志),并且如果它指示关于与SRC1/DEST操作数相关联的值的更新操作失败,则重复正好在比较-交换操作之前的步骤以便在重复比较-交换操作之前更新与SRC2和SRC3相关联的值。 Before the exchange operation - Thereafter, the test flag (e.g., zero flag), and if it indicates an update operation on values ​​SRC1 / DEST number associated with the operation fails, repeat just comparison - step exchange operation prior to the repeated comparison update the value of SRC2 and SRC3 associated.

[0076] 一个或多个实施例包括一种制品,该制品包括有形机器可访问和/或机器可读介质,在该介质上存储有SIMD指令,该指令对多个数据元素指定向量比较和交换操作,每个数据元素具有对应的测试元素、替换元素和掩码元素,该指令在由机器(例如,执行单元)执行时使该机器:如果相应掩码元素激活,则将数据元素与对应的测试元素进行比较; 响应确定所有比较指示匹配,设置标志并用对应的替换元素替换所比较的数据元素;以及响应确定所有比较指示不匹配,则将标志清零并用对应的数据元素替换所比较的测试元素。 [0076] One or more embodiments include an article, the article comprising a tangible machine-accessible and / or machine-readable medium storing SIMD instructions on the medium, the compare and swap instruction specifies a vector of a plurality of data elements operation, each data element having a corresponding test element, and element replacement mask elements, the instructions cause the machine when executed by a machine (e.g., execution unit): If the corresponding mask element is activated, the corresponding data elements the test element is compared; in response to determining that all the comparison indicates a match, a flag is set and the replacement data element being compared with a replacement element corresponds; and in response to determining that all the comparison indicates a mismatch, then the flag is cleared and replaced with the corresponding data element comparison testing element. 有形介质可包括一种或多种固体物质。 Tangible medium may include one or more solid materials. 该介质可包括用于提供(例如,存储)以可供机器访问的形式的信息的机制。 The medium may include providing (e.g., memory) available to the mechanism of the machine information in the form of access. 例如,该介质可以可选地包括可记录介质,例如软盘、光存储介质、光盘、CD-ROM、磁盘、磁光盘、只读存储器(ROM)、可编程ROM(PROM)、可擦除可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)、随机存取存储器(RAM)、静态RAM(SRAM)、动态RAM (DRAM)、闪速存储器及其组合。 For example, the medium may optionally include recordable media such as floppy disks, optical storage media, optical disks, CD-ROM, magnetic disk, magneto-optical disk, read only memory (ROM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), a random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), flash memory, and combinations thereof.

[0077] 合适的机器包括但不限于执行单元、通用处理器、专用处理器(例如,图形处理器和密码处理器)、密码加速器、网络通信处理器、计算机系统、网络装置、调制解调器、个人数字助理(PDA)、蜂窝电话以及各种各样具有一个或多个执行单元的其它电子装置,这些只是举例。 [0077] Suitable machines include, but are not limited to, an execution unit, a general purpose processor, a special purpose processor (e.g., a graphics processor and cryptographic processors), cryptographic accelerators, network communications processor, a computer system, a network device, a modem, a personal digital Assistant (PDA), cellular phones, and various other electronic devices having one or more execution units, these are merely examples. 其它实施例涉及具有执行单元和/或用于执行本文所公开的方法的计算机系统、嵌入式系统或其它电子装置。 Other embodiments relate to a computer system having an execution unit and / or for performing the herein disclosed, an embedded system or other electronic device.

[0078] 图6示出包括处理器51的合适的计算机系统50的实例。 [0078] FIG. 6 illustrates an example of a computer system comprising a suitable processor 51, 50. 该处理器包括能够执行至少一个向量比较和交换指令53的至少一个执行单元52。 The processor is capable of performing at least one vector comprising a comparator and a switching instruction unit 52 performs at least 53.

[0079] 处理器经由总线(例如,前端总线)或其它互连55耦合到芯片组M。 [0079] processor (e.g., FSB), or other interconnection coupled via bus 55 to the chipset M. 该互连可用于经由芯片组在处理器与系统中的其它组件之间传送数据信号。 The interconnect can be used to transmit data signals between the processor and other components of the system via the chipset.

[0080] 芯片组包括称为存储器控制器集线器(MCH) 56的系统逻辑芯片。 [0080] called a chipset comprising a memory controller hub (MCH) 56, system logic chip. MCH耦合到前端总线或其它互连阳。 MCH is coupled to the front end of the male bus or other interconnect.

[0081] 存储器58耦合到MCH。 [0081] The memory 58 is coupled to the MCH. 在各种实施例中,存储器可包括随机存取存储器(RAM)。 In various embodiments, the memory may include random access memory (RAM). DRAM是在一些但不是所有计算机系统中使用的RAM类型的实例。 DRAM is an example of RAM type in some but not all computer systems used. 如图所示,存储器可用于存储诸如一个或多个乘法指令的指令59和数据60。 As shown, the memory may be used to store one or more instructions, such as multiply instructions 59 and data 60.

[0082] 组件互连61也与MCH耦合。 [0082] The interconnect assembly 61 is also coupled with the MCH. 在一个或多个实施例中,该组件互连可包括一个或多个外围组件互连express (PCIe)接口。 In one or more embodiments, the interconnect assembly may include one or more peripheral component interconnect express (PCIe) interface. 该组件互连可允许其它组件通过芯片组耦合到系统的其余部分。 The interconnect assembly may allow other components coupled to the rest of the system via the chipset. 这些组件的一个实例是图形芯片或其它图形装置,但这是可选的而非必需的。 Examples of these components is a graphics chip or other graphics devices, but this is optional and not required.

[0083] 芯片组还包括输入/输出(I/O)控制器集线器(ICH)62。 [0083] The chipset further comprises an input / output (I / O) controller hub (ICH) 62. ICH通过集线器接口总线或其它互连63耦合到MCH。 ICH is coupled to the MCH via a hub interface bus or other interconnect 63. 在一个或多个实施例中,总线或其它互连63可包括直接媒体接口(DMI)。 In one or more embodiments, a bus or other interconnect 63 may include a Direct Media Interface (DMI).

[0084] 数据存储设备64耦合到ICH。 [0084] The data storage device 64 is coupled to the ICH. 在各种实施例中,数据存储设备可包括硬盘驱动器、 软盘驱动器、⑶-ROM装置、闪速存储器装置等或其组合。 In various embodiments, the data storage device may include a hard disk drive, a floppy drive, ⑶-ROM device, flash memory device, etc., or combinations thereof.

[0085] 第二组件互连65也与ICH耦合。 [0085] The second component interconnect 65 is also coupled to the ICH. 在一个或多个实施例中,第二组件互连可包括一个或多个外围组件互连express (PCIe)接口。 In one or more embodiments, the second interconnect assembly may include one or more peripheral component interconnect express (PCIe) interface. 第二组件互连可允许各种类型的组件通过芯片组耦合到系统的其余部分。 The second component interconnect may allow various types of components coupled to the remainder of the system via the chipset.

[0086] 串行扩展端口66也与ICH耦合。 [0086] serial expansion port 66 is also coupled with the ICH. 在一个或多个实施例中,串行扩展端口可包括一个或多个通用串行总线(USB)端口。 In one or more embodiments, a serial expansion port may include one or more universal serial bus (USB) port. 串行扩展端口可允许各种其它类型的输入/输出装置通过芯片组耦合到系统的其余部分。 Serial expansion port may allow a variety of other types of input / output devices coupled to the rest of the system via the chipset.

[0087] 可以可选地与ICH耦合的其它组件的几个说明性实例包括但不限于音频控制器、 无线收发器和用户输入装置(例如,键盘、鼠标)。 [0087] Several illustrative examples of other components may optionally be coupled with the ICH include, but are not limited to, audio controller, a wireless transceiver and a user input device (e.g., keyboard, mouse).

[0088] 网络控制器67也耦合到ICH。 [0088] The network controller 67 is also coupled to the ICH. 网络控制器可允许系统与网络耦合。 The network controller may allow the system coupled with a network.

[0089] 在一个或多个实施例中,计算机系统可执行可从Microsoft Corporation of Redmond, Washington获得的WINDOWS™操作系统版本。 [0089] WINDOWS ™ operating system version of the embodiment, the computer system can perform of Redmond, Washington available from Microsoft Corporation in one or more embodiments. 或者,可以使用诸如UNIX、Linux或嵌入式系统的其它操作系统。 Alternatively, you can use other operating systems such as UNIX, Linux or embedded systems.

[0090] 这只是合适的计算机系统的一个特定实例。 [0090] This is just a suitable computer system of a specific example. 例如,在一个或多个备选实施例中,处理器可具有多个核。 For example, in one or more alternative embodiments, the processor may have multiple cores. 又如,在一个或多个备选实施例中,MCH 56可在物理上与处理器51 — 起集成在管芯上,并且处理器可通过集成式MCH直接与存储器58耦合。 As another example, in one or more alternative embodiments, MCH 56 may be physically processor 51-- from integrated on the die, and the processor 58 can be directly coupled to the memory by the integrated MCH. 再如,在一个或多个备选实施例中,其它组件可与处理器一起集成在管芯上,以便例如提供芯片上系统(SoC) 设计。 Again, in one or more of the alternative embodiment, other components may be integrated together with the processor on the die, for example to provide a system-on-chip (SoC) designs. 又如,在一个或多个备选实施例中,计算机系统可具有多个处理器。 As another example, in one or more alternative embodiments, the computer system may have a plurality of processors.

[0091] 图7是合适的计算机系统70的另一个实例。 [0091] FIG. 7 is another example of a suitable computer system 70. 该第二示范实施例具有与上述计算机系统50的某些类似性。 The second exemplary embodiment has certain similarities to the above-described computer system 50. 为清楚起见,本论述将倾向于强调差异而不重复所有类似性。 For clarity, this discussion will tend to emphasize the differences without repeating all the similarities. [0092] 与计算机系统50类似,计算机系统70包括处理器71和具有I/O控制器集线器(ICH) 72的芯片组74。 Similar [0092] 50 with the computer system, the computer system 70 includes a processor and a chipset 71 with I / O controller hub (ICH) 72 74. 计算机系统70还包括与芯片组74耦合的第一组件互连81、与ICH 耦合的第二组件互连85、与ICH耦合的串行扩展端口86、与ICH耦合的网络控制器87、以及与ICH耦合的数据存储设备84。 The computer system 70 further includes a first assembly 81 and the interconnect 74 coupled to a chipset, coupled with the ICH second interconnect assembly 85, coupled with the ICH serial expansion port 86, coupled with the ICH network controller 87, and a a data storage device coupled to ICH 84.

[0093] 处理器71是多核处理器并包括处理器核72-1至72_M,其中M是等于或大于2的整数(例如,2、4、7或更大)。 [0093] The processor 71 is a multi-core processor includes a processor core and 72-1 to 72_M, where M is an integer equal to 2 or (e.g., 2,4,7 or more) greater than. 每个核可包括能够执行本文所公开的指令的至少一个实施例的至少一个执行单元。 Each core may include at least one execution unit of at least one embodiment of the embodiment capable of executing instructions herein disclosed. 如图所示,核-1包括高速缓存88 (例如,Ll高速缓存)。 As shown, the core 88 -1 includes a cache (e.g., Ll cache). 其它核中的每个核可类似地包括专用高速缓存。 Other cores each core may include a dedicated cache similarly. 这些处理器核可在单个集成电路(IC)芯片上实现。 The processor cores implemented on a single integrated circuit (IC) chip.

[0094] 处理器还包括至少一个共享高速缓存89。 [0094] The processor further includes at least a shared cache 89. 该共享高速缓存可存储供处理器的一个或多个组件(例如,核)使用的数据(例如,指令)。 The shared cache may be stored by one or more components of a processor (e.g., core) using the data (e.g., instructions). 例如,共享高速缓存可本地缓存存储在存储器78中的数据以便供处理器的组件更快速地访问。 For example, a shared data cache may be stored in the local cache memory for the processor 78 to be accessed more quickly assembly. 在一个或多个实施例中,共享高速缓存可包括一个或多个中间级高速缓存(例如,2级(U)、3级(U)、4级(L4)或其它级高速缓存)、最后一级高速缓存(LLC)和/或其组合。 In one or more embodiments, the shared cache may include one or more mid-level caches (e.g., grade (U) 2, Level 3 (U), stage (L4) 4 or other levels of cache), and finally a cache (LLC), and / or combinations thereof.

[0095] 处理器核和共享高速缓存均与总线或其它互连90耦合。 [0095] processor core and the shared cache memory are coupled to a bus or other interconnect 90. 该总线或其它互连可耦合这些核和共享高速缓存并允许通信。 The bus or other interconnections and these nuclei may be coupled to the shared cache and allow communication.

[0096] 处理器还包括存储器控制器集线器(MCH) 76。 [0096] The processor further includes a memory controller hub (MCH) 76. 如该示范实施例中所示,MCH与处理器71集成在一起。 As shown in the exemplary embodiment, MCH integrated with the processor 71. 例如,MCH可以与处理器核同处在管芯上。 For example, MCH may be in the same processor core die. 处理器通过MCH与存储器78 耦合。 A memory coupled to the processor through the MCH 78. 在一个或多个实施例中,存储器可包括DRAM,但这不是必需的。 In one or more embodiments, a DRAM memory may include, but is not required.

[0097] 芯片组包括输入/输出(I/O)集线器91。 [0097] The chipset includes an input / output (I / O) hub 91. I/O集线器通过总线(例如,QuickPath Interconnect(QPI))或其它互连75与处理器耦合。 I / O hub via a bus (e.g., QuickPath Interconnect (QPI)) or other interconnects coupled to the processor 75. 第一组件互连81与I/O集线器91耦合。 91 coupled to a first interconnect assembly 81 and the I / O hub.

[0098] 这只是合适系统的一个特定实例。 [0098] This is just one particular example of a suitable system. 本领域中已知的膝上型、桌面型、手持PC、个人数字助理、工程工作站、服务器、网络装置、网络集线器、交换器、嵌入式处理器、数字信号处理器(DSP)、图形装置、视频游戏装置、机顶盒、微控制器、蜂窝电话、便携式媒体播放器、手持装置和各种其它电子装置的其它系统设计和配置也是合适的。 Known in the art laptop, desktop, handheld PC, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSP), a graphics device, other video game apparatus systems, set top box, a microcontroller, a cellular phone, portable media players, handheld devices, and other electronic devices of various designs and configurations are also suitable. 一般来说,能够包括本文所公开的处理器和/或执行单元的各种各样的系统或电子装置一般都是合适的。 In general, it can include a processor and / or a variety of systems or electronic device execution unit disclosed herein are generally suitable.

[0099] 在以上描述中,为了说明的目的,阐述了众多具体细节以充分理解本发明的实施例。 [0099] In the above description, for purposes of explanation, numerous specific details are set forth to fully understand embodiments of the present invention. 但是,对于本领域技术人员显而易见的是,没有这些具体细节中的一些细节也可实现一个或多个其它实施例。 However, the skilled person will be apparent that without these specific details Some details may implement one or more other embodiments. 提供所描述的具体实施例不是为了限制本发明,而是为了说明本发明的实施例。 Providing the described specific embodiments are not to limit the invention, but to illustrate embodiments of the present invention. 本发明的范围不由以上提供的特定实例确定,而是只由随附权利要求确定。 Specific examples of the scope of the present invention is not provided above is determined, but is determined only by the appended claims. 在其它情况下,以框图形式或者没有详细示出公知的电路、结构、装置和操作,以免使本描述晦涩难懂。 In other instances, in block diagram form or without detail circuit diagram illustrating a known structure, and operation means, so as not to obscure the present description. 在认为合适的情况下,各图中重复使用附图标记或附图标记的末端部分来指示对应或类似的元件,这些元件可以可选地具有类似的特性。 In the case considered appropriate, repeating the drawings to indicate corresponding or analogous elements or end portions of reference numerals of the reference numerals, these elements may optionally have similar characteristics.

[0100] 某些操作可由硬件组件来执行,或者能以机器可执行指令来实施,这些指令可用于造成或者至少导致用执行这些操作的指令编程的电路或硬件。 [0100] Certain operations may be performed by hardware components or can be embodied in machine-executable instructions, the instructions may be used to cause, or at least cause programmed with the instructions performing the operations of these circuits or hardware. 该电路可包括通用或专用处理器或逻辑电路,这只是举例。 The circuit may include a general purpose or special-purpose processor or logic circuits, this example only. 这些操作也可以可选地由硬件和软件的组合来执行。 These operations may also optionally be performed by a combination of hardware and software. 执行单元和/或处理器可包括响应机器指令或从该机器指令得到的一个或多个控制信号来存储指令指定的结果操作数的具体或特定电路或其它逻辑。 Execution unit and / or a particular number of results or a specific circuit operation in response to a machine instruction processor may comprise one or more control signals or obtained from the machine instruction to store instructions or other logic specified.

[0101] 还应明白,在整篇说明书中提到例如“一个实施例”、“实施例”或“一个或多个实施例”时表示,在实现本发明的实施例时可包含特定特征。 When [0101] It should also be understood that, for example, reference to "one embodiment" throughout the specification, "an embodiment" or "one or more embodiments" means, when implementing embodiments of the present invention may include a particular feature. 类似地,应明白,在本描述中,各种特征有时集中在单个实施例、图或对其的描述中,以便使本公开流畅并帮助理解各个发明方面。 Similarly, it should be understood that in this description, various features are sometimes concentrated in the described embodiment of FIG single or thereto in order that the present disclosure and aiding in understanding the various aspects of the invention. 但是,不应将本公开方法解释为反映本发明需要比每个权利要求中明确叙述的特征更多的特征的意图。 However, the present disclosure is not to be interpreted as reflecting an intention of the method of the present invention is required than are expressly recited in each claim of the more features. 而是,如以下权利要求所反映,发明方面可在于比单个公开的实施例的所有特征少的特征。 Rather, as the following claims reflect, inventive aspects may lie in less than all features of a single disclosed embodiment features. 因此,具体实施方式之后的权利要求由此明确包括在本具体实施方式中,每个权利要求独自代表本发明的一个单独的实施例。 Thus, the claims following the Detailed embodiments are hereby expressly included in the present embodiment, each representative of claim own a separate embodiment of the present invention.

[0102] 上文描述了本发明的多个实施例。 [0102] a plurality of the above-described embodiments of the present invention. 然而,将了解,在不偏离本发明的精神和范围的情况下,可以进行各种修改。 However, it will be appreciated that, without departing from the spirit and scope of the present invention may be variously modified. 例如,计算机系统无需局限于具有多个处理器或存储器访问装置的计算机系统。 For example, a computer system need not be limited to computer systems having multiple processors or a memory access means. 本发明可容易地用于单处理器系统,在其中实现读-修改-写指令。 The present invention can be readily used in a single processor system, implemented reading - modification - writing instruction.

[0103] 还将明白,对存储器的共享区域的访问控制可通过不同于以上实例中所描述的测试和设置序列的方式来实现。 [0103] will be appreciated that access to the shared memory region may be achieved by way of the control test and set sequence differs from the above described examples. 例如,可使用简单的计数器,其中将每次访问递增指定计数。 For example, using a simple counter, which is incremented each time access to a specified count.

[0104] 还应明白,优选实施例的VCMPXCHG指令执行读-修改-写操作,但是修改和写阶段基本上作为单个步骤来实现。 [0104] It should also be appreciated, VCMPXCHG read instruction execution preferred embodiment of embodiment - modification - writing operation, the write phase but modifications and substantially implemented as a single step. 取代在读取原始数据之后计算修改值、接着随后写修改值, 可预设VCMPXCHG指令的修改值以供该指令使用。 Substituted modified value calculated after reading the raw data and then subsequently write the modified value may be a preset value modified VCMPXCHG instructions for use of the instruction. 尽管它们的使用取决于在进行比较时所获得的决定,但可立即将这个预设的修改值(SRC3)写入到目的地以修改目的地值。 Although their use depends on the decision obtained during the comparison, but this may be immediately modified preset value (SRC3) to modify the destination written in the destination value.

[0105] 因此,描述了一种用于实现利用掩码的向量比较和交换操作的技术。 [0105] Accordingly, techniques are described for implementing the comparison with masking vector and the exchange. 应明白,本文描述的VCMPXCHG指令和实现也可在其它容量内使用,并且无需局限于控制对共享存储器空间的访问的功能。 It should be understood, VCMPXCHG implementations described herein and instructions can also be used in other capacities, and access control need not be limited to the shared memory space. 例如,VCMPXCHG指令可用于推测性执行,其中对多个数据元素执行SIMD操作,只有当在操作过程期间这些数据元素未被另一代理修改时才将其结果写入到共享存储器空间。 For example, VCMPXCHG speculative execution instruction can be used, wherein performing SIMD operations on multiple data elements, only when the data during the course of operation of these elements will not modify another agent writes its results to the shared memory space. 因此,其它实施例在随附权利要求的范围内。 Accordingly, other embodiments within the scope of the appended claims Example.

Claims (22)

  1. 1. 一种方法,包括:通过处理装置中的解码器解码单个指令,所述单个指令对第一存储位置、第二存储位置和第三存储位置之间的多个数据元素指定向量比较和交换操作;发出所述单个指令以供所述处理装置中的执行单元执行;以及响应所述单个指令的执行,将来自所述第一存储位置的数据元素与所述第二存储位置中的对应数据元素进行比较;以及响应确定存在匹配,用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素。 1. A method, comprising: decoding a single instruction by the decoder processing apparatus, the single storage location of the first instruction, a second plurality of data elements between a storage position and a third storage location of the specified vector Compare and Swap operation; issuing instructions for the single execution unit of the processing device; and in response to the single instruction is executed, the data element corresponding to a first storage location data in the second storage locations from the comparing element; and in response to determining there is a match, replacing said data element from said first storage location with the corresponding data element from the third storage location.
  2. 2.如权利要求1所述的方法,其中所述单个指令还指定用于存储对应于所述第一存储位置中的相应数据元素的多个掩码元素的掩码存储位置。 2. The method according to claim 1, wherein the single instruction also specifies a mask storage location storing a plurality of mask elements corresponding to the respective first storage location for the data element.
  3. 3.如权利要求2所述的方法,其中将来自所述第一存储位置的数据元素与所述第二存储位置中的对应数据元素进行比较包括:当对应于来自所述第一存储位置的数据元素的掩码元素激活时,将来自所述第一存储位置的所述数据元素与所述第二存储位置中的对应数据元素进行比较。 3. The method according to claim 2, wherein the data elements from the first storage location of corresponding data elements with the second storage location in Comparative comprising: when the first storage corresponding to the location from when the data elements data elements activated mask elements, from the first storage location is compared with corresponding data elements in the second storage location.
  4. 4.如权利要求2所述的方法,其中用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素包括:当对应于来自所述第一存储位置的数据元素的掩码元素激活时,用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素。 4. The method according to claim 2, wherein replacement with the corresponding data element from the third storage location of the data elements from the first storage location comprises: when the first storage corresponding to the location from when the mask data elements the activation element, replacing the first data elements from the storage location with the corresponding data element from the third storage location.
  5. 5.如权利要求1所述的方法,还包括:当不存在匹配时,用来自所述第一存储位置的对应数据元素替换来自多个第二数据元素的数据元素。 5. The method according to claim 1, further comprising: when there is no match, replacing data elements from the plurality of second data elements with corresponding data elements from the first storage location.
  6. 6.如权利要求5所述的方法,其中所述单个指令还指定用于存储对应于所述第一存储位置中的相应数据元素的多个掩码元素的掩码存储位置。 6. The method according to claim 5, wherein the single instruction also specifies a mask storage location storing a plurality of mask elements corresponding to the respective first storage location for the data element.
  7. 7.如权利要求6所述的方法,其中将来自所述第一存储位置的数据元素与所述第二存储位置中的对应数据元素进行比较包括:当对应于来自所述第一存储位置的数据元素的掩码元素激活时,将来自所述第一存储位置的所述数据元素与所述第二存储位置中的对应数据元素进行比较。 7. The method according to claim 6, wherein the data elements from the first storage location of corresponding data elements with the second storage location in Comparative comprising: when the first storage corresponding to the location from when the data elements data elements activated mask elements, from the first storage location is compared with corresponding data elements in the second storage location.
  8. 8.如权利要求6所述的方法,其中用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素包括:当对应于来自所述第一存储位置的数据元素的掩码元素激活时,用来自所述第三存储位置的对应数据元素替换来自所述第一存储位置的所述数据元素。 8. The method according to claim 6, wherein the element is replaced with corresponding data from the third storage location of the data elements from the first storage location comprises: when the first storage corresponding to the location from when the mask data elements the activation element, replacing the first data elements from the storage location with the corresponding data element from the third storage location.
  9. 9.如权利要求6所述的方法,其中用来自所述第一存储位置的对应数据元素替换来自所述第二存储位置的所述数据元素包括:当对应于来自所述第一存储位置的数据元素的掩码元素激活时,用来自所述第一存储位置的对应数据元素替换来自所述第二存储位置的数据元素。 9. The method according to claim 6, wherein the element is replaced with corresponding data from the first storage location of the data element from the second storage location comprises: when the first storage corresponding to the location from when the mask data elements the activation element, the replacement data element from the second storage location with the corresponding data elements from the first storage location.
  10. 10. 一种处理器,包括:存储位置,配置成存储多个第一数据元素、多个第二数据元素和多个第三数据元素,所述多个第二和第三数据元素中的每个对应于所述多个第一数据元素中的一个;解码器,配置成解码单个指令,所述单个指令对所述多个第一、第二和第三数据元素指定向量比较和交换操作;以及执行单元,耦合到所述解码器以接收经解码的指令,并耦合到所述存储位置以执行所述向量比较和交换操作;其中,响应所述向量比较和交换操作的执行,所述执行单元配置成: 比较来自所述多个第一和第二数据元素的对应数据元素;以及响应确定存在匹配, 用来自所述多个第三数据元素的对应数据元素替换来自所述多个第一数据元素的数据元素。 10. A processor, comprising: a storage location configured to store a first plurality of data elements, a second plurality of data elements and a plurality of third data elements, the second and third plurality of data elements each a first plurality of data corresponding to said one element; a decoder configured to decode a single instruction, the single instruction specifies a vector comparison of said plurality of first, second, and third elements and data exchange operations; and an execution unit coupled to said instruction decoder to receive decoded and coupled to the storage location of the vector to perform a compare and swap operation; wherein in response to execution of the vector compare and swap operation, the execution unit is configured to: compare the plurality of corresponding data elements from the first and second data elements; and in response to determining there is a match with the first plurality of replacement from the corresponding data elements from the plurality of third data elements the data elements.
  11. 11.如权利要求10所述的处理器,其中,响应所述向量比较和交换操作的执行,所述执行单元还配置成:如果不存在匹配,则用来自所述多个第一数据元素的对应数据元素替换来自所述多个第二数据元素的数据元素。 11. The processor as recited in claim 10, wherein, in response to the compare and swap performing vector operations, said execution unit is further configured to: if a match does not exist, with the first data from said plurality of elements replacement data element corresponding data elements from the second plurality of data elements.
  12. 12.如权利要求11所述的处理器,其中所述单个指令还指定用于存储对应于所述多个第一数据元素中的相应数据元素的多个掩码元素的掩码存储位置。 12. The processor of claim 11, wherein the single instruction also specifies a mask storage location storing a plurality of mask elements corresponding to said first plurality of data elements for the corresponding data element.
  13. 13.如权利要求12所述的处理器,其中所述执行单元还配置成在相应掩码元素激活时比较来自所述多个第一和第二数据元素的对应数据元素。 13. The processor of claim 12, wherein said execution unit is further configured to compare the plurality of corresponding data elements from the first and second data element in the corresponding mask element is activated.
  14. 14.如权利要求12所述的处理器,其中所述执行单元配置成在相应掩码元素激活时用来自所述多个第三数据元素的对应数据元素替换来自所述多个第一数据元素的数据元素。 14. The processor of claim 12, wherein said execution unit is configured to replace the first data from said plurality of data elements with the corresponding elements from said third plurality of data elements when the corresponding mask element to activate data elements.
  15. 15.如权利要求12所述的处理器,其中所述执行单元配置成在相应掩码元素激活时用来自所述多个第一数据元素的对应数据元素替换来自所述多个第二数据元素的数据元素。 15. The processor of claim 12, wherein said execution unit is configured with a plurality of corresponding data elements from said first data element in the corresponding mask elements from the plurality of second activation replacing data elements data elements.
  16. 16.如权利要求12所述的处理器,其中所述执行单元将所述向量比较和交换操作作为原子操作来执行。 16. The processor of claim 12, wherein said vector execution unit the compare and swap operation is performed as an atomic operation.
  17. 17.如权利要求12所述的处理器,其中响应所述向量比较和交换操作的执行,所述执行单元还配置成:如果在其对应掩码元素激活的每对对应数据元素之间存在匹配,则设置标志;以及如果不存在匹配,则将所述标志清零。 17. The processor of claim 12, wherein in response to the compare and swap performing vector operations, said execution unit is further configured to: if there is between each pair of corresponding data elements corresponding matching mask elements activated , then a flag is set; and if there is no match, then the flag is cleared.
  18. 18. 一种系统,包括:存储器控制器,耦合到配置成存储多个第一数据元素的第一存储位置;以及耦合到所述存储器控制器的处理器,所述处理器包括:寄存器文件,配置成存储多个第二数据元素和多个第三数据元素,所述多个第二和第三数据元素中的每个对应于所述多个第一数据元素中的一个;解码器,配置成解码单个指令,所述单个指令对所述多个第一、第二和第三数据元素指定向量比较和交换操作;以及执行单元,耦合到所述解码器以接收经解码的指令,并耦合到所述第一存储位置和所述寄存器文件以执行所述向量比较和交换操作;其中,响应所述向量比较和交换操作的执行,所述执行单元配置成: 比较来自所述多个第一和第二数据元素的对应数据元素;以及响应确定存在匹配, 用来自所述多个第三数据元素的对应数据元素替换来自所述多 18. A system, comprising: a memory controller coupled to the plurality of configured to store the first data element of a first storage location; and a processor coupled to the memory controller, said processor comprising: a register file, configured to store a plurality of data elements and a second plurality of third data elements, the second and third plurality of data elements each corresponding to one of said first plurality of data elements; a decoder configured to decode a single instruction, the single instruction specifies a vector compare and swap operation on the plurality of first, second, and third data element; and an execution unit coupled to said instruction decoder to receive decoded and coupled the first storage location and to the vector register file to perform the compare and swap operation; wherein in response to execution of the vector compare and swap operation, the execution unit is configured to: compare from the first plurality and corresponding data elements of the second data element; and in response to determining there is a match with corresponding data elements from the plurality of third data elements from the plurality of alternative 第一数据元素的数据元素;以及响应确定不存在匹配,用来自所述多个第一数据元素的对应数据元素替换来自所述多个第二数据元素的数据元素。 The data elements of the first data element; and in response to determining there is no match, is replaced with corresponding data elements from said first plurality of data elements of the second plurality of data elements from the data elements.
  19. 19.如权利要求18所述的系统,其中所述单个指令还指定用于存储对应于所述多个第一数据元素中的相应数据元素的多个掩码元素的掩码寄存器。 19. The system according to claim 18, wherein the single instruction also specifies a mask register storing a plurality of mask elements corresponding to said first plurality of data elements for the corresponding data element.
  20. 20.如权利要求19所述的系统,其中所述执行单元配置成:当相应掩码元素激活时,比较来自所述多个第一和第二数据元素的对应数据元素对;如果每个比较结果是匹配,则设置标志;以及如果每个比较结果是不匹配,则将所述标志清零。 If each comparison; activated when the corresponding mask element, comparing corresponding data from the first and second plurality of data elements of the element: 20. A system as claimed in claim 19, wherein said execution unit is configured to the result is matched, the flag is set; and if each of the comparison result does not match, then the flag is cleared.
  21. 21.如权利要求20所述的系统,其中所述执行单元将所述向量比较和交换操作作为原子操作来执行。 21. The system according to claim 20, wherein said vector execution unit the compare and swap operation is performed as an atomic operation.
  22. 22. —种其上存储有指令的计算机可读介质,所述指令可进行操作以使处理器装置: 解码单个指令,所述单个指令对多个数据元素指定向量比较和交换操作,每个数据元素具有对应的测试元素、替换元素和掩码元素;如果相应掩码元素激活,则将数据元素与对应的测试元素进行比较;以及响应确定所有比较指示匹配,设置标志,并用对应的替换元素替换所比较的数据元素;以及响应确定所有比较指示不匹配,将标志清零,并用对应的数据元素替换所比较的测试元素。 22. - kind having instructions stored thereon a computer-readable medium, the instructions operable to cause the processor means: decoding a single instruction, the single instruction specifies a vector compare and swap operation on multiple data elements, each data element having a corresponding test element, and element replacement mask element; if the corresponding mask element is activated, then the data element is compared with the corresponding test element; and in response to determining that all the comparison indicates a match, a flag is set, and replaced with the corresponding replaced elements the comparison of the data elements; and in response to determining that all the comparison indicates a mismatch, the flag is cleared, and the alternative test elements compared with the corresponding data element.
CN 201010619577 2009-12-22 2010-12-21 Synchronization Simd vector CN102103570B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US12/644529 2009-12-22
US12644529 US8996845B2 (en) 2009-12-22 2009-12-22 Vector compare-and-exchange operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201510434397 CN105094749A (en) 2009-12-22 2010-12-21 Synchronizing simd vectors

Publications (2)

Publication Number Publication Date
CN102103570A true true CN102103570A (en) 2011-06-22
CN102103570B CN102103570B (en) 2015-08-12

Family

ID=44152784

Family Applications (2)

Application Number Title Priority Date Filing Date
CN 201510434397 CN105094749A (en) 2009-12-22 2010-12-21 Synchronizing simd vectors
CN 201010619577 CN102103570B (en) 2009-12-22 2010-12-21 Synchronization Simd vector

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN 201510434397 CN105094749A (en) 2009-12-22 2010-12-21 Synchronizing simd vectors

Country Status (7)

Country Link
US (1) US8996845B2 (en)
JP (2) JP5421458B2 (en)
KR (1) KR101461378B1 (en)
CN (2) CN105094749A (en)
DE (1) DE112010004963T5 (en)
GB (1) GB2488619B (en)
WO (1) WO2011087590A3 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309813A (en) * 2012-03-15 2013-09-18 国际商业机器公司 Data processing method and device
CN104011649A (en) * 2011-12-23 2014-08-27 英特尔公司 Apparatus and method for propagating conditionally evaluated values in simd/vector execution
CN104011616A (en) * 2011-12-23 2014-08-27 英特尔公司 Apparatus and method of improved permute instructions
CN104040487A (en) * 2011-12-23 2014-09-10 英特尔公司 Instruction for merging mask patterns
CN104169867A (en) * 2011-12-23 2014-11-26 英特尔公司 Systems, apparatuses, and methods for performing conversion of a mask register into a vector register
CN105051679A (en) * 2012-12-28 2015-11-11 英特尔公司 Functional unit having tree structure to support vector sorting algorithm and other algorithms
CN105359129A (en) * 2013-08-06 2016-02-24 英特尔公司 Methods, apparatus, instructions and logic to provide population count functionality for genome sequencing and alignment
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8996845B2 (en) 2009-12-22 2015-03-31 Intel Corporation Vector compare-and-exchange operation
CN105955704A (en) 2011-11-30 2016-09-21 英特尔公司 Instruction and logic for providing vector horizontal comparison function
WO2013101229A1 (en) * 2011-12-30 2013-07-04 Intel Corporation Structure access processors, methods, systems, and instructions
WO2014137327A1 (en) * 2013-03-05 2014-09-12 Intel Corporation Analyzing potential benefits of vectorization
US9411593B2 (en) * 2013-03-15 2016-08-09 Intel Corporation Processors, methods, systems, and instructions to consolidate unmasked elements of operation masks
GB2520603B (en) * 2013-09-26 2016-04-06 Imagination Tech Ltd Atomic memory update unit and methods
US9466091B2 (en) 2013-09-26 2016-10-11 Imagination Technologies Limited Atomic memory update unit and methods
US9390023B2 (en) * 2013-10-03 2016-07-12 Cavium, Inc. Method and apparatus for conditional storing of data using a compare-and-swap based approach
US20160283237A1 (en) * 2015-03-27 2016-09-29 Ilan Pardo Instructions and logic to provide atomic range operations
WO2018022528A1 (en) * 2016-07-27 2018-02-01 Intel Corporation System and method for multiplexing vector compare
WO2018022525A1 (en) * 2016-07-27 2018-02-01 Intel Corporation System and method for multiplexing vector mask matches

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460121B1 (en) * 1998-09-14 2002-10-01 Compaq Information Technologies Group, L.P. Method for providing an atomic memory read using a compare-exchange instruction primitive
CN1633637A (en) * 2001-10-05 2005-06-29 英特尔公司 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions
CN1662904A (en) * 2002-06-26 2005-08-31 国际商业机器公司 Digital signal processor with cascaded SIMD organization
CN1790310A (en) * 2004-12-17 2006-06-21 英特尔公司 Evaluation unit for single instruction, multiple data execution engine flag registers
US20070143551A1 (en) * 2005-12-01 2007-06-21 Sony Computer Entertainment Inc. Cell processor atomic compare and swap using dedicated SPE
US20070260634A1 (en) * 2006-05-04 2007-11-08 Nokia Corporation Apparatus, system, method, and computer program product for synchronizing the presentation of media content

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4482956A (en) * 1982-11-04 1984-11-13 International Business Machines Corporation Parallel queueing method
JPS61288243A (en) * 1985-06-17 1986-12-18 Fujitsu Ltd Processing system for compare and swap instruction
JPS6285372A (en) 1985-10-09 1987-04-18 Nec Corp Comparing and swapping system in multi-processor system
US6880071B2 (en) 2001-04-09 2005-04-12 Sun Microsystems, Inc. Selective signalling of later reserve location memory fault in compound compare and swap
CN100545804C (en) * 2003-08-18 2009-09-30 上海海尔集成电路有限公司;海尔集团公司 Microprocessor frame based on CISC structure and instruction realizing style
US8607241B2 (en) * 2004-06-30 2013-12-10 Intel Corporation Compare and exchange operation using sleep-wakeup mechanism
US7627723B1 (en) 2006-09-21 2009-12-01 Nvidia Corporation Atomic memory operators in a parallel processor
US7908255B2 (en) * 2007-04-11 2011-03-15 Microsoft Corporation Transactional memory using buffered writes and enforced serialization order
US8996845B2 (en) 2009-12-22 2015-03-31 Intel Corporation Vector compare-and-exchange operation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460121B1 (en) * 1998-09-14 2002-10-01 Compaq Information Technologies Group, L.P. Method for providing an atomic memory read using a compare-exchange instruction primitive
CN1633637A (en) * 2001-10-05 2005-06-29 英特尔公司 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions
CN1662904A (en) * 2002-06-26 2005-08-31 国际商业机器公司 Digital signal processor with cascaded SIMD organization
CN1790310A (en) * 2004-12-17 2006-06-21 英特尔公司 Evaluation unit for single instruction, multiple data execution engine flag registers
US20070143551A1 (en) * 2005-12-01 2007-06-21 Sony Computer Entertainment Inc. Cell processor atomic compare and swap using dedicated SPE
US20070260634A1 (en) * 2006-05-04 2007-11-08 Nokia Corporation Apparatus, system, method, and computer program product for synchronizing the presentation of media content

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
CN104011649A (en) * 2011-12-23 2014-08-27 英特尔公司 Apparatus and method for propagating conditionally evaluated values in simd/vector execution
CN104011616A (en) * 2011-12-23 2014-08-27 英特尔公司 Apparatus and method of improved permute instructions
CN104040487A (en) * 2011-12-23 2014-09-10 英特尔公司 Instruction for merging mask patterns
CN104169867A (en) * 2011-12-23 2014-11-26 英特尔公司 Systems, apparatuses, and methods for performing conversion of a mask register into a vector register
CN104169867B (en) * 2011-12-23 2018-04-13 英特尔公司 The system, apparatus and method for performing vector mask register to register the conversion
US9798541B2 (en) 2011-12-23 2017-10-24 Intel Corporation Apparatus and method for propagating conditionally evaluated values in SIMD/vector execution using an input mask register
US9658850B2 (en) 2011-12-23 2017-05-23 Intel Corporation Apparatus and method of improved permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9575753B2 (en) 2012-03-15 2017-02-21 International Business Machines Corporation SIMD compare instruction using permute logic for distributed register files
CN103309813B (en) * 2012-03-15 2016-06-29 国际商业机器公司 Data processing method and apparatus
CN103309813A (en) * 2012-03-15 2013-09-18 国际商业机器公司 Data processing method and device
US9760373B2 (en) 2012-12-28 2017-09-12 Intel Corporation Functional unit having tree structure to support vector sorting algorithm and other algorithms
CN105051679A (en) * 2012-12-28 2015-11-11 英特尔公司 Functional unit having tree structure to support vector sorting algorithm and other algorithms
CN105359129A (en) * 2013-08-06 2016-02-24 英特尔公司 Methods, apparatus, instructions and logic to provide population count functionality for genome sequencing and alignment

Also Published As

Publication number Publication date Type
GB2488619B (en) 2017-10-18 grant
JP5421458B2 (en) 2014-02-19 grant
WO2011087590A2 (en) 2011-07-21 application
KR20120096588A (en) 2012-08-30 application
CN102103570B (en) 2015-08-12 grant
DE112010004963T5 (en) 2012-11-22 application
JP2012531682A (en) 2012-12-10 application
US20110153989A1 (en) 2011-06-23 application
GB201119083D0 (en) 2011-12-21 grant
WO2011087590A3 (en) 2011-10-27 application
US8996845B2 (en) 2015-03-31 grant
GB2488619A (en) 2012-09-05 application
JP2014059902A (en) 2014-04-03 application
JP5876458B2 (en) 2016-03-02 grant
CN105094749A (en) 2015-11-25 application
KR101461378B1 (en) 2014-11-20 grant

Similar Documents

Publication Publication Date Title
US7600097B1 (en) Detecting raw hazards in an object-addressed memory hierarchy by comparing an object identifier and offset for a load instruction to object identifiers and offsets in a store queue
US20100229043A1 (en) Hardware acceleration for a software transactional memory system
US20120144120A1 (en) Programmable atomic memory using hardware validation agent
US20080005504A1 (en) Global overflow method for virtualized transactional memory
US20090172317A1 (en) Mechanisms for strong atomicity in a transactional memory system
US20140059333A1 (en) Method, apparatus, and system for speculative abort control mechanisms
US20110252203A1 (en) Transaction based shared data operations in a multiprocessor environment
US20100107243A1 (en) Permissions checking for data processing instructions
US20110138126A1 (en) Atomic Commit Predicated on Consistency of Watches
US20130339673A1 (en) Intra-instructional transaction abort handling
US20070156994A1 (en) Unbounded transactional memory systems
US8180977B2 (en) Transactional memory in out-of-order processors
US20130339327A1 (en) Facilitating transaction completion subsequent to repeated aborts of the transaction
US20110208921A1 (en) Inverted default semantics for in-speculative-region memory accesses
US20130339703A1 (en) Restricting processing within a processor to facilitate transaction completion
US20060026371A1 (en) Method and apparatus for implementing memory order models with order vectors
US20110202748A1 (en) Load pair disjoint facility and instruction therefore
US20020087925A1 (en) Computer processor read/alter/rewrite optimization cache invalidate signals
US20110153960A1 (en) Transactional memory in out-of-order processors with xabort having immediate argument
US20080115042A1 (en) Critical section detection and prediction mechanism for hardware lock elision
US7254678B2 (en) Enhanced STCX design to improve subsequent load efficiency
US20100169580A1 (en) Memory model for hardware attributes within a transactional memory system
US20130339615A1 (en) Managing transactional and non-transactional store observability
US20100106872A1 (en) Data processor for processing a decorated storage notify
US20100169894A1 (en) Registering a user-handler in hardware for transactional memory event handling

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model