US20110066813A1 - Method And System For Local Data Sharing - Google Patents

Method And System For Local Data Sharing Download PDF

Info

Publication number
US20110066813A1
US20110066813A1 US12/877,587 US87758710A US2011066813A1 US 20110066813 A1 US20110066813 A1 US 20110066813A1 US 87758710 A US87758710 A US 87758710A US 2011066813 A1 US2011066813 A1 US 2011066813A1
Authority
US
United States
Prior art keywords
memory
address
processor
addresses
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/877,587
Other versions
US8478946B2 (en
Inventor
Michael Mantor
Michael Mang
Karl Mann
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Micro Devices Inc
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Priority to US12/877,587 priority Critical patent/US8478946B2/en
Assigned to ADVANCED MICRO DEVICES, INC. reassignment ADVANCED MICRO DEVICES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MANN, KARL, MANG, MICHAEL, MANTOR, MICHAEL
Publication of US20110066813A1 publication Critical patent/US20110066813A1/en
Assigned to ADVANCED MICRO DEVICES, INC. reassignment ADVANCED MICRO DEVICES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MANN, KARL, MANG, MICHAEL, MANTOR, MICHAEL
Application granted granted Critical
Publication of US8478946B2 publication Critical patent/US8478946B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3888Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication

Definitions

  • the present invention relates generally to sharing of data in data processing units.
  • processors may have shared memory capabilities, they do not provide an architecture that allows the number of banks to be easily changed. Rather, the entire architecture of these existing products would need to be revised in order to change the number of memory banks. Additionally, these existing products do not have conflict resolution, full accessibility (addressability), or atomics.
  • An embodiment includes a local data share (LDS) unit that allows a plurality of threads to share data.
  • LDS local data share
  • Embodiments include a co-operative set of threads to load data into shared memory so that they can have repeated memory access allowing higher memory bandwidth.
  • Embodiments of the present invention can be used in any computer system, computing device, entertainment system, media system, game systems, communication device, personal digital assistant, or any system using one or more processors.
  • Embodiments of the present invention may be used processing systems having multi-core central processing units (CPUs), GPUs, and/or general purpose GPUs (GPGPUs), along with other types of processors because code developed for one type of processor may be deployed on another type of processor with little or no additional effort.
  • CPUs central processing units
  • GPUs general purpose GPUs
  • code developed for execution on a GPU also known as GPU kernels, can be deployed to be executed on a CPU, using embodiments of the present invention.
  • FIG. 1A is a block diagram illustration of an exemplary local data share (LDS) unit constructed in accordance with embodiments of present invention
  • FIG. 1B is an illustration of an exemplary local memory
  • FIG. 2A is an illustration of an exemplary input queue constructed in accordance with embodiments of the present invention.
  • FIG. 3A is an illustration of an exemplary atomic logic and bypass unit constructed in accordance with embodiments of the present invention.
  • FIG. 3B is a flowchart of an exemplary method of operating the atomic logic and bypass unit illustrated in FIG. 3A ;
  • FIG. 4 is a flowchart of an exemplary method of operating a direct read address module in accordance with embodiments of the present invention.
  • An embodiment of the present invention includes a local data share (LDS) unit that allows a plurality of threads to share data.
  • LDS local data share
  • Embodiments include a co-operative set of threads to load data into shared memory so that they can have repeated memory access allowing higher memory bandwidth.
  • the LDS is configurable and can have any number of GPU or CPU threads being processed in parallel.
  • the present invention allows processing of a plurality of threads (e.g. 32 threads) in parallel.
  • a conflict state machine that analyzes memory addresses for each of the plurality of threads.
  • the conflict state machine may then check the lower bits (or any other bit groups) of each of the plurality of addresses to determine which bank of memory each address maps to.
  • the conflict state machine subsequently schedules access to one or more banks of memory. In this way, data can be shared between related threads in a cooperative manner to realize increased performance.
  • FIG. 1A is an illustration of an exemplary local data share unit (LDS) 100 , according to an embodiment of the invention.
  • LDS 100 includes address generation unit 102 , output queue 104 , atomic logic and bypass unit 106 , input queue 108 , output queue 104 and conflict state machine 110 .
  • LDS 100 also includes direct read address module 112 , multiplexers 182 - 186 and local memory 120 . (It is to be appreciated that the structure illustrated in FIG. 1A is for the purposes of illustration and not limitation.)
  • local memory 120 is 32 kilo-bytes in size and can be constructed from 32-256 deep ⁇ 32 bits wide with one write and one read port.
  • FIG. 1B is an illustration of an exemplary local memory 120 , according to an embodiment of the invention. As illustrated in FIG. 1B , local memory 120 can have an interleaved bank address such that the lower 5 bits of a ‘DWORD’ address will be bank select bits and the upper 8 bits will be an address within the bank. The local memory 120 allows all banks to be read, written or both in one clock.
  • FIG. 1D is a flowchart of an exemplary method 190 illustrating operation of local memory 120 , according to an embodiment of the invention.
  • step 192 local memory 120 detects a first address that is valid for its bank and forwards the word address to read and/or write memory.
  • step 194 local memory 120 forwards the data read and the address with a thread (pixel) to the atomic logic and bypass 106 for conflict and/or broadcast determination.
  • memory space will need to be allocated per set of cooperative threads.
  • a set of cooperative threads will be referenced as a thread group, and can include up to 1024 threads (per shader) and be machine independent.
  • a subset of threads in a thread group will form a wavefront and that size will be dependent on the width of a single instruction-multiple data unit (SIMD).
  • SIMD single instruction-multiple data unit
  • a thread block can include a partial wavefront, one wavefront, or multiple wavefronts.
  • the shared memory base address will be available to both instruction based addressing and shader math to enable desired addressing modes.
  • Each and every thread of a thread group will have both read and write access to any location in the allocated space up to 32 kilo-bytes.
  • Individual wavefronts or thread groups that do not require any shared memory can co-exist on a SIMD with thread blocks that use all the shared memory.
  • write clients of LDS 100 include:
  • read clients of LDS 100 include:
  • FIG. 2A illustrates an exemplary input queue 108 , according to an embodiment of the invention.
  • one indexed request stored can be stored in input queue 108 .
  • the data is stored in a manner so that it can be accessed in one clock for cooperative operations.
  • each thread will have storage for one A, B, C operand.
  • Each opcode specifies the use of A, B, C.
  • input queue 108 is partitioned for 16 pixels to enable writing for each of 4 pixels independently.
  • LDS 100 may process an index command once enough data has been received by input queue 108 .
  • output queue 104 will hold the results of an operation until the subsequent instruction reads the data corresponding to each indexed operation.
  • FIG. 2B illustrates an exemplary output queue 104 , according to an embodiment of the invention.
  • output queue 104 will accept data read from local memory 120 .
  • output queue 104 can accept 32 DWORDS per clock.
  • FIG. 3A illustrates a diagram of atomic logic and bypass unit 106 according to an embodiment of the invention.
  • Atomic logic and bypass unit 106 provides a set of operations in a sequencer to LDS 100 .
  • Atomic logic and bypass unit 106 reads a memory location from local memory 120 and takes data that came with the address to performs a compare and replace or an atomic add.
  • An atomic add for example, means that no other access to that memory address can happen during this atomic operation.
  • atomic logic and bypass unit 106 takes the data at an address and modifies it and stores it back in at that address before any other processor can access to that same address.
  • atomic logic and bypass unit 106 would read the data from that address perform the first atomic operation, write it back, return the pre-op value and then get the second request or operation.
  • FIG. 3B is flowchart 320 illustrating an exemplary operation of atomic logic and bypass unit 106 , according to an embodiment of the invention.
  • atomic logic and bypass unit 106 reads a memory location from local memory 120 and receives data associated with the address.
  • atomic logic and bypass unit 106 performs a compare and replace operation or performs an atomic add operation.
  • One mode of operation of atomic logic and bypass unit 106 is ‘direct read’ in which it has the ability to read local memory 120 directly bypassing the input queue 108 and output queue 104 . In this mode, the memory address is passed on to local memory 120 directly and the data is read bypassing output queue 104 .
  • Another embodiment called the ‘interpolation read’ mode, includes performing a read operation on the LDS 100 's data arriving at multiplexer 182 and becomes the return data. If a write operation is being performed, LDS 100 's data can be selected by multiplexer 182 and sent back to the LDS 100 location.
  • address generation unit 102 As an illustrative example of the operation of address generation unit 102 , consider that 32 addresses are received from 32 lanes on the input data. Part of this data includes command information and part of it is input data. The command portion can have an offset that is common to all the address indices. Thus, when 32 lanes of data are received and the command data part of the command data is a modifier to the address that came per lane, address generation unit 102 modifies the 32 addresses to offset addresses. In this way, when the addresses are sent to local memory 120 indexed operations can be performed without re-calculating base addresses. In another embodiment, a direct write from another processor is also allowed. In this mode, inputs from a shader processor (not shown) may be stalled and the shader processor provides an address from which the dependent address are determined from. This also allows LDS 100 to write multiple lanes of data into the local memory 120 .
  • the direct read address module may be used in the direct read mode and the interpolation mode described above.
  • Direct read address module 112 receives a start address, a base address and a stride value. The direct read address module then uses the stride to find relative read addresses.
  • a direct read mode would is a compressed request that has a base address, a stride and a number of bit-masks.
  • the request could include requests that are serviced in one clock so there are no means to provide any kind of stalling.
  • addresses can be requested with strides of data that may not generate any memory bank conflicts.
  • interpolation read mode logic is included in direct read address module 112 . Direct read address module 112 derives addresses for different pixels of a wavefront for interpolation data. In this way, in accordance with the interpolation process and organization of shared memory, there are no bank conflicts or collisions during interpolation direct reads.
  • FIG. 4 is a flowchart 420 illustrating an exemplary operation of direct read address module 112 .
  • direct read address module 112 receives a start address, a base address and a stride value.
  • direct read address module 112 uses the stride to find relative read addresses.
  • embodiments of the invention include conflict state machine 110 to schedule work sent down to memory and atomic blocks based on memory accesses to avoid bank conflicts.
  • conflict state machine 110 analyzes memory addresses for each of the plurality of threads.
  • Conflict state machine 110 may then check the lower bits (or any other bit groups) of each of the plurality of addresses to determine which bank of memory each address maps to.
  • Conflict state machine 110 subsequently schedules access to one or more banks of memory. In this way, data can be shared between related threads in a cooperative manner to realize increased performance.
  • LDS 100 is configurable and can, for example, interface 64 processors or 32 processors.
  • the shared memory width of LDS 100 is independent of the width of the pipe that is providing requests to LDS 100 .
  • input queue 108 can receive width of the computational unit. This width can differ from the width of the shared memory.
  • LDS 100 operates in a manner that takes 2 of the 16 bits wide data and couples them into one clock of operation against a local memory 120 .
  • LDS 100 can be configured either by the width of the machine that is attached to LDS 100 or the width of the shared memory that is applied.
  • FIG. 1 is an illustration of an example computer system in which the present invention, or portions thereof, can be implemented as computer-readable code. It should be noted that the simulation, synthesis and/or manufacture of the various embodiments of this invention may be accomplished, in part, through the use of computer readable code, including general programming languages (such as C or C++), hardware description languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL (AHDL), or other available programming and/or schematic capture tools (such as circuit capture tools).
  • general programming languages such as C or C++
  • HDL hardware description languages
  • HDL Verilog HDL
  • VHDL Verilog HDL
  • Altera HDL Altera HDL
  • circuit capture tools such as circuit capture tools
  • This computer readable code can be disposed in any known computer usable medium including a semiconductor, magnetic disk, optical disk (such as CDROM, DVD-ROM) and as a computer data signal embodied in a computer usable (e.g., readable) transmission medium (such as a carrier wave or any other medium such as, for example, digital, optical, or analog-based medium).
  • a computer usable (e.g., readable) transmission medium such as a carrier wave or any other medium such as, for example, digital, optical, or analog-based medium.
  • the code can be transmitted over communication networks including the Internet and internets.
  • a core such as a GPU/CPU core

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Multi Processors (AREA)

Abstract

Embodiments for a local data share (LDS) unit are described herein. Embodiments include a co-operative set of threads to load data into shared memory so that the threads can have repeated memory access allowing higher memory bandwidth. In this way, data can be shared between related threads in a cooperative manner by providing a re-use of a locality of data from shared registers. Furthermore, embodiments of the invention allow a cooperative set of threads to fetch data in a partitioned manner so that it is only fetched once into a shared memory that can be repeatedly accessed via a separate low latency path.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This patent application claims the benefit of U.S. Provisional Patent Application No. 61/240,475 (Attorney Docket No. 1972.1040000), filed Sep. 8, 2009, entitled “Method and System for Local Data Sharing,” which is incorporated herein by reference in its entirety.
  • BACKGROUND
  • 1. Field of the Invention
  • The present invention relates generally to sharing of data in data processing units.
  • 2. Background Art
  • Although some processors may have shared memory capabilities, they do not provide an architecture that allows the number of banks to be easily changed. Rather, the entire architecture of these existing products would need to be revised in order to change the number of memory banks. Additionally, these existing products do not have conflict resolution, full accessibility (addressability), or atomics.
  • What is needed is therefore is a flexible shared memory architecture that allows a designer to trade off performance versus cost without changing the architecture of the shared memory.
  • BRIEF SUMMARY OF EMBODIMENTS OF THE INVENTION
  • An embodiment includes a local data share (LDS) unit that allows a plurality of threads to share data. Embodiments include a co-operative set of threads to load data into shared memory so that they can have repeated memory access allowing higher memory bandwidth.
  • In this way, data can be shared between related threads in a cooperative manner to realize increased performance and a reduction of required power for some jobs. This particular technique of shared data will also enable a new class of potential algorithms that can be processed on the processor by providing a re-use of a locality of data from shared registers. Furthermore, embodiments of the present invention allow a cooperative set of threads to fetch data in a partitioned manner so that it is only fetched once into a shared memory. The shared memory can be repeatedly accessed via a separate low latency path.
  • Embodiments of the present invention can be used in any computer system, computing device, entertainment system, media system, game systems, communication device, personal digital assistant, or any system using one or more processors.
  • Embodiments of the present invention, for example, may be used processing systems having multi-core central processing units (CPUs), GPUs, and/or general purpose GPUs (GPGPUs), along with other types of processors because code developed for one type of processor may be deployed on another type of processor with little or no additional effort. For example, code developed for execution on a GPU, also known as GPU kernels, can be deployed to be executed on a CPU, using embodiments of the present invention.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute part of the specification, illustrate embodiments of the invention and, together with the general description given above and the detailed description of the embodiment given below, serve to explain the principles of the present invention. In the drawings:
  • FIG. 1A is a block diagram illustration of an exemplary local data share (LDS) unit constructed in accordance with embodiments of present invention;
  • FIG. 1B is an illustration of an exemplary local memory;
  • FIG. 1C is an exemplary basic data flow arrangement for a read, read/modify/write or write operations;
  • FIG. 1D is a flowchart of an exemplary method of operation of the local memory illustrated in FIG. 1B in accordance with embodiments of the present invention;
  • FIG. 2A is an illustration of an exemplary input queue constructed in accordance with embodiments of the present invention;
  • FIG. 2B is an illustration of an exemplary output queue constructed in accordance with embodiments of the present invention;
  • FIG. 3A is an illustration of an exemplary atomic logic and bypass unit constructed in accordance with embodiments of the present invention.
  • FIG. 3B is a flowchart of an exemplary method of operating the atomic logic and bypass unit illustrated in FIG. 3A; and
  • FIG. 4 is a flowchart of an exemplary method of operating a direct read address module in accordance with embodiments of the present invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • An embodiment of the present invention includes a local data share (LDS) unit that allows a plurality of threads to share data. Embodiments include a co-operative set of threads to load data into shared memory so that they can have repeated memory access allowing higher memory bandwidth.
  • The LDS is configurable and can have any number of GPU or CPU threads being processed in parallel. By way of example, the present invention allows processing of a plurality of threads (e.g. 32 threads) in parallel. Included is a conflict state machine that analyzes memory addresses for each of the plurality of threads. The conflict state machine may then check the lower bits (or any other bit groups) of each of the plurality of addresses to determine which bank of memory each address maps to. The conflict state machine subsequently schedules access to one or more banks of memory. In this way, data can be shared between related threads in a cooperative manner to realize increased performance.
  • System
  • FIG. 1A is an illustration of an exemplary local data share unit (LDS) 100, according to an embodiment of the invention. As shown in FIG. 1A, LDS 100 includes address generation unit 102, output queue 104, atomic logic and bypass unit 106, input queue 108, output queue 104 and conflict state machine 110. LDS 100 also includes direct read address module 112, multiplexers 182-186 and local memory 120. (It is to be appreciated that the structure illustrated in FIG. 1A is for the purposes of illustration and not limitation.)
  • Local Memory 120
  • In an embodiment, not intended to limit the invention, local memory 120 is 32 kilo-bytes in size and can be constructed from 32-256 deep×32 bits wide with one write and one read port. FIG. 1B is an illustration of an exemplary local memory 120, according to an embodiment of the invention. As illustrated in FIG. 1B, local memory 120 can have an interleaved bank address such that the lower 5 bits of a ‘DWORD’ address will be bank select bits and the upper 8 bits will be an address within the bank. The local memory 120 allows all banks to be read, written or both in one clock.
  • As an example, local memory 120 can enable up to 32 DWORD read and write ports accessible per clock when no bank conflicts exist. Since there will likely be latency associated with reading and writing, LDS 100 can prevent conflicts during the exposed latency of read/write. The basic structure of local memory 120 includes 32 banks that each pick up to one unique address to service. On a read only operation, multiple addresses that are the same can have read data broadcasted, but otherwise the same address may be serialized through atomic logic and bypass unit 106 by conflict state machine 110. The operation of atomic logic and bypass unit 106 is described in additional detail below.
  • FIG. 1C illustrates an exemplary basic data flow arrangement for a read, read/modify/write or write operations. As shown in FIG. 1C, each bank of the local memory 120 can detect a first address that is valid for its bank and pipes the word address to read and/or write memory. Both the read data (if read enabled) and the address with a thread (pixel) selected will be forwarded to the atomic logic and bypass 106 for conflict and/or broadcast determination. Thus, if a read and the word address matches, it will get serviced. Furthermore, if a read and the word address did not match, it will not be serviced and conflict state machine 110 will be notified.
  • FIG. 1D is a flowchart of an exemplary method 190 illustrating operation of local memory 120, according to an embodiment of the invention. In step 192, local memory 120 detects a first address that is valid for its bank and forwards the word address to read and/or write memory. In step 194, local memory 120 forwards the data read and the address with a thread (pixel) to the atomic logic and bypass 106 for conflict and/or broadcast determination.
  • In an embodiment, memory space will need to be allocated per set of cooperative threads. A set of cooperative threads will be referenced as a thread group, and can include up to 1024 threads (per shader) and be machine independent. A subset of threads in a thread group will form a wavefront and that size will be dependent on the width of a single instruction-multiple data unit (SIMD). A thread block can include a partial wavefront, one wavefront, or multiple wavefronts.
  • In the present invention, the shared memory base address will be available to both instruction based addressing and shader math to enable desired addressing modes. Each and every thread of a thread group will have both read and write access to any location in the allocated space up to 32 kilo-bytes. Individual wavefronts or thread groups that do not require any shared memory can co-exist on a SIMD with thread blocks that use all the shared memory.
  • In an embodiment, not intended to limit the invention, write clients of LDS 100 include:
  • 1. Store operations of ALU results
  • 2. Input attribute data
  • 3. Atomic read/write operations
  • 4. Direct texture return data
  • In an embodiment, not intended to limit the invention, read clients of LDS 100 include:
  • 1. Direct reads for ALU Instructions
  • 2. Index ALU load Operations
  • 3. Atomic read/write operations
  • Input Queue 108
  • In an embodiment, input queue 108 can store the data received in a manner so that a plurality of adjacent threads or pixels can do an operation together and such that bank conflicts can be minimized due to adjacency of address for cooperative threads. In an embodiment, input queue 108 provides enough storage to hide the latency of acquiring GPU or CPU data and the local data share look up for read data.
  • FIG. 2A illustrates an exemplary input queue 108, according to an embodiment of the invention.
  • As shown in FIG. 2A, one indexed request stored can be stored in input queue 108. The data is stored in a manner so that it can be accessed in one clock for cooperative operations. For example, each thread will have storage for one A, B, C operand. Each opcode specifies the use of A, B, C.
  • In an embodiment, input queue 108 is partitioned for 16 pixels to enable writing for each of 4 pixels independently. LDS 100 may process an index command once enough data has been received by input queue 108.
  • Output Queue 104
  • In an embodiment, output queue 104 will hold the results of an operation until the subsequent instruction reads the data corresponding to each indexed operation.
  • FIG. 2B illustrates an exemplary output queue 104, according to an embodiment of the invention. In an embodiment, output queue 104 will accept data read from local memory 120. For example, output queue 104 can accept 32 DWORDS per clock.
  • Atomic Logic and Bypass Unit 106
  • FIG. 3A illustrates a diagram of atomic logic and bypass unit 106 according to an embodiment of the invention. Atomic logic and bypass unit 106 provides a set of operations in a sequencer to LDS 100.
  • In an embodiment, atomic logic and bypass unit 106 includes a read modify write path. In an embodiment, there are a plurality (e.g. thirty-two) atomic modules per lane that can accomplish an atomic operation.
  • Atomic logic and bypass unit 106 reads a memory location from local memory 120 and takes data that came with the address to performs a compare and replace or an atomic add. An atomic add, for example, means that no other access to that memory address can happen during this atomic operation. Thus, atomic logic and bypass unit 106 takes the data at an address and modifies it and stores it back in at that address before any other processor can access to that same address.
  • As a purely illustrative example, if 32 lanes of data from local memory 120 were received and all of them have the same destination address then these operations would be completely serialized by atomic logic and bypass unit 106. Thus, atomic logic and bypass unit 106 would read the data from that address perform the first atomic operation, write it back, return the pre-op value and then get the second request or operation.
  • FIG. 3B is flowchart 320 illustrating an exemplary operation of atomic logic and bypass unit 106, according to an embodiment of the invention. In step 322, atomic logic and bypass unit 106 reads a memory location from local memory 120 and receives data associated with the address. In step 324, atomic logic and bypass unit 106 performs a compare and replace operation or performs an atomic add operation.
  • One mode of operation of atomic logic and bypass unit 106 is ‘direct read’ in which it has the ability to read local memory 120 directly bypassing the input queue 108 and output queue 104. In this mode, the memory address is passed on to local memory 120 directly and the data is read bypassing output queue 104.
  • Another embodiment, called the ‘interpolation read’ mode, includes performing a read operation on the LDS 100's data arriving at multiplexer 182 and becomes the return data. If a write operation is being performed, LDS 100's data can be selected by multiplexer 182 and sent back to the LDS 100 location.
  • Address Generation Unit 102
  • As an illustrative example of the operation of address generation unit 102, consider that 32 addresses are received from 32 lanes on the input data. Part of this data includes command information and part of it is input data. The command portion can have an offset that is common to all the address indices. Thus, when 32 lanes of data are received and the command data part of the command data is a modifier to the address that came per lane, address generation unit 102 modifies the 32 addresses to offset addresses. In this way, when the addresses are sent to local memory 120 indexed operations can be performed without re-calculating base addresses. In another embodiment, a direct write from another processor is also allowed. In this mode, inputs from a shader processor (not shown) may be stalled and the shader processor provides an address from which the dependent address are determined from. This also allows LDS 100 to write multiple lanes of data into the local memory 120.
  • Direct Read Address Module 112
  • In an embodiment, the direct read address module may be used in the direct read mode and the interpolation mode described above. Direct read address module 112 receives a start address, a base address and a stride value. The direct read address module then uses the stride to find relative read addresses. In this way, a direct read mode would is a compressed request that has a base address, a stride and a number of bit-masks. In the direct read mode, for example, the request could include requests that are serviced in one clock so there are no means to provide any kind of stalling. However, it is to be appreciated that addresses can be requested with strides of data that may not generate any memory bank conflicts. In an embodiment, interpolation read mode logic is included in direct read address module 112. Direct read address module 112 derives addresses for different pixels of a wavefront for interpolation data. In this way, in accordance with the interpolation process and organization of shared memory, there are no bank conflicts or collisions during interpolation direct reads.
  • FIG. 4 is a flowchart 420 illustrating an exemplary operation of direct read address module 112. In step 422, direct read address module 112 receives a start address, a base address and a stride value. In step 424, direct read address module 112, then uses the stride to find relative read addresses.
  • Conflict State Machine 110
  • As discussed above, embodiments of the invention include conflict state machine 110 to schedule work sent down to memory and atomic blocks based on memory accesses to avoid bank conflicts. In an embodiment, conflict state machine 110 analyzes memory addresses for each of the plurality of threads. Conflict state machine 110 may then check the lower bits (or any other bit groups) of each of the plurality of addresses to determine which bank of memory each address maps to. Conflict state machine 110 subsequently schedules access to one or more banks of memory. In this way, data can be shared between related threads in a cooperative manner to realize increased performance.
  • Configuring Coupled Processors/ALUs
  • In an embodiment, LDS 100 is configurable and can, for example, interface 64 processors or 32 processors. The shared memory width of LDS 100 is independent of the width of the pipe that is providing requests to LDS 100. Thus, by adjusting the width of the input and output units and the computational unit that is attached to it, input queue 108 can receive width of the computational unit. This width can differ from the width of the shared memory. As a purely illustrative example, if there are 64 processors interfaced with LDS 100 over 4 clocks 16 bits wide data, LDS 100 operates in a manner that takes 2 of the 16 bits wide data and couples them into one clock of operation against a local memory 120.
  • In this way, LDS 100 can be configured either by the width of the machine that is attached to LDS 100 or the width of the shared memory that is applied.
  • Various aspects of the present invention can be implemented by software, firmware, hardware (or hardware represented by software such as, for example, Verilog or hardware description language instructions), or a combination thereof. FIG. 1 is an illustration of an example computer system in which the present invention, or portions thereof, can be implemented as computer-readable code. It should be noted that the simulation, synthesis and/or manufacture of the various embodiments of this invention may be accomplished, in part, through the use of computer readable code, including general programming languages (such as C or C++), hardware description languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL (AHDL), or other available programming and/or schematic capture tools (such as circuit capture tools). This computer readable code can be disposed in any known computer usable medium including a semiconductor, magnetic disk, optical disk (such as CDROM, DVD-ROM) and as a computer data signal embodied in a computer usable (e.g., readable) transmission medium (such as a carrier wave or any other medium such as, for example, digital, optical, or analog-based medium). As such, the code can be transmitted over communication networks including the Internet and internets. It is understood that the functions accomplished and/or structure provided by the systems and techniques described above can be represented in a core (such as a GPU/CPU core) that is embodied in program code and may be transformed to hardware as part of the production of integrated circuits.
  • CONCLUSION
  • The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventor(s), and thus, are not intended to limit the present invention and the appended claims in any way.
  • The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
  • The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
  • The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (20)

What is claimed is:
1. A device, comprising:
an input buffer coupled to a conflict state machine; and
an output buffer;
wherein a number of ALUs coupled to a local data share unit is configurable based on a width of the input buffer and the output buffer.
2. The system of claim 1, further comprising:
an address generation unit to generate a plurality of memory addresses;
an atomic logic and bypass unit associated with the address generation unit to perform atomic operations; and
a conflict state machine associated with the input queue to resolve memory addressing conflicts based on the memory addresses.
3. The system of claim 2, further comprising:
a direct read address module to determine a relative address based on a stride value; and a local memory.
4. The system of claim 3, further comprising:
one or more multiplexers associated with the local memory and the atomic logic and bypass unit to control data provided to the local memory.
5. A device, comprising:
an input buffer;
an output buffer; and
shared memory banks coupled to the input and output buffers, wherein a number of the shared memory banks is configurable to control the cost of providing a shared memory pool.
6. A method for local data sharing between a plurality of threads, comprising:
analyzing memory addresses for each of the plurality of threads;
checking one or more bits of each of the memory addresses to determine mapping to one or more memory banks; and
scheduling access to the memory banks.
7. The method of claim 6, further comprising:
determining mapping to the memory banks by comparing lower order bits of the addresses.
8. The method of claim 6, further comprising:
modifying the memory addresses to offset addresses.
9. The method of claim 8, further comprising:
receiving a start address, a base address and a stride value; and
computing a relative read address using the stride value.
10. The method of claim 6, wherein the analyzing step comprises:
determining a portion having an offset common to all address indices.
11. The method of claim 6, further comprising:
reading the one or more memory banks in a single cycle.
12. The method of claim 6, further comprising:
stalling input from a current shader processor to receive a memory address from another shader processor.
13. A computer-readable medium that stores instructions adapted to be executed by a processor to:
analyze memory addresses for each of the plurality of threads;
check one or more bits of each of the memory addresses to determine mapping to one or more memory banks; and
schedule access to the memory banks.
14. The computer readable medium of claim 13, further comprising instructions adapted to be executed by a processor to:
determine mapping to the memory banks by comparing lower order bits of the addresses.
15. The computer readable medium of claim 13, further comprising instructions adapted to be executed by a processor to:
modify the memory addresses to offset addresses.
16. The computer readable medium of claim 14, further comprising instructions adapted to be executed by a processor to:
receive a start address, a base address and a stride value; and
compute a relative read address using the stride value.
17. The computer readable medium of claim 13, further comprising instructions adapted to be executed by a processor to:
determine a portion having an offset common to all address indices.
18. The computer readable medium of claim 13, further comprising instructions adapted to be executed by a processor to:
read the one or more memory banks in a single cycle.
19. The computer readable medium of claim 13, further comprising instructions adapted to be executed by a processor to:
stall input from a current shader processor to receive a memory address from another shader processor.
20. The computer readable medium of claim 13, wherein the instructions stored and executed by the processor are adapted to manufacture an apparatus configured to perform said analyzing, said checking and said scheduling.
US12/877,587 2009-09-08 2010-09-08 Method and system for local data sharing Active 2031-08-16 US8478946B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/877,587 US8478946B2 (en) 2009-09-08 2010-09-08 Method and system for local data sharing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US24047509P 2009-09-08 2009-09-08
US12/877,587 US8478946B2 (en) 2009-09-08 2010-09-08 Method and system for local data sharing

Publications (2)

Publication Number Publication Date
US20110066813A1 true US20110066813A1 (en) 2011-03-17
US8478946B2 US8478946B2 (en) 2013-07-02

Family

ID=43731601

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/877,587 Active 2031-08-16 US8478946B2 (en) 2009-09-08 2010-09-08 Method and system for local data sharing

Country Status (1)

Country Link
US (1) US8478946B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140115301A1 (en) * 2012-10-23 2014-04-24 Analog Devices Technology Processor architecture and method for simplifying programming single instruction, multiple data within a register
US20190171274A1 (en) * 2012-08-31 2019-06-06 Intel Corporation Configuring Power Management Functionality In A Processor
CN111899149A (en) * 2020-07-09 2020-11-06 浙江大华技术股份有限公司 Image processing method and device based on operator fusion and storage medium
CN112732461A (en) * 2021-01-06 2021-04-30 浙江智慧视频安防创新中心有限公司 Inter-algorithm data transmission method and device in system
US20220197824A1 (en) * 2020-12-15 2022-06-23 Xsight Labs Ltd. Elastic resource management in a network switch
US12007909B2 (en) * 2021-12-15 2024-06-11 Xsight Labs Ltd. Elastic resource management in a network switch

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070156946A1 (en) * 2005-12-29 2007-07-05 Intel Corporation Memory controller with bank sorting and scheduling
US20080140994A1 (en) * 2006-10-06 2008-06-12 Brucek Khailany Data-Parallel processing unit
US20080320476A1 (en) * 2007-06-25 2008-12-25 Sonics, Inc. Various methods and apparatus to support outstanding requests to multiple targets while maintaining transaction ordering
US7657724B1 (en) * 2006-12-13 2010-02-02 Intel Corporation Addressing device resources in variable page size environments
US7805589B2 (en) * 2006-08-31 2010-09-28 Qualcomm Incorporated Relative address generation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070156946A1 (en) * 2005-12-29 2007-07-05 Intel Corporation Memory controller with bank sorting and scheduling
US7805589B2 (en) * 2006-08-31 2010-09-28 Qualcomm Incorporated Relative address generation
US20080140994A1 (en) * 2006-10-06 2008-06-12 Brucek Khailany Data-Parallel processing unit
US7657724B1 (en) * 2006-12-13 2010-02-02 Intel Corporation Addressing device resources in variable page size environments
US20080320476A1 (en) * 2007-06-25 2008-12-25 Sonics, Inc. Various methods and apparatus to support outstanding requests to multiple targets while maintaining transaction ordering

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171274A1 (en) * 2012-08-31 2019-06-06 Intel Corporation Configuring Power Management Functionality In A Processor
US20190317585A1 (en) * 2012-08-31 2019-10-17 Intel Corporation Configuring Power Management Functionality In A Processor
US10877549B2 (en) * 2012-08-31 2020-12-29 Intel Corporation Configuring power management functionality in a processor
US11237614B2 (en) * 2012-08-31 2022-02-01 Intel Corporation Multicore processor with a control register storing an indicator that two or more cores are to operate at independent performance states
US20140115301A1 (en) * 2012-10-23 2014-04-24 Analog Devices Technology Processor architecture and method for simplifying programming single instruction, multiple data within a register
CN103777924A (en) * 2012-10-23 2014-05-07 亚德诺半导体技术公司 Processor architecture and method for simplifying programmable single instruction, multiple data within a register
US9557993B2 (en) * 2012-10-23 2017-01-31 Analog Devices Global Processor architecture and method for simplifying programming single instruction, multiple data within a register
CN111899149A (en) * 2020-07-09 2020-11-06 浙江大华技术股份有限公司 Image processing method and device based on operator fusion and storage medium
US20220197824A1 (en) * 2020-12-15 2022-06-23 Xsight Labs Ltd. Elastic resource management in a network switch
CN112732461A (en) * 2021-01-06 2021-04-30 浙江智慧视频安防创新中心有限公司 Inter-algorithm data transmission method and device in system
US12007909B2 (en) * 2021-12-15 2024-06-11 Xsight Labs Ltd. Elastic resource management in a network switch

Also Published As

Publication number Publication date
US8478946B2 (en) 2013-07-02

Similar Documents

Publication Publication Date Title
US10120728B2 (en) Graphical processing unit (GPU) implementing a plurality of virtual GPUs
US9965392B2 (en) Managing coherent memory between an accelerated processing device and a central processing unit
US7634621B1 (en) Register file allocation
US11379713B2 (en) Neural network processing
US8619087B2 (en) Inter-shader attribute buffer optimization
US9798543B2 (en) Fast mapping table register file allocation algorithm for SIMT processors
US8539130B2 (en) Virtual channels for effective packet transfer
TW201337751A (en) System and method for performing shaped memory access operations
US10761851B2 (en) Memory apparatus and method for controlling the same
US20120079200A1 (en) Unified streaming multiprocessor memory
US8180998B1 (en) System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations
US8195858B1 (en) Managing conflicts on shared L2 bus
CN101561754B (en) Partition-free multi-slot memory system architecture
US8478946B2 (en) Method and system for local data sharing
US20120198458A1 (en) Methods and Systems for Synchronous Operation of a Processing Device
US20100088489A1 (en) data transfer network and control apparatus for a system with an array of processing elements each either self-or common controlled
US10990445B2 (en) Hardware resource allocation system for allocating resources to threads
US9330432B2 (en) Queuing system for register file access
US8321618B1 (en) Managing conflicts on shared L2 bus
US9286256B2 (en) Sharing data crossbar for reads and writes in a data cache
US8570916B1 (en) Just in time distributed transaction crediting
US11093276B2 (en) System and method for batch accessing
US11755331B2 (en) Writeback hazard elimination using a plurality of temporary result-storage elements
US10620958B1 (en) Crossbar between clients and a cache
US10452401B2 (en) Hints for shared store pipeline and multi-rate targets

Legal Events

Date Code Title Description
AS Assignment

Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MANTOR, MICHAEL;MANG, MICHAEL;MANN, KARL;SIGNING DATES FROM 20100913 TO 20100916;REEL/FRAME:025149/0089

AS Assignment

Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MANTOR, MICHAEL;MANG, MICHAEL;MANN, KARL;SIGNING DATES FROM 20100913 TO 20100916;REEL/FRAME:026019/0921

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8