CN113568851A - Method for accessing a memory and corresponding circuit - Google Patents

Method for accessing a memory and corresponding circuit Download PDF

Info

Publication number
CN113568851A
CN113568851A CN202110461211.5A CN202110461211A CN113568851A CN 113568851 A CN113568851 A CN 113568851A CN 202110461211 A CN202110461211 A CN 202110461211A CN 113568851 A CN113568851 A CN 113568851A
Authority
CN
China
Prior art keywords
memory
data
local
burst
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110461211.5A
Other languages
Chinese (zh)
Inventor
G·博尔戈诺沃
L·雷菲奥伦汀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
STMicroelectronics SRL
Original Assignee
STMicroelectronics SRL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from IT102020000009364A external-priority patent/IT202000009364A1/en
Application filed by STMicroelectronics SRL filed Critical STMicroelectronics SRL
Publication of CN113568851A publication Critical patent/CN113568851A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7814Specially adapted for real time processing, e.g. comprising hardware timers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)

Abstract

Methods of accessing a memory and corresponding circuits are disclosed. An embodiment method of accessing memory to read and/or write data includes generating a memory transaction request including a burst of memory access requests for a set of memory locations in the memory, the memory locations having respective memory addresses. The method also includes sending, via the interconnect bus, first and second signals to a memory controller circuit coupled to the memory, the first signal conveying a memory transaction request and the second signal conveying information for mapping a burst of memory access requests onto a respective memory address of a memory location in the memory. The method further comprises the following steps: based on the information conveyed by the second signal, a corresponding memory address of the memory addresses is calculated, and the memory addresses are accessed to read data from and/or write data to the memory addresses.

Description

Method for accessing a memory and corresponding circuit
Cross Reference to Related Applications
The benefit of italian application No. 102020000009364 filed on 29/4/2020, which is hereby incorporated by reference.
Technical Field
This specification relates to digital signal processing circuits, such as hardware accelerators, and related methods, apparatus and systems.
Background
Various real-time digital signal processing systems (e.g., for processing video and/or image data, radar data, wireless communication data, as is increasingly required in the automotive field) may involve processing a relevant amount of data per unit time.
In this regard, various digital signal processors (e.g., coprocessors for computing algorithms such as Fast Fourier Transforms (FFTs), beamforming, Finite Impulse Response (FIR) filters, neural networks, etc.) are known in the art. Among these, pipeline architectures and memory-based architectures are two known solutions.
To efficiently handle resource demanding processing (e.g., computation of FFT algorithms over large data sets and/or different sizes), a memory-based architecture may be preferred.
However, digital signal processors known in the art may not provide a memory access scheme suitable for efficient computation of certain algorithms.
Disclosure of Invention
It is an object of one or more embodiments to provide a method of accessing memory in a digital signal processor that addresses the above disadvantages.
One or more embodiments may be directed to providing a communications bus controller (e.g., for an Advanced Microcontroller Bus Architecture (AMBA) advanced extensible interface (AXI) bus) suitable for high performance digital signal processing applications. This may be achieved by extending allowed delta/wrap burst (wrapping burst) transactions and by specifying a bank access scheme to be used using an optional user available signal.
Such object may be achieved, according to one or more embodiments, by a method having the features set forth in the appended claims.
One or more embodiments may relate to a corresponding circuit.
The claims are an integral part of the technical teaching provided herein with respect to the examples.
In accordance with one or more embodiments, a method of accessing a memory to read and/or write data is provided. The method may include generating a memory transaction request comprising a burst of memory access requests to a set of memory locations in memory, wherein the memory locations have respective memory addresses. The method may include sending signals to a memory controller circuit coupled to the memory via an interconnect bus. The first signal may convey a memory transaction request and the second signal may convey information for mapping a burst of memory access requests onto a corresponding memory address of a memory location in the memory. The method can comprise the following steps: based on the information conveyed by the second signal, a corresponding memory address of the memory location is calculated, and the memory location is accessed to read data from and/or write data to the memory location.
Thus, one or more embodiments may advantageously provide the possibility to group different single memory accesses on a bus into a single burst transaction and/or to encode the bank access scheme to be used within a transaction burst. One or more embodiments may be compatible with existing bus standards (e.g., AXI4, AXI 3).
Drawings
One or more embodiments will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is an exemplary circuit block diagram of an electronic system, such as a system on a chip, in accordance with one or more embodiments; and
FIG. 2 is an exemplary data flow diagram of a radix-2, 16-point Fast Fourier Transform (FFT) algorithm.
Detailed Description
In the following description, one or more specific details are set forth in order to provide a thorough understanding of examples of embodiments of the present description. Embodiments may be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail so as not to obscure certain aspects of the embodiments.
Reference to "an embodiment" or "one embodiment" within the framework of the specification is intended to indicate that a particular configuration, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, phrases such as "in an embodiment" or "in one embodiment" that may be present in one or more points of the specification do not necessarily refer to one and the same embodiment. Furthermore, particular conformations, structures, or features may be combined in any suitable manner in one or more embodiments.
Throughout the drawings attached hereto, like parts or elements are denoted with like reference numerals/numerals and, for the sake of brevity, the corresponding description will not be repeated.
The references/headings used herein are provided for convenience only and thus do not define the scope or range of the embodiments.
As an introduction to the detailed description of the exemplary embodiments, reference may be made first to fig. 1. Fig. 1 is an exemplary circuit block diagram of an electronic system 1, such as a system on chip (SoC) designed for digital signal processing, in accordance with one or more embodiments. The electronic system 1 may comprise electronic circuits, such as: a central processing unit 10(CPU, e.g. microprocessor), a main system memory 12 (e.g. system RAM-random access memory), a direct memory access controller 14 and a digital signal processor 16 (e.g. hardware accelerator circuitry, e.g. a memory-based FFT co-processor).
It should be understood that in this description, any specific reference to the FFT co-processor is made purely by way of non-limiting example when designating the digital signal processor 16. As will be apparent from the following description, the digital signal processor 16 according to one or more embodiments may be configured to perform a variety of different algorithms.
As shown in fig. 1, the electronic circuits in the electronic system 1 may be connected by means of a system interconnection network 18, such as a SoC interconnection, a network on chip, a network interconnection or a crossbar (crossbar).
As illustrated in FIG. 1, in one or more embodiments, the digital signal processor 16 may include at least one processing element 160, and preferably a number P of processing elements 1600、1601、…、160P-1And a set of local data memory banks M0、…、MQ-1
In one or more embodiments, digital signal processor 16 can also include a local control unit 161, a local interconnect network 162, a local data store controller 163, and a local read-only memory 165 coupled to a set of local read-only memories 165 (preferably, a number P of local read-only memories 165)0、1651、…、165P-1) The local ROM controller 164; and a set of local configurable coefficient memories 167 (preferably, a number P of local configurable coefficient memories 167)0、1671、…、167P-1) The memory controller 166 is configured locally.
In one or more embodiments, processing element 1600、1601、…、160P-1Mathematical operators may be included, such as radix-2butterfly units (radix-2butterfly units) and/or multiply-accumulate (MAC) units. In various embodiments, a higher radix arithmetic processing unit, for example, commonly referred to as a radix-S butterfly unit, may be used. In various embodiments, processing element 160 may be reconfigurable to perform different operations at different times.
Processing elements 160 may include respective internal direct memory access controllers 168 with low complexity0、1681、…、168P-1. In particular, processing elements 160 may be configured to access data from a local data store through respective direct memory access controllersM0、…、MQ-1And/or retrieve input data from the main system memory 12. Thus, the processing element 160 may perform fine processing on the retrieved input data to generate processed output data. The processing elements 160 may be configured to store the processed output data in the local data memory banks M through respective direct memory access controllers0、…、MQ-1And/or main system memory 12.
In one or more embodiments, a number Q-S P of local data storage banks M may be provided0、…、MQ-1To avoid the need for P processing elements 1600、1601、…、160P-1Memory access conflicts during parallel computations of execution. Thus, in a preferred embodiment including radix-2butterfly units, a number Q-2 × P of local data memory banks M may be provided0,…,MQ-1
Preferably, the local data memory bank M0、…、MQ-1A buffer (e.g., double buffer) may be provided that may facilitate restoring memory upload time (write operations) and/or download time (read operations). In particular, each local data bank may be replicated, so that data may be read from one of the two banks (e.g. for being processed) and (new) data may be stored in the other bank (e.g. for being processed later) at the same time. Thus, moving data may not negatively impact computational performance because it may be masked.
In one or more embodiments, local data store M0、…、MQ-1May be advantageous in combination with data processing in streaming mode or back-to-back (e.g., as applied to an FFT N-point processor configured to detail a contiguous sequence of N data inputs).
In one or more embodiments, local data store M0、…、MQ-1May include memory banks having limited storage capacity (and therefore limited silicon footprint). In the exemplary case of an FFT processor, each local data bank may have at least: (a)A storage capacity of maxN)/Q, where maxN is the longest FFT hardware can handle.
Typical values in applications involving a hardware accelerator may be as follows:
4096 points, each point being for example a floating-point single-precision complex number (real, imaginary) having a size of 64 bits (or 8 bytes),
p-8, yielding Q-16,
so that the storage capacity of each bank may be equal to (4096 × 8 bytes)/16 — 2KB (KB — kilobyte).
In one or more embodiments, local interconnect network 162 may comprise a low complexity interconnect system, for example, based on a known type of bus network, such as an AXI-based interconnect. For example, the data parallelism of the local interconnect network 162 may be 64 bits and the address width may be 32 bits.
Local interconnect network 162 may be configured to connect processing elements 160 to local data banks M0、…、MQ-1And/or main system memory 12. In addition, local interconnect network 162 may be configured to connect local control units 161 and local configuration memory controller 166 to system interconnect network 18.
In particular, the interconnection network 162 may include a set of P master ports MP0、MP1、…、MPP-1Each of these primary ports is coupled to a respective processing element 160; a set of P slave ports SP0、SP1、…、SPP-1Each of these slave ports may be coupled to a local data store M via a local data store controller 1630、…、MQ-1(ii) a Another pair of ports including a system master port MP for coupling to the system interconnection network 18PAnd system slave port SPP(e.g., to receive instructions from central processing unit 10 and/or to access data stored in system memory 12); and a further slave port SP coupled to the local control unit 161 and the local configuration memory controller 166P+1
In one or more embodiments, the interconnection network 162 may be fixed (i.e., non-reconfigurable).
In an exemplary embodiment (see, e.g., table I-1 provided below, where the "X" symbol indicates an existing connection between two ports), the interconnection network 162 may implement the following connections: p main ports MP coupled to processing elements 1600、MP1、…、MPP-1Can be connected to a respective slave port SP0、SP1、…、SPP-1The corresponding slave port SP0、SP1、…、SPP-1Coupled to local data storage controller 163; and is coupled to a system master port MP of the system interconnection network 18PCan be connected to a slave port SP coupled to the local control unit 161 and the local configuration memory controller 166P+1
Table I-1 provided below summarizes such exemplary connections made through the interconnection network 162.
TABLE I-1
SP0 SP1 SPP-1 SPP SPP+1
MP0 X
MP1 X
MPP-1 X
MPP X
In another exemplary embodiment (see, e.g., table I-2 provided below, where the "X" symbol indicates an existing connection between two ports), the interconnection network 162 may further implement the following connections: p main port MP0、MP1、…、MPP-1Can be connected to a system slave port SPPThe system slave port SPPTo system interconnect network 18. In this manner, a connection may be provided between any processing element 160 and the SOC via system interconnection network 18.
Table I-2 provided below summarizes such exemplary connections made through the interconnection network 162.
TABLE I-2
SP0 SP1 SPP-1 SPP SPP+1
MP0 X X
MP1 X X
MPP-1 X X
MPP X
In another exemplary embodiment (see, e.g., table I-3 provided below, where the "X" symbol indicates an existing connection between two ports and the "X" in parentheses indicates an optional connection), the interconnection network 162 may further implement the following connections: system host port MP coupled to system interconnect network 18PCan be connected to the slave port SP0、SP1、…、SPP-1At least one slave port (here a set of P slave ports SP)0、SP1、…、SPP-1Is the first slave port SP0). Thus, MP can be at the main portPAnd (any) slave ports. Depending on the particular application of the system 1, the main port MPPCan be extended to a plurality of (e.g. all) slave ports SP0、SP1、…、SPP-1. Main port MPPAnd slave port SP0、SP1、…、SPP-1May be (only) used for loading input data to be processed into the local data memory bank M0、…、MQ-1As long as all data banks can be accessed via a single slave port. Loading input data may be accomplished using only one slave port, while processing data by parallel computation may utilize multiple (e.g., all) slave ports SP0、SP1、…、SPP-1
Table I-3 provided below summarizes such exemplary connections made through the interconnection network 162.
TABLE I-3
Figure BDA0003042462000000071
Figure BDA0003042462000000081
Further, the processing element 160 may be configured to retrieve input data from the local read-only memory 165 and/or from the local configurable coefficient memory 167 to perform such refinement (eliberation).
In one or more embodiments, processing elements 160 are accessible via local ROM 165 to local ROM controller 1640、1651、…、165P-1May be configured to store digital factors and/or coefficients (e.g., twiddle factors or other complex coefficients for FFT computation) for implementing a particular algorithm or operation. Local ROM controlThe processor 164 may implement a particular address scheme.
In one or more embodiments, the processing element 160 has a local configurable coefficient memory 167 accessible via a local configuration memory controller 1660、1671、…、167P-1May be configured to store application-dependent digital factors and/or coefficients (e.g., coefficients for implementing FIR filters or beamforming operations, weights of neural networks, etc.) that may be configured by software. The local configuration memory controller 166 may implement a particular address scheme.
In one or more embodiments, local read-only memory 1650、1651、…、165P-1And/or local configurable coefficient memory 1670、1671、…、167P-1May advantageously be partitioned into P memory banks equal to the number of processing elements 160 included in hardware accelerator circuit 16. This may help to avoid conflicts in parallel computing processes.
Note that known (e.g., standard) buses used in system-on-chip design, such as the AMBA AXI bus or other buses, may allow access to only consecutive words (or doublewords, halfwords, or bytes) during memory accesses by way of delta or packed bursts (addressing schemes). Thus, known bus-based parallel architectures for digital signal processors (e.g., FFT processors) may perform a single data transfer from a local memory bank to a processing element by means of internal DMA or by means of an address generator, as long as the stride arrangement of data may not be supported by a known type of interconnect (e.g., a standard AXI bus).
Furthermore, known buses may lack dedicated signals to specify a particular bank access scheme to be used for a burst transaction.
The above-described limitations on the operation of known buses may result in limitations in bandwidth, latency, and/or processing time for many different types of digital signal processors (e.g., for the computation of FFT algorithms).
Note that the processing of the algorithm may involve fetching the data vectors from memory and/or storing the data vectors into memory, where the data vectors are separated by programmable steps. In addition, depending on the algorithm calculated, the data may be arranged in memory according to different access patterns, e.g., to avoid or reduce memory access conflicts.
For example, considering the exemplary case of FFT computation, the internal processing element 160 in each FFT stage0、1601、…、160P-1May not be continuous, but may be composed of a form having a power of 2 (i.e., form 2)n) Are spaced apart in steps (in words) of values of (c). Thus, the entire data transmission cannot be grouped into a single typical incremental burst. In known solutions this may lead to increased complexity of the DMA control unit and higher total computational latency.
Thus, the processing of various algorithms in a digital signal processor (e.g., FFT, beamforming, FIR filter, neural network, etc.) may benefit from providing a way to access (in read mode and/or write mode) data stored in memory from incremental bursts with programmable steps between successive beats (beats).
Moreover, such processing may benefit from providing a way to specify different bank access schemes (e.g., delta, low-order interleave, FFT-specific, etc.) within a transaction.
In one or more embodiments, the local control unit 161 may comprise controller circuitry of the digital signal processor 16. Such controller circuitry may configure (e.g., dynamically) each internal direct memory access controller 168 with a particular memory access scheme and cycle period.
In one or more embodiments, local data store controller 163 may be configured to arbitrate (e.g., by processing elements 160) for local data store M0、…、MQ-1To access (c). For example, local data memory controller 163 may use a memory access scheme (e.g., a calculation for a particular algorithm) that may be selected based on signals received from central processing unit 10.
In one or more embodiments, the local data memory controller 163 may convert an incoming read/write transaction burst (e.g., an AXI burst) generated by a read/write direct memory access controller into a read/write memory access sequence according to a specified burst type, burst length, and memory access scheme.
Thus, one or more embodiments of digital signal processor 16 as illustrated in FIG. 1 may be intended by associating processing elements with local data store M0、…、MQ-1The implementation of the (reconfigurable) connections between is delegated to the local data store controller 163 to reduce the complexity of the local interconnect network 162.
In particular, one or more embodiments may provide a standard-compliant extension to data transmissions that can be sent over the local interconnect 162 using an optional user signal.
By way of example, where local interconnect 162 is an AXI-based interconnect, the use and/or use signals may be used in order to improve processing elements 160 and local data banks M0、…、MQ-1Data transmission between.
Again, it will be understood that references to AXI-based interconnects are made purely by way of example: one or more embodiments may be applied to any bus-based digital signal processor 16 for which vector accesses are performed in variable steps into memory and user-specific bus signals are available. Furthermore, it will be understood that reference to a radix-2butterfly unit as a possible processing element 160 is made purely by way of example: one or more embodiments may be applied to a digital signal processor 16 that includes any type of processing element or mathematical operator 160, for example, also including a standard "single instruction, multiple data" (SIMD) vector processor.
As described with reference to fig. 1, each input terminal of processing element 160 may have associated therewith (within 168) a read direct memory access controller that allows for access to local data memory bank M via interconnect 1620、…、MQ-1A read burst request (e.g., an AXI read burst) is issued to obtain input data to be processed.
In addition, each output terminal of processing element 160 has associated therewith (within 168) a write direct memory access controller that allows for access to local data bank M via interconnect 1620、…、MQ-1A write burst request (e.g., an AXI write burst) is issued to store the output processed data.
In one or more embodiments, local data memory controller 163 may receive input (AXI) read bursts and/or input (AXI) write bursts generated by direct memory access controller 168 over interconnect 162 (e.g., from an interface by way of AXI 4). Local data memory controller 163 may convert such read and/or write bursts into corresponding sequences of read memory accesses and/or write memory accesses according to specified burst types, burst lengths, and memory access schemes.
In particular, one or more embodiments may rely on the use of user-available signals (such as signals AWUSER and ARUSER in the AXI standard) to issue non-standard incremental burst transactions (in addition to standard linear incremental bursts) at the DMA 168. For example, the user-available signal may encode information for different stride permutations (permutations) for performing different computation stages of the FFT algorithm.
In addition, the user-useable signal may encode information regarding the memory access scheme that local data memory controller 163 should use. In practice, local data memory controller 163 may also implement a memory access scheme that helps avoid memory conflicts during algorithm computations (e.g., FFT computations) when different processing elements 160 are used.
For example, a first sub-portion of the user available signal may be used to carry information for each beat of the burst about the stride to be added at each memory transaction, starting at the address AxADDR. A second subsection of the user-available signal may be used to define an access scheme (e.g., FFT-specific, low-order interleaved, linear, etc.) to be used by the local data memory controller 163 to map addresses onto physical memory locations.
By way of example only, information may be encoded as follows by reference to the AWUSER and ARUSER signals of the AMBA AXI bus on 16 bits.
For example, a first sub-portion (AWUSER _ STRIDE or ARUSER _ STRIDE) comprising eleven (11) least significant bits of the user available signal (AWUSER [10:0] or ARUSER [10:0]) may be used to specify the increment of the incremental burst in words (8 bytes).
For example, in the case of a write burst, an address other than the starting address may be calculated according to the following formula:
ADDRnext=ADDRprevious+(AWUSER[10:0]+1)*8
similarly, in the case of a read burst, an address other than the starting address may be calculated according to the following formula:
ADDRnext=ADDRprevious+(ARUSER[10:0]+1)*8
thus, for example, where extension bits (AWUSER [10:0], ARUSER [10:0]) are bound to 0, one or more embodiments may maintain backward compatibility with (classical) incremental burst schemes (doublewords).
Still referring to the present example, a second sub-portion (AWUSER _ SCHEME or ARUSER _ SCHEME), e.g., including the five (5) most significant bits of the user available signal (AWUSER [15:11] or ARUSER [15:11]), may be used to specify an address mapping SCHEME used by local data memory controller 163 to map addresses on corresponding physical memory locations.
For example, one bit (e.g., the most significant bit) of the second sub-portion may be used to encode information about whether the access scheme is relevant (e.g., reuse [15] ═ 1 or reuse [15] ═ 1) or irrelevant (e.g., reuse [15] ═ 0 or reuse [15] ═ 0) to the FFT transaction. The remaining bits of the second sub-part may be used to encode information about which access scheme should be used. Table II provided below summarizes possible encoding of information in this second sub-portion of the user-available signal.
Watch two
Figure BDA0003042462000000121
Figure BDA0003042462000000131
In one or more embodiments, by means of the local control unit 161, it is also possible to program and start the execution of the entire FFT algorithm (for example, with a maximum length of 4096 complex points) in addition to or instead of a single DMA transfer. The programming of the read DMA and the write DMA 168 in the processing element 160 may be done by registers inside the local control unit 161 (e.g. via the APB interface). The computation of the entire FFT algorithm may be controlled by a Finite State Machine (FSM) inside the local control unit 161, which schedules the different computation stages and programs the DMA control registers accordingly. The DMA control registers may not be programmed through the APB interface during the computation of the FFT algorithm.
In addition, local control unit 161 may run a loop finite state machine for each DMA 168. Such a cycle FSM, when activated, may cause DMA 168 to issue burst transfers in a cyclic manner at a determined cycle depth by programming registers of DMA 168. The loop depth may be programmable. The loop depth may have a maximum statistically configurable value (e.g., equal to three). Such a cyclic FSM may facilitate acquiring and/or storing data, for example, when performing processing on data arranged in a 3D matrix.
According to a first example, one or more embodiments of the operation of the present disclosure will now be described with reference to AXI 4-applicable bus in the particular case of computation of an FFT algorithm. Memory-based parallel FFT algorithms may be of interest, for example, for automotive radar applications or ultra-wideband (UWB) communication systems, such as OFDM-based systems.
The radix-2 FFT algorithm at 2n points can be divided into n different stages, each stage at 2nOver 2P cycles. At each clock cycle, a single base-2 processing element 160 may take 2 inputs and provide two results according to the following equation:
1)
Figure BDA0003042462000000141
2)
Figure BDA0003042462000000142
wherein the factor
Figure BDA0003042462000000143
Referred to as the twiddle factor and the N/2 exponential difference between the processed points is valid for the first stage. For the following phase, a right-shifted version of the initial N/2 difference value may be used, as shown in the data flow diagram of FIG. 2.
For example, where multiple local data banks equal to 2 × P are provided within the processor 16 (for the base-2 algorithm), the inputs and outputs may be read and stored in parallel from the local data banks, respectively.
For each input of processing element 160, a data read operation may be performed by an internal read DMA. Internal write DMA may perform a write operation on each output, writing it to the same local memory location for the corresponding input operand. Thus, such in-place strategies may help to reduce local memory consumption, which facilitates computation of long fast fourier transforms.
Thus, in digital signal processor 16 according to the present example, it may be desirable to provide a collision-free bank access scheme implemented within local data memory controller 163. Typically, operands to be accessed simultaneously by processing elements 160 may be located in the same memory module, e.g., as a result of the FFT algorithm data reordering between the stages illustrated in fig. 2.
Certain solutions are known in the art that provide a way to allocate data on memory modules in a manner that can avoid collisions. For example, the document Takala et al, "Conflict-Free Parallel Memory Access Scheme For FFT Processors", 5.25 to 28.2003, Mangu, conference book of International Circuit and systems conference 2003, pages IV to IV, doi: 10.1109/ISCS.2003.1205957 provides a general solution comprising Parallel bases having a variable number of-2 basesSRadical-2 of a treating elementS FFT。
In one or more embodiments in accordance with this first example, local interconnect 162 may be selected as AXI 4-applicable bus. This may facilitate the use of automated flow tools available for such open standards.
The user-dependent signals, ARUSER and AWUSER, may be used to read and write, respectively, the incident transaction to encode information for the local data memory controller 163, where such information may include a stride between two consecutive beats of the incident when a delta/wrap incident is issued, and a bank access scheme to be used during the delta/wrap incident when the delta/wrap incident is issued.
Thus, by extending supported burst types in accordance with one or more embodiments in a standard-compliant manner, memory transactions (e.g., all memory transactions) within an FFT phase for input/output ports of processing element 160, which are a stride permutation of input/output data, may be grouped together into a single burst transaction.
Thus, the local control unit 161 of the digital signal processor 16 (in this example, the FFT co-processor) may be configured to control and program the local DMAs 168 according to the selected FFT algorithm to be computed, which local control unit 161 may program the execution of only one burst per FFT phase of each DMA 168.
In addition, the overall latency of the FFT algorithm can be reduced by improving memory access.
Table III provided below summarizes possible encodings of information in user-usable signals (e.g., AWUSER and areuser) according to the present example.
Watch III
Figure BDA0003042462000000151
Figure BDA0003042462000000161
In the specificationTable IV provided at the end illustrates a possible use of the bus extension to compute an N-point fast fourier transform with a number P of base-2 processing elements 160 provided within the digital signal processor 16, where N-2, according to an embodimentn
In table IV, the operator (> > n) represents a right shift of n positions of the initial value stored in the loop right shift register.
As illustrated in table IV, the input and/or output data transfers for each processing element 160 at each FFT stage may be grouped into a single burst transaction. Accordingly, internal DMA processing can be simplified, and memory access latency can be reduced.
In the presently contemplated example, the programming of the DMA 168 during FFT computation may be accomplished by using a simple circular right shift register that may be initialized according to the selected FFT length only at the beginning of the FFT computation and then updated at the beginning of each phase.
The tables V-1 to V-4 provided at the end of the description are further examples of possible usages of bus extensions according to embodiments to calculate a 16-point fast fourier transform by means of four radix-2 processing elements 160 in the digital signal processor 16.
Thus, in the presently contemplated example of computing an FFT in accordance with one or more embodiments, a single FFT stage may be performed by a single extended burst rather than multiple single accesses, with advantages in terms of throughput, latency, traffic (data transfer efficiency and performance).
As a second example, one or more embodiments of the operation of the present disclosure will now be described with reference to AXI 4-an applicable bus in the particular case of computing matrix products. Matrix products (or scalar products of vectors) may find application in processing related to, for example, FIR filters, beamforming, neural networks.
A general product of the matrix a (M, N) × B (N, P) ═ C (M, P) may be computed using a digital signal processor 16 (e.g., a SIMD vector processor or a processor including reconfigurable processing elements 160) that includes R processing elements, and system memory is accessed using bus extensions in accordance with one or more embodiments.
In this example, each processing element 160 may include a multiply-accumulate (MAC) unit and may be used to calculate the product between a row of matrix a and a column of matrix B. To compute the matrix product, it may prove more efficient to use a low-order interleaved access scheme as a mapping method between virtual addresses and physical memory locations, as this may result in a reduction of memory conflicts. The Interleaved memory structure is typically employed by Vector Processors to efficiently process large data structures, as exemplified in the document G.S. Sohi, "High-Bandwidth Interleaved Memories for Vector Processors-A simulation", IEEE Transactions on Computers, volume 42, No. 1, 1993, pp.34-44, doi: 10.1109/12.192212.
Table VI, provided below, summarizes possible encodings of information in the user usable signals (e.g., AWUSER and areuser) of the present example according to the matrix product.
TABLE VI
Figure BDA0003042462000000171
Figure BDA0003042462000000181
Table VII, provided at the end of the description, illustrates possible uses of bus extensions to calculate the product of the matrix a (M, N) × B (N, P) ═ C (M, P), according to an embodiment.
The tables VIII-1 to VIII-4 provided at the end of the description are further examples of possible usages of bus extensions according to embodiments to calculate the product of two 4 x 4 data matrices by means of four processing elements 160 in the digital signal processor 16.
Thus, in the presently contemplated example of computing a matrix product in accordance with one or more embodiments, the row x column level (i.e., scalar product of vectors) may be performed by a single extended burst rather than multiple single accesses (no specific data organization in memory), with advantages in terms of throughput, latency, traffic (data transfer efficiency and performance).
Accordingly, one or more embodiments may facilitate implementation of a digital signal processing system having one or more of the following advantages: the possibility of grouping together different single memory accesses performed by a processing element into a single burst transaction on a bus during a digital signal processing algorithm such as matrix multiplication or FFT; the possibility to encode the bank access scheme to be used in a transaction burst; compatibility with existing bus standards (e.g., AXI4), which is a result of the available optional user-related signals using the bus; the silicon complexity of a typical bus-based digital signal processor (e.g., an FFT processor) is reduced due to improved processing of internal data transfers; reduce latency associated with internal data transmission from memory, thereby improving processing time of different algorithms (e.g., FFT, FIR filter, beamforming, neural network, etc.); the method is suitable for a data processing accelerator or a SIMD vector processor.
In one or more embodiments, the electronic system 1 may be implemented as an integrated circuit in a single silicon chip or die (e.g., as a system on a chip). Alternatively, the electronic system 1 may be a distributed system comprising a plurality of integrated circuits, interconnected together for example by a Printed Circuit Board (PCB).
As illustrated herein, an access memory (e.g., M)0,…,MQ-1) For reading and/or writing data may include, for example, generating (e.g., 168), at a processing circuit (e.g., 160), a memory transaction request including a burst of memory access requests toward a set of memory locations in a memory, the memory locations having respective memory addresses; sending first and second signals to a memory controller circuit (e.g., 163) coupled to the memory via an interconnect bus (e.g., 162), the first signal conveying a memory transaction request and the second signal conveying information for mapping a burst of memory access requests to a respective memory address of a memory location in the memory; and calculating from the information conveyed by the second signal(e.g., 163) the corresponding memory address of the memory location and accesses the memory location to read data from and/or write data to the memory location.
The read data may be intended to be processed by the processing circuitry and the write data may be generated by the processing circuitry.
As illustrated herein, an interconnect bus may comprise an advanced extensible interface (AXI) bus, and a method may comprise: the first and second signals are encoded according to an AXI protocol for transmission via an interconnect bus, and the second signals are transmitted over an aware channel and/or an areser channel of the AXI bus.
As illustrated herein, a method may include generating a memory transaction request, the memory transaction request including an incremental burst of memory access requests or a wrapped burst of memory access requests.
As illustrated herein, a method may include including burst type data and burst length data into a memory transaction request transmitted by a first signal.
As exemplified herein, a method can include including, in information conveyed by the second signal, a stride value indicating a number of data units (e.g., a number of data words, each word equal to, for example, 8 bytes) between two consecutive memory locations (e.g., two consecutive beats of a burst) in a burst of memory access requests, and calculating a respective memory address of the memory location from the stride value.
As illustrated herein, a method may include including data indicative of the determined memory access scheme into information conveyed by the second signal, and accessing a memory location to read data from and/or write data to the memory location in accordance with the data indicative of the determined memory access scheme.
As illustrated herein, a method may include including data indicative of a memory access scheme selected from an incremental access scheme, a low-order interleaved access scheme, and an access scheme used to compute a fast fourier transform algorithm (e.g., a Takala access scheme) in the information conveyed by the second signal.
As illustrated herein, a method may include programming a processing circuit to process data in a plurality of subsequent processing stages, and generating at least one memory transaction request at each processing stage to read data from and/or write data to a memory location.
As exemplified herein, the circuitry (e.g., 16) may include memory for storing data, processing circuitry for processing data, and memory controller circuitry coupled to the memory and to the processing circuitry via an interconnect bus.
As exemplified herein, the processing circuitry may be configured to generate a memory transaction request comprising respective bursts of memory access requests for a set of memory locations in the memory, the memory locations having respective memory addresses, and to send first and second signals to the memory controller circuitry via the interconnect bus, the first signal conveying the memory transaction request and the second signal conveying information for mapping the bursts of memory access requests to the respective memory addresses of the memory locations in the memory.
As exemplified herein, the memory controller circuitry may be configured to calculate respective memory addresses of the memory locations from the information conveyed by the second signal, and to access the memory locations to read data from the memory locations for processing by the processing circuitry and/or to write data processed by the processing circuitry to the memory locations.
Without prejudice to the underlying principles, the details and the embodiments may vary, even significantly, with respect to what has been described by way of example only, without departing from the scope of protection.
The scope of protection is determined by the appended claims.
While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.
TABLE IV
Figure BDA0003042462000000211
TABLE V-1
Figure BDA0003042462000000212
TABLE V-2
Figure BDA0003042462000000221
TABLE V-3
Figure BDA0003042462000000222
TABLE V-4
Figure BDA0003042462000000231
TABLE VII
Figure BDA0003042462000000232
TABLE VIII-1
Figure BDA0003042462000000241
TABLE VIII-2
Figure BDA0003042462000000242
TABLE VIII-3
Figure BDA0003042462000000243
TABLE VIII-4
Figure BDA0003042462000000251

Claims (22)

1. A method of accessing a memory to read and/or write data, the method comprising:
generating a memory transaction request comprising a burst of memory access requests for a set of memory locations in the memory, the memory locations having respective memory addresses;
sending, via an interconnect bus, first and second signals to memory controller circuitry coupled to the memory, the first signal conveying the memory transaction request and the second signal conveying information for mapping a burst of the memory access request to a respective memory address of the memory location in the memory; and
from the information conveyed by the second signal, a respective memory address of the memory location is calculated and the memory location is accessed to read data from and/or write data to the memory location.
2. The method of claim 1, wherein the interconnect bus comprises an advanced extensible interface (AXI) bus, the method comprising:
encoding the first and second signals according to an AXI protocol for transmission over the interconnect bus; and
transmitting the second signal over an AWUSER channel and/or an ARUSER channel of the AXI bus.
3. The method of claim 1, wherein generating the memory transaction request comprises: an incremental burst of memory access requests, or a wrapped burst of memory access requests.
4. The method of claim 1, further comprising: including burst type data and burst length data in the memory transaction request transmitted by the first signal.
5. The method of claim 1, further comprising:
including a stride value in the information conveyed by the second signal, the stride value indicating a number of data units between two consecutive memory locations in a burst of the memory access request; and
calculating the corresponding memory address of the memory location according to the stride value.
6. The method of claim 1, further comprising:
including data indicative of a memory access scheme in the information conveyed by the second signal; and
accessing the memory location to read data from and/or write data to the memory location in accordance with the data indicative of the memory access scheme.
7. The method of claim 6, wherein the memory access scheme is selected from the group consisting of: an incremental access scheme, a low-order interleaved access scheme, or an access scheme for computation of a fast fourier transform algorithm.
8. The method of claim 1, further comprising:
programming the processing circuitry to process the data in a plurality of subsequent processing stages; and
at least one memory transaction request is generated at each of the subsequent processing stages to read data from and/or write data to the memory location.
9. A circuit, comprising:
a memory for storing data;
processing circuitry for processing data;
a memory controller circuit; and
an interconnect bus coupling the memory controller circuitry to the memory and the processing circuitry;
wherein the processing circuitry is configured to:
generating a memory transaction request comprising respective bursts of memory access requests for a set of memory locations in the memory, the memory locations having respective memory addresses; and
sending first and second signals to the memory controller circuitry via the interconnect bus, the first signal conveying the memory transaction request, the second signal conveying information for mapping a burst of the memory access request onto a respective memory address of the memory location in the memory; and
wherein the memory controller circuitry is configured to:
calculating a respective memory address of the memory location from the information conveyed by the second signal; and
the memory location is accessed to read data from the memory location for processing by the processing circuitry and/or to write data processed by the processing circuitry to the memory location.
10. The circuit of claim 9, wherein the interconnect bus comprises an advanced extensible interface (AXI) bus, and wherein the processing circuit is further configured to:
encoding the first and second signals in accordance with the AXI protocol for transmission over the interconnect bus; and
transmitting the second signal over an AWUSER channel and/or an ARUSER channel of the AXI bus.
11. The circuitry of claim 9, wherein the memory transaction request comprises: an incremental burst of memory access requests, or a wrapped burst of memory access requests.
12. The circuit of claim 9, wherein the memory transaction request transmitted by the first signal includes burst type data and burst length data.
13. The circuit according to claim 9, wherein the first and second switches are connected to the first and second switches,
wherein the processing circuitry is configured to include a stride value in the information conveyed by the second signal, the stride value indicating a number of data units between two consecutive memory locations in a burst of the memory access request; and is
Wherein the memory controller circuitry is configured to calculate the respective memory address of the memory location as a function of the stride value.
14. The circuit according to claim 9, wherein the first and second switches are connected to the first and second switches,
wherein the processing circuitry is configured to include data indicative of a memory access scheme in the information conveyed by the second signal; and
wherein the memory controller circuitry is configured to access the memory locations in accordance with the data indicative of the memory access scheme to read data from and/or write data to the memory locations.
15. The circuit of claim 14, wherein the memory access scheme is selected from the group consisting of: an incremental access scheme, a low-order interleaved access scheme, or an access scheme for computation of a fast fourier transform algorithm.
16. The circuit according to claim 9, wherein the first and second switches are connected to the first and second switches,
wherein the processing circuitry is programmed to process data in a plurality of subsequent processing stages; and
wherein the processing circuitry is configured to generate at least one memory transaction request in each of the subsequent processing stages to read data from and/or write data to the memory location.
17. An electronic system, comprising:
a system interconnection network;
a central processing unit coupled to the system interconnection network;
a main system memory coupled to the system interconnect network;
a direct memory access controller coupled to the system interconnect network;
a digital signal processor coupled to the system interconnection network, the digital signal processor comprising:
a local memory for storing data;
local processing circuitry for processing data;
a local memory controller circuit; and
a local interconnect bus coupling the local memory controller circuitry to the local memory and the local processing circuitry;
wherein the local processing circuitry is configured to:
generating a memory transaction request comprising respective bursts of memory access requests for a set of memory locations in the local memory, the memory locations having respective memory addresses; and
sending first and second signals to the local memory controller circuitry via the local interconnect bus, the first signal conveying the memory transaction request, the second signal conveying information for mapping a burst of the memory access request onto a respective memory address of the memory location in the local memory; and
wherein the local memory controller circuitry is configured to:
calculating a respective memory address of the memory location from the information conveyed by the second signal; and
accessing the memory location to read data from the memory location for processing by the local processing circuitry and/or to write data processed by the local processing circuitry to the memory location.
18. The electronic system of claim 17, wherein the local interconnect bus comprises an advanced extensible interface (AXI) bus, and wherein the local processing circuitry is further configured to:
encoding the first and second signals in accordance with an AXI protocol for transmission via the local interconnect bus; and
transmitting the second signal over an AWUSER channel and/or an ARUSER channel of the AXI bus.
19. The electronic system of claim 17, wherein the memory transaction request comprises: an incremental burst of memory access requests, or a wrapped burst of memory access requests.
20. The electronic system of claim 17, wherein the memory transaction request transmitted by the first signal comprises burst type data and burst length data.
21. The electronic system of claim 17, wherein the electronic system,
wherein the local processing circuitry is configured to include a stride value in the information conveyed by the second signal, the stride value indicating a number of data units between two consecutive memory locations in a burst of the memory access request; and
wherein the local memory controller circuitry is configured to calculate the respective memory address of the memory location as a function of the stride value.
22. The electronic system of claim 17, wherein the electronic system,
wherein the local processing circuitry is configured to include data indicative of a memory access scheme in the information conveyed by the second signal; and
wherein the local memory controller circuitry is configured to access the memory locations in accordance with the data indicative of the memory access scheme to read data from and/or write data to the memory locations.
CN202110461211.5A 2020-04-29 2021-04-27 Method for accessing a memory and corresponding circuit Pending CN113568851A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IT102020000009364 2020-04-29
IT102020000009364A IT202000009364A1 (en) 2020-04-29 2020-04-29 PROCEDURE FOR ACCESSING A MEMORY AND CORRESPONDING CIRCUIT
US17/224,747 2021-04-07
US17/224,747 US11620077B2 (en) 2020-04-29 2021-04-07 Method of accessing a memory, and corresponding circuit

Publications (1)

Publication Number Publication Date
CN113568851A true CN113568851A (en) 2021-10-29

Family

ID=78161440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110461211.5A Pending CN113568851A (en) 2020-04-29 2021-04-27 Method for accessing a memory and corresponding circuit

Country Status (1)

Country Link
CN (1) CN113568851A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114006994A (en) * 2021-11-16 2022-02-01 同济大学 Transmission system based on configurable wireless video processor
CN115712505A (en) * 2022-11-25 2023-02-24 湖南胜云光电科技有限公司 Data processing system for distributing power signals in register

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055813A1 (en) * 2005-09-08 2007-03-08 Arm Limited Accessing external memory from an integrated circuit
CN110611561A (en) * 2018-06-15 2019-12-24 意法半导体股份有限公司 Cryptographic method and circuit, corresponding device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055813A1 (en) * 2005-09-08 2007-03-08 Arm Limited Accessing external memory from an integrated circuit
CN110611561A (en) * 2018-06-15 2019-12-24 意法半导体股份有限公司 Cryptographic method and circuit, corresponding device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114006994A (en) * 2021-11-16 2022-02-01 同济大学 Transmission system based on configurable wireless video processor
CN115712505A (en) * 2022-11-25 2023-02-24 湖南胜云光电科技有限公司 Data processing system for distributing power signals in register
CN115712505B (en) * 2022-11-25 2023-06-30 湖南胜云光电科技有限公司 Data processing system for distributing electric signals in register

Similar Documents

Publication Publication Date Title
EP0329023A2 (en) Apparatus for performing digital signal processing including fast fourier transform radix-4 butterfly computations
US6898691B2 (en) Rearranging data between vector and matrix forms in a SIMD matrix processor
EP3343387A1 (en) Compute engine architecture to support data-parallel loops with reduction operations
US8412917B2 (en) Data exchange and communication between execution units in a parallel processor
TWI827432B (en) Computing apparatus, machine learning computing apparatus, combined processing apparatus, neural network chip, electronic device, board, and computing method
KR101162649B1 (en) A method of and apparatus for implementing fast orthogonal transforms of variable size
JP3749022B2 (en) Parallel system with fast latency and array processing with short waiting time
US4819152A (en) Method and apparatus for addressing a memory by array transformations
JP4163178B2 (en) Optimized discrete Fourier transform method and apparatus using prime factorization algorithm
CN113568851A (en) Method for accessing a memory and corresponding circuit
BR9612911B1 (en) apparatus and method for performing multiplication-addition operations on packet data.
CN111183418B (en) Configurable hardware accelerator
JPS6125188B2 (en)
US11620077B2 (en) Method of accessing a memory, and corresponding circuit
US6766433B2 (en) System having user programmable addressing modes and method therefor
Lee et al. VLSI design of a wavelet processing core
CN111158757A (en) Parallel access device and method and chip
US11443014B1 (en) Sparse matrix multiplier in hardware and a reconfigurable data processor including same
CN107111547A (en) Memory access unit
US11354257B2 (en) Circuit, corresponding device, system and method
CN113971260A (en) Digital signal processing circuit and corresponding operating method
EP3232321A1 (en) Signal processing apparatus with register file having dual two-dimensional register banks
JP2022500782A (en) Data processing systems, methods, and programs
US7447722B2 (en) Low latency computation in real time utilizing a DSP processor
Pechanek et al. MFAST: a single chip highly parallel image processing architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination