NZ716954B2 - Computing architecture with peripherals - Google Patents

Computing architecture with peripherals Download PDF

Info

Publication number
NZ716954B2
NZ716954B2 NZ716954A NZ71695414A NZ716954B2 NZ 716954 B2 NZ716954 B2 NZ 716954B2 NZ 716954 A NZ716954 A NZ 716954A NZ 71695414 A NZ71695414 A NZ 71695414A NZ 716954 B2 NZ716954 B2 NZ 716954B2
Authority
NZ
New Zealand
Prior art keywords
interconnect
memory transfer
memory
port
master
Prior art date
Application number
NZ716954A
Other versions
NZ716954A (en
Inventor
Aaron Gittins Benjamin
Original Assignee
Aaron Gittins Benjamin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aaron Gittins Benjamin filed Critical Aaron Gittins Benjamin
Priority claimed from PCT/IB2014/063189 external-priority patent/WO2015008251A2/en
Publication of NZ716954A publication Critical patent/NZ716954A/en
Publication of NZ716954B2 publication Critical patent/NZ716954B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • G06F12/082Associative directories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • G06F13/362Handling requests for interconnection or transfer for access to common bus or bus system with centralised access control
    • G06F13/364Handling requests for interconnection or transfer for access to common bus or bus system with centralised access control using independent requests or grants, e.g. using separated request and grant lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • G06F13/368Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control
    • G06F13/372Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control using a time-dependent priority, e.g. individually loaded time counters or time slot
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/28Using a specific disk cache architecture
    • G06F2212/283Plural cache memories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6032Way prediction in set-associative cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • G06F2212/621Coherency control relating to peripheral accessing, e.g. from DMA or I/O device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1072Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers for memories with random access ports synchronised on clock signal pulse trains, e.g. synchronous memories, self timed memories

Abstract

shared memory computing architecture (300) has M interconnect masters (350, 351, 352, 353, 354), one interconnect target (370), and a timeslot based interconnect (319). The interconnect (319) has a unidirectional timeslot based interconnect (320) to transport memory transfer requests with T timeslots and a unidirectional timeslot based interconnect (340) to transport memory transfer responses with R timeslots. For each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot and starts at least L clock cycles after the start time of that corresponding memory request timeslot. The value of L is >= 3 and < T. Interconnect target (370) is connected to interconnect (319). Each interconnect master (350, 351, 352, 353, 354) is connected to interconnect (319). ots and a unidirectional timeslot based interconnect (340) to transport memory transfer responses with R timeslots. For each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot and starts at least L clock cycles after the start time of that corresponding memory request timeslot. The value of L is >= 3 and < T. Interconnect target (370) is connected to interconnect (319). Each interconnect master (350, 351, 352, 353, 354) is connected to interconnect (319).

Description

June 2020 1 NZ IP No. 716954 COMPUTING ARCHITECTURE WITH PERIPHERALS Field of the invention The present invention relates to multi interconnect master computing architectures and is particularly applicable to real-time and mixed-criticality computing involving peripherals.
Background of the invention Throughout this specification, including the claims: a bus master is a type of interconnect master; a bus target / slave is a type of an onnect target; a memory store coupled with a memory controller may be described at a higher level of abstraction as a memory store; a peripheral may or may not have I/O pins; a peripheral is connected to an interconnect that transports memory transfer requests; a eral may be memory mapped, such that a memory transfer t to the interconnect target port of a peripheral is used to control that peripheral; a processor core may be remotely connected to an interconnect over a bridge; and a definition and description of domino timing s can be found in [1].
Many shared memory computing devices with multiple bus-masters / interconnect-masters, such as the European Space Agencies’ Next tion rocessor architecture [3] experience severe real-time problems [4]. For e, the memory transfer requests of software running on one core of the NGMP architecture experiences ed timing interference from unrelated memory transfer requests issued by other bus masters [4] over the shared ARM AMBA AHB [2] interconnect. For example, unwanted timing interference can occur by memory transfer requests issued by other cores and bus master peripherals to the level 2 cache module and SDRAM. Even though most memory transfer requests are in practice at most 32-bytes in length, a single memory er request can block the bus from servicing other memory transfer requests for more than 10 clock .
Summary of the invention In contrast, in one aspect, embodiments of the present invention provide a shared memory computing device comprising: a first clock; at least M interconnect masters, where the value of M is 4; at least 1 interconnect target; Signed B. Gittins 14 June 2020 2 NZ IP No. 716954 a first ot based interconnect for transporting memory transfer requests and their corresponding responses, comprising: an input clock port that is connected to the first clock; a unidirectional timeslot based interconnect to transport memory transfer requests with T timeslots, where the value of T is at least 4; a unidirectional timeslot based interconnect to transport memory transfer responses with R ots, in which: for each of the R timeslots that timeslot: corresponds to one memory transfer request timeslot; and starts at least L clock cycles after the start time of that corresponding memory request ot, where the value of L is at least 3 and less than the value of T; in which: at least one interconnect target is connected to the first timeslot based interconnect; and for each interconnect master I of the M interconnect masters: each interconnect master I is connected to the first ot based interconnect; and each of the T ots is mappable to a different one of the M interconnect masters.
A shared memory computing device optimised for upper-bound worst case execution time analysis comprising: an on-chip random access memory store comprising at least two interconnect target ports, in which: the first target port: has a data path of D-bits in width, the value of D being larger than or equal to 2; is d to sustain a throughput of one D-bit wide memory transfer request per clock cycle; and is adapted to sustain a throughput of one D-bit wide memory transfer response per clock cycle; and the second target port: has a data path of E-bits in width, the value of E being larger than or equal to 1; is adapted to sustain a throughput of one E-bit wide memory transfer Signed B. Gittins 14 June 2020 3 NZ IP No. 716954 request per clock cycle; and is adapted to sustain a throughput of one E-bit wide memory transfer response per clock cycle; a first on-chip shared memory interconnect which: has a data path of D-bits in width; is exclusively connected to the first port of the at least two interconnect target ports of the on-chip random access memory; is d to sustain a throughput of one D-bit wide memory transfer request per clock cycle to the on-chip random access memory; is adapted to sustain a throughput of one D-bit wide memory transfer response per clock cycle; and has at least two cache modules connected to it, each cache module comprising: a master port with a D-bit wide data path which is connected to this interconnect; and a target port; and a second on-chip shared memory interconnect which: has a data path of E-bits in width; is exclusively ted to the second port of the at least two interconnect target ports of the on-chip random access memory; is adapted to sustain a peak throughput of one E-bit wide memory transfer request per clock cycle to the on-chip random access memory; is adapted to sustain a peak throughput of one E-bit wide memory transfer response per clock cycle; and has at least two interconnect masters connected to it.
A shared memory ing device comprising: a first system interconnect; an p random access memory store comprising at least one interconnect target port, in which the first interconnect target port is connected to the first system interconnect; at least one sub-computing device, each sub-computing device comprising: a first local interconnect; a first interconnect master connected to a local onnect of the sub-computing device; an interconnect bridge sing two ports, in which: the first port is connected to the first system onnect; and the second port is connected to a local interconnect of the sub-computing Signed B. Gittins 14 June 2020 4 NZ IP No. 716954 device; and in which the first interconnect master is adapted to issue memory transfer ts to the on-chip random access memory store; and a first peripheral, comprising: a first interconnect target port which is ted to the first local interconnect of the first of the at least one sub-computing devices; a first interconnect master port which is adapted to issue memory transfer requests to the on-chip random access memory store; in which: the first interconnect master of the first of the at least one sub-computing devices is adapted to issue memory transfer ts to the first peripheral.
A shared memory computing device sing: M interconnect-masters, where the value of M is at least 2, each interconnect-master comprising: an egress port; and an ingress port; and a first timeslot based onnect for transporting memory transfer requests and their ponding responses, comprising: an arbiter and decoder module; a M-to-1 multiplexer, comprising: a select port; M data input ports; and 1 data output port; and a 1-to-M demultiplexer, comprising: a select port; 1 data input port; and M data output ports; in which: for each interconnect master I: the egress port of interconnect master I is connected to the data input port I of the M-to-1 multiplexer; and the ingress port of interconnect master I is connected to the data output port I of the 1-to-M demultiplexer; the arbiter and decoder module of the interconnect controls the value supplied to the select port of the M-to-1 multiplexer; and Signed B. Gittins 14 June 2020 5 NZ IP No. 716954 the value ed to the select port of the 1-to-M demultiplexer is the value supplied to the select port of the M-to-1 multiplexer delayed by L clock cycles, where the value of L is larger or equal to 3.
A shared memory computing device comprising: M interconnect-nodes, where the value of M is at least 2, each interconnect-node comprising: an egress port; and an ingress port; a singular interconnect node comprising: an egress port; and an ingress port; a first Mx1 interconnect for transporting memory transfer requests and their corresponding responses, comprising: M bidirectional ports, each comprising: an s port which is connected to the egress port of a different one of the M onnect-nodes; and an egress port, which is connected to the ingress port of a different one of the M interconnect-nodes; a singular bidirectional port comprising: an egress port which is connected to the ingress port of the ar interconnect node; and an ingress port which is connected to the egress port of the singular interconnect node; a el-in, serial-out (PISO) M input port x 1 output port shift register with M stages, in which: for each stage I of the M stages: that stage is connected to the egress port of the interconnect node I of M interconnect nodes; and the output of stage 1 is ted to the egress port of the singular port of the interconnect; a serial-in, parallel-out (SIPO) 1 input port x M output port module, in which the input is connected to the ingress port of the singular port of the interconnect; and an arbiter and decoder module which is adapted to control the PISO Mx1 shift register and the SIPO 1xM module.
A shared memory computing device optimised for worst case ion time analysis Signed B. Gittins 14 June 2020 6 NZ IP No. 716954 comprising: N fully associative cache modules, where the value of N is at least 1, each fully associative cache module comprising: a master port: a target port; a means to track dirty cache-lines; a finite state machine with one or more policies, in which at least one policy: employs an allocate on read strategy; employs an allocate on write strategy; and employs a least recently used eviction strategy; and N processor cores, in which each core is assigned a different one of the N fully associative cache modules as its e cache.
A shared memory computing device optimised for worst case execution time analysis comprising: at least one interconnect master; N cache modules, where the value of N is at least 1, each cache module comprising: a master port: a target port; and a finite state e that s an update-type cache coherency policy; N processor cores, in which each core: is assigned a different one of the N fully associative cache modules as its private cache; and in which the execution time of memory transfer requests issued by each of the N processor cores is are not modified by: the ted memory transfer requests issued by any of the other N processor cores; or the unrelated memory transfer requests issued by at least one other interconnect A bidirectional interconnect for transporting memory transfer requests and their corresponding memory er responses, comprising: a unidirectional interconnect to transport memory transfer requests; and a unidirectional interconnect to transport memory transfer responses, adapted to transport memory transport responses that includes a copy of the ponding memory er request.
Signed B. Gittins 14 June 2020 7 NZ IP No. 716954 Further inventive aspects of the present invention are set out in the claims appearing at the end of this specification.
Brief description of the drawings For a better understanding of the invention, and to show how it may be carried into effect, embodiments of it are shown, by way of non-limiting example only, in the accompanying drawings. In the gs: figure 1 is a block schematic diagram illustrating preferred embodiments of the present invention; figure 2 is a flow-chart illustrating processes according to the embodiments of figure 1; figure 3 is a block schematic diagram preferred embodiments of the present invention; figure 4 is a flow-chart illustrating processes according to the embodiments of figure 3; figure 5 is a timing diagram illustrating timing according to the embodiments of figure; figure 6 is a block schematic diagram illustrating preferred embodiments of the present invention; figures 7 and 8 are timeslot scheduling diagrams according to embodiments of the type of figure 3 figure 9 is an access control list diagram according to embodiments of the type of figure 3; figure 10 is a hybrid block schematic diagram illustrating the allocation of memory, and the timing of interconnect masters access that memory according to embodiments of the type of figure 3 and figure 6; figure 11 is a block tic diagram illustrating portions of the embodiments of figures 1 and 3; figure 12 is a block tic diagram illustrating preferred ments of the present invention; figure 13 is a hart illustrating processes of according to the embodiments of figure figure 14 is a block schematic diagram illustrating portions of the embodiments of figures 3 and 12; figure 15 is a high-level block schematic m illustrating a red embodiment of the present ion; figures 16 to 19 are flow-charts illustrating processes according to the ments of figure 15; and figure 20 is a diagram rating two sets of fields according to preferred embodiments Signed B. Gittins 14 June 2020 8 NZ IP No. 716954 of the present invention.
Description of preferred embodiments of the invention Figure 1 is a block schematic diagram rating ns of a shared memory computing architecture (300) for preferred embodiments of the present invention. Shared memory computing architecture (300) comprises 5 unidirectional onnect s (350, 351, 352, 353, 354). Each unidirectional interconnect bridge (350, 351, 352, 353, 354) comprises: an interconnect target port ({350.ti, }, {351.ti, 351.te},{352.ti, 352.te}, {350.ti, 353.te}, {354.ti, 354.te}) comprising: an ingress port (350.ti, 351.ti, 352.ti, 353.ti, 354.ti); and an egress port (350.te, 351.te, 352.te, , 354.te); an interconnect master port ({350.mi, 350.me}, {351.mi, 351.me}, {352.mi, 352.me}, {353.mi, 353.me}, i, 354.me}) comprising: an ingress port (350.mi, 351.mi, 352.mi, 353.mi, 354.mi); and an egress port (350.me, 351.me, 352.me, 353.me, ); a memory transfer request module (330, 332, 334, 336, 338) comprising: an ingress port (350.ti, 351.ti, 352.ti, 353.ti, 354.ti); an egress port (350.me, 351.me, 352.me, 353.me, 354.me); a memory er response module (331, 333, 335, 337, 339) comprising: an ingress port (350.ti, , 352.ti, 353.ti, ); and an egress port (350.me, 351.me, 352.me, 353.me, 354.me).
The shared memory computing architecture (300) further comprises: M interconnect masters (350, 351, 352, 353, 354), where the value of M is 5, in which each interconnect master comprises: an egress port (350.me, 351.me, 352.me, 353.me, 354.me); and an ingress port (350.mi, , 352.mi, 353.mi, 354.mi); and a first timeslot based interconnect (319) for transporting memory transfer requests and their corresponding responses, comprising: an arbiter and decoder module (360); a M-to-1 multiplexer (321), comprising: a select port; M data input ports (320.a, 320.b, 320.c, 320.d, ; and 1 data output port (320.f); and a 1-to-M demultiplexer (341), comprising: a select port; Signed B. Gittins 14 June 2020 9 NZ IP No. 716954 1 data input port (340.f); and M data output ports (340.a, 340.b, 340.c, 340.d, 340.e); in which: for each interconnect master I: the egress port of onnect master I is connected to the data input port I of the M-to-1 multiplexer ({350.me, 320.a}, {351.me, , {352.me, 320.c}, {353.me, 320.d}, {354.me, 320.e}); and the ingress port of interconnect master I is ted to the data output port I of the 1-to-M demultiplexer ({350.mi, 340.a}, {351.mi, , {352.mi, 340.c}, {353.mi, 340.d}, {354.mi, 340.e}); the arbiter and decoder module (360) of the interconnect (319) controls the value supplied on wire (361) to the select port of the M-to-1 multiplexer (321); and the value supplied (on wire 342) to the select port of the 1-to-M demultiplexer (341) is the value supplied to the select port of the M-to-1 multiplexer delayed by the first in first out module (329) for L clock cycles, where the value of L is larger or equal to 3.
The onnect arbiter and decoder module (360) es as inputs the control signals, e.g. on wire (362), generated by the 5 interconnect masters (350, 351, 352, 353, 354) that are received on ports (320.a, 320.b, 320.c, 320.d, 320.e) respectively and the control signals on wire (363) generated by the 1 interconnect target (370) and received on port (340.f). Preferably the ling scheme of the interconnect r and decoder module (360) is adapted to consider the state of those control signals (such as the values received on wires (362) and (363)).
The interconnect arbiter and decoder module (360) generates one or more control s released as output on ports (340.a, 340.b, 340.c, 340.d, 340.e) that are supplied to the 5 interconnect master’s ingress ports (350.mi, 351.mi, , 353.mi, 354.mi). The interconnect arbiter and decoder module (360) also generates one or more control signals as outputs (not illustrated) which are ed over port (320.f) to the interconnect target’s (370) ingress port.
Preferably the arbiter and decoder module (360) of the first timeslot based interconnect (319) employs at least one scheduling scheme selected from the group comprising: a least recently granted interconnect master scheme (see figure 8); a least recently granted interconnect master scheme with rate throttling on at least one interconnect master (see figure 8); a static timeslot scheme (see figure 5); a dynamic timeslot scheme (see figure 2); and a time triggered protocol scheme (see figure 7); Signed B. Gittins 14 June 2020 10 NZ IP No. 716954 Preferably the shared memory computing ecture (300) is adapted such that: the arbiter of the first timeslot based onnect (319) is adapted to: grant a first timeslot to one of the M interconnect masters (350, 351, 352, 353, 354); not grant the next timeslot to that interconnect master; and grant one of the later timeslots to the that interconnect master; the first interconnect master is adapted to: issue a memory transfer request to a first interconnect target during the first timeslot; and the first interconnect target is adapted to: transmit at least part of its response to the first interconnect master during the later timeslot granted to the first onnect master.
Preferably at least one onnect target (370) can receive two or more outstanding memory transfer requests before releasing a memory er response related to the first memory transfer request. ably at least one interconnect master (350, 351, 352, 353, 354) is adapted to be able to issue two or more outstanding memory transfer requests to that interconnect target (370) before ing the memory transfer response corresponding to the first memory transfer request to that interconnect target. For example when a processor core is d to concurrently issue a first memory transfer request to retrieve executable code and a second memory transfer request to access data.
Preferably the duration of least one timeslot of the interconnect (319) is 1 clock cycle in length.
For example, a first timeslot is 1 clock cycle in length, and the second timeslot is 1 clock cycle in length. In an alternate preferred ment of the present invention, each timeslot of the interconnect (319) has a variable duration of length that is upper-bound for that ot. For example, the duration of the first ot is one 1 clock cycle and the duration of the second timeslot ranges from 1 to 2 clock cycles in length.
For the remainder of the text describing figure 1, each timeslot of interconnect (319) has a duration of 1 clock cycle in length, the FIFO module (329) releases the value of each input as output 3 clock cycles later, and the sub modules (371), (373) and (372) of module (370) each take 1 clock cycle to process their inputs and generate a corresponding output.
The shared memory computing architecture (300) further comprises an additional 5 interconnect Signed B. Gittins 14 June 2020 11 NZ IP No. 716954 s (310, 311, 312, 313, 314), each comprising an egress port (310.e, 311.e, 312.e, 313.e, 314.e) and an ingress port (310.i, 311.i, 312.i, 313.i, . Each of the additional 5 interconnect masters (310, 311, 312, 313, 314) are connected to the interconnect target ports of the 5 interconnect bridges (350, 351, 352, 353, 354) respectively.
The interconnect target (370) is an on-chip shared memory comprising one interconnect target port, in which that target port: is adapted to sustain a peak throughput of one memory transfer request per clock cycle; is adapted to sustain a peak throughput of one memory transfer response per clock cycle. ably at least one memory transfer request can be buffered by one or more of the M unidirectional interconnect bridges. Preferably at least one of the M unidirectional interconnect bridges is adapted to support read pre-fetching and write combining.
In some red embodiments, one or more of the M unidirectional interconnect bridges (350, 351, 352, 353, 354) are interconnect protocol transcoding bridges in which the protocol to transcode is a bus interconnect protocol such as ARM AMBA AHB [2].
In some preferred embodiments, at least two of the M unidirectional interconnect bridges (350, 351, 352, 353, 354) are cache modules, in which each of those cache modules are d to complete at least one memory transfer request from a cache-line stored in its cache-line store without waiting for that cache module’s time-slot on the timeslot based interconnect (319). In this way, each cache module has the capability to complete memory transfer requests at a rate faster than the worst-case rate that timeslots are d to that cache module on the ot based interconnect (319).
In some cases the data-path width of the 5 interconnect s (310, 311, 312, 313, 314) will be less than the data-path width of the 5 cache s’ interconnect master ports ({350.mi, 350.me}, {351.mi, 351.me}, {352.mi, 352.me}, {353.mi, 353.me}, {354.mi, 354.me}). For example, as illustrated in the block diagram 300 of figure 1, the data-path width of the 5 interconnect masters (310, 311, 312, 313, 314) is 32-bits (301), the data-path width of the ot based interconnect (319) is 512-bits (302), and the data-path width of the on-chip memory store (370) is 512-bits (302).
The use of N cache modules (350, 351, 352, 353, 354) connected to the same timeslot based Signed B. Gittins 14 June 2020 12 NZ IP No. 716954 interconnect (319) is highly desirably when performing upper-bound worst case execution time analysis of one or more tasks running in a N processor core (310, 311, 312, 313, 314) architecture. Benefits include improved decoupling of the execution time of N concurrently outstanding memory transfer requests issued by N different cores (310, 311, 312, 313, 314), and to mask some of the access time latencies of memory transfer requests addressed to the shared on-chip memory (370) over that timeslot based interconnect (319). Preferably each of those N cache s (350, 351, 352, 353, 354) has a means for maintaining cache coherency with the N-1 other cache modules (350, 351, 352, 353, 354) with zero unwanted timing interference incurred against the memory er requests received on that cache’s interconnect target port.
Figure 1 also illustrates embodiments of the invention in which a shared memory computing architecture (300) comprises: a first clock (not illustrated); M interconnect masters (350, 351, 352, 353, 354), where the value of M is 5; 1 interconnect target (370); a first timeslot based onnect (319) for transporting memory transfer requests and their corresponding responses, sing: an input clock port (318) that is connected to the first clock; a unidirectional timeslot based interconnect (320) to transport memory transfer ts with T timeslots, where the value of T is 5; a unidirectional timeslot based interconnect (340) to ort memory transfer responses with R timeslots, where the value of R is 5, in which: for each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot; and starts at least L clock cycles after the start time of that corresponding memory request ot, where the value of L is 3; in which: interconnect target (370) is connected to the first timeslot based interconnect (319); for each interconnect master I of the M onnect masters (350, 351, 352, 353, 354): each interconnect master I is ted to the first timeslot based interconnect (319); and each of the T timeslots is mappable to a different one of the M interconnect masters.
Signed B. Gittins 14 June 2020 13 NZ IP No. 716954 The shared memory ing architecture (300) r comprises an on-chip random access memory store (370), sing: an input clock port that is connected to the first clock (not illustrated); and at least one interconnect target port which is connected to the first timeslot based interconnect (319), and in which: each memory transfer request takes at most K clock cycles to complete under fault-free ion, where the value of K is 3; and that target port can sustain a throughput of 1 memory transfer t per clock cycle.
In a preferred embodiment of the preferred invention the interconnect target (370) comprises: a first delay buffer (371) to delay memory transfer requests; an inner interconnect target (373); a second delay buffer (372) to delay memory transfer responses; in which: the input of the interconnect target (370) is supplied as input to the first delay buffer (371); the output of the first delay buffer (371) is supplied as input to the module (373); the output of the module (373) is supplied as input to the second delay buffer (372); and the output of the second delay buffer (372) is supplied as the output of the interconnect target (370).
In this way, it is possible to transform any interconnect target into an interconnect target that delays its memory transfer requests and memory transfer responses. The same type of approach can be adapted to transform any interconnect master into an onnect master that delays its memory transfer ts to the interconnect and delays their corresponding responses received from that interconnect.
Figure 2 is a flow-chart illustrating the steps in a memory transfer request process (400) from an interconnect master (310) to a memory store (370) of figure 1 according to preferred embodiments of the present invention. In figure 2 the value of L is 3. Each of the interconnect bridges (350) to (354) is adapted to: buffer a single contiguous region of memory that is 512-bits wide; perform t wide read and 512-bit wide write operations over its master port to the interconnect (319); Signed B. Gittins 14 June 2020 14 NZ IP No. 716954 support write combining of 32-bit write memory transfer requests received over its target port to its 512-bit wide buffer; and support 32-bit wide read memory transfer requests received over its target port to the contents of that 512-bit wide buffer.
In step 410, start the interconnect master (310) read memory transfer request s.
In step 411, the interconnect master (310) issues a read memory transfer request of 32-bits over the egress port (310.e) to the target port {350.ti, 350.te} of the interconnect bridge (350).
In step 412, the interconnect master (310) waits for and receives the memory transfer response from the interconnect bridge (350) on the ingress port ). This completes the 32-bit read memory transfer request issued in step 411.
In step 413, end the interconnect master (310) read memory transfer request process.
In step 420, start the interconnect bridge (350) memory transfer relay s.
In step 421, the onnect bridge (350) receives the 32-bit read memory transfer request issued in step 411 on its interconnect target port {350.ti, 350.te}.
In step 422, the interconnect bridge (350) requests a timeslot on the ot based interconnect over its interconnect master port {350.mi, }. This interconnect request signal is transported over wire (362) and ed by the interconnect arbiter (360).
In step 423, the onnect bridge (350) waits one or more clock cycles until it is granted a timeslot on the timeslot based onnect (319).
In step 424, the interconnect bridge (350) is allotted an upper-bound duration of time within the timeslot to issue its memory transfer request and any associated data. The interconnect bridge (350) issues a 512-bit read memory transfer request over its interconnect master port to the timeslot based interconnect (319).
In step 425, the interconnect bridge (350) waits for the memory transfer request to be processed.
In this particular example, the interconnect bridge (350) does not issue any additional memory transfer requests onto the timeslot based interconnect (319) while waiting for the currently outstanding memory transfer request to be processed.
In step 426, the interconnect bridge (350) is notified by the timeslot based interconnect (319) when the 512-bit wide read memory transfer request response is available. The interconnect bridge is allotted an upper-bound duration of timeslot to receive the response of that memory transfer request. The interconnect bridge (350) receives the response to its memory er request and buffers it y.
In step 427, the interconnect bridge relays the requested s of data from the 512-bit read memory transfer response over its interconnect target port back to the interconnect master (310).
Signed B. Gittins 14 June 2020 15 NZ IP No. 716954 In step 428, end the interconnect bridge (350) memory er relay process.
In step 430, start the timeslot based interconnect (319) memory transfer request cycle.
In step 431, the timeslot based onnect arbiter and decoder module (360) receives the value on each interconnect request signal of the 5 interconnect bridges (350, 351, 352, 353, 354) connected to the timeslot based interconnect (319).
In step 432, the timeslot based interconnect arbiter and decoder module (360) evaluates the received value from each interconnect request signal according to the policy, configuration and execution history of the currently active arbitration scheme. For example, if the timeslot based interconnect arbiter is currently employing a least recently granted interconnect master scheme, then the least ly granted interconnect master is selected from the set of interconnect masters currently ting a timeslot on the interconnect (see figure 8). Alternatively, if the ot based interconnect arbiter and decoder module (360) is currently using a cyclic ot scheduling scheme, then the value on the interconnect request signals does not influence the scheduling of timeslots.
In step 433, the timeslot based interconnect arbiter and decoder module (360) is illustrated as having selected the interconnect bridge (350) for the next timeslot. The ot based interconnect arbiter and decoder module (360) s the interconnect bridge (350) it has been granted the next timeslot on the interconnect (319). In the next clock cycle, the timeslot based interconnect r adjusts the value of the index to the multiplexer (321) to select the data-path of port (320.a).
In step 434, a copy of the read memory transfer request and associated data is transmitted over the onnect master port of the interconnect bridge (350) and is received on the data-path of port (320.a).
In step 435, a copy of the read memory transfer request received by the timeslot based interconnect (319) is forwarded to the memory store (370) which is connected to the onnect target port (320.f) of the timeslot based interconnect (319). For example, the multiplexer (321) forwards the selected information received on its data-path to the target port (320.f).
In step 436, the value supplied to the select input of the multiplexer (321) is delayed (329) for L clock cycles.
In step 437, the value received on the data-path of the target port (340.f) is supplied as input to the data input port of the iplexer (341). The select port of the demultiplexer receives the value supplied to the select port of the multiplexer (321) L clock cycles earlier.
In step 438, the value received on target port (340.f) is forwarded to the interconnect bridge (350) and received in step 426.
In step 439, end the timeslot based interconnect (319) memory transfer t cycle.
Signed B. Gittins 14 June 2020 16 NZ IP No. 716954 In step 440, start the memory store (370) memory transfer request cycle.
In step 441, memory store (370) receives a 512-bit wide read memory transfer request and delays it in the buffer (371) for 1 clock cycle.
In step 442, the memory store (370) processes the read memory transfer request (373) in 1 clock cycle and delays the memory transfer response output for 1 clock cycle in the buffer (327).
In step 443, the memory store (370) its the read memory transfer request response.
In step 445, end the memory store (370) memory transfer request cycle.
In a preferred embodiment of the present invention, a ng cache module (354) snoops every memory transfer response released as output by the de-multiplexer (341) over wire (343). ably each memory transfer response incorporates a copy of its corresponding memory er request.
In a preferred embodiment of the present invention, each of the 5 interconnect master ports of the interconnect (319) are connected to a different memory management unit (MMU) (380, 381, 382, 383, 384) respectively. In this way, the 5 MMU (380, 381, 382, 383, 384) provide a means to enforce an access control policy between interconnect masters and the interconnect target from within the interconnect (319).
In an alternate preferred embodiment of the present invention, onnect node (370) is an interconnect master, and interconnect nodes (350) to (354) are ol transcoding s, interconnect nodes (310) to (314) are interconnect targets, and modules (380) to (384) are not used.
Figure 3 is a block schematic diagram illustrating portions of a shared memory computing architecture (500) according to preferred embodiments of the present invention. The shared memory computing ecture (500) comprises: M interconnect masters (540, 541, 542, 543, 544), where the value of M is 5, in which each interconnect master comprises: an egress port (540.me, , 542.me, 543.me, 544.me); and an s port (540.mi, 541.mi, 542.mi, 543.mi, 544.mi); and a first timeslot based interconnect (501) for transporting memory transfer requests and their corresponding responses, comprising: an arbiter and decoder module (510); a M-to-1 multiplexer (521), comprising: Signed B. Gittins 14 June 2020 17 NZ IP No. 716954 a select port; M data input ports (520.a, 520.b, 520.c, 520.d, 520.e); and 1 data output port; and a 1-to-M iplexer (531), comprising: a select port; 1 data input port; and M data output ports (531.a, 531.b, 531.c, 531.d, 531.e); in which: for each onnect master I: the egress port of interconnect master I is connected to the data input port I of the M-to-1 multiplexer ({540.me, 520.a}, {541.me, 520.b}, {542.me, 520.c}, {543.me, 520.d}, {544.me, 520.e}); and the ingress port of interconnect master I is connected to the data output port I of the 1-to-M iplexer ({540.mi, 531.a}, {541.mi, 531.b}, {542.mi, 531.c}, {543.mi, 531.d}, {544.mi, 531.e}); the arbiter and decoder module (510) of the interconnect (501) controls the value supplied on wire (511) to the select port of the M-to-1 multiplexer (521); and the value supplied on wire (513) to the select port of the 1-to-M demultiplexer (531) is the value supplied to the select port of the M-to-1 lexer delayed by the first in first out module (515) by L clock cycles, where the value of L is 3.
The shared memory computing architecture (500) further comprises: S interconnect targets (560, 561, 562, 563, 564), where the value of S is 5, each interconnect target comprising: an egress port (560.e, 561.e, 562.e, 563.e, 564.e); and an ingress port (560.i, 561.i, 562.i, 563.i, ; in which the first timeslot based interconnect for orting memory transfer requests and their corresponding responses further comprises: a 1-to-S demultiplexer (522), comprising: a select port; 1 data input port; and S data output ports , 520.g, 520.h, 520.i, 520.j); and and a S-to-1 multiplexer (532), comprising: a select port; S data input ports (530.f, 530.g, 530.h, 530.i, 530.j); and 1 data output port; Signed B. Gittins 14 June 2020 18 NZ IP No. 716954 in which: the data input port of the 1-to-S demultiplexer (522) receives as input the output of the M-to-1 multiplexer (521); the data input port of the 1-to-M demultiplexer (533) receives as input the output of the S-to-1 multiplexer (533); for each interconnect target J: the ingress port of interconnect target J is connected to the data output port I of the 1-to-S demultiplexer ({560.i, 520.f}, {561.i, 520.g}, {562.i, 520.h}, {563.i, 520.i}, , 520.j}); and the egress port of interconnect target J is connected to the data input port S of the S-to-1 multiplexer e, 530.f}, {561.e, , , 530.h}, {563.e, 530.i}, {564.e, 530.j}); and the arbiter and decoder module (510) of the onnect controls the value ed on wire (512) to the select port of the 1-to-S demultiplexer (522); and the value supplied on wire (514) to the select port of the S-to-1 multiplexer is the value supplied to the select port of the 1-to-S demultiplexer (522) delayed by the first in first out module (516) by L clock cycles.
In figure 3, the data-path width of the interconnect 501 is 32-bits (599).
The interconnect arbiter and r module (510) receives as inputs the l signals (not illustrated) generated by the 5 interconnect masters (540, 541, 542, 543, 544) that are received on ports (520.a, 520.b, 520.c, 520.d, 520.e) respectively and the control signals (not illustrated) generated by the 5 interconnect targets (560, 561, 562, 563, 564) and received on ports (530.f, 530.g, 530.h, 530.i, 530.j). Preferably one or more of the scheduling scheme of the r and decoder module (510) is adapted to consider the state of those control signals.
The interconnect arbiter and decoder module (510) generates one or more control signals as output on ports (530.a, 530.b, 530.c, 530.d, 530.e) that are supplied to the 5 interconnect master’s ingress ports (540.mi, 541.mi, 542.mi, 543.mi, ) respectively. The interconnect arbiter and decoder module (510) also generates one or more control signals as outputs (not illustrated) which are supplied over ports (320.f, 320.g, 320.h, 320.i, 320.j) to the ingress ports (560.i, 561.i, 562.i, 563.i, 564.i) of the interconnect targets (560, 561, 562, 563, 564) respectively.
Preferably the arbiter and decoder module (510) of the timeslot based onnect (501) Signed B. Gittins 14 June 2020 19 NZ IP No. 716954 employs at least one scheduling scheme selected from the group comprising: a least recently granted interconnect master scheme (see figure 8); a least recently granted interconnect master scheme with rate throttling on at least one interconnect master (see figure 8); a static timeslot scheme (see figure 5); a dynamic timeslot scheme; and a time triggered protocol scheme (see figure 7).
Preferably the shared memory computing architecture (500) is adapted such that: the arbiter and decoder module (510) of the first timeslot based onnect (501) is adapted to: grant a first timeslot to one of the M onnect masters (540, 541, 542, 543, 544); not grant the next ot to that interconnect master; and grant one of the later timeslots to the that interconnect ; the first interconnect master is adapted to: issue a memory transfer request to a first interconnect target during the first timeslot; and the first interconnect target is adapted to: transmit at least part of its response to the first onnect master during the later timeslot d to the first interconnect master.
Preferably at least one interconnect target (560, 561, 562, 563, 564) can receive two or more outstanding memory transfer requests before releasing a memory transfer response related to the first memory transfer request. Preferably at least one interconnect master (560, 561, 562, 563, 564) can issue two or more outstanding memory transfer requests to that interconnect target before receiving the memory transfer response ponding to the first memory transfer request to that interconnect target. For example a processor core (540) may concurrently issue a memory transfer request to retrieve able code and a memory transfer request to access data.
Preferably the duration of least one timeslot of the first ot based interconnect (501) is 1 clock cycle in length. For example, a first timeslot is 1 clock cycle in length, and the second timeslot is 1 clock cycle in length. In an alternate preferred embodiment, each timeslot of the first timeslot based interconnect has a variable duration of length that is upper-bound for that timeslot. For example, the duration of the first timeslot is one 1 clock cycle and the duration of Signed B. Gittins 14 June 2020 20 NZ IP No. 716954 the second ot ranges from 1 to 2 clock cycles in length.
For the remainder of the text describing figure 3, each timeslot of the onnect (501) has a duration of 1 clock cycle in length, the FIFO (515) releases the value of each input as output 3 clock cycles later, the FIFO (516) releases the value of each input as output 3 clock cycles later, and the on-chip memory store (560) releases its output after 3 clock cycles. The interconnect target peripherals (561) to (564) take a variable amount of time to generate memory transfer responses to the memory transfer requests they receive.
Figure 4 is a flow-chart illustrating (600) the steps of two rent memory er requests issued from 2 interconnect masters in the same clock-cycle to two different interconnect targets of figure 3. The value of L is 3, the onnect arbiter and decoder module (510) is employing a static round-robin timeslot schedule in which each timeslot has a fixed duration of 1 clock cycle in length according to a preferred embodiment of the present invention. In this pedagogical example, the interconnect masters (540) to (544) are adapted to issue memory transfer requests in the same clock cycle they receive notification of being d the current timeslot. Furthermore, the onnect arbiter and decoder module (510) is assumed to already be started and operating.
In clock cycle 1 (601): In step 631, the interconnect arbiter and decoder module (510) grants the current timeslot of the ot based interconnect (501) to interconnect master (543). Interconnect master (543) does not issue a memory transfer t.
In step 610, start the memory transfer request process for interconnect master (540).
In step 611, interconnect master (540) requests a timeslot on the timeslot based interconnect (501).
In step 620, start the memory transfer request process for interconnect master (541).
In step 621, interconnect master (541) requests a timeslot on the timeslot based interconnect (501).
In clock cycle 2 (602): In step 632, the onnect arbiter and decoder module (510) grants the current timeslot of the interconnect (501) to interconnect master (544), that interconnect master does not issue a memory transfer request.
In clock cycle 3 (603): In step 633, the interconnect arbiter and decoder module (510) signals to interconnect master (540) that it has been granted the current timeslot on the interconnect (501). The Signed B. Gittins 14 June 2020 21 NZ IP No. 716954 onnect arbiter and decoder module sets the value of the select input of the multiplexer (521) to select interconnect master (540). That value is also forwarded to the delay module (515) and is delayed for 3 clock cycles before being forwarded to the select input of demultiplexer (531).
In step 612, the interconnect master (540) issues a memory transfer request addressed to peripheral (562) along with all associated data to the timeslot based interconnect (501) in one clock cycle.
In step 633, the interconnect r and decoder module (510) decodes the address of that memory transfer request, identifies that the memory address corresponds the address range of the peripheral (562) and sets the value of the select input on the iplexer (522) to select peripheral (562). That value is also forwarded to the delay module (516) and is delayed for 3 clock cycles before being forwarded to the select input of multiplexer (532).
In clock cycle 4 (604): In step 634, the interconnect arbiter and decoder module (510) signals to interconnect master (541) that it has been granted the current timeslot on the interconnect (501). The interconnect arbiter and decoder module (510) sets the value of the select input of the multiplexer (521) to select onnect master (541). That value is also forwarded to the delay module (515) and is delayed for 3 clock cycles before being forwarded to the select input of demultiplexer (531).
In step 622, the interconnect master (541) issues a memory transfer request addressed to peripheral (563) along with all associated data in one clock cycle to the timeslot based interconnect (501).
In step 634, the interconnect arbiter and decoder module (510) decodes the address of that memory transfer request, fies that the memory address corresponds the address range of the eral (563) and sets the value of the select input on the demultiplexer (522) to select peripheral (563).
In clock cycle 5 (605): In step 635, the interconnect arbiter and r module (510) grants the current timeslot of the onnect to interconnect master (542). Interconnect master (542) does not issue a memory transfer request.
In clock cycle 6 (606): The peripheral (562) generates its memory transfer response to the onnect transfer request issued in step 612.
In step 636, the interconnect arbiter and decoder module (510) grants the current timeslot of the interconnect to interconnect master (543). Interconnect master (543) does not Signed B. Gittins 14 June 2020 22 NZ IP No. 716954 issue a memory transfer request. The index to the multiplexer (532) selects eral (562), and the demultiplexer (531) selects interconnect master (540), forwarding the entire memory transfer response from the eral (562) to interconnect master (540) in one clock cycle.
In step 613 the interconnect master (540) receives the response.
In clock cycle 7 (607): The peripheral (563) generates its se to the interconnect er request issued in step 613.
In step 637, the interconnect arbiter and decoder module (510) grants the current timeslot of the interconnect (501) to interconnect master (544). Interconnect master (544) does not issue a memory transfer t. The index to the multiplexer (532) selects peripheral (563), and the demultiplexer (531) selects interconnect master (541), forwarding the entire memory er response from the peripheral (563) to interconnect master (541) in one clock cycle.
In step 623, the interconnect master (541) receives the response.
End of the memory transfer request process for interconnect master (540).
In clock cycle 8 (608): In step 638, the interconnect arbiter and decoder module (510) grants the current timeslot of the interconnect to interconnect master (540). Interconnect master (540) does not issue a memory transfer request.
End of the memory transfer request process for interconnect master (541).
In a preferred embodiment of the present invention, a snarfing cache module (544) snoops every memory transfer response released as output by the de-multiplexer (531) over wire (534).
Preferably each memory transfer se incorporates a copy of its corresponding memory transfer request.
In a red embodiment of the present invention, each of the 5 interconnect master ports of the interconnect (501) are ted to a different memory management unit (MMU) (not illustrated) respectively. In this way, the 5 MMU provide a means to enforce an access control policy between interconnect masters and the interconnect target from within the onnect (501).
It is further preferred that the means to enforce an access control policy is adapted to ensure that no more than one interconnect master (540 to 544) can issue memory transfer requests to a given interconnect target (560 to 564). In this way the access control policy tees that a memory Signed B. Gittins 14 June 2020 23 NZ IP No. 716954 transfer request to that interconnect target (560 to 564). will not be delayed by another other interconnect master (540 to 544).
In some cases, for the purpose of increasing the clock-speed of the circuitry, it may be desirable to increase the pipeline depth of the interconnect (501) by adding ers (523) and (533).
In a preferred embodiment of the present invention, each of the M interconnect masters (540, 541, 542, 543, 544) are interconnect bridges.
Figure 5 is a timing m illustrating 3 rows of timing events (200) for memory transfer ts (220), their completion times (230) and their response times (240) on a timeslot based interconnect for transporting memory transfer requests generated by a shared memory computing architecture of the type illustrated in figure 3 ing to a red embodiment of the t invention.
Timeline 210 illustrates 13 timeslots, the duration of each timeslot being 1 clock cycle in length.
Row 220 illustrates the consecutive g of 7 interconnect s (not illustrated) labelled (A) to (G) to 13 timeslots in a statically scheduled round-robin scheme with a period of 7 clock cycles (201). In this illustration each interconnect master continually issues back-to-back blocking read memory transfer requests. By ng, it is meant that each interconnect master waits for the response of any of its outstanding memory transfer requests before issuing its next memory transfer request. In this illustration, each interconnect master is issuing a memory transfer request to a different interconnect target (not illustrated).
Specifically, row (220) rates the timing of memory transfer requests issued on a unidirectional timeslot based interconnect with 7 timeslots as follows: the first memory transfer request is issued by interconnect master (A) at timeslot (220.1); the first memory transfer request is issued by interconnect master (B) at timeslot (220.2); the first memory transfer request is issued by interconnect master (C) at timeslot (220.3); the first memory transfer request is issued by interconnect master (D) at ot (220.4); the first memory transfer request is issued by onnect master (E) at ot (220.5); the first memory transfer t is issued by interconnect master (F) at timeslot (220.6); the first memory transfer request is issued by interconnect master (G) at timeslot (220.7); the second memory transfer request is issued by interconnect master (A) at timeslot (220.8); no memory transfer request is issued by interconnect master (B) at timeslot (220.9); the second memory transfer request is issued by interconnect Signed B. Gittins 14 June 2020 24 NZ IP No. 716954 master (C) at timeslot 0); the second memory transfer request is issued by interconnect master (D) at timeslot (220.11); the second memory transfer request is issued by interconnect master (E) at timeslot (220.12); and the second memory transfer request is issued by interconnect master (F) at timeslot (220.13).
Row 230 rates the time at which each memory transfer request completes: no memory transfer requests are completed on timeslots ), (130.2), (130.3) and (130.5); the memory transfer request (220.1) completes at timeslot (230.4); the memory transfer request (220.2) completes at timeslot (230.8); the memory transfer request (220.3) completes at timeslot (230.6); the memory transfer request (220.4) completes at ot (230.7); the memory transfer request (220.5) tes at timeslot (230.8); the memory transfer request (220.6) tes at timeslot (230.9); the memory transfer t (220.7) completes at ot (230.10); the memory transfer request (220.8) completes at timeslot (230.11); the memory transfer request (220.9) completes at ot (230.12); and the memory transfer request (220.10) tes at timeslot (230.13).
Row 240 illustrates the timing of memory transfer responses on a second unidirectional timeslot based interconnect with 7 timeslots: the memory transfer request (220.1) receives its completion response at timeslot (240.4); the memory transfer request (220.2) receives a completion g response at timeslot (240.5); the memory transfer t (220.2) receives its completion response at timeslot (240.11); the memory transfer t ) receives its completion response at timeslot (240.6); the memory transfer request (220.4) receives its completion se at timeslot (240.7); the memory transfer request (220.5) receives its completion se at timeslot (240.8); the memory transfer request (220.6) receives its completion response at timeslot (240.9); the memory transfer request (220.7) receives its completion response at timeslot (240.10); the memory er request (220.8) receives its completion response at timeslot (240.11); there is no memory transfer request issued at (220.9); the memory transfer t (220.10) receives its completion se at timeslot (240.13).
In this illustration (200), the interconnect targets of interconnect masters (A) and (C) to (G) complete are guaranteed to complete their memory transfer request within 3 timeslots (254), where as the interconnect target of interconnect master (B) is guaranteed to complete its memory er request within 6 timeslots (253).
Figure 5 illustrates that the alignment of the memory transfer request timeslots (120) and the memory transfer response timeslots ({220.1, 240.4}, {220.2, 240.5}, {220.3, 240.6}, …) are phase shifted by 3 clock cycles to the right (241). In this case, 9 out of 10 memory transfer Signed B. Gittins 14 June 2020 25 NZ IP No. 716954 responses (240.4, 240.6, 240.7, 240.8, 240.9, 240.10, 240.11, 240.12, 240.13) were not delayed (254) longer than necessary (258), resulting in significantly improved performance when compared to not phase shifting the time between the request timeslot and response timeslots.
Only one (230.B1) of the 13 memory transfer ses (230) was delayed. In this case, it was delayed by 4 clock cycles (257). Advantageously, the idle timeslot (240.5) and the delay of the memory transfer response (230.8) had no impact on the timing of memory transfer requests/responses of any other interconnect masters. Ideally the phase shifting is selected to optimise for the round-trip time for the majority of memory transfer requests at the cost of a relatively small increase in latency for the minority.
In this way we have described the timing behaviour of a shared memory computing architecture that comprises: M interconnect masters (A, B, C, D, E, F, G), where the value of M is 7; 7 interconnect s; a first timeslot based interconnect for transporting memory transfer requests and their corresponding responses, comprising: a unidirectional timeslot based interconnect to transport memory transfer requests (220) with T ots, where the value of T is 7 (201); a unidirectional timeslot based interconnect to transport memory transfer responses (240) with R timeslots, in which: for each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot ({240.4, 220.1}, {240.5, 220.2}, …); and starts at least L clock cycles (241) after the start time of that corresponding memory request timeslot 1, 240.4} through to 0, }), where the value of L is at least 3 and less than the value of T; all 7 interconnect s are connected to the first timeslot based interconnect; for each interconnect master I of the M onnect masters (A, B, C, D, E, F, G): each interconnect master I is connected to the first ot based onnect; in which each of the T timeslots (220.1, 220.2, 220.3, 220.4, 220.5, 220.6, 220.7) is mappable to a different one of the M interconnect masters (A, B, C, D, E, F, G).
Furthermore, figure 5 illustrates that the value of R (which is 7) equals the value of T (which is 7), and each of the T memory transfer t timeslots (220.1, 220.2, 220.3, 220.4, 220.5, Signed B. Gittins 14 June 2020 26 NZ IP No. 716954 220.6, 220.7) on the first timeslot based interconnect has a corresponding memory transfer response timeslot (240.4, 240.5, 240.6, 240.7, 240.8, 240.9, 240.10) of the same length (1 clock cycle) on that onnect.
Figure 6 is a block schematic diagram illustrating portions of a shared memory computing architecture (700), ing embodiments of figure 3 according to a preferred embodiment of the present invention. The shared memory computing architecture (700) comprises: a first system onnect (720) of the type described in figure 3; an on-chip random access memory store (761) comprising at least one interconnect target port ({761.i1, 761.e1}, {761.i1, 761.e1}), in which the first interconnect target port {761.i1, } is connected to the first system (720) interconnect; at least two mputing devices (730, 740), in which: the first (730) of the at least two sub-computing device (730, 740) comprises: a first local interconnect (710) comprising: a unidirectional interconnect (711) for transporting memory transfer requests; and a unidirectional interconnect (712) for transporting the corresponding memory transfer responses; a first interconnect master (731) connected to a local interconnect (710) of the sub-computing device; a unidirectional interconnect bridge , 733.b} comprising two ports, in which: the first port is connected to the first system interconnect (720); the second port is connected to a local interconnect (710) of the sub-computing device; and in which the first interconnect master (731) is d to issue memory er requests to the on-chip random access memory store (761) over the unidirectional interconnect bridge {733.a, 733.b}; and the second (740) of the at least two sub-computing device (730, 740) comprises: a first local interconnect (715) comprising: a unidirectional interconnect (716) for transporting memory transfer requests; and a ectional interconnect (717) for transporting the corresponding memory transfer responses; a first interconnect master (741) connected to a local interconnect (715) of Signed B. Gittins 14 June 2020 27 NZ IP No. 716954 the sub-computing device; and a unidirectional interconnect bridge {743.a, 743.b} sing two ports, in which: the first port is connected to the first system interconnect (720); and the second port is connected to a local interconnect of the subcomputing device (715); and in which the first interconnect master (741) is adapted to issue memory er requests to the on-chip random access memory store (761) over the unidirectional interconnect bridge {743.a, 743.b}; and a first peripheral (751), comprising: a first interconnect target port (751.t1) which is connected to the first local interconnect (710) of the first (730) of the at least two sub-computing devices (730, 740); and a first interconnect master port (751.m1) which is adapted to issue memory transfer requests to the on-chip random access memory store (761); in which: the first interconnect master (731) of the first (730) of the at least two puting devices (730, 740) is adapted to issue memory transfer requests to the first peripheral (751).
The first peripheral (751) of the shared memory computing architecture (700) r ses: a second interconnect target port (751.t2) which is ted to the first local interconnect (715) of a second (740) of the at least two sub-computing devices (730, 740); and the first interconnect master (741) of the second (740) of at least two sub-computing devices (730, 740) is adapted to issue memory transfer requests to the first peripheral (751).
The shared memory computing architecture (700) further comprises: a second peripheral (752), comprising a first interconnect target port (752.t1) which is connected to the first system interconnect (720); in which the first interconnect master (731, 741) of at least two (730, 740) of the at least two sub-computing s (730, 740) is adapted to issue memory transfer requests to the second peripheral (752).
Signed B. Gittins 14 June 2020 28 NZ IP No. 716954 The first peripheral (751) of the shared memory computing architecture (700) r comprises a first interconnect master (751.m1) which is d to issue memory transfer requests to the p random access memory (761) over the interconnect (720).
The multiprocessor interrupt ller (771) with software le interrupt lines is adapted to map one or more interrupt lines between each peripheral (751, 752) and one or more interconnect masters (731, 741). The multiprocessor interrupt controller has a dedicated interconnect target port (772, 773) for each of the at least two sub-computing devices (730, 740).
Preferably, the private memory store (732) is ted as an interconnect target to the local interconnect (710) of the sub-computing device (731). ably, each port of the dual-port time-analysable memory controller and off-chip memory store (762) is connected as a onnect target to the timeslot based interconnect (720).
Preferably, the timer module (742) has a interconnect target port which is connected to interconnect (715) of the sub-computing device (740) that can generate an interrupt which is exclusively received (not illustrated) by interconnect master (741).
In figure 6 the interconnect master (731) can issue memory transfer requests to interconnect target (732) and the interconnect target port ) of the interconnect bridge {733.a, 733.b} to the timeslot based interconnect (720) over interconnect (710). This capability permits scaling of the number of interconnect target devices accessible by the interconnect master (731) in a statically time-analysable manner without increasing the number of time-slots on one or more timeslot based interconnects (720). This also permits frequent, latency sensitive, memory transfer requests from (731) to be serviced by a interconnect target device (732), without incurring multi interconnect master arbitration latency penalties that are present on the timeslot based interconnect (720).
Preferably the first system interconnect (720) is a timeslot based onnect. A desirable property of connecting the interconnect masters erals (751, 752) ly to the timeslot based interconnect (720) is that it becomes trivially easy to calculate the upper-bound latency of their memory transfer requests and the peak bandwidth that can be sustained to the on-chip memory (761).
Preferably, the shared memory computing device (700) of figure 6 comprises a means, such as Signed B. Gittins 14 June 2020 29 NZ IP No. 716954 the use of memory ment units (not illustrated), to enforce an access control policy that limits which interconnect masters ({733.a, 733.b}, {743.a, 743.b}, 751, 752) can issue memory transfer requests to which interconnect targets ({752.t1}, 761).
In an alternate preferred embodiment, the shared memory computing architecture (700) r comprises a second system interconnect (799) in which: the on-chip random access memory store (761) has at least two onnect target ports ({761.i1, 761.e1}, {761.i2, 761.e2}); the second interconnect target port {761.i2, 761.e2} of the random access memory store (761) is ted to the second system interconnect (799); the first interconnect master port of the first peripheral is disconnected from the first system interconnect (720) and connected to the second system interconnect (799); and the first interconnect master port of the second eral is disconnected from the first system interconnect (720) and connected (not illustrated) to the second system interconnect (799).
Figure 7 is a block diagram illustrating a static timeslot le (810) with a cycle of 24 fixed timeslots (801 to 824) of 1 clock cycle each that rotate ally left (850) by 1 entry every clock cycle for preferred embodiments of the present invention. The 4 interconnect masters (1, 2, 3, 4) are scheduled once every second timeslot (801, 803, 805, …), such that each interconnect master is scheduled once every eight timeslots. For example, interconnect master is scheduled in timeslots (801, 809, 817). The value (illustrated as corresponding to interconnect master 1) in element (825) is used by the arbiter and decoder module to l which interconnect master is granted access to a given timeslot based interconnect. The 12 onnect master peripherals (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) are scheduled one every second timeslot, such that each of those 12 onnect master peripherals is scheduled once every 24 timeslots. In this way, the 4 onnect masters (1, 2, 3, 4) are granted higher-frequency access, and thus proportionally more bandwidth, than the other 12 interconnect master peripherals. This particular scheduling scheme is well suited to managing 4 processor cores along with 12 onnect master peripherals on one timeslot based interconnect, such as interconnect (720) of figure 6. Clearly each interconnect master peripheral (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) must be able to buffer data to write without loss for up to 24 clock cycles.
Figure 8 is a block diagram that illustrates a least recently granted (LRG) interconnect master scheme with 16 time-slots of 1 clock cycle each according to preferred embodiment of the present invention. Region (860) illustrates the value of the 16 timeslots (861 to 876) in the first Signed B. Gittins 14 June 2020 30 NZ IP No. 716954 clock cycle and region (880) rates the value of the same 16 timeslots (881 to 896) in the second clock cycle. The LRG scheme ensures that if all 16 interconnect masters are concurrently g memory transfer request at an equal rate, then each interconnect master is granted equal number of timeslots to the onnect. On the other hand, if less than 16 interconnect masters are concurrently issuing memory transfer requests, then the available bandwidth is opportunistically allocated to the active interconnect masters. In figure 8, at the start of the first clock-cycle (860) interconnect masters 4 (864), 9 (869), 10 (870), and 12 (872) have issued memory transfer requests and are waiting to be granted on the timeslot based interconnect. In this case, the least recently granted interconnect master was onnect master 12 (872), and that interconnect master is granted access to the current timeslot on the timeslot based interconnect. At the start of the next clock cycle (860), interconnect master 12 (881) is placed at the start of the queue, interconnect masters 1 to 11 (861, …, 871) age by one clock cycle (862, …, 892), and interconnect master 6 (886) issues a memory transfer request to the timeslot based interconnect. In this clock cycle (880), the least recently granted interconnect master with a pending memory transfer request is 10 (891), and it is granted access to the t timeslot of the ot based onnect.
In a further preferred embodiment of the t invention, a rate limiting counter is associated with each of the 16 interconnect masters, for example counter (897) for interconnect master 12 (881). The rate limiting counter ses by one each clock cycle, stopping at zero. When the timeslot based interconnect is reset, each interconnect master is assigned a value indicating how many clock cycles must pass before that interconnect master may be granted the timeslot based interconnect by the arbiter after having completed a memory transfer t. This rate-limiting capability can be used to reduce power consumption (by ng the number of reads and/or writes to the shared memory store) and to ensure higher-bandwidth or higher-frequency devices have greater opportunity to be granted the timeslot based interconnect.
Figure 9 is a table illustrating an access l list (ACL) (900) for 8 interconnect masters (902) and 16 interconnect targets (901) connected to a timeslot based interconnect, for preferred embodiments of the present invention. The label ‘X’ illustrates that a specific interconnect master may access a specific onnect target, where its absence indicates prohibition of access.
Figure 9 illustrates an access control list policy (900) in which that ACL policy has been configured in such a way that no more than one interconnect master can issue memory transfer requests to a given interconnect target on that timeslot based interconnect. For example the first Signed B. Gittins 14 June 2020 31 NZ IP No. 716954 interconnect target (910) may only be accessed by the third interconnect master (922) as illustrated by the label ‘X’ (in column 1, row 3) and its absence in every other column of row 3.
An interconnect master may be permitted by the ACL policy to access more than one interconnect target. For example the first interconnect master (920) is permitted to issue memory transfer requests to the second (911) and fourth (913) interconnect s. Furthermore, an interconnect master may not be permitted to issue memory transfer requests to any interconnect target peripherals on that interconnect, as illustrated by the row five of the table for the fifth interconnect master (924).
Figure 9 can be encoded as a 1 ional array of 64-bits in length, partitioned into 16 elements (one for each interconnect target), each element being 4-bits in length and indicating which one of the up to 16 interconnect masters may access it.
Preferably, the ACL policy is adapted to be dynamically adjusted at me by supervisor re, such as a hypervisor or ing system, in response to the set of currently active tasks. Preferably there are two levels of ACL policy. A first ACL policy specifying which set of interconnect masters are ted to be mapped to any given interconnect target, and a second ACL policy that selects which (if any) one of those interconnect masters is currently assigned to any given interconnect target. This then permits a system-level supervisory software to set system level ACL constraints, while permitting each sub-computing device to independently select a valid ACL configuration from the permissible sub-set of all le configurations for that sub-computing .
Figure 10 is a hybrid block schematic diagram illustrating the allocation/partitioning (1100) of memory ((761) of figure 6), and the timing of interconnect masters, specifically software tasks (1120) g on processor cores and peripherals (1130), accessing that memory ((761) of figure 6) according to embodiments of the type of figure 3 and figure 6. In this illustration the width of the ot based interconnect ((720) of figure 6) is 1024 bits in length, and each of the elements of memory in memory store (761) is also 1024 bits in .
Logical partition 1101 illustrates two ts of memory store (761) allocated to store the content of a network packet for a peripheral that performs operations on 2048-bit long packets.
Logical partition (1102) shows 6 ts of memory store (761) allocated for use by memory transfer requests issued by at least one interconnect master port of that peripheral. Logical partitions (1103) and (1104) are allocated 2 elements of memory store (761) which are used as end-of-queue buffers, so that while one packet is being written into one of the two logical Signed B. Gittins 14 June 2020 32 NZ IP No. 716954 partitions, the other packet in the other logical partition is being transferred to an independent bly off-chip) memory. This permits the head-of-queue packets to be stored in SRAM store (761) while still having buffers allocated for receiving and ading packets as they arrive from that peripheral to an independent memory.
Logical partition (1105) illustrates 12 elements of memory assigned to 12 time-slots of a timetriggered protocol with variable length lots of up to 1024-bits in length.
Logical partitions (1107, 1108, 1109, 1110, 1111) are assigned to a single network eral that has 5 virtual ports. Each of those 5 logical partitions may be assigned exclusively to a different processor core and/or operating system instance and/or communications session. In preferred ments of the present invention the number of virtual queues, and the length of each l queue assigned to a peripheral is dynamically set at boot up, and those preferences are communicated to the peripheral over its interconnect target port, or a ion in (1100) storing configuration data.
Logical partition (1112) is left unallocated.
Logical partition (1113) is allocated for sending and receiving messages between two RTOS instances running on a first processor core and a second processor core. Preferably, the two RTOS instances are configured to further sub-partition that space.
Timeline 1119 rates four ({1121, 1123}, {1123, 1125}, {1125, 1127}, {1127, 1229}) time and space (T&S) partitions for re tasks (1122, 1124, 1126) illustrated in region (1120). A first task (1122) operates in the first T&S partition {1121, 1132} on sor core (731), a second task (1124) operates in a second T&S partition on processor core (731), a third task (1126) operates in a third T&S partition on processor core (731). With regard to peripheral activity (1130), a peripheral (752 of figure 6) receives a packet transmitted to it over a public wide-area network, and writes that packet into partition . Due to unknown latencies introduced at run time by competing traffic over the public wide-area network, it is not possible to accurately predict at what time that packet will arrive. That packet is processed by the task (1126) in the third T&S partition, and a new packet of data is generated by that task (1126) and written into partition (1105). The onnect master port of that peripheral (752) accesses the partition (1105) to retrieve that new packet so that it can be transmitted over the wide area network. The tasks (1122, 1124, 1126) all access memory (1100) during their ated timeslots.
Signed B. Gittins 14 June 2020 33 NZ IP No. 716954 Advantageously, when the timeslot based interconnect (720) is running a fixed time-slot scheduling scheme, the reception (1131) and transmission (1132) of packets results in no unwanted/uncontrolled timing interference for the memory transfer requests issued by processor core (731) to (732). As there is no uncontrolled timing interference, static worst case execution time analysis of tasks running on core (731) can be achieved with tighter bounds than with the conventional multi-core architectures in which multiple processor cores and interconnect master peripherals are permitted work-preserving access to SDRAM. When the timeslot based interconnect is running in a least recently granted interconnect master mode without rate limiters, the timing interference is upper bound to the lent of a static timeslot scheduling scheme with one timeslot per interconnect master.
Advantageously, the 1024-bit wide SRAM (720) offers exceptionally high bandwidth when compared to a 64-bit wide double-data-rate off-chip SDRAM channel operating at comparable clock-speeds. It is possible to use the relatively high aggregate bandwidth of the SRAM (720) to ensure that every peripheral has sufficient bandwidth to e at its (off-chip I/O) wire-speed, even in a static timeslot led environment servicing multiple interconnect s. This approach tends to significantly se the total ive usable memory bandwidth within a computing device. For example, in many cases, a packet sent or received by a peripheral may not ever have to be written to the relatively low-bandwidth off-chip memory store.
Figure 11 is a block schematic diagram illustrating ns of a shared memory computing architecture (1300) optimised for bound worst case ion time, employing embodiments of figures 1 and 3 according to a preferred embodiment of the present invention.
The shared memory computing architecture (1300) ses: a first system interconnect (1350) of the type bed in figure 1; an on-chip random access memory store (1370) comprising two interconnect target ports, in which the first interconnect target port is connected to the first system (1350) interconnect; at least two sub-computing device (1330, 1340), in which: the first (1330) of the at least two mputing devices (1330, 1340) comprises: a first local interconnect (1310) sing: a unidirectional interconnect (1311) for transporting memory transfer requests; and a unidirectional interconnect (1312) for transporting the corresponding memory transfer responses; Signed B. Gittins 14 June 2020 34 NZ IP No. 716954 a first interconnect master (1331) connected to a local interconnect (1310) of the sub-computing ; a unidirectional interconnect bridge {1351.a, 1352a} comprising two ports, in which: the first port is connected to the first system interconnect (1350); the second port is connected to a local interconnect (1310) of the sub-computing ; and in which the first interconnect master (1331) is adapted to issue memory transfer requests to the on-chip random access memory store (1370) over the unidirectional interconnect bridge {1351.a, 1352.a}; the second (1340) of the at least two sub-computing devices (1330, 1340) comprises: a first local interconnect (1315) comprising: a unidirectional onnect (1316) for transporting memory transfer requests; and a unidirectional interconnect (1317) for transporting the corresponding memory transfer responses; a first interconnect master (1341) connected to a local interconnect (1315) of the sub-computing device; and a unidirectional interconnect bridge {1351.b, 1352.b} comprising two ports, in which: the first port is connected to the first system interconnect (1370); the second port is connected to a local interconnect of the puting device (1315); and in which the first onnect master (1341) is adapted to issue memory transfer requests to the on-chip random access memory store (1370) over the unidirectional interconnect bridge {1351.b, 1352.b}.
The shared memory computing ecture (1300) further comprises: an p random access memory store (1370) comprising at least two interconnect target ports, in which: the first port: has a data path of D-bits in width, the value of D being equal to 128; is adapted to sustain a throughput of one D-bit wide memory transfer Signed B. Gittins 14 June 2020 35 NZ IP No. 716954 t per clock cycle; and is adapted to sustain a throughput of one D-bit wide memory transfer response per clock cycle; and the second port: has a data path of E-bits in width, the value of E being equal to 16; is d to sustain a throughput of one E-bit wide memory transfer request per clock cycle; and is adapted to sustain a throughput of one E-bit wide memory er response per clock cycle; a first on-chip shared memory interconnect (1350) of the type described in figure 1 which: has a data path of D-bits in width; is exclusively connected to the first port of the at least two interconnect target ports of the on-chip random access memory (1370); is adapted to sustain a throughput of one D-bit wide memory transfer request per clock cycle to the on-chip random access memory (1370); is adapted to sustain a throughput of one D-bit wide memory er response per clock cycle; and has at least two cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) connected to it, each cache module comprising: a master port with a D-bit wide data path which is connected to this interconnect (1350); and a target port; and a second on-chip shared memory interconnect (1360) of the type described in figure 1 which: has a data path of E-bits in width; is exclusively connected to the second port of the at least two interconnect target ports of the on-chip random access memory (1370); is adapted to sustain a peak throughput of one E-bit wide memory transfer request per clock cycle to the on-chip random access memory (1370); and is adapted to sustain a peak throughput of one E-bit wide memory transfer response per clock cycle; and has at least two interconnect masters (1381, 1382) connected to it. ably the dual-port on-chip random access store (1370) is internally sed of 8 dualport 16-bit wide p random access stores arranged in parallel. The first port is adapted to Signed B. Gittins 14 June 2020 36 NZ IP No. 716954 receive memory transfer requests with data lengths ranging from 16 to 128-bits in length, in multiples of 16 bits. The second part is d to receive 16 bit memory transfer ts. This configuration is well suited to cost effectively creating a memory store that can sustain the eed bandwidth ements of a relatively large number of lower bandwidth peripherals while permitting interconnect masters (1331) and (1341) relatively high dth tency access to that data.
In an alternate preferred ment of the present invention, the value of D is equal to 256 and the value of E is equal to 256 and the dual-port on-chip random access store (1370) is internally comprised of 16 dual-port 32-bit wide on-chip random access stores arranged in parallel. This configuration is well suited to supporting the wire speed of higher bandwidth peripherals.
Preferably both the first (1350) and second (1360) on-chip shared memory interconnects employ timeslot based arbitration s; and at least two timeslots of the first on-chip shared memory interconnect each have a timeslot length of one clock cycle in length.
It is further preferred that both interconnects (1350) and (1360) only employ timeslots that have a duration of 1 clock cycle in length, and in which the data-path width is adapted so that it is sufficiently wide to transmit an entire memory transfer request and/or its corresponding memory transfer response in 1 clock cycle. This later configuration is particularly desirable, when compared against a configuration in which both interconnects employ timeslots of 2 clock cycles, a uration which would double the worst case access latency for an interconnect master directly connected to the interconnect seeking to gain access to a ot. To place this result in t, several commercial off the shelf average case execution time optimised multi- core computer architectures employ bus protocols, such as AMBA AHB 2, which permit memory transfer ts to block the bus for well over 10 clock cycles.
This later configuration, in which each timeslot is 1 clock cycle in length, is extremely desirable even if one or more of the interconnect masters can not sustain high rates of memory transfer requests. This is because this configuration achieves the lowest worst case access latencies at the point of contention between interconnect masters.
The computing architecture (1300) r comprises: at least one processor core (1331, 1341); a peripheral (1383), comprising: a first interconnect target port (1381.t1) which is connected by wires (1384, 1385) Signed B. Gittins 14 June 2020 37 NZ IP No. 716954 to the first p shared memory interconnect (1350); and a first interconnect master port (1381.m1) which is connected to the second onchip shared memory interconnect (1360); in which: at least one (1331, 1341) of the at least one processor cores (1331, 1341) can issue a memory transfer request over the first on-chip shared memory interconnect (1350) to the peripheral (1383); the peripheral (1383) can store data in the on-chip random access memory over the second system interconnect (1360); and the at least one (1331, 1341) of the at least one processor cores (1331, 1341) can read that data.
The computing architecture (1300) further comprises: a first peripheral interconnect (1355) of the type described in figure 3 for transporting memory transfer requests and their corresponding responses; a peripheral , comprising: a first interconnect target port t1) which is connected to the first eral interconnect (1355); a second interconnect target port (1381.t2) which is connected to the first peripheral interconnect (1355); and a first interconnect master port (1381.m1) which is connected to one (1360) of the at least two on-chip shared memory interconnects (1350, 1360); in which: at least one of the at least one processor cores (1331, 1341) can issue a memory transfer request over the first peripheral interconnect (1355) to the peripheral (1381); the peripheral (1381) can store data in the on-chip random access memory (1370) over the second system interconnect (1360); and the at least one of the at least one sor cores (1331, 1341) can read that data.
Preferably the eral interconnect is adapted to transport each memory transfer t in 1 clock cycle and each corresponding memory transfer response in 1 clock cycle. Preferably the data-path width of the peripheral interconnect (1355) is less than the data-path width of the second interconnect (1350, 1360).
Preferably there is a second peripheral interconnect (not illustrated) adapted to enable the Signed B. Gittins 14 June 2020 38 NZ IP No. 716954 processor cores (1331, 1341) to communicate with peripherals that do not have an interconnect master ace. The use of a second peripheral interconnect for peripherals that do not have interconnect master aces is ularly advantageous because it permits many relatively low bandwidth peripherals to be placed and routed on the chip some distance away from the memory store (1370) which is used by relatively high bandwidth interconnect-master peripherals.
The ing architecture (1300) further comprises: a eral (1382), comprising: a first interconnect target port (1382.t1) which is connected to the first peripheral interconnect (1355); a first interconnect master port (1382.m1) which is connected to one (1360) of the at least two on-chip shared memory interconnects; in which: at least one of the at least one processor cores (1331, 1341) can issue a memory transfer request over the first peripheral onnect (1355) to the peripheral (1381); the peripheral (1381) can store data in the on-chip random access memory (1370) over the second system interconnect (1360); and the at least one of the at least one processor cores (1331, 1341) can read that data.
Preferably the two interconnect bridges ({1351.a, 1352.a}, {1351.b, 1352.b}) are cache modules.
The use of cache modules is highly desirable as it permits interconnect masters with relatively narrow data path widths, such as 32-bit processor cores (1331, 1341), to take better advantage of interconnects (1350) and shared on-chip es (1370) with vely wide data paths (e.g. 128-bit). For example, if there are sixteen 32-bit processor cores, in which each core has a private cache module that is attached to the same interconnect (1350), increasing the data-path width of that interconnect (1350) from t to 512-bit or higher ses the amount of data prefetched by read memory transfer requests issued by each cache module to that interconnect (1350). This in turn tends to result in ed masking of the worst case 16 clock cycle access latencies between 2 consecutive memory transfer requests issued by a cache module to that shared memory (1370) over that interconnect (1350) for that caches’ processor core.
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which are connected to the first on-chip shared memory interconnect (1350) maintain cache-coherency with each other ({1351.a, 1352.a}, {1351.b, 1352.b}) with zero timing interference to unrelated Signed B. Gittins 14 June 2020 39 NZ IP No. 716954 memory transfer requests received on the target port of those at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}). These properties simplify the worst case execution time analysis of tasks running on cores (1331, 1341) that access their private cache modules (({1351.a, 1352.a}, {1351.b, 1352.b}).
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which are connected to the first on-chip shared memory interconnect (1350) operate in a cachecoherency group that maintains cache-coherency between each other and also maintains cache coherency against the write memory transfer requests (1399) issued to at least one of the other ports of the on-chip random access memory (1370). For example in a 16 core system (1331, 1341, …) with 64 interconnect-master peripherals (1381, 1382, 1383, …), a cache-coherency group could include 2 out of 16 sor cores, and 10 out of 64 interconnect-master peripherals. This reduces the upper-bound rate of cache coherency traffic that must be processed by the cache s for those 2 cores, resulting in icant power savings and lower-cost s look-up mechanisms in the cache modules. e.g. this cache coherency group would only need to sustain looking up to 12 memory transfer requests every 16 clock cycles instead of g up to 32 memory er requests every 16 clock cycles.
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which are connected to the first on-chip shared memory interconnect (1350) e in a cachecoherency group that maintains cache-coherency between each other are update type of caches that snarf each others write requests. This is particularly advantageous when ming worst case execution time (WET) analysis of tightly coupled tasks in shared memory ectures. Let us consider the situation in which the first core (1341) requests a ce lock and the second core (1331) releases that same ce lock. The cache snarfing mechanisms can be adapted to guarantee that all write requests issued by the core (1341) before that core (1341) released the resource lock are processed by the snarfing cache of core (1331) before that core (1331) is granted that shared ce lock. This ensures that each cache-line that was t in the cache of core (1331) before that core (1331) requested a shared memory resource lock are coherent with the write memory transfer requests issued by core (1341). This then avoids the need to consider which cache-lines, if any, were updated by other tasks running on other cores in the cache coherency group that are sharing a common region of memory. This can result in a very significant reduction in upper-bound WCET analysis complexity. It can also result in tighter upper-bound WCET analysis times for those tasks. By way of ison, the use of an eviction type of cache would result in some cache-lines that were present in the cache of core (1331) before the resource lock was requested being evicted so as to maintain coherency with the Signed B. Gittins 14 June 2020 40 NZ IP No. 716954 write memory transfer requests of core (1341). This would require the upper-bound WCET analysis tools to identify which cache-lines could potentially have been d so as to make pessimistic timing assumptions about access to those cache-lines.
The use of p dual port memory (1370) is particularly well suited for supporting a relatively low number of high-bandwidth bus masters such as processor cores (1331, 1341) connected to the first onnect (1350), and a larger number of peripherals (for example, 64 peripherals) operating at their wire speed which are connected to the second interconnect (1360). In particular, increasing the number of peripherals, say from 64 to 128, does not reduce the bandwidth, or increase the access latencies of processor cores (1331), (1341) to the shared memory (1370). Furthermore, one or more timeslots of the second interconnect (1360) can be allocated to high bandwidth peripherals (say 1 gigabit/s Ethernet peripherals) over lower dth peripherals (say 10 Megabit/s Ethernet peripherals) which need only be ted one timeslot to meet their wire speed bandwidth requirements.
In some situations, it will be desirable for one or more of the M onnect bridges ({1351.a, 1252.a}, {1351.b, 1252.b}) to operate as an interconnect protocol transcoding bridge in which the protocol to transcode is a bus interconnect protocol such as ARM AMBA AHB [2].
The time-analysable rocessor interrupt controller (1392) with software maskable interrupt lines is adapted to map one or more interrupt lines between the peripherals (1381, 1382) and one or more interconnect masters (1331, 1341).
The shared memory computing device (1300) further comprises: N cache modules ({1351.a, 1352.a}, {1351.b, }), where the value of N is 2, each cache module comprising: a master port: a target port; and a finite state machine that employs an update-type cache coherency policy; N processor cores (1331, 1341), in which each core: is assigned a different one of the N fully associative cache modules ({1351.a, 1352.a}, {1351.b, }) as its private cache; and in which: the execution time of memory er requests issued by each of the N sor cores (1331, 1341) is not modified by the: unrelated memory er requests issued by any of the other N processor Signed B. Gittins 14 June 2020 41 NZ IP No. 716954 cores (1331, 1341); and unrelated memory transfer requests issued by at least one other interconnect master (1381, 1382, 1383); one {1351.a, 1352.a} of the N cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) can maintain cache coherency against a different one of the N cache modules {1351.b, 1352.b}; and that cache module {1351.a, 1352.a} can maintain cache coherency against memory transfer requests issued by the at least one interconnect master (1381, 1382, 1383) by monitoring wire (1399).
Figures 12 to 14 illustrate alternative interconnect designs ing to preferred embodiments of the present invention. These alternative onnect designs can be employed to implement the interconnect (720) of figure 6 and the 3 interconnects (1350), (1360) and (1355) of figure 11.
Figure 12 is a block schematic diagram rating portions of a shared memory computing architecture (1700) for preferred embodiments of the present invention. The shared memory computing architecture (1700), comprises: M interconnect nodes (1701, 1702, 1703, 1704), where the value of M is 4, each interconnect node sing: an egress port; and an ingress port; a singular interconnect node (1705) comprising: an egress port; and an ingress port; a first Mx1 interconnect (1706) for transporting memory transfer requests and their corresponding responses, comprising: M ctional ports ({1711.i, 1711.e}, i, 1712.e}, {1713.i, 1713.e}, {1714.i, 1714.e}), each comprising: an ingress port (1711.i, 1712.i, 1713.i, 1714.i) which is ted to the egress port of a different one of the M interconnect nodes (1701, 1702, 1703, 1704); and an egress port (1711.e, 1712.e, 1713.e, 1714.e), which is connected to the ingress port of a ent one of the M interconnect nodes (1701, 1702, 1703, 1704); a ar bidirectional port ({1715.i, 1715.e}) comprising: an egress port e) which is connected to the ingress port of the Signed B. Gittins 14 June 2020 42 NZ IP No. 716954 singular onnect node (1705); and an ingress port i) which is connected to the egress port of the singular onnect node (1705); a parallel-in, serial-out (PISO) M input port x 1 output port shift register (1707) with M stages (1751, 1752, 1753, 1754), in which: for each stage I of the M stages: that stage is connected to the egress port of the interconnect node I of M interconnect nodes ({1751, 1711.i, 1701}, {1752, 1712.i, 1702}, {1753, 1713.i, 1703}, {1754, 1714.i, ; and the output of stage 1 (1751) is ted to the egress port (1715.e) of the singular port of the interconnect; a serial-in, parallel-out (SIPO) 1 input port x M output port module (1708), in which the input is connected to the ingress port of the singular port of the interconnect (1715.i); and an arbiter and decoder module (1716) which is adapted to control the PISO Mx1 shift register (1707) and the SIPO 1xM module (1708).
In this pedagogical description, the value of W is set as the number of bits to transport a memory transfer request of the m length for that interconnect and its corresponding response in one clock cycle. An idle memory transfer request is encoded as W bits with the binary value of zero. The arbiter and decoder module (1716) controls: the select input of each of the 2 data input, 1 data output multiplexers (1720, 1721, 1272, 1723, 1725, 1726, 1727, 1728), each multiplexer having a data-path of W bits; the select input of the optional 2 data input, 1 data output multiplexer (1729) which has a data-path of W bits; the enable input of each of the registers (1730, 1731, 1732), each er having a data-path of W bits; the enable input of each of the optional registers (1740, 1741, 1742, 1743, 1744), each register having a data-path of W bits; the enable input of register (1746) which has a ath of W bits, the enable input of each of the optional registers (1745, 1747), each register having a data-path of W bits.
The interconnect arbiter and decoder module (1716) receives as inputs the control signals (not illustrated) received on ports (1711.i, 1712.i, 1713.i, , 1715.i). Preferably the arbiter and decoder module (1716) implements at least one scheduling policy that considers the state of those input control signals.
The interconnect arbiter and decoder module (1716) generates one or more control signals as outputs (not illustrated) that are supplied as output on ports e, 1712.e, 1713.e, 1714.e, 1715.e). One or more of these controls signals released as output on ports (1711.e, 1712.e, Signed B. Gittins 14 June 2020 43 NZ IP No. 716954 1713.e, , 1715.e) are used to inform the interconnect nodes (1701, 1702, 1703, 1704, 1705) if it has been granted a timeslot on the interconnect to issue a memory er request (if it is a interconnect master); and to provide relevant meta-data ated with a memory transfer request sent to that interconnect node (if it is a interconnect target).
The following text s the use of the optional registers (1740, 1741, 1742) and the optional registers (1745, 1747).
This paragraph describes the parallel-in, serial-out (PISO) M input port x 1 output port shift register module (1707) in greater detail. The data-path of each of the ingress ports (1711.i, 1712.i, 1713.i, ) is gated by the multiplexers (1720, 1721, 1722, 1723) respectively. The data path of each of the egress ports of (1711.e, 1712.e, 1713.e, 1714.e, 1714.s) is gated by the multiplexers (1725, 1726, 1727, 1728, 1729) respectively. In the fourth stage (1754) of the parallel-in, serial-out (PISO) M input port x 1 output port shift register (1707), the binary value 0 is supplied as input to the first data port of multiplexer (1737). The output of multiplexer (1723) is supplied as input to the second data port of multiplexer (1737). The output of lexer (1737) is supplied as data input to the register (1732). In the third stage (1753), the output of register (1732) is supplied as input to the first data port of multiplexer (1736). The output of multiplexer (1722) is supplied as input to the second data port of multiplexer (1736). The output of multiplexer (1736) is supplied as data input to the register (1731). In the second stage , the output of register (1731) is supplied as input to the first data port of multiplexer (1735). The output of multiplexer (1721) is supplied as input to the second data port of multiplexer (1735).
The output of multiplexer (1735) is ed as data input to the register (1730). In the first stage (1753), the output of register (1730) is supplied as input to the first data port of multiplexer (1717). The output of lexer (1720) is supplied as input to the second data port of multiplexer (1717). The output of multiplexer (1717) is released as the egress output of port (1715.e).
This paragraph describes the serial-in, el-out (SIPO) 1 input port x M output port module (1708) in greater detail. The output of interconnect node (1705) is received on ingress port (1715.i) and is supplied to the data input of registers (1740) and (1745). The output of the W-bit wide register (1740) is gated by multiplexer (1725). The output of W-bit wide er (1745) is supplied to the data input of registers (1741) and (1746). The output of the W-bit wide register (1741) is gated by lexer (1726). The output of W-bit wide register (1746) is supplied to the data input of registers (1742) and (1747). The output of the W-bit wide register (1742) is gated by multiplexer (1727). The output of W-bit wide register (1747) is supplied is gated by Signed B. Gittins 14 June 2020 44 NZ IP No. 716954 multiplexer (1728). ably the arbiter and decoder module (1716) is adapted to employ the ingress and egress gating to selectively block the outputs and inputs of interconnect nodes (1701, 1702, 1703, 1704) respectively. rmore, the gating multiplexers can be used by the r and decoder module (1716) to enforce access controls. The gating multiplexers can be implemented using AND gates without loss of generality.
In a preferred embodiment of the present invention, the interconnect node (1705) is an interconnect master, and the interconnect nodes (1701, 1702, 1703, 1704) are interconnect s. In this embodiment, memory transfer requests are transported over the first serial-in, parallel-out (SIPO) 1 input port x M output port module (1708) and memory transfer responses are transported over the parallel-in, serial-out (PISO) M input port x 1 output port shift er module (1707). Preferably each timeslot has a length of 1 clock cycle, onnect master (1705) is adapted to issue a new memory transfer request every clock cycle and each interconnect target (1701, 1702, 1703, 1704) is adapted to issue a memory transfer response once every 4 clock cycles.
Preferably each interconnect target (1701, 1702, 1703, 1704) is assigned one timeslot, and the interconnect master issues memory transfer requests in a round-robin fashion to each of the interconnect targets (1701, 1702, 1703, 1704). In a preferred embodiment of the present invention, the register (1740) is replaced with a 2 stage FIFO, the register (1741) is replaced with a 1 stage FIFO, the optional registers (1742) and (1743) are both replaced with a 1 stage FIFO, and the optional registers (1745) and (1747) are not used. In this case, the memory transfer request for each timeslot (for 1701, 1702, 1703, 1704) is loaded into its corresponding FIFO (1740, 1741, 1742, 1743). The concurrent output of each FIFO (1740, 1741, 1742, 1743) is delayed by 1 clock cycle for each delay register (1745, 1746, 1447) that is employed. In this illustration, only one delay register (1746) is employed, and so the output of each FIFO (1740, 1741, 1742, 1743) is released in parallel in the second timeslot. In this way a new memory transfer request can be issued every clock cycle in a round robin scheme with 4 timeslots, gh it takes 5 clock cycles to transport each of those memory transfer requests to the 4 interconnect targets (1701, 1702, 1703, 1704).
In an alternate preferred ment of the present invention, the interconnect node (1705) is an interconnect , and the interconnect nodes (1701, 1702, 1703, 1704) are onnect masters. In this embodiment memory er requests are transported over the parallel-in, Signed B. Gittins 14 June 2020 45 NZ IP No. 716954 serial-out (PISO) M input port x 1 output port shift register module (1707) and memory transfer responses are transported over the first serial-in, parallel-out (SIPO) 1 input port x M output port module (1708). Preferably each timeslot is 1 clock cycle in length, the interconnect s (1701, 1702, 1703, 1704) are adapted to issue a memory transfer request once every 4 clock cycles and the interconnect target (1705) is d to receive a memory transfer request each clock cycle and issue a memory transfer response each clock cycle.
Preferably module (1707) is adapted to transporting just memory transfer requests and module (1708) is adapted to transport memory er responses along with a copy of their corresponding memory transfer requests to facilitate cache ncy for update-type snooping caches (1705, 1715, 1744, 1729, , 1704).
Figure 13 is a flow-chart (1800) illustrating the steps of interconnect master (1702) issuing a single memory transfer request over interconnect (1706) to interconnect target (1705) ing to a preferred embodiment of the present ion. The process described in flow chart (1800) will not use the optional registers (1740, 1741, 1742, 1743, 1744, 1745, 1474), and the 4 memory transfer ses within a statically scheduled round-robin period of 4 clock cycles will not be buffered and released in parallel. In this way, only PISO module (1707) is enting a timeslot based scheme, but the SIPO module (1708) employs a best-effort scheduling scheme.
In clock cycle 1 (1801): In step 1820, the interconnect target (1705) receives the output of PISO module (1707) which contains an idle memory transfer request. The onnect target (1720) generates an idle memory transfer response incorporating a copy of its corresponding idle memory transfer request. The value of that memory transfer response is supplied to interconnect In step 1830, the value of the memory transfer response generated in step 1820 is received as input on port 1715.i and supplied to the input of the SIPO module (1708) and will be relayed across the 2 stages of that SIPO . The first stage includes the modules , (1726) and (1746). The second stage includes the modules (1727) and (1728). The interconnect arbiter and decoder module (1716) generates control signals on ports (1711.e), (1712.e), (1713.e), and (1714.e) granting the next ingress timeslot of the interconnect (1706) simultaneously to each of the interconnect masters (1701), (1702), (1703) and (1704) respectively.
In step 1810, the value of the control signal generated by the SIPO module (1707) in step 1830 is received as input by the interconnect master (1702).
Signed B. Gittins 14 June 2020 46 NZ IP No. 716954 In clock cycle 2 (1802): In step 1821, the interconnect target (1705) receives the output of PISO module (1707) which contains an idle memory transfer t. The onnect target (1720) generates an idle memory transfer response incorporating a copy of its corresponding idle memory transfer request which was received in step 1820. The value of that memory transfer response is supplied to the interconnect (1708).
In step 1811, the interconnect master (1702) tes a memory transfer request addressed to interconnect target (1705) the value of which is supplied to interconnect (1708).
In step 1831, the value of the memory transfer response generated in step 1821 is received as input to the SIPO module (1708) and will be relayed across the 2 stages of the SIPO module. The value of the memory transfer request generated in step 1811 is received as input to the second stage (1752) of the PISO module (1701) and stored in register (1730). Each of the other 3 interconnect nodes (1701), (1703), and (1704) generate an idle memory transfer response which is received as input to the first stage (1751), third stage (1753) and fourth stage (1754) respectively.
In clock cycle 3 : In step 1832, the value of the memory transfer t stored in er (1730) is released as output of the PISO module (1707) and supplied as input to the interconnect target (1705).
In step 1822, the interconnect target (1705) es the output of PISO module (1707) which contains the value of the memory transfer request generated as output by the interconnect master (1702) in step 1811 and begins to processes that request. The interconnect target (1720) generates an idle memory transfer response incorporating a copy of its corresponding idle memory transfer request which was received in step 1821.
The value of that memory er response is supplied to the interconnect (1708).
In step 1832, the value of the memory transfer response generated in step 1822 is received as input to the SIPO module (1708) and will be relayed across the 2 stages of the SIPO module.
In clock cycle 4 (1804): In step 1823, the interconnect target (1705) receives the output of PISO module (1707) which ns an idle memory transfer request. The interconnect target (1720) generates a memory transfer response incorporating a copy of its corresponding idle memory Signed B. Gittins 14 June 2020 47 NZ IP No. 716954 transfer request which was received in step 1822. The value of that memory transfer response is supplied to the interconnect (1708).
In step 1833, the value of the memory transfer response generated in step 1823 is received as input to the SIPO module (1708) and will be relayed across the 2 stages of the SIPO module. That value of that memory transfer response received as input to the SIPO module (1708) is directly released as output over port (1712.e) to interconnect master (1702).
In step 1812, the interconnect master (1702) receives the value of the memory transfer response sent in step 1832 corresponding to the interconnect master’s (1702) memory transfer request issued in step 1811.
In this way we have illustrated an interconnect master (1702) issuing a memory transfer request to interconnect target (1705) and receiving its corresponding memory transfer response over interconnect (1706).
Preferably, the shared memory computing architecture (1700) further comprises a second serialin , parallel-out (SIPO) 1 input port x M output port (only port (1714.s) is illustrated) module (1709) for transporting cache coherency traffic, in which: the input is connected to the ingress port (1715.i) of the singular port {1715.i, 1715.e} of the interconnect (1706); and the arbiter and decoder module (1716) controls the second SIPO 1xM module.
Preferably the first SIPO (1708) and second SIPO (1709) employ different routing es. Let us consider an example where interconnect nodes (1701, 1702, 1703, 1704) are interconnect masters. In this example, the r and r module (1716) selectively routes the value of each memory er response back to the interconnect master that issued the corresponding memory er request on the first SIPO (1708). However, for the second SIPO (1709), the arbiter and decoder module (1716) forwards the value of each and every memory transfer response (and its corresponding memory transfer t data) to the snoop port (only 1704.s illustrated) of all interconnect s. See the description of figure 20 for an example encoding a memory er response with its corresponding memory transfer request. In this way the snooping of write memory transfer requests can be performed when monitoring just the interconnect transporting memory er ses. Preferably cache coherency groups are ed so that memory transfer responses (and their ponding memory transfer request data) are selectively forwarded ing to the cache coherency group policies in force on that interconnect (1706).
Signed B. Gittins 14 June 2020 48 NZ IP No. 716954 So in this way we have rated a ctional interconnect (1706) for transporting memory transfer requests and their corresponding memory er responses, comprising: a unidirectional interconnect to transport memory transfer requests (1707); a unidirectional onnect to transport memory transfer responses (1708, 1709) which is adapted to transport memory transport responses that includes a copy of the corresponding memory transfer request.
In an alternate preferred embodiment, the interconnect node (1705) is an interconnect bridge. In some situations, it will be desirable for the interconnect bridge (1705) to operate as an interconnect protocol transcoding bridge in which the protocol to transcode is a bus interconnect protocol such as ARM AMBA AHB [2].
Figure 14 is a block schematic diagram illustrating portions of a shared memory computing architecture (1900), employing embodiments of figures 3 and 12 for preferred embodiments of the present invention. Shared memory ing ecture (1900) comprises: 16 interconnect masters (1901 to 1916); 1 interconnect target (1917); a composite interconnect {1960, 1961, 1962, 1963, 1964} comprising: four terconnects (1960, 1961, 1962, 1693) of the type described in figure 12, each sub-interconnect having 4 interconnect master ports ({1921 to 1924}, {1925 to 1928}, {1929 to 1932}, {1933 to 1936}) and 1 output port (1941, 1942, 1943, 1944); one sub-interconnect (1964) having 4 input ports (1951 to 1954) and 1 interconnect target port (1955); in which: the 4 interconnect masters (1901) to (1904) are connected to sub-interconnect (1960) on ports (1921) to (1924) respectively; the 4 onnect masters (1905) to (1908) are ted to terconnect (1961) on ports (1925) to (1928) respectively; the 4 interconnect masters (1909) to (1912) are connected to sub-interconnect (1962) on ports (1929) to (1932) respectively; the 4 interconnect masters (1913) to (1916) are connected to sub-interconnect (1963) on ports (1933) to (1936) respectively; the 4 output ports (1941, 1942, 1493, 1944) of the 4 sub-interconnects (1960, 1961, 1962, 1963) are connected to the 4 input ports (1951, 1952, 1953, 1954) of Signed B. Gittins 14 June 2020 49 NZ IP No. 716954 the sub-interconnect (1964) respectively; the interconnect target (1917) is connected to sub-interconnect (1964) on port (1955); Preferably, the composite interconnect {1960, 1961, 1962, 1963, 1964} employs a statically scheduled timeslot scheme with 16 timeslots, one for each of the interconnect masters (1901 to 1916).
In one preferred embodiment of the t invention, the arbiter and decoder modules of the five sub-interconnects (1960, 1961, 1962, 1963, 1964) are trivially substituted with a single arbiter and decoder module controlling the ite interconnect {1960, 1961, 1962, 1963, 1964}. In an alternate preferred embodiment of the present invention, the five arbiter and decoder s in sub-interconnects (1960, 1961, 1962, 1963, 1964) are adapted to co-ordinate their activities to create a single logical finite state machine (not illustrated) lling the composite interconnect {1960, 1961, 1962, 1963, 1964}.
Figure 14 illustrates that different types of interconnects can be combined together to create a ite interconnect without a loss of generality.
In an ate embodiment of the present invention, the interconnect nodes (1901 to 1916) are interconnect targets and the onnect node (1917) is an interconnect bridge which permits one or more onnect masters (not illustrated) to issue memory transfer requests over that interconnect bridge (1917) to the onnect targets (1901 to 1916). ably the composite interconnect {1960, 1961, 1962, 1963, 1964} further comprises a means to enforce an access control policy between interconnect masters and interconnect targets. It is further preferred that the means to enforce an access control policy is adapted to ensure that no more than one interconnect master can issue memory transfer requests to a given interconnect target (1901 to 1916). In this way the access control policy guarantees that a memory transfer request to that interconnect target will not be delayed by other interconnect masters.
Figure 15 is a high-level block schematic diagram illustrating a cache module (1200) for preferred embodiments of the t invention. Cache module (1200) comprises: an interconnect target port (1210); an interconnect master port (1215); two snoop ports (1212) and ; a first in first (FIFO) queue (1214) to store cache coherency being adapted to store snoop Signed B. Gittins 14 June 2020 50 NZ IP No. 716954 traffic ed on the two snoop ports (1212) and (1213); a FIFO queue (1211) to store memory transfer requests received on the interconnect target port (1210) being adapted to store: at least one outstanding write memory transfer request; and at least one outstanding read memory transfer request; a dual-port cache-line store (1230) being adapted to store at least two cache-lines; a FIFO queue (1235) being adapted to queue write memory transfer events; a FIFO queue (1236) being adapted to queue read memory transfer ; a queue (1237) being adapted to queue the order to process read and write memory transfer events queued in the FIFO queues (1235) and (1236); a FIFO queue (1238) called a write buffer (1238) being d to store the data of cache-lines that have been evicted from the line store (1320) and are to be written over the interconnect master port (1215); a dual port address tag finite state machine (1231) comprising: a first target port; a second target port; a means to store tags that associate cache-lines stored in the cache-line store (1230) with their respective al and/or physical) addresses; a means to search for tags by their (virtual and/or physical address); and a means to search for tags by their index within the cache-line store (1230); a triple port status tag finite state machine (1232) comprising: a first target port: a second target port; a third target port; a means to store tags that ate the cache-lines stored in the cache-line store (1230) with their status and other related ation, including: which cache-lines are allocated; which cache-lines are in the process of being evicted; optionally which cache-lines are in the process of being cleaned; which portions of the cache-lines are valid; and which portions of the cache-lines are dirty; and a means to process commands received on the first, second and third target ports in a way that ensures internal consistency of the t of the tags and the responses to the concurrently issued commands; an interconnect (1239) that is work preserving comprising: a high priority master port; Signed B. Gittins 14 June 2020 51 NZ IP No. 716954 a low priority master port; and a target port connected the second port of the dual-port cache-line store (1230); a front-side FSM (1220) comprising: a master port connected to the low priority master port of the interconnect (1239); a bidirectional communications channel with the FIFO queue (1211); a bidirectional communications channel with the interconnect target port ; a unidirectional communications channel with the queuing FSM (1221); a bidirectional communications channel with the back-side FSM ; a master port connected to the second master port of the dual port address tag finite state machine (1231); and a master port connected to the second target port of the triple port status tag finite state machine (1232); a queuing FSM (1221) comprising: a bidirectional communications channel with the front-side FSM (1220); a bidirectional communications channel with the back-side FSM (1222); two master ports ted to the FIFO queue (1235) being adapted to queue write memory transfer events; two master ports ted to the FIFO queue (1236) being adapted to queue read memory transfer events; and two master ports ted to the FIFO queue (1237) being d to queue the order to process read and write memory transfer events. a back-side FSM (1222) sing: a master port connected to the high ty master port of the interconnect (1239); a bidirectional ications channel with the queuing FSM (1221); a bidirectional communications channel with the side FSM (1220); a master port connected to the third target port of the triple port status tag finite state machine (1232); two master ports connected to the write buffer (1238); and a bidirectional communications channel with the interconnect master port (1215); a snoop FSM (1223) comprising: a bidirectional communications channel with the FIFO queue (1214); a bidirectional communications channel with the back-side FSM (1222); a master port connected to the first target port of the dual port address tag finite state machine (1231); Signed B. Gittins 14 June 2020 52 NZ IP No. 716954 a master port connected to the first target port of the triple port status tag finite state machine (1232); and a master port connected to the first port of the ort cache-line store (1230).
Figure 16 is a flow-chart (1400) illustrating the steps of the front-side FSM (1220) of figure 15 according to a preferred embodiment of the present invention. The process described in flow chart (1400) is a functional ption which executes over 1 or more clock cycles.
In step 1401, start the front-side FSM s.
In step 1402, perform a blocking read to fetch the next memory transfer request from the ingress FIFO queue (1211). By blocking, it is meant that the read t will wait until a memory transfer request is retrieved, even if the FIFO queue (1211) is initially empty when the read request is issued.
In step 1403, issue a blocking command to the address tag finite state machine (1321) to search for a cache-line by the address encoded in the memory transfer t ed in step 1402. If the cache-line is present, then issue a blocking command to the status tag finite state e (1322) to: (a) ve the status details including which portions of that cache-line are valid, (b) request the status details of the least recently used cache-line, and (c) ask if there are any currently unallocated cache-lines.
In step 1404, if the memory transfer request received in step 1402 is a read request go to step 1405 otherwise go to step 1415.
In step 1405, if the memory transfer request received in step 1402 corresponds to a cache-line that is present in the cache-line store (1230) and the requested content is present in that cacheline then go to step 1413 otherwise go to step 1406.
In step 1406, if the read memory transfer request received in step 1402 corresponds to a cacheline that is present in the cache-line store (1230) but a requested portion of that cache-line is not present/valid then go to step 1412 otherwise go to step 1407.
In step 1407, if there is at least one cated cache-line ble in the cache-line store (1230), then go to step 1411, otherwise go to step 1408.
In step 1408, issue a non-blocking command to the status tag finite state machine (1232) marking the least recently used cache-line as being in the process of being evicted.
In step 1409, if the least recently used cache-line to be evicted is dirty and therefore must be written out of the cache module (1200) then go to step 1410, otherwise go to step 1411.
In step 1410, issue a non-blocking d to the queuing FSM (1221) requesting an eviction of the dirty cache-line. Wait for a notification from the back-side FSM (1222) indicating a write transaction has completed.
Signed B. Gittins 14 June 2020 53 NZ IP No. 716954 In step 1411, issue a ng command to the status tag finite state machine (1232) requesting the allocation of an unallocated cache-line and receive the index for that newly allocated cacheline.
In step 1412, issue a non-blocking d to the queuing FSM (1221) to requesting a read memory transfer request, passing the index of the cache-line to store the retrieved data. Wait for the ide FSM (1222): (a) to indicate that the cache-line has been read and stored in the cache-line store (1230), and (b) to forward a copy of the requested data to the front-side FSM.
In step 1413, issue a blocking command to the line store (1230) to read a copy of the requested data and forward a copy of the requested data to the front-side FSM.
In step 1414, issue a memory transfer response containing the requested read data to the interconnect target port.
In step 1415, if the memory er request received in step 1402 corresponds to a cache-line that is present in the cache-line store (1230) then go to step 1421 otherwise go to step 1416.
In step 1416, if there is at least one unallocated cache-line available in the cache-line store (1230) then go to step 1420, otherwise go to step 1417.
In step 1417, issue a non-blocking command to the status tag finite state machine (1232) g the least ly used cache-line as being in the process of being evicted.
In step 1418, if the least recently used cache-line to be evicted is dirty and therefore must be written out of the cache module (1200) then go to step 1419, otherwise go to step 1420.
In step 1419, issue a non-blocking d to the queuing FSM (1221) request an eviction of the dirty line. Wait for a notification from the back-side FSM (1222) indicating that a write transaction has completed.
In step 1420, issue a blocking command to the status tag finite state machine (1232) requesting the allocation of an unallocated cache-line and e the index to that newly allocated cache- line.
In step 1421, issue a non-blocking command to the cache-line store (1230) to write a copy of the data received in the write memory transfer request to the location in the cache-line store (1230) indicated by the index received in step 1420.
In step 1422, issue a ocking command to the status tag finite state machine (1232) marking that cache-line as being dirty.
In step 1423, if this cache-line was previously clean, issue a non-blocking command to the queuing FSM (1221) to inform it this cache-line is now dirty.
In step 1424, end the front-side FSM process.
In this way, we have demonstrated that the front-side FSM: Signed B. Gittins 14 June 2020 54 NZ IP No. 716954 employs an allocate on read gy; employs an allocate on write gy; employs a least recently used eviction strategy; and writes can be performed to any dirty cache-line which has been queued for on, but not yet d.
Figure 17 is a flow-chart 1500 illustrating the steps of the queuing FSM (1221) of figure 15 ing to a preferred embodiment of the present invention. The process described in flow chart (1400) is a onal description which executes every clock cycle that the cache module (1200) is enabled. At least one of the 4 policies is ed at power on, and the currently active policy can be changed at run time.
In step 1501, start the queuing FSM (1221) process.
In step 1502, receive any commands issued by the front FSM (1220); In step 1503, receive any notifications issued by the back FSM (1222); In step 1504, if there are no commands issued by the front FSM (1220) this clock cycle then go to step 1514, otherwise go to step 1505.
In step 1505, if a read command is received in step 1502, go to step 1506. If an eviction d is received in step 1502, go to step 1507. Otherwise a dirty cache-line notification command has been received in step 1502 therefore go to step 1508.
In step 1506, store the read command in FIFO queue (1236); go to step 1508.
In step 1507, store the write command in FIFO queue (1235); go to step 1508.
In step 1508, if the currently active policy is policy 1, go to step 1509. If the currently active policy is policy 2, go to step 1510. If the currently active policy is policy 3, go to step 1511.
Otherwise the currently active policy is policy 4 therefore go to step 1512.
In step 1509, policy 1 employs a policy in which a cache-line is solely evicted in response to ing a memory transfer request which either: flushes at least one specific cache-line; or requires the allocation of at least one cache-line.
Policy 1 ignores all dirty cache-line notification commands received in step 1502. In a preferred embodiment of the t invention, read and write operations will be queued in (1237) in the order they are received. In an alternate preferred embodiment of the present invention, read operations will take priority over queued write operations. Go to step 1513.
In step 1510, policy 2 employs a policy in which each cache-line is queued for eviction as soon as it becomes dirty and a read-miss is serviced after all the currently outstanding dirty cachelines have been evicted.
Signed B. Gittins 14 June 2020 55 NZ IP No. 716954 If a dirty cache-line notification command was received in step 1502 then generate a write command and store it in the FIFO queue (1235) to queue writing this dirty line out of the cache-module (1200). Go to step 1513.
In step 1511, policy 3 employs a policy in which each cache-line is queued for eviction as soon as it becomes dirty and a read-miss is serviced before all the currently outstanding dirty cachelines have been evicted.
If a dirty cache-line notification command was ed in step 1502 then generate a write command and store it in the FIFO queue (1235) to queue writing this dirty cache-line out of the cache-module (1200). Go to step 1513.
In step 1512, policy 4 employs a policy in which each cache-line is queued for eviction as soon as it becomes dirty; and in which a read-miss is serviced before the eviction of the currently outstanding dirty cache-lines queued for eviction on the condition that the execution time of each of the outstanding dirty-cache-lines evictions is not modified as a result of executing the readmiss operation first, otherwise the read-miss operation is delayed.
If a dirty cache-line notification d was ed in step 1502 then generate a write command and store it in the FIFO queue (1235) to queue writing this dirty cache-line out of the module (1200). Go to step 1513.
In step 1513, the t of the queue (1237) is d according to the currently active policy.
In step 1514, if there are no transaction-completed notifications issued by the back FSM (1220) this clock cycle then go to step 1519, ise go to step 1515.
In step 1515, if the back FSM (1220) issued a read transaction completed notification go to step 1516, otherwise a write transaction completed cation has been issued and therefore go to step 1517.
In step 1516, remove one element from the FIFO queue (1236). Go to step 1518.
In step 1517, remove one element from the FIFO queue (1235). Go to step 1518.
In step 1518, remove one element from the queue (1237).
In step 1519, release a copy of the head-of-line values for queues (1236), (1235), (1237) as input to the back FSM (1222).
In step 1520, end the queuing FSM (1221) process.
Figure 18 is a flow-chart (1600) illustrating the steps of the back-side FSM (1222) of figure 15 according to a preferred embodiment of the t invention. The process described in flow chart (1600) is a functional description which executes over 1 or more clock cycles. This process assumes the interconnect connected to the cache modules’ master interconnect port (1215) issues memory transfer responses to write memory transfer ts ting if the transaction completed or needs to be resent because the transaction was corrupted before it could Signed B. Gittins 14 June 2020 56 NZ IP No. 716954 be completed.
In step 1601, start the back-side FSM (1222) process.
In step 1602, receive any ds issued by the front FSM (1220); In step 1603, receive a copy of the f-line values for queues (1236), (1235), (1237) and store in variables R, W, and T respectively.
In step 1604, if there is no outstanding read memory transfer event R and no outstanding write memory transfer event T, then go to step 1620, otherwise go to step 1605.
In step 1605, issue a blocking request to the interconnect master interface requesting a timeslot on the interconnect (not rated). Preferably the interconnect (not illustrated) es the interconnect master port (1215) that it will be granted a timeslot on the interconnect at least one clock cycle before its allotted timeslot starts. The rest of this process assumes this is the case.
In step 1606 if the value of T indicates the read operation should be serviced go to step 1608 otherwise the write operation should be serviced therefore go to step 1607.
In step 1607, issue a blocking command to the cache-line store (1230) to read a copy of the requested data to write as per write memory transfer event W.
In step 1608, issue a non-blocking command to the status tag finite state machine (1232) updating the status of the cache-line as clean. Go to step 1609.
In step 1609, wait 1 clock cycle for the start of the memory transfer t timeslot on the interconnect (not illustrated).
In step 1610, if the value of T indicates the read operation should be serviced go to step 1611 otherwise the write operation should be serviced therefore go to step 1615.
In step 1611, create a read memory transfer request in response to the read memory transfer event R and issue that memory er request over the onnect master port (1215).
In step 1612, wait until the memory transfer response to the read memory transfer request issued in step 1611 is received on interconnect master port .
In step 1613, issue a non-blocking command to the line store (1230) to write a copy of the data received in step 1612 using the cache-line index stored in the read memory transfer event R.
In step 1614, issue a non-blocking command to the status tag finite state machine (1232) updating the status of the portions of cache-line that are now valid. Go to step 1618.
In step 1615, create a write memory transfer request in response to the write memory transfer event W and issue that memory transfer request over the interconnect master port (1215).
In step 1616, wait until the memory transfer se to the write memory transfer t issued in step 1615 is received on interconnect master port 1215.
In step 1617, if the memory transfer response received in step 1616 request the write memory transfer request is present, go to step 1615 otherwise go to step 1618.
Signed B. Gittins 14 June 2020 57 NZ IP No. 716954 In step 1618, issue a transaction complete notification to the front FSM (1220) and a full copy of the memory transfer response.
In step 1619, issue a transaction te notification to the queuing FSM .
In step 1620, end the back-side FSM (1222) process.
In an alternate red embodiment of the present invention, the notification to the front side FSM (1220) and queuing FSM (1221) of the completion of a write memory transfer request which is currently performed in steps 1618 and 1619 can instead be performed in step 1608.
This may permit the front side FSM (1220) to continue processing its current memory transfer request .
Figure 19 is a flow-chart 1000 illustrating the steps of the snoop FSM (1223) of figure 15 according to a preferred embodiment of the present invention. The process described in flow chart (1000) is a functional description which executes over 1 or more clock cycles.
In step 1401, start the snoop FSM process.
In step 1002, m a blocking read to fetch the next element of snoop c received on the two snoop ports (1212, 1213) from the FIFO queue (1214). In this embodiment snoop traffic is encoded as a copy of the memory transfer request and its ponding memory transfer response. Preferably all snoop traffic is transported and stored using forward error correcting techniques. For example, the use of triple modular replication of all signals and registers, the use of error correcting codes, or the use of double modular redundancy on communications paths with time-shifted redundant transmission of messages with error checking codes.
In step 1003, if a read memory transfer request is received in step 1002, go to step 1008. If a successful write memory transfer request has been received go to step 1004. Otherwise go to step 1008. Preferably read memory transfer ts are not issued to the snoop ports (1212) and (1213).
In step 1004, issue a blocking command to the address tag finite state machine (1321) to search for the index of a cache-line by the s encoded in the memory transfer request ed in step 1402.
In step 1005, if the cache-line is not present in the cache-line store (1230) then go to step 1008.
In step 1006, issue a blocking command to the cache-line store (1230) to write a copy of the data stored in the memory transfer request into the corresponding line in the cache-line store (1230). In this embodiment we have avoided adjusting the status valid status flags to avoid introducing a modification of the execution time for memory transfer requests issued on the interference-target port (1210). This is the preferred mode of operation when the processor core Signed B. Gittins 14 June 2020 58 NZ IP No. 716954 is not fully timing compositional and suffers from timing anomalies.
In an alternate preferred embodiment of the present invention, a ng command is issued to the status tag finite state machine (1232) to update which portions of the cache-lines are valid.
This may accelerate the execution time of memory transfer requests issued on the erence- target port (1210) but may introduce additional complexity when performing worst case execution time analysis of software running on the core associated with this cache.
In step 1007, end the snoop FSM (1222) process.
The cache module of figure 15 is employed as: the cache modules {733.a, 733.b}, {743.a, 743.b} of figure 6; and the cache modules {1351.a, 1352.a}, {1352.a, 1352.b} of figure 11.
In this way we have now described how the shared memory computing device of figure 6 and 15 comprises: N fully associative cache modules, where the value of N is at least 1, each fully associative cache module comprising: a master port: a target port; a means to track dirty cache-lines; and a finite state machine with one or more policies, in which at least one policy: employs an allocate on read strategy; employs an allocate on write strategy; and employs a least recently used eviction strategy; and N processor cores, in which each core: is assigned a different one of the N fully ative cache modules as its e cache.
The combined use of a fully-associative write-back cache modules with a least recently used eviction scheme as thus described is particularly well suited for upper-bound WCET analysis.
In st, set-associative write-back caches with any type of eviction scheme (a mode of operation found in a very large number of commercial computer architectures) is highly rable for upper-bound WCET analysis due to the interaction between: unknown effective addresses, the set-associative cache architecture, and the eviction of dirty cache-lines as a result of n ive addresses.
Signed B. Gittins 14 June 2020 59 NZ IP No. 716954 With n effective addresses, for example that may occur as a result of a ependent look up to an array that occupies more than one cache-line, it is not possible to statically determine exactly which set of the set-associative cache is ed. As a result, upper-bound WCET analysis tools must make conservative assumptions about any one of the sets of the cache that could have been accessed by that unknown effective address. In a 4-way sociative cache, this can lead to the pessimistic tion by an upper-bound WCET analysis tool that a full 25% of the cache-lines in the cache store may not be present. In both through and write-back modes of operation, upper-bound WCET analysis tools work on the worst case assumption that none of those potentially evicted cache-lines will now be present and that a read memory transfer request to a cache-line that was present must be re-read. However in write back mode of operation, upper-bound WCET analysis tools must also make pessimistic assumptions about the back operations that may occur as a result of cache-lines that were dirty before the unknown effective addresses lookup. Furthermore, if the cache-lines are backed in SDRAM using an open-page mode of ion, those write-back operations may adjust which rows are open in that SDRAM and thus the timing of ions to that SDRAM. Consequently this combination of write back mode of operation with set-associative caches can result in quite pessimistic upper-bound WCET results when compared to write through mode operation with set-associative caches. The later being the most r mode of operation for performing bound WCET analysis today.
In contrast, a fully-associative cache with least recently used eviction scheme does not introduce any ambiguity as to which cache-line would be evicted on an unknown effective address. Using associative caches with least recently used eviction schemes and back mode of operation as described above will tend to result in better upper-bound WCET analysis results when compared to set associative caches with write-through mode of operation, and fullyassociative caches with least recently used eviction schemes and write-through mode of operation.
This technique can be used with some processor cores that do exhibit timing effects (such as the Freescale MPC755), although it is preferred that those cores do not exhibit timing effects.
Figure 20 is a diagram illustrating the fields 2020 of a memory transfer request (2000) and the fields of its corresponding memory transfer response (2010) which includes a copy of the ponding memory transfer request (2010) according to a preferred embodiment of the present invention. In figure 20, the memory transfer request (2000) comprises: an 8-bit field (2001) indicating uniquely identifier an interconnect-master within the Signed B. Gittins 14 June 2020 60 NZ IP No. 716954 computing architecture; an 8-bit field (2002) indicating the ction ID for that interconnect-master; a 4-bit field (2003) indicating the transaction type, for example, a read or write memory transfer t type; a 5-bit field (2004) used to te the size of the memory transfer request in bytes; a 32-bit field (2005) used to indicate the address of the memory transfer request in bytes; a 256-bit field (2006) used to store the data to write for write memory transfer requests.
In figure 20, the memory transfer response (2010) ses: a copy of the memory transfer request, which comprises: an 8-bit field (2001) indicating uniquely identifier an interconnect-master within the computing architecture; an 8-bit field (2002) indicating the transaction ID for that interconnect-master; a 4-bit field (2003) indicating the transaction type, for example, a read or write memory transfer request type; a 5-bit field (2004) used to indicate the size of the memory transfer request in bytes; a 32-bit field (2005) used to te the address of the memory transfer request in bytes; a 256-bit field (2011) used to store the data to write for write memory transfer requests; and a 4-bit response status field (2012).
The field (2011) is used to store the data read for read memory transfer requests. Figure 20 illustrates that the memory transfer response has all the essential meta-data used in the original memory transfer request. In preferred embodiments, bus protocols do not use the transaction ID field (2002) if they do not employ transaction ID’s. s embodiments of the invention may be ed in many different forms, including computer program logic for use with a processor (eg., a microprocessor, microcontroller, digital signal processor, or general purpose computer), mmable logic for use with a programmable logic device (eg., a field programmable gate array (FPGA) or other PLD), discrete components, integrated circuitry (eg., an application specific integrated circuit (ASIC)), or any other means including any ation f. In an exemplary embodiment of the present invention, predominantly all of the communication between users and the server is Signed B. Gittins 14 June 2020 61 NZ IP No. 716954 implemented as a set of computer program instructions that is ted into a computer executable form, stored as such in a computer readable medium, and executed by a microprocessor under the control of an operating system.
Computer program logic implementing all or part of the functionality where described herein may be embodied in various forms, including a source code form, a computer executable form, and various intermediate forms (e.g., forms ted by an assembler, compiler, linker, or locater). Source code may include a series of er program instructions implemented in any of various programming languages (e.g., an object code, an assembly language, or a high-level language such as ADA SPARK, Fortran, C, C++, JAVA, Ruby, or HTML) for use with various ing systems or operating environments. The source code may define and use various data structures and communication messages. The source code may be in a er able form (e.g., via an interpreter), or the source code may be ted (e.g., via a ator, assembler, or compiler) into a computer executable form.
The computer program may be fixed in any form (e.g., source code form, computer executable form, or an intermediate form) either ently or transitorily in a tangible storage medium, such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash- Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM or DVD-ROM), a PC card (e.g., PCMCIA card), or other memory device. The computer program may be fixed in any form in a signal that is transmittable to a computer using any of s communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies (e.g., Bluetooth), networking technologies, and inter-networking technologies. The computer program may be distributed in any form as a removable e medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the internet or world wide web).
Hardware logic (including mmable logic for use with a programmable logic device) implementing all or part of the functionality where described herein may be designed using traditional manual s, or may be designed, captured, simulated, or nted electronically using various tools, such as computer aided design (CAD), a hardware description language (e.g., VHDL or AHDL), or a PLD programming language (e.g., PALASM, ABEL, or CUPL).
Signed B. Gittins 14 June 2020 62 NZ IP No. 716954 Programmable logic may be fixed either permanently or transitorily in a tangible storage medium, such as a nductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM or DVD-ROM), or other memory device. The programmable logic may be fixed in a signal that is transmittable to a computer using any of s communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies (e.g., Bluetooth), networking technologies, and internetworking logies. The programmable logic may be distributed as a removable e medium with accompanying printed or electronic ntation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or buted from a server or electronic bulletin board over the communication system (e.g., the internet or world wide web).
Throughout this specification, the words “comprise”, “comprised”, “comprising” and “comprises” are to be taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.
REFERENCES [1] G. Gebhard. Timing anomalies reloaded. In B. Lisper, editor, WCET, volume 15 of OASICS, pages 1–10. s Dagstuhl - Leibniz-Zentrum fuer Informatik, y, 2010.
ARM AMBA Specification (Rev 2.0), 1999. ARM IHI 0011A Aeroflex Gaisler. NGMP ication, Next tion Multi-Purpose Microprocessor.
Report, European Space Agency, Feb 2010. Contract 22279/09/NL/JK. http://microelectronics.esa.int/ngmp/NGMP-SPECi1r4.pdf F. J. a, R. Gioiosa, M. Fernandez, E. Quinones, M. Zulianello, and L. Fossati.
Multicore OS Benchmark (for NGMP). Final report, Barcelona Supercomputing Centre, 2012.
Under contract RFQ13153/10/NL/JK. http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf Signed B. Gittins 14 June 2020 63 NZ IP No. 716954

Claims (5)

Claims
1. A shared memory computing device comprising: a shared memory; at least one interconnect master, in which each interconnect master is adapted to 5 issue memory transfer requests that can be ed by the shared memory; N cache modules, where the value of N is at least 1, each cache module comprising: a master port; a target port that is adapted to issue memory transfer requests that can be 10 received by the shared memory; and means to implement an update-type cache coherency policy; M processor cores, where the value of M is equal to the value of N, in which each processor core: is ed a ent one of the N cache modules as that processor core’s 15 private cache; and in which the memory access latency of omic memory transfer requests issued by each of the M processor cores is not modified by: the memory transfer requests issued by any of the at least one interconnect masters.
2. A shared memory ing device as claimed in claim 1, in which the value of N is at least 2 and in which the memory access latency of non-atomic memory transfer requests issued by each of the M processor cores is not modified by the memory transfer requests issued by any of the other M processor cores.
3. A shared memory computing device as claimed in claim 2, in which at least one of the N cache modules is adapted to maintain coherency with regard to the data of the write memory transfer requests received on the target port of a different one of the N cache Signed B. Gittins 14 June 2020 64 NZ IP No. 716954
4. A shared memory computing device as claimed in any one of claims 1 to 3, in which at least one of the N cache modules is adapted to maintain coherency with regard to the data of the write memory transfer requests issued by one of the at least one interconnect masters to a memory address located in the shared .
5. A shared memory computing device as claimed in any one of claims 1 to 4, in which at least one of the N cache modules is a fully associated cache. Signed B. Gittins 14 June 2020 65 NZ IP No. 716954 This is the last page of the specifications and . Pages 65 to 80 are intentionally left blank. Signed B. Gittins 14 June 2020 page 1 of 18 NZ IP No. 716954 s 314 i te 339 mi e 3 3 4 e ti 338 me 384
NZ716954A 2013-07-18 2014-07-17 Computing architecture with peripherals NZ716954B2 (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
AU2013902678 2013-07-18
AU2013902678A AU2013902678A0 (en) 2013-07-18 Computing architecture with peripherals
AU2013904532A AU2013904532A0 (en) 2013-11-25 Computing architecture with peripherals
AU2013904532 2013-11-25
PCT/IB2014/063189 WO2015008251A2 (en) 2013-07-18 2014-07-17 Computing architecture with peripherals

Publications (2)

Publication Number Publication Date
NZ716954A NZ716954A (en) 2021-02-26
NZ716954B2 true NZ716954B2 (en) 2021-05-27

Family

ID=

Similar Documents

Publication Publication Date Title
US10210117B2 (en) Computing architecture with peripherals
Starke et al. The cache and memory subsystems of the IBM POWER8 processor
US7406086B2 (en) Multiprocessor node controller circuit and method
US8407432B2 (en) Cache coherency sequencing implementation and adaptive LLC access priority control for CMP
US7818388B2 (en) Data processing system, method and interconnect fabric supporting multiple planes of processing nodes
US8732398B2 (en) Enhanced pipelining and multi-buffer architecture for level two cache controller to minimize hazard stalls and optimize performance
US6279084B1 (en) Shadow commands to optimize sequencing of requests in a switch-based multi-processor system
US6249520B1 (en) High-performance non-blocking switch with multiple channel ordering constraints
US7380102B2 (en) Communication link control among inter-coupled multiple processing units in a node to respective units in another node for request broadcasting and combined response
US20010055277A1 (en) Initiate flow control mechanism of a modular multiprocessor system
JP2000231536A (en) Circuit having transaction scheduling of state base and its method
CN102375800A (en) Multiprocessor system-on-a-chip for machine vision algorithms
JP2016503934A (en) Context switching cache system and context switching method
US6877056B2 (en) System with arbitration scheme supporting virtual address networks and having split ownership and access right coherence mechanism
US7680971B2 (en) Method and apparatus for granting processors access to a resource
Ang et al. StarT-Voyager: A flexible platform for exploring scalable SMP issues
US7882309B2 (en) Method and apparatus for handling excess data during memory access
US7415030B2 (en) Data processing system, method and interconnect fabric having an address-based launch governor
US6145032A (en) System for recirculation of communication transactions in data processing in the event of communication stall
NZ716954B2 (en) Computing architecture with peripherals
JP2002198987A (en) Active port of transfer controller with hub and port
Exploring Scalable CSAIL
Lyberis et al. The 512-core Formic Hardware Prototype