CN113435153B

CN113435153B - Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems

Info

Publication number: CN113435153B
Application number: CN202110626551.9A
Authority: CN
Inventors: 王俊
Original assignee: Shanghai Tiantian Smart Core Semiconductor Co ltd
Current assignee: Shanghai Zhirui Electronic Technology Co ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-07-22
Anticipated expiration: 2041-06-04
Also published as: CN113435153A

Abstract

The invention discloses a design method of a digital circuit interconnected by GPU cache subsystems, which comprises the steps of dividing the memory address of GPU equipment; setting an independent cross path in a region, wherein the cross path is connected with a computing core containing L1 cache slices of the region as a request initiator, and the L2 cache slices of the region as a request receiver; setting a cache mode of an L2 cache slice in the region; l2 cache memory and device memory and its memory controller in the connection area; setting an original owner cache in the region, and taking charge of the initial reading and the final writing back of the memory address of the device in the region; each L2 cache slice in this region and one L2 cache slice in the other half region are combined into a mirror L2 cache slice group. The invention can save the wiring area of the chip, shorten the average access delay, improve the efficiency of accessing the cache data by the operation core and support the completion of data sharing and synchronous processing among different cache layers on the hardware level.

Description

Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems

Technical Field

The invention belongs to the technical field of digital circuits, and particularly relates to a design method of a digital circuit interconnected by GPU cache subsystems.

Background

FIG. 1 is an interconnection structure of a conventional on-chip cache subsystem of a GPU. Each set of computational cores shares an independent private cache L1, and all the computational cores share a second level cache L2 cache slice with a larger capacity. In order to provide sufficient access bandwidth for all the computational cores at the same time, the L2 cache slice is generally organized in a distributed cache manner, that is, the L2 cache slice is divided into a plurality of L2 cache slices according to the space to which the accessed address belongs, each L2 cache slice can provide a share of access bandwidth, and all the computational cores and all the L2 cache slices are connected with each other by a cross bus. A conventional 4X4 cross-bus interconnect is shown in fig. 2. The conventional method usually reconfigures functions such as address hashing of access requests and the like to achieve the aim of access balancing when a cross path routes each L2 cache slice according to addresses, so that the bandwidth of all L2 cache slices is fully utilized.

With the continuous expansion of the internal scale of the GPU, the GPU cache access is organized by using the distributed cache manner, which mainly faces the following problems:

1. the more operation cores, the larger the total bandwidth requirement of the memory access, and the larger the total bandwidth of the L2 cache memory slice access is required to be provided. This requires splitting the L2 cache slices into more L2 cache slices, enabling more multiple cross-bus lanes to meet the needs. The area of the cross-via routing can become a bottleneck.

2. More compute cores and more L2 cache tiles make the longest trace of the cross-way longer and longer, and when a task of a single compute core wishes to use all the address space, i.e., access requests, to associate all the L2 cache tiles, a data block transfer will eventually depend on the delay from the compute core to a certain longest trace path L2 cache tile.

Therefore, in the digital design of the GPU cache subsystem and its interconnection, the above problems need to be solved, and furthermore, early GPUs neither frequently synchronize nor share data, but instead, by exposing thread information and cache structures to programmers, programmers achieve synchronization and data sharing through software without hardware consistency.

GPGPUs are now becoming more and more popular. Technicians begin to use GPU architectures as general tasks (general purpose processors), which places requirements on the GPU to achieve hardware cache coherency. LRCC (LAZY freed consistency-direct coherence) is a cache coherence protocol suitable for GPU to realize data synchronization and sharing. The protocol is mainly based on a 'producer-consumer' access consistency model, data sharing and synchronization among L1 cache slices in the slices are realized through an acquire-release mechanism, synchronization occurs when a consumer tries to acquire a flag that the (acquire) producer has released, the L1 cache slice as the producer side writes updated shared data in the cache back to a point (generally a publicly visible L2 cache slice) visible to the consumer when synchronization occurs, and the L1 cache slice as the consumer side needs to invalidate old data in the cache when synchronization is completed, so that subsequent access requests can be prevented from reading out cache data, and latest data can be read from a synchronization point. The L2 cache slice is used as a synchronization point of the shared data of the L1 cache slice and is responsible for recording the state of a cache line, namely the attribution condition, and when the synchronization occurs, the L1 of a producer and a consumer is instructed to start the data synchronization and finish the data synchronization through request interaction. The logic behavior of the L1 in the LRCC protocol is shown in fig. 3, and the logic behavior of the L2 cache slice in the LRCC protocol is shown in fig. 4. GetV and GetO in the figure are two types of read requests defined by LRCC, GetV does not require ownership (owner) of the address, GetO needs to acquire ownership of the address. The specific LRCC protocol cache coherency data synchronization flow is shown in fig. 5.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for designing digital circuits interconnected by GPU cache subsystems, aiming at the deficiencies of the prior art.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a method for designing digital circuits interconnected by GPU cache subsystems comprises the following steps:

step 1: dividing the address of the device memory of the GPU, and setting a computing core, an L1 cache chip, an L2 cache chip, the device memory and a memory controller thereof;

and 2, step: setting an independent cross path in a region, wherein the cross path is connected with a computing core containing L1 cache slices of the region as a request initiator, and the L2 cache slices of the region as a request receiver;

and 3, step 3: setting a cache mode of an L2 cache slice in the area;

and 4, step 4: l2 cache memory and device memory and its memory controller in the connection area;

and 5: setting an original owner cache in the region, and taking charge of the initial reading and the final write-back of the memory address of the equipment in the region;

and 6: each L2 cache slice of the region and one L2 cache slice of the other half region are combined into a mirror L2 cache slice group.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the step 1 is specifically as follows:

dividing the device memory address of the GPU into two areas, wherein each area is provided with: half of the GPU computation core and its private L1 cache slice, shared L2 cache slice which occupies half of the total capacity of the L2 cache slice, and device memory and its memory controller which occupy half of the total capacity.

The step 2 is specifically:

independent cross-ways within a region are set that connect the compute cores of the region that contain L1 cache slices as request initiators and the L2 cache slices of the region as request receivers.

The step 3 is specifically:

setting a cache mode of an L2 cache slice in the area:

the L2 cache slice of each region is divided into a plurality of L2 cache slices to be connected with the cross access according to the actual bandwidth requirement of the system by adopting a conventional distributed cache mode.

The step 4 is specifically:

the connections between the L2 cache tiles and the device memory and its memory controller are indicated by separate lines, with several L2 cache tiles corresponding to one device memory path.

In step 5 above, the L2 slices directly connected to the device memory are defined as the original owner cache corresponding to the device memory access address.

In step 1, the L2 cache slices in the mirror L2 cache slice group are connected by a single path.

The routing logic of the digital circuits interconnected by the designed GPU cache subsystem is as follows:

when a computing core containing an L1 cache slice accesses an L2 cache slice request, firstly, routing is carried out through a cross access of the region, and the cross access is routed to a certain L2 cache slice of the region according to a request address;

half of the addresses routed to the L2 cache slice themselves belong to the L2 cache slice address region as the original owner;

an L2 cache slice may receive an access request from a cross bus or a mirror L2 cache slice, and for a cache miss request, whether the cache miss request belongs to the local region or not is judged according to an address, the cache miss request belongs to the local region, and the cache miss request is read through a corresponding device memory access through a path between an L2 cache slice and a device memory, and the cache miss request which does not belong to the local region is read through a mirror L2 cache slice group direct connection path to an L2 cache slice of a mirror L2 cache slice.

LRCC-based cache coherency for GPU L2 cache tile level is achieved by improving the LRCC protocol, extending it to the L2 cache tile level and defining its logical behavior.

The invention has the following beneficial effects:

1. compared with the L2 cache slice connection mode of the conventional GPU, the connection mode of the invention has the advantages that M represents the total number of GPU request initiators, N represents the total number of GPU L2 cache slices, the connection paths required by the conventional method for achieving the same connection effect (all L2 cache slices can be accessed) are M x N groups in total, and the connection paths of the method are M (N/2) + N/2 groups in total. That is, the method of the present invention has less (MN-N)/2 sets of connection paths than the conventional method, as shown by comparing fig. 2 with 4X4 connection of fig. 10, 6 sets of connection paths can be saved, and chip routing area is saved.

2. The method for partitioning the area and the connection mode based on the L2 cache slice mirror image group reduce the route target of the cross access by half, and are beneficial to the path length of the route in one area. Originally, the data of the remote end of the other region of the access address of the operation core can be cached in the L2 cache slice which is closer to the same region through the L2 cache slice mirror group, so that the average access delay is shortened, and the efficiency of the operation core for accessing the cache data is improved.

3. The connection mode of the mirror group of the L2 cache slice is essentially that a non-unique L2 cache slice exists in a system, the problem of cache consistency of the L2 cache slice and the L2 cache slice exists, an LRCC protocol is extended to the L2 cache slice, and according to the cache logic behavior defined by the invention, the method supports the completion of data sharing and synchronization between different cache levels on a hardware level.

Drawings

FIG. 1 is an interconnection structure of conventional on-chip cache subsystems of a GPU;

FIG. 2 is a conventional 4X4 cross-bus interconnect;

FIG. 3 is the logical behavior of the L1 cache slice in the LRCC protocol;

FIG. 4 is the logical behavior of L2 cache slices in the LRCC protocol

FIG. 5 is a LRCC protocol cache coherency data synchronization flow;

FIG. 6 is a schematic diagram of the memory access subsystem connection relationship of the present invention;

FIG. 7 is a schematic diagram of a data path taken after a cache miss occurs in a region 1 compute core 1 read request accessing L2 cache tile 1 according to an embodiment of the present invention;

fig. 8 is the logical behavior of the L1 cache slice in the extended LRCC protocol of the present invention;

FIG. 9 is the logical behavior of the L2 cache slice in the extended LRCC protocol of the present invention;

FIG. 10 is a schematic diagram of a cross-bus connection of the present invention with 4 requestors, 4L 2 cache slots;

FIG. 11 is a schematic diagram illustrating a process of accessing a non-local area address space by a normal access request based on the digital circuit design method of the present invention and finally writing back to the device memory in a software synchronization manner;

FIG. 12 is a flow chart of the present invention for achieving hardware consistency by using the extended LRCC algorithm to complete data synchronization;

fig. 13 is a schematic diagram of a synchronization flow in a case where a producer and a consumer belong to the area 1 and data to be synchronized belongs to the area 2 in the embodiment of the present invention;

FIG. 14 is a schematic diagram of a synchronization process in a case where scenario 3, the producer and the data to be synchronized belong to area 1, and the consumer belongs to area 2 in the embodiment of the present invention;

fig. 15 is a schematic diagram of a synchronization flow in the case where scenario 4, the producer belongs to area 1, and the consumer and the data belong to area 2 in the embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 6, a method for designing a digital circuit interconnected by GPU cache subsystems includes:

in an embodiment, a device memory address of a GPU is divided into two regions, where: half of GPU computing core and its private L1 cache slice, shared L2 cache slice which occupies half of total capacity of L2 cache slice, and device memory and its memory controller which occupy half of total capacity;

step 2: setting an independent cross path in a region, wherein the cross path connects a computing core containing an L1 cache slice of the region as a request initiator, and an L2 cache slice of the region as a request receiver;

in the embodiment, each region is provided with an independent cross path, the cross path connects the computation cores containing the L1 cache slices of the region as a request initiator, and the L2 cache slices of the region as a request receiver;

and 3, step 3: setting a cache mode of an L2 cache slice in the area;

in the embodiment, the L2 cache slice in each area is divided into a plurality of L2 cache slices according to the actual bandwidth requirement of the system and connected with the cross access by adopting a conventional distributed cache mode;

the connection mode between the L2 cache slice and the device memory and the memory controller thereof is not the focus of the discussion of the invention;

in the embodiment, the L2 cache slices are connected with the device memory and the memory controller thereof, and a plurality of L2 cache slices have corresponding relations with one device memory path and are represented by single connecting lines;

and 5: setting an original owner cache in the region, and taking charge of the initial reading and the final writing back of the memory address of the device in the region;

in an embodiment, the L2 slice directly connected to the device memory is defined as the original owner cache of the corresponding device memory access address;

step 6: each L2 cache slice of the region and one L2 cache slice of the other half region are combined into a mirror L2 cache slice group.

In one embodiment, the L2 slices in the mirrored L2 slice group are connected by a single path.

half of the addresses routed to the L2 cache slice themselves belong to the L2 cache slice address region that is the original owner (only half of the address space is contained in one region, while the request address may belong to another region);

an L2 cache tile may receive an access request from a cross bus or a mirror L2 cache tile, determine, according to an address, whether a cache miss request belongs to a local region or not, and read a request belonging to the local region through a path between an L2 cache tile and a device memory to a corresponding device memory path, and read a request not belonging to the local region through a mirror L2 cache tile direct-connection path to an L2 cache tile of a mirror L2 cache tile, as shown in fig. 7.

In an embodiment, LRCC-based cache coherency for GPU L2 cache slice level is achieved by improving the LRCC protocol, extending it to the L2 cache slice level and defining its logical behavior.

That is, the extended LRCC protocol is used between L2 cache slices in the GPU to complete hardware caching, and in the extended protocol, the logical behavior of the L1 cache slice is as shown in fig. 8. The logical behavior of the L2 cache tile is shown in FIG. 9.

Examples

Fig. 11 is a schematic diagram illustrating a flow of accessing a non-local area address space by a common access request based on the digital circuit design method of the present invention and finally writing the access request back to a device memory in a software synchronization (software indicates that the request is written through and invalidated) (in the following figures, the processes of accessing memory, data synchronization, etc. are emphasized, in order to make a picture simpler, a plurality of L2 cache slices are not organized according to an actual L2 cache slice — according to a cache distribution mode, but only an L2 cache slice of one area is used as a logical overall abstract representation, and the same applies below):

region 1 computing core generates access instruction, address A belongs to region 2, L1 cache slice is not accessed and missed, read request is sent to region 1 cross bus, region 1 cross bus routes request to L2 cache slice of region 1

Secondly, the cache slice of the region 1L2 also misses the access, the request address A belongs to the non-local region space, and the request is sent to the L2 cache slice of the region 2 through the mirror image L2 cache slice group direct access

Region 2L2 cache plate also has access miss, request address A belongs to local region space, and sends request to corresponding equipment memory

And fourthly, the device memory returns data to the L2 cache slice in the area 2, the L2 cache slice in the area 2 returns data to the L2 cache slice in the area 1, and the L2 cache slice in the area 1 returns data to the L1 cache slice in the area 1 and caches the data. The region 1L1 cache slice receives the data and caches it.

Region 1 computing core writes updated A to region 1L1 cache slice by indicating write-through and invalidation (software control), L1 cache slice access hit, and invalidates its own A address cache data for write-through invalid request, and sends the request and data to region 1L2 cache slice

Region 1L2 cache slice also has access hit, and for write-through invalid request, invalidates its own A address cache data, and sends the request and data to region 2L2 cache slice

And the memory access of the cache piece of the region 2L2 is not hit, and the request and the data are directly written through and written into the corresponding equipment memory for the write-through invalid request.

Fig. 12 shows a process of implementing hardware consistency by using an extended LRCC algorithm and completing data synchronization according to the present invention, where the present example is scenario 1, a producer and a consumer, data to be synchronized belong to the same area, and a basic synchronization process:

(r) the producer (corresponding to the L1 cache slice, different iterations below) releases flag, sends a GetO flag request to the crossbar bus, which routes the request to the L2 cache slice according to the behavior logic of FIG. 8

(ii) consumer (corresponding to L1 cache slice, next different iteration) acquire flag, sending GetV flag request to crossbar bus according to behavior logic of FIG. 8, crossbar bus routing request to L2 cache slice

③ the L2 cache tiles send a request write-back to the producer according to the behavior logic of FIG. 9

Producer writes back synchronization data and flag according to the behavior logic of FIG. 8

L2 cache slice responds to the consumer's GetV flag request according to the behavior logic of fig. 9. The consumer receives the response, invalidates its non-owned data according to the behavior logic of fig. 8, and completes the data consistency synchronization.

Fig. 13 shows a schematic diagram of the synchronization flow in the case where the producer and the consumer belong to the area 1 and the data to be synchronized belongs to the area 2 in the scenario 2. Basic synchronous flow:

the producer release flag, according to the behavioral logic of FIG. 8, sends a GetO flag request to the zone 1 crossbar, which routes the request to the L2 cache slice of zone 1.

According to the behavior logic of the figure 9, the L2 cache slice of the region 1 sends GetO to the L2 cache slice of the region 2 serving as the original owner of the flag through a mirror image L2 cache slice group direct connection path, and the L2 cache slice of the region 2 updates the occupancy state of the flag according to the behavior logic of the figure 9.

And thirdly, the consumer acquire flag, according to the behavior logic of fig. 8, sends a GetV flag request to the zone 1 cross bus, and the cross bus routes the request to the L2 cache slice of zone 1.

Region 1's L2 cache slice sends a "request write back" to the producer according to the behavior logic of fig. 9.

Producer writes back sync data and flag to L2 cache slice of region 1 according to the behavior logic of fig. 8.

Sixthly, the L2 cache slice of the region 1 responds to the GetV flag request of the consumer according to the behavior logic of FIG. 9. The consumer receives the response, invalidates its own non-owned data according to the behavior logic of fig. 8, and completes the data consistency synchronization.

Fig. 14 shows a schematic diagram of the synchronization flow in the case of scenario 3, where the producer and the data to be synchronized belong to area 1 and the consumer belongs to area 2. Basic synchronous flow:

the producer release flag, according to the behavior logic of fig. 8, sends a GetO flag request to the zone 1 cross bus, which routes the request to the L2 cache slice of zone 1.

(vii) Consumer acquire flag, according to the behavioral logic of FIG. 8, sends a GetV flag request to zone 2 crossbar, which routes the request to the L2 cache slice of zone 2.

③ the flag of the L2 cache slice in the region 2 is in an invalid state, and according to the behavior logic of fig. 9, GetV is directly sent to the original owner of the flag through the mirror image L2 cache slice group, i.e., the L2 cache slice in the region 1.

Region 1's L2 cache slice sends a ' request writeback ' to producer according to the behavior logic of FIG. 9

Producer writes back synchronization data and flag according to behavior logic of FIG. 8

Sixthly, the L2 cache slice of the region 1 responds to the GetV flag request according to the behavior logic of FIG. 9. The L2 cache tile of region 2 receives the response and invalidates its non-owned data according to the behavior logic of fig. 9.

And the L2 cache slice in the region 2 responds to the GetV flag request after completing invalidation. The consumer receives a response invalidating its non-possession data according to the behavior logic of fig. 4. The whole data synchronization is completed. (producer writeback, Consumer and region 2L2 cache slice invalidate old data)

Fig. 15 shows a schematic diagram of the synchronization flow in the case of scenario 4, where the producer belongs to area 1 and the consumer and data belong to area 2. Basic synchronous flow:

Secondly, according to the behavior logic of the FIG. 9, the L2 cache slice of the region 1 directly sends the GetO flag request to the original owner of the flag through the mirror image L2 cache slice group, and the L2 cache slice of the region 2. The L2 cache slice for region 2 updates the flag occupancy state according to the behavior logic of FIG. 9.

And the consumer acquire flag sends a GetV flag request to the zone 2 cross bus according to the behavior logic of FIG. 8, and the cross bus routes the request to the L2 cache slice of the zone 2.

Flag original owner, according to the behavior logic of fig. 9, sending "request write back" to the L2 cache slice in region 2, directly connecting the request write back through the mirror image L2 cache slice group, and sending the request write back to the L2 cache slice in region 1.

Region 1's L2 cache slice sends a "request writeback" to the producer according to the behavior logic of FIG. 9.

Sixthly, the producer writes back the synchronization data and flag to the L2 cache slice of region 1 according to the behavior logic of FIG. 8.

The L2 cache slice in the region 1 directly writes back all the data which is not owned by the L2 cache slice group through the mirror image L2 after receiving the flag written back by the producer according to the behavior logic of the graph 9. And then the flag is written back to the original owner L2 cache slice.

And after receiving the flag written back by the cache slice L2 in the area 2 in the cache slice L2 in the area 1, responding to the GetV flag request of the consumer according to the behavior logic shown in the figure 9. The consumer receives the response, invalidates its own non-owned data according to the behavior logic of fig. 8, and completes the data consistency synchronization.

According to the invention, by forming the mirror image direct-connection L2 cache slice group, the wiring area of the cross path can be effectively reduced, the problem that the wiring area of the large GPU design becomes a bottleneck is solved, the length of the cross path can be reduced, the time delay of accessing the cache is reduced, and the memory access efficiency of the large GPU design is improved. But this approach breaks the foundation that all cores of a conventional GPU design share one L2 cache slice, i.e., there are multiple copies of L2 cache slices. Introducing possible cache coherency issues at the L2 cache slice level.

In addition to the conventional problem, as well as solving the possible cache consistency problem of the GPU private L1 cache slice, by exposing the thread information and the new cache structure to the programmer, so as to enable the programmer to implement the means of synchronization and data sharing through software without hardware consistency, the present invention also implements the LRCC-based cache consistency solution of the GPU L2 cache slice level by improving the LRCC protocol, extending it to the L2 cache slice level and defining its logic behavior, and supports the processing of data sharing and synchronization between different cache levels in the presence of non-unique L2 cache slices.

The above are only preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples, and all technical solutions that fall under the spirit of the present invention belong to the scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A method for designing digital circuits interconnected by GPU cache subsystems is characterized by comprising the following steps:

and 3, step 3: setting a cache mode of an L2 cache slice in the area;

step 6: each L2 cache slice of the region and one L2 cache slice of the other half region form a mirror image L2 cache slice group;

the step 1 specifically comprises the following steps:

dividing the device memory address of the GPU into two areas, wherein each area is provided with: half of GPU computing core and its private L1 cache slice, shared L2 cache slice which occupies half of total capacity of L2 cache slice, and device memory and its memory controller which occupy half of total capacity;

half of the addresses routed to the L2 cache slice belong themselves to the L2 cache slice address region as the original owner;

an L2 cache slice may receive an access request from a cross bus or a mirror image L2 cache slice, judge whether the cache miss request belongs to the local region or not according to the address, and read the request belonging to the local region through a path between an L2 cache slice and a device memory, and read the request not belonging to the local region through a path between a mirror image L2 cache slice group and a direct path to an L2 cache slice of a mirror image L2 cache slice.

2. The method according to claim 1, wherein the step 2 specifically comprises:

independent cross-ways within a region are provided that connect the compute cores of the region that contain the L1 cache tiles as the request initiator and the L2 cache tiles of the region as the request receiver.

3. The method according to claim 2, wherein the step 3 is specifically:

setting a cache mode of an L2 cache slice in the area:

the L2 cache slice of each region is divided into a plurality of L2 cache slices according to the actual bandwidth requirement of the system and connected with the cross access by adopting a conventional distributed cache mode.

4. The method according to claim 3, wherein the step 4 is specifically:

5. The method as claimed in claim 4, wherein in step 5, the L2 slice directly connected to the device memory is defined as the original owner cache of the corresponding device memory access address.

6. The method as claimed in claim 5, wherein in step 1, the L2 cache tiles in the mirrored L2 cache tile group are connected by a single path.

7. The method of claim 1, wherein LRCC-based cache coherency at GPU L2 cache slice level is achieved by modifying LRCC protocol, extending it to L2 cache slice level and defining its logical behavior.