CN115168247B

CN115168247B - Method for dynamically sharing memory space in parallel processor and corresponding processor

Info

Publication number: CN115168247B
Application number: CN202211068433.1A
Authority: CN
Inventors: 苏叶华
Original assignee: Shanghai Denglin Technology Co ltd; Beijing Denglin Technology Co ltd
Current assignee: Shanghai Denglin Technology Co ltd; Beijing Denglin Technology Co ltd
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-12-02
Anticipated expiration: 2042-09-02
Also published as: WO2024045585A1; CN115168247A

Abstract

The present application provides a method for dynamically sharing memory space in a parallel processor and a corresponding processor, wherein one part of the memory space of a local memory of the processor is used as a local memory, and the other part is used as a cache memory. The memory access control unit of the processor respectively updates the initial positions of the local memory and the cache memory in the memory of the processor according to the received settings of the size of the local memory and the size of the cache memory; the cache memory determines a new data storage location corresponding to each cache block in the memory according to the size of the cache memory and the initial location of the cache memory in the memory of the processor, and a mapping is established between each tag storage location and the new data storage location of each cache block. The scheme allows a user to dynamically adjust the size of the local memory and the storage space of the cache memory in the processor, and improves the execution performance of the processor for the application program without increasing the chip area and the hardware cost.

Description

Method for dynamically sharing memory space in parallel processor and corresponding processor

Technical Field

The present application relates to memory management techniques in high performance parallel processors, and more particularly, to a method and system for sharing memory between a processor's local memory and cache memory.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art to the present disclosure.

High performance parallel processors, such as General-purpose graphics processing units (gpgpgpu), generally support programming models of local memory and global memory, which provides flexibility for a user to select local memory or global memory according to actual applications. Local memory and global memory have their own advantages and disadvantages: the local memory is generally closer to the computing kernel, the access speed is higher, but the capacity is smaller; the global memory has a large capacity, but the access speed and the access delay are poor. For the problem of access delay of the global memory, the GPGPU adopts a Cache technology to solve the access bottleneck of the global memory, for example, a fast-speed but small-capacity Cache (Cache, also referred to as Cache for short) is arranged between the processor and the global memory as a Cache of the global memory data, such as an L1 Cache. In order to further improve performance, multiple levels of Cache can be introduced, for example, after L1 Cache, L2 Cache, even L3 Cache is added to reduce the latency problem of access.

Thus, in a parallel processor architecture, such as a GPGPU, that includes multiple compute cores, each compute core often contains a local memory and an L1 cache. Generally, the larger the respective capacities of the local memory and the L1 cache of each compute core, the better the performance, but accordingly, this results in an increase in the chip area and hardware cost of the processor.

The above-mentioned contents are only for assisting understanding of the technical solutions of the present application, and are not taken as a basis for evaluating the prior art of the present application.

Disclosure of Invention

The inventor researches and discovers that when an application program is written on a GPGPU processor by using a programming language such as CUDA and OPENCL, a user can select a programming model taking local memory or global memory as a main factor according to actual application requirements. In other words, the user-written application will not have a large capacity requirement for both local memory and the L1 cache. Therefore, the inventor designs a solution for dynamically sharing the storage space in the parallel processor, so that the local memory and the L1 cache can share the storage space, and a user can dynamically allocate the size of the local memory or the L1 cache according to actual application. By the scheme, the same Random Access Memory (RAM) space can be shared by the local memory and the L1 cache in a time-sharing mode, so that the purposes of reducing the chip area and the hardware cost are achieved. Since the storage space in the local memory or L1 cache usually accounts for 80% of the area cost, about 40% of the area cost can be saved by adopting the scheme of dynamically sharing the storage space.

According to a first aspect of embodiments of the present application, there is provided a method for dynamically sharing a memory space in a parallel processor, comprising: respectively updating the initial positions of a local memory and a cache memory in a memory of the processor by an access control unit of the processor according to the received settings of the size of the local memory and the size of the cache memory, wherein one part of the storage space of the memory is used as the local memory, and the other part of the storage space of the memory is used as a data storage unit of the cache memory; updating the setting of the size of an index field in a memory access address of the cache memory by a memory access control unit of the processor according to the received setting of the size of the cache memory; and determining the corresponding new data storage position of each cache block in the memory by the cache memory according to the size of the cache memory provided by the access control unit and the initial position of the cache memory in the memory of the processor, and establishing mapping between each tag storage position and the new data storage position of each cache block.

In such an embodiment, the local memory and the L1 cache in the processor share the same memory, but the size of the occupied storage space of the local memory and the L1 cache is not fixed, but can be dynamically changed according to the configuration provided by the user, so as to allow the user to readjust the size of the storage space of the local memory and the L1 cache in the processor according to the selected local memory or global memory-based programming model when writing an application program on the GPGPU processor using a programming language such as CUDA, OPENCL, and the like, so as to better improve the execution performance of the processor on the application program. If the user currently selects local memory based programming, then the storage space of the local memory in the processor may be expanded appropriately, whereas if the user currently selects global memory based programming, then the storage space of the L1 cache in the processor may be expanded appropriately. Thus, the execution performance of the processor for the application program is improved, and the chip area and the hardware cost are not increased.

In some embodiments, the method may further include dividing, by the memory access control unit of the processor, a storage space of a corresponding size as the local memory from a preset address in the memory of the processor according to the received settings of the size of the local memory and the size of the cache memory, and then allocating the storage space for the cache memory, where a starting position of the local memory is the preset address.

In such an embodiment, the starting location of the local memory need not be changed each time the storage space size is adjusted, and the new starting location of the updated cache memory in the memory of the processor can be determined only based on the newly set local memory space size. This not only simplifies the flow of memory space management, but also prevents the data of the shared local memory part from being lost before and after the update when the size of the local memory is updated.

In some embodiments, the cache memory may be a set-associative cache, and wherein the size of the index field in the access address of the cache memory is determined based on the result of dividing the cache memory size by a preset cache block size of the cache memory and the number of tags contained in each set.

In some embodiments, the method may further include locating, by the memory access control unit of the processor, in response to receiving a memory access request for the local memory, data to be accessed by the memory access request using the updated starting location of the local memory.

In some embodiments, the method may further include mapping, by the memory access control unit of the processor, an address in the memory access request to a memory access address of the cache memory in response to receiving the memory access request for the global memory, and sending the memory access address to the cache memory, where the memory access address uses the updated index field size; and locating, by the cache memory in response to receiving a memory access request from the memory access control unit, data to be accessed by the memory access request according to the established mapping between the tag storage location and the new data storage location.

In some embodiments, locating, by the cache memory in response to receiving a memory access request from the memory access control unit, data to be accessed by the memory access request according to the established mapping between tag storage locations and new data storage locations may include:

when the cache is hit, determining a cache block corresponding to the hit tag according to the established mapping between the tag storage position and the new data storage position, and extracting data to be accessed by the access request from the cache block as a response to the access request;

the following operations are performed on a cache miss:

allocating a label storage position for the memory access request to store a label field in a memory access address of the memory access request, and selecting one of a plurality of cache blocks which are not bound with a label from a data storage unit of a cache memory to allocate to the memory access request;

setting the label binding position of the cache block originally corresponding to the allocated label storage position as an indication that the cache block is not bound with the label, then establishing a mapping relation between the label storage position and the cache block allocated to the memory access request, and setting the label binding position of the cache block allocated to the memory access request as an indication that the cache block is bound with the label; and

and acquiring the data to be accessed by the access request from the next-level memory and storing the data in the cache block allocated to the access request.

In such embodiments, each tag storage location in the cache is no longer fixedly bound to a cache block, but may be dynamically mapped or bound to any cache block, and the data in the tag and cache blocks may not necessarily be updated synchronously. When the storage space of the L1 cache changes, for example, the data storage unit 103 of the L1 cache is moved or allocated to another location of the shared storage space, as long as the starting location of the L1 cache in the shared storage space is given, the cache block may be reallocated from the starting location, as long as the mapping between the tag storage locations and the new data storage locations of the cache blocks is reestablished accordingly. Thus, when the storage space of the L1 cache is changed or adjusted according to the configuration of the user, the data in the newly allocated cache block can be located or found based on the reestablished mapping relationship.

According to a second aspect of embodiments of the present application, there is provided a processor supporting a dynamic shared memory space, including an access control unit, a memory, and a cache memory, where the cache memory includes a controller, a tag storage unit for storing a tag, a data storage unit composed of a plurality of cache blocks, and a mapping unit; wherein a portion of the storage space of the memory is used as a local memory and another portion is used as a data storage unit of a cache memory, wherein:

the memory access control unit is configured to: updating the starting positions of the local memory and the cache memory in the memory of the processor respectively according to the received settings of the size of the local memory and the size of the cache memory; updating the setting of the size of the index field in the access address of the cache memory according to the received setting of the size of the cache memory;

the controller of the cache memory is configured to: and determining the corresponding new data storage position of each cache block in the memory according to the size of the cache memory provided by the access control unit and the initial position of the cache memory in the memory of the processor, and establishing mapping between each label storage position in the label storage unit and the new data storage position of each cache block in the mapping unit.

In some embodiments, the memory access control unit may be further configured to: according to the received settings of the size of the local memory and the size of the cache memory, the storage space with the corresponding size is firstly divided from the preset address in the memory to be used as the local memory, and then the storage space is allocated for the cache memory, wherein the initial position of the local memory is the preset address. In some embodiments, the memory may be implemented in the form of random access memory. The mapping unit in the cache memory may be implemented in the form of a register. The mapping unit realized in the form of the register can further reduce the cost and the area occupied by the storage of the mapping relation in the cache memory, and improve the speed of analyzing the mapping relation between the label and the cache block.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a block diagram of a cache memory according to one embodiment of the present application.

FIG. 2 is a diagram illustrating a mapping relationship between tags and cache blocks according to an embodiment of the present application.

FIG. 3 is a flowchart illustrating a method for dynamically sharing memory in a parallel processor according to an embodiment of the present application.

Detailed Description

For the purpose of making the present application more apparent, its technical solutions and advantages will be further described in detail by means of specific embodiments in the following, with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In a parallel processor such as a GPGPU, memory requests from multiple threads of different compute kernels are sent to a memory access control Unit (LSU) of the processor, and the LSU accesses data to be accessed by these memory requests from a local memory or a global memory. In a programming model using a local memory, the local memory is usually accessed in an array form, so the memory access control unit can locate a specific storage location in the local memory, which the memory access request of each thread wants to access, based on the starting address of the local memory. In a programming model using a global memory, when an access control unit receives an access request for a global memory address, the global memory address to be accessed in the access request is first mapped to an access address of a cache memory (e.g., an L1 cache), and then whether corresponding data is cached in the L1 cache is searched from the L1 cache, if the data to be accessed is cached in the L1 cache (which may be referred to as a "hit"), the data is directly returned to a processor from the L1 cache, otherwise, the controller of the L1 cache extracts the data from a next-level memory (e.g., the global memory) for caching and returns the data to the processor.

The inventor researches and discovers that when an application program is written on a GPGPU processor by using a programming language such as CUDA and OPENCL, a user can select a programming model taking local memory or global memory as a main factor according to actual application requirements. In other words, the user may write an application that does not have a large capacity requirement for both local memory and the L1 cache. Therefore, in the embodiments of the present application, a solution is provided for sharing the storage space of the same Random Access Memory (RAM) by a local memory and an L1 cache in a parallel processor. The user can dynamically allocate the space size of the local memory and the L1 cache in the processor through upper-layer software according to the actual application requirement. Once the space allocation is set, the local memory will only access the specified space size, and likewise the L1 cache will only access the space specified for it. The space of the L1 cache and the local memory may also change according to the user configuration. For the local memory, as long as the starting address and the storage space size of the local memory are given, the memory access control unit can be positioned to any storage position in the local memory. Otherwise, as for the L1 cache, the access address and the index space may change once the space size of the L1 cache changes. The existing L1 cache control mechanism is set for the L1 cache with fixed space size, and cannot support dynamic storage space.

More specifically, a cache memory (hereinafter collectively referred to as an L1 cache) located between a processor and global memory typically includes a controller, a tag storage unit for holding tags (tags), and a data storage unit for holding data. The data storage space of the L1 cache is divided equally into cache blocks (also referred to as cache lines) of equal size. The minimum unit of data transfer between the L1 cache and the global memory is the cache block size. Each cache block in the data storage unit is provided with a unique corresponding label storage position in the label storage unit. The tag corresponding to the data stored in each cache block is stored in the corresponding tag storage location of that cache block. When the data in the cache block is updated, the tag in the tag storage position corresponding to the cache block is updated, and the tag storage position are replaced synchronously. Although the tag is also a part of the L1 cache, when referring to the size of the L1 cache, the tag only refers to the maximum amount of data that can be accommodated by the L1 cache, i.e., the size of the storage space of the data storage unit of the L1 cache.

Existing caches are generally divided into three types: direct mapped Cache (Direct mapped Cache), fully associative Cache (Full associative Cache) and Set associative Cache (Set associative Cache). In the direct mapping cache, each main memory block can only map to the cache block with a fixed position in the L1 cache, and even if a plurality of positions are still left in the cache, the main memory block cannot be occupied, so that the storage space of the cache cannot be fully utilized, and if a program just needs to repeatedly access different main memory blocks corresponding to the same cache position, block collision often occurs, and continuous replacement is needed, so that the hit rate is reduced. In the fully associative cache, each main memory block can be mapped to any cache block in the L1 cache, when accessing data, an address in an access request needs to be compared with a tag of each cache block to determine whether the data is hit, but a cache miss can be determined only after all the data is compared each time, and then replacement is performed, which directly affects data caching efficiency and access efficiency. The storage space is divided into several groups in the group associative cache, and the direct mapping mode is used between every two groups, and the fully associative mapping mode is used between every two cache blocks in the group. Correspondingly, the main memory is also partitioned according to the size of the L1 cache, and each partition is further divided into a plurality of groups, and each group comprises a plurality of main memory blocks. Thus, each group in main memory can only map to a specified group in Cache, but each main memory block in a group in main memory can map to any Cache block in the specified group in Cache. That is, the intra-group is fully associative mapping and the inter-group is direct mapping. The set associative cache is simpler and more efficient than the fully associative cache in judging block hit and replacement algorithms, the probability of block conflict is lower than that of the direct mapping cache, and the hit rate is between the direct mapping cache and the fully associative cache. In such L1 cache, the speed of searching and comparing tags directly affects the efficiency of accessing data by the processor, so the number of tags in the same group is usually not too large, and the number of tags is usually 4, 8 or 16, and is usually less than 32.

In parallel computing processors, set associative caches are more employed as L1 caches. Hereinafter, an L1 cache of a set associative cache type is illustrated, but it should be understood that the embodiments described in this application may be applied to other types of L1 caches, with appropriate adjustment or modification.

When the controller of the L1 cache receives the memory access request, the controller judges whether the corresponding data is cached in the L1 cache or not through the memory access address in the memory access request. The memory address generally includes three parts: a tag (tag), an index (index), and an offset (offset). Where the offset is used to address a certain data in a cache block, the index is used to locate a certain group in the L1 cache, and the tag is used to compare with the corresponding tag of each cache block contained in the group specified by the index to determine if a hit occurs. Still taking an L1 cache with a size of 64 Bytes and a cache block size of 8 Bytes as an example, assume that the L1 cache is divided into two groups, each of which has a storage space size of 32 Bytes, i.e. contains 4 cache blocks. When the size of the cache block is 8 bytes, the addressing range can be represented by 3 bits, so that the access address can adopt the lowest 3 bits to represent an offset field; and the number of the groups is 2, so that all the groups can be covered by 1 bit, so that the access address can adopt 1 bit adjacent to the offset to represent an index field, and the rest bits in the access address are label fields for comparing with tags corresponding to all cache blocks in the group positioned by the index. The tag field stores a portion of the address of the global memory to be accessed by the memory access request. Therefore, when an example L1 cache receives a memory access request, the index field of the address in the memory access request can be positioned to a corresponding group, the tag field in the memory access address is compared with the tags of all cache blocks contained in the group, if the matched tags exist, the cache is hit, and corresponding data in the cache block corresponding to the matched tags can be extracted according to the offset field in the memory access address and returned to the processor. If not, this indicates that the data to be accessed has not yet been cached in the L1 cache (i.e., a "miss"). Under the condition of cache miss, a controller of the L1 cache allocates a cache block and a corresponding tag for data to be accessed, loads the data from a main memory into an allocated cache line, and stores a tag field corresponding to the access address in the tag corresponding to the cache block.

In the existing L1 cache, a tag storage location in a tag storage unit is fixedly bound to each cache block (i.e. a data storage location) in a data storage unit, and when a data storage space of the L1 cache changes (e.g. the location of a cache block is moved or changed), the L1 cache cannot address corresponding data according to the tag and the index. Therefore, conventional L1 caches cannot support dynamic changes in memory size.

In addition, as can be seen from the above analysis, the bit number occupied by the offset field in the access address of the L1 cache is determined by the size of the cache block, and the tag field corresponds to a part of the address of the global memory to be accessed by the access request and is also preset; and only the number of bits occupied by the index field can change with the size of the L1 cache, and the change of the index field does not affect external elements such as a global memory and only affects the internal addressing of the L1 cache. For example, for a cache block size of 128b,8-way group associative cache (i.e., each group contains 8 cache blocks), the offset determined by the cache block size occupies the lowest 7 bits (i.e., [ 6 ] portion) of the memory address, and when the cache size is 32KB (i.e., 128 × 8 32), there are 32 groups in total, and the index portion in the memory address occupies 5 bits, i.e., [ 11: section 7; if the buffer size becomes 64KB (128 × 8 × 64), there are 64 groups, and the index portion of the memory address occupies 6 bits, i.e., [ 12: section 7. That is, when the size of the data storage space of the L1 cache changes, the size of the index field in the access address must be changed accordingly for proper addressing.

FIG. 1 is a block diagram of a cache memory 100 according to an embodiment of the present application. The cache memory 100 includes a controller 101, a tag storage unit 102 for storing tags, a data storage unit 103 made up of a plurality of cache blocks, and a mapping unit 104. Unlike the fixed binding of tag storage locations to cache blocks in existing caches, there is a dynamic mapping between tag storage locations and data storage locations (i.e., cache blocks) in the cache 100. That is, each tag storage location is no longer fixedly bound to a cache block, but rather can be dynamically mapped to or bound to any cache block. In this embodiment, the mapping relationship between the tag storage location and the cache block is maintained in the mapping unit 104. The mapping unit 104 may, for example, maintain a one-to-one mapping between tag sequence numbers and cache block sequence numbers in the form of a table. The tag sequence number is used to indicate the position of each tag held in the tag storage unit 102. The cache block sequence number is used to indicate the location of each cache block in the data storage unit 103.

Fig. 2 is a diagram illustrating a mapping relationship between tag storage locations and cache blocks according to an example of the present application. As shown in fig. 2, there are k +1 tag storage locations in the tag storage unit, and n +1 cache blocks in the data storage unit, where n and k are both natural numbers, and n is greater than or equal to k. For example, the 1 st tag t0 is currently mapped to the 6 th cache block d5, the 2 nd tag t1 is currently mapped to the 9 th cache block d8, \8230;, the k +1 th tag is currently mapped to the 24 th cache block d23. It can be seen that the mapping relationship stored in the mapping unit 104 actually represents or reflects the mapping relationship between the tag currently stored in each storage location in the tag storage unit 102 and the data block currently stored in the corresponding cache block in the data storage unit 103. When the location and the size of the storage space of the data storage unit 103 of the L1 cache change, the mapping between the storage location of the tag established by the mapping unit 104 and the storage location of the corresponding data is adjusted, so that each tag in the tag storage unit 102 is mapped to each cache block of the data storage unit 103 in the new location. By means of the dynamic mapping and storing mode, the number of the retrieved labels and the cache blocks can be changed or adjusted, and therefore the dynamic change of the storage space of the L1 cache can be supported. And may allow the L1 cache to be moved or allocated to any portion of the shared memory space, as long as the starting position of the L1 cache in the shared memory space is given, the data storage location corresponding to each tag storage location may be relocated by adjusting the mapping unit 104.

In some embodiments, the number of cache blocks in the data storage unit 103 may be greater than the number of tags contained in the tag storage unit 102. Each cache block may or may not be bound to a tag. Each cache block is provided with a tag binding location. The tag binding bit is used to indicate whether the cache block is bound to a tag, for example, when the cache block is bound to a tag, the tag binding bit may be set to 1, y, or T, etc., and when the cache block no longer binds any tag, the tag binding bit may be set to 0, n, or F, etc. Each cache block is allowed to release the data resource therein only under the condition that any label is not bound, namely the cache block can participate in data replacement and can be used for storing new data. In some embodiments, each cache block may also be provided with a status bit to indicate whether the operation has been completed for that cache block. For example, when the controller 101 determines that the read/write operation for the data currently stored in the cache block has been completely completed, the status bit of the cache block may be set to 1, y, T, or the like, whereas the status thereof may be set to 0, n, F, or the like. Each cache block is allowed to release the data resource therein only when any tag is not bound and the read operation is completely completed, namely, the cache block can participate in data replacement and can be used for storing new data.

In this embodiment, dynamic mapping or dynamic binding of tags to data in cache blocks is achieved by introducing the mapping unit 104. The tag and data in the cache block may not need to be updated synchronously. For example, when the tag stored in a certain storage location in the tag storage unit 102 is replaced with a new tag, a new cache block may be allocated in the data storage unit 103 for the data corresponding to the new tag, and a mapping between the new tag and the newly allocated cache block may be established in the mapping unit 104, while the data in the cache block corresponding to the old tag originally stored in the storage location remains in the data storage unit 103. Accordingly, when the storage space of the L1 cache is changed, for example, the data storage unit 103 of the L1 cache is moved or allocated to another location of the shared storage space, as long as the starting location of the shared storage space is given to the L1 cache, the cache block can be reallocated from the starting location, as long as the data storage location corresponding to each tag storage location is relocated in the mapping unit 104 accordingly. However, it should be understood that when the storage space of the L1 cache is changed or adjusted according to the configuration of the user, the data in the newly allocated cache block is located or found based on the mapping relationship re-established by the mapping unit 104, and the data stored in the L1 cache before the storage space is not adjusted will not be retained, and the space occupied by the data can be replaced by other data.

In some embodiments, mapping unit 104 may be implemented using random access memory such as SRAM, DRAM, or otherwise store a one-to-one mapping between tag sequence numbers and cache block sequence numbers in a data structure such as an array, linked list, or the like. Taking an array as an example, the number of elements in the array is the number of tags that can be stored in the tag storage unit 102. The first element in the array stores the sequence number of the cache block currently corresponding to the first tag in the tag storage unit 102, and so on. In still other embodiments, the mapping unit 104 may be implemented in the form of a register, for example, the mapping unit 104 is implemented as a set of registers, each register corresponds to a storage location of each tag in the tag storage unit 102, and a value of each register is a sequence number of a cache block corresponding to the tag of the corresponding location. The mapping unit realized in the form of the register can further reduce the cost and the area occupied by the storage of the mapping relation in the L1 cache, and improve the speed of analyzing the mapping relation between the label and the cache block.

With continued reference to fig. 1, when the cache memory 100 receives a memory access request sent by the memory access control unit LSU, the controller 101 resolves a memory access address contained in the received memory access request. And positioning to a corresponding group according to the index field in the memory access address, and then comparing the label field in the memory access request with the labels contained in the positioned group. If a matching tag can be found, then a cache hit indicates that the data to be accessed by the memory request has already been cached in the cache. If no matching tag is found after the comparison, it indicates that the data to be accessed by the memory access request is not cached in the cache memory, and at this time, the controller 101 needs to read the data to be accessed by the memory access request from the next level memory (e.g. L2 cache or main memory, etc.) into the cache memory 100.

In the case of cache hit, the controller 101 determines a data storage location (i.e., a certain cache block in the data storage unit 103) corresponding to the hit tag according to the mapping relationship stored in the mapping unit 104, and extracts data to be accessed by the access request from the corresponding cache block according to the offset field in the access address as a response to the access request, and returns the response to the access control unit LSU of the processor.

In the case of a cache miss, the controller 101 allocates a tag to the access request, for example, a tag portion of an access address included in the access request is used as a newly allocated tag, and the newly allocated tag is stored in the tag storage unit 102, and at this time, the newly allocated tag needs to replace an original tag stored in one of the storage locations of the tag storage unit 102, so as to implement updating of the tag. It is the tag storage unit 102 that actually allocates a storage location for the tag of the access request. Meanwhile, the controller 101 needs to allocate a cache block in the data storage unit 103 for the access request, so as to store the data read from the next level memory and to be accessed by the access request. In order to establish a corresponding relationship between the tag assigned to the access request and the cache block, the controller 101 further needs to update the mapping relationship between the tag and the cache block in the mapping unit 104, so that a mapping is established between the tag assigned to the access request in the tag storage unit 102 and the cache block assigned to the access request in the data storage unit 103. For example, the mapping unit 104 searches the cache block sequence number corresponding to the tag sequence number according to the sequence number of the storage location of the tag in the tag storage unit 102, sets the tag binding position of the cache block corresponding to the cache block sequence number found in the data storage unit 103 to indicate that the cache block is not bound to the tag, and replaces the found cache block sequence number with the sequence number of the cache block allocated to the access request in the data storage unit 103. After the corresponding mapping is established in the mapping unit 104, the tag binding bit assigned to the cache block of the memory request is set to indicate that the tag has been bound. Thus, the data to be accessed by the memory request can be read from the next-level memory and stored in the cache block allocated to the memory request.

In some embodiments, the number of cache blocks in the data storage unit 103 may be greater than the number of tags contained in the tag storage unit 102. Each cache block may or may not be bound to a tag. Each cache block is provided with a tag binding bit and a status bit. The tag binding bit is used to indicate whether the cache block is bound to a tag, for example, when the cache block is bound to a tag, the tag binding bit may be set to 1, y, or T, etc., and when the cache block no longer binds any tag, the tag binding bit may be set to 0, n, or F, etc. The status bit is used to indicate whether the operation has been completed for the cache block. For example, when the controller 101 determines that the read/write operation for the data currently stored in the cache block has been completed, the status bit of the cache block may be set to 1, y, T, or the like, whereas the status bit may be set to 0, n, F, or the like. Each cache block is allowed to release the data resource therein only when any tag is not bound and the read operation is completely completed, namely, the cache block can participate in data replacement and can be used for storing new data.

In this embodiment, dynamic mapping or dynamic binding of tags to data in cache blocks is achieved by introducing the mapping unit 104. The tag and data in the cache block may not need to be updated synchronously. For example, when the tag stored in a certain storage location in the tag storage unit 102 is replaced with a new tag, a new cache block may be allocated in the data storage unit 103 for the data corresponding to the new tag, and a mapping between the new tag and the newly allocated cache block may be established in the mapping unit 104, while the data in the cache block corresponding to the old tag originally stored in the storage location remains in the data storage unit 103.

FIG. 3 illustrates a method for dynamically sharing memory in a parallel processor according to an embodiment of the present application, which mainly comprises the following steps:

in step S1), the memory access control unit of the processor updates the starting positions of the local memory and the cache memory in the memory of the processor, respectively, according to the received settings of the size of the local memory and the size of the cache memory. In this embodiment, a user is allowed to re-size the storage space of the local memory and the L1 cache in the processor according to a selected local memory or global memory based programming model when writing an application on the GPGPU processor using a programming language such as CUDA, OPENCL, or the like, to better improve the execution performance of the processor for the application. If the user currently selects local memory based programming, then the storage space of the local memory in the processor may be expanded appropriately, whereas if the user currently selects global memory based programming, then the storage space of the L1 cache in the processor may be expanded appropriately. In order to improve the execution performance of the processor for the application program without increasing the chip area and hardware cost, in this embodiment, the storage space of the local memory and the L1 cache are both from the same Random Access Memory (RAM) in the processor, and the size of the storage space occupied by the local memory and the L1 cache is not fixed, but may dynamically change with the configuration provided by the user. For example, when a user selects programming based on local memory, the space fraction stored in the memory locally can be enlarged; when the user selects the programming based on the global memory, the space ratio of the L1 cache in the memory can be enlarged. The user may provide their current settings for local memory size and cache size to the processor's access control unit by, for example, invoking a configuration interface provided by the processor, providing a configuration file, selecting a corresponding configuration option, sending a configuration command, and the like. The memory access control unit of the processor may divide the storage space of the corresponding size from the memory of the processor as the local memory and the cache memory, respectively, according to the settings of the local memory size and the cache memory size received by the memory access control unit, and at the same time, according to the reallocation of the shared memory. In fact, for a memory shared by local memory and the L1 cache, each time the memory space occupied by local memory and cache memory is re-partitioned, this can be achieved by simply determining the new starting locations of local memory and cache memory in the memory of the processor.

In some embodiments, when the memory access control unit of the processor divides the storage space with corresponding size from the memory of the processor to be used as the local memory and the cache memory respectively according to the received settings of the size of the local memory and the size of the cache memory, the storage space of the local memory is allocated first, and the storage space of the L1 cache is reallocated. And for the local memory, allocating a storage space of a corresponding size from a preset address in the memory of the processor (for example, a starting address of the memory) each time, so that the local memory occupies a low address space part of the shared storage space. Thus, the initial position of the local memory does not need to be changed during each update, and the new initial position of the updated cache memory in the memory of the processor can be determined only according to the newly set size of the local memory space. This not only simplifies the flow of memory space management, but also prevents the data of the local memory part shared before and after the update from being lost when the size of the local memory is updated. As mentioned above, when the size of the data storage space of the L1 cache is updated, the index field in the access address changes due to the change of the number of the cache blocks involved, so that the data stored in the L1 cache before the update is released or emptied. The L1 cache after the update cannot locate the data cached before the update. For the L1 cache after the storage space adjustment, when a specific storage location in the L1 cache is accessed according to the starting location of the L1 cache in the RAM, it is equivalent to add an offset after the original address, and the offset is the size of the space allocated to the local memory.

In step S2) updating, by the access control unit of the processor, the setting of the size of the index field in the access address to the cache memory according to the received setting of the cache memory size. As mentioned above, the bit number occupied by the offset field in the access address of the L1 cache is determined by the size of the cache block, and the tag field corresponds to a part of the address of the global memory to be accessed by the access request and is also preset; and only the number of bits occupied by the index field can change with the size of the L1 cache, and the change of the index field only affects the internal addressing of the L1 cache. Under the condition that the size of a cache block and the number of tags contained in each group are not changed, when the size of a data storage space of an L1 cache is changed, the size of an index field in a memory access address is correspondingly changed so as to carry out correct addressing. For example, for a cache block size of 128b,8-way set associative cache (i.e., each set contains 8 cache blocks), when the cache size is 32KB (i.e., 128 × 8 × 32), there are 32 sets in total, and the index field in the access address occupies 5 bits; if the buffer size becomes 64KB (128 × 8 × 64), there are 64 groups, and the index field in the memory address occupies 6 bits. That is, given a cache block size and the number of tags contained within each bank, the number of bits occupied by the index field is set according to the number of banks of the L1 cache, and the number of banks of the L1 cache may be obtained by dividing the cache size by the cache block size and the number of tags contained within each bank. When the access control unit of the processor determines the size of the index field in the access address of the L1 cache, the new group number stored in the L1 cache is determined. In one example, the memory access control unit may include the starting position of the cache memory in the memory of the processor determined in step S1), the setting of the size of the cache memory, the new group number determined in step S2), and other parameters in a notification message indicating the adjustment space to be sent to the cache memory together, so as to facilitate the cache memory to readjust and configure its data storage space.

With continued reference to fig. 3, in step S3), when the cache memory receives a notification message indicating to adjust the space sent from the memory access control unit of the processor, a corresponding new data storage location of each group and each cache block contained therein in the memory of the processor may be determined in turn starting from the starting location of the cache memory in the memory of the processor. As mentioned above, each cache block in a data storage location of the cache memory needs to have a corresponding tag storage location set to be correctly addressed. Therefore, a new mapping relationship needs to be established between these re-determined locations of the cache blocks and the tag storage locations of the tag storage units of the cache memory in step S3). That is, a new mapping is established between each tag storage location in the tag storage unit of the cache and each new data storage location in the data storage unit after the storage space adjustment, so that the cache block corresponding to each tag can be located based on the newly established mapping relationship.

Through the steps, a user can dynamically adjust the size of the storage space of the distributed local memory or the L1 cache in the memory of the processor at any time according to actual application. When the memory access control unit of the processor receives a memory access request for the local memory, the updated initial position of the local memory can be used for positioning data to be accessed by the memory access request. When the access control unit responds to the access request of the global memory, firstly, the address in the access request is mapped into the access address of the cache memory and sent to the cache memory, wherein the access address adopts the updated index field size. When the cache memory receives a memory access request from the memory access control unit, the cache memory can locate the data to be accessed by the memory access request according to the mapping between the newly established label storage position and the new data storage position. In the scheme, the same Random Access Memory (RAM) space in the processor can be shared by the local memory and the L1 cache in a time-sharing manner, so that the aims of reducing the chip area and the hardware cost are fulfilled. Since the storage space in the local memory or L1 cache usually accounts for 80% of the area cost, about 40% of the area cost can be saved by adopting the scheme of dynamically sharing the storage space. And the scheme allows a user to readjust the storage space size of the local memory and the L1 cache in the processor according to the selected programming model based on the local memory or the global memory when the application program is written, thereby better improving the execution performance of the processor for the application program.

In some embodiments, after the storage space of the L1 cache is adjusted through the above steps, when the L1 cache receives a memory access request from the memory access control unit, the memory access address contained in the received memory access request is resolved by the controller thereof. Extracting corresponding index fields from the memory access address according to the updated index field size, positioning to corresponding groups according to the extracted index fields, and then comparing the label fields in the memory access request with each label contained in the positioned groups. If a matching tag can be found, then a cache hit indicates that the data to be accessed by the access request has already been cached in the L1 cache. If all the tags are compared, and no matching tag is found, it indicates that the data to be accessed by the access request is not cached in the L1 cache, and at this time, the L1 cache needs to read the data to be accessed by the access request from the next-level memory (e.g., L2 cache or main memory, etc.) to the L1 cache. Under the condition of cache hit, determining which cache block the hit label corresponds to according to the mapping relation between the label newly established in the step S3) and the cache block, and extracting data to be accessed by the access request from the cache block according to the offset field in the access address as a response to the access request and returning the response to the access request to the access control unit. Under the condition of cache miss, the controller of the L1 cache replaces a label stored in one of the label storage positions of the label storage unit by a label field in the access address of the access request, determines that the selected label storage position corresponds to the cache block according to the mapping relation between the label newly established in the step S3) and the cache block, and then acquires the data to be accessed by the access request from the next-level memory and stores the data in the cache block.

In some embodiments where the cache block of the L1 cache is provided with a tag binding location, in the case of a cache miss, the controller of the L1 cache allocates a tag storage location for the access request in the tag storage unit to store a tag field in the access address of the access request, and selects one of the plurality of cache blocks not bound to the tag from the data storage unit to allocate to the access request; setting the label binding position of the cache block originally corresponding to the distributed label storage position as an indication that the cache block is not bound with the label in a mapping unit; then establishing a mapping relation between the label storage position and the cache block distributed to the memory access request, and setting a label binding bit of the cache block distributed to the memory access request to indicate that the label binding bit is bound with the label; and acquiring the data to be accessed by the memory access request from a next-level memory and storing the data in a cache block allocated to the memory access request.

In still other embodiments of the present application, a processor supporting a dynamic shared memory space is further provided, and except for the access control unit, the memory and the cache memory, the rest of the components are the same as those of the existing processor, and are not described herein again. In this embodiment the cache is a cache as received above in connection with fig. 1 and 2, comprising a controller, a tag storage unit for holding a tag, a data storage unit consisting of a plurality of cache blocks, and a mapping unit. A portion of the memory space of the processor's memory is used as local memory and another portion is used as a data storage unit of the cache memory. The memory may be implemented in the form of Random Access Memory (RAM) such as SRAM, DRAM, or the like. Wherein the memory access control unit is configured to: updating the initial positions of the local memory and the cache memory in the memory of the processor respectively according to the received settings of the size of the local memory and the size of the cache memory; and updating the setting of the size of the index field in the access address of the cache memory according to the received setting of the size of the cache memory. For details, reference may be made to what was described above in connection with steps S1) and S2). Wherein the controller of the cache memory is configured to: according to the size of the cache memory provided by the access control unit and the initial position of the cache memory in the memory of the processor, the corresponding new data storage position of each cache block in the memory is determined, and the mapping between each label storage position in the label storage unit and the new data storage position of each cache block is established in the mapping unit. For details, reference may be made to what was described above in connection with step S3).

It should be understood that for the modules in processors and caches, such as memory control units, controllers, and the method steps performed thereby, referred to herein, except as being implemented in pure computer readable program code, the same functionality may be achieved by logically programming the corresponding functional modules, processes or steps such that the modules are in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, and the like. Therefore, the controller, the memory access control unit, and the like implemented in this way may be regarded as a hardware component, and the devices included therein for implementing various functions may also be regarded as the internal structure of the hardware component. Or even means for performing the respective function may be regarded as being either a software module for performing the relevant process or method step or a structure within a hardware component.

References in the specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., indicate that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment," or the like, in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or characteristic illustrated or described in connection with one embodiment may be combined, in whole or in part, with a feature, structure, or characteristic of one or more other embodiments without limitation, as long as the combination is not logical or operational.

The terms "comprises," "comprising," and "having," and similar referents, in the context of this specification, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The word "a" or "an" does not exclude a plurality. Additionally, the various elements of the drawings of the present application are merely schematic illustrations and are not drawn to scale.

Although the present application has been described through the above-described embodiments, the present application is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present application.

Claims

1. A method for dynamically sharing memory in a parallel processor, comprising:

the method comprises the following steps that an access control unit of a processor updates the initial positions of a local memory and a cache memory in a memory of the processor respectively according to received settings of the size of the local memory and the size of the cache memory, wherein one part of the storage space of the memory is used as the local memory, the other part of the storage space of the memory is used as a data storage unit of the cache memory, the data storage unit of the cache memory is composed of a plurality of cache blocks, the cache memory further comprises a label storage unit and a mapping unit, the label storage unit is used for storing each label in the label storage unit, and the mapping unit is used for storing the mapping relation between each label in the label storage unit and each cache block in the data storage unit;

updating the setting of the size of an index field in a memory access address of the cache memory by a memory access control unit of the processor according to the received setting of the size of the cache memory;

and the cache memory re-determines the corresponding new data storage position of each cache block in the data storage unit of the cache memory according to the size of the cache memory provided by the access control unit and the initial position of the cache memory in the memory of the processor, and establishes mapping between each label storage position and the new data storage position of each cache block by updating the mapping unit.

2. The method as claimed in claim 1, further comprising the step of dividing, by the memory access control unit of the processor, a storage space with a corresponding size as the local memory from a preset address in the memory of the processor according to the received settings of the size of the local memory and the size of the cache memory, and then allocating the storage space for the cache memory, wherein the starting position of the local memory is the preset address.

3. The method of claim 1, wherein the cache memory is a set associative cache, and wherein the size of the index field in the access address of the cache memory is determined based on the result of dividing the size of the cache memory by the size of a preset cache block of the cache memory and the number of tags contained in each set.

4. The method of claim 1, further comprising locating, by an access control unit of a processor, in response to receiving an access request for the local memory, data to be accessed by the access request using the updated starting location of the local memory.

5. The method of any of claims 1-4, further comprising mapping, by an access control unit of the processor, an address in an access request to a memory address of the cache memory and sending the memory address to the cache memory in response to receiving the access request for the global memory, wherein the memory address employs the updated index field size;

in response to receiving a memory access request from a memory access control unit, locating, by a cache memory, data to be accessed by the memory access request according to the mapping between the established tag storage location and the new data storage location.

6. The method of claim 5, wherein locating, by the cache memory in response to receiving a memory access request from the memory access control unit, data to be accessed by the memory access request according to the established mapping between tag storage locations and new data storage locations comprises:

when the cache is hit, determining a cache block corresponding to the hit tag according to the mapping between the established tag storage position and the new data storage position, and extracting data to be accessed by the memory access request from the cache block as a response to the memory access request;

the following operations are performed on a cache miss:

7. A processor supporting a dynamic shared memory space comprises an access control unit, a memory and a cache memory, wherein the cache memory comprises a controller, a label storage unit for storing labels, a data storage unit consisting of a plurality of cache blocks and a mapping unit; wherein one part of the storage space of the memory is used as a local memory and the other part is used as a data storage unit of a cache memory, wherein:

the controller of the cache memory is configured to: according to the size of the cache memory provided by the access control unit and the initial position of the cache memory in the memory of the processor, the corresponding new data storage position of each cache block in the data storage unit in the memory is determined again, and the mapping between each label storage position in the label storage unit and the new data storage position of each cache block in the data storage unit is established in the mapping unit.

8. The processor of claim 7, wherein the memory access control unit is further configured to: according to the received settings of the size of the local memory and the size of the cache memory, firstly dividing a storage space with a corresponding size from a preset address in the memory to be used as the local memory, and then allocating the storage space for the cache memory, wherein the initial position of the local memory is the preset address.

9. The processor of claim 7, wherein the memory is implemented in the form of random access memory.

10. The processor of claim 7, wherein the mapping unit is implemented in the form of a register.