CN115705300A - Method for cache memory and related product - Google Patents

Method for cache memory and related product Download PDF

Info

Publication number
CN115705300A
CN115705300A CN202110926707.5A CN202110926707A CN115705300A CN 115705300 A CN115705300 A CN 115705300A CN 202110926707 A CN202110926707 A CN 202110926707A CN 115705300 A CN115705300 A CN 115705300A
Authority
CN
China
Prior art keywords
latch
data
mode
chip
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110926707.5A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202110926707.5A priority Critical patent/CN115705300A/en
Priority to PCT/CN2022/110740 priority patent/WO2023016383A1/en
Publication of CN115705300A publication Critical patent/CN115705300A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present disclosure relates to a method for a cache memory, a system on a chip, a board and a computing device, wherein the computing device of the present disclosure is embodied in a computing processing means included in a combined processing means, which may also include a general interconnect interface and other processing means. And the computing processing device interacts with other processing devices to jointly complete computing operation designated by a user. The combined processing device may further include a storage device connected to the computing processing device and the other processing device, respectively, for storing data of the computing processing device and the other processing device. The scheme of the present disclosure can improve the use efficiency of the cache memory.

Description

Method for cache memory and related product
Technical Field
The present disclosure relates generally to the field of chip technology. More particularly, the present disclosure relates to a method for a cache memory, a system on chip including the cache memory, a board including the system on chip, and a computing device including the board.
Background
The operating performance of a computing system depends to a considerable extent on the average access latency of the memory. System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of a cache memory (referred to as "cache"). To this end, processors typically employ a caching mechanism and utilize caches to accommodate speed and performance mismatches between the processor and the lower-speed main memory. Current caches implement a multi-Level caching mechanism, e.g., employing three levels of Cache (L1, L2, and L3), and call the Cache closest to main memory the Last Level Cache ("Last Level Cache", LLC). In view of the frequent use of cache in system on chip and its important role, it is necessary to propose an effective management strategy to improve the utilization rate of cache and reduce the number of accesses to main memory. In addition, how to extend the application of LLC for different scenarios also becomes a problem to be solved.
Disclosure of Invention
In view of the technical problems mentioned in the background section above, the present disclosure provides a residency scheme for a cache memory. By the scheme of the disclosure, a specific area in the cache memory can be configured as a locking area, and the multiple-use data can be resident in the locking area, thereby improving the cache hit rate and improving the overall performance of the system. Based on this, the present disclosure provides a scheme for a cache memory in the following aspects.
In a first aspect, the present disclosure provides a method for caching, comprising: configuring a particular memory space in a cache memory as a latch region that supports a plurality of latch modes, wherein each of said latch modes corresponds to a latch-related operation performed on data in said latch region; receiving a latch-related request to latch-related operations on the data in the latch region; and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
In a second aspect, the present disclosure provides a cache memory comprising: a configuration module for configuring a particular storage space in the cache memory into latch regions that support a plurality of latch modes, wherein each of the latch modes corresponds to a latch-related operation performed on data in the latch region; a latch execution module to: receiving a latch-related request to latch-related operations on the data in the latch region; and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
In a third aspect, the present disclosure provides a system on a chip comprising a cache memory as described above and in the following embodiments; and a processor for generating the latch related request; wherein the latch execution module of the cache memory is configured to execute a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
In a fourth aspect, the present disclosure provides a board comprising a system on chip as described above and in various embodiments below.
In a fifth aspect, the present disclosure provides a computing device comprising a board as described above and in various embodiments below.
According to the scheme provided in the aspects of the present disclosure, the latch area can be utilized to latch and unlock data for multiple uses, thereby significantly increasing the cache hit rate. Further, the latch area of the present disclosure supports a plurality of latch modes, and the latch modes can be selectively used according to the configuration, so that the application scenarios of the latch area are expanded. When used in the context of producer and consumer kernels, the latching regions of the present disclosure may serve as a medium for data transfer, thereby promoting data accessibility and usage. In addition, since the probability of cache hit is improved by the latch area, the overall performance of the computing system is also obviously improved by the scheme of the disclosure.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:
fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device, according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating an internal architecture of a multi-core computing device according to an embodiment of the disclosure;
FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;
FIG. 6 is a flow chart illustrating a method for caching in accordance with an embodiment of the present disclosure;
FIG. 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the present disclosure;
FIG. 8 is a simplified block diagram illustrating a system on a chip according to an embodiment of the present disclosure;
FIG. 9 is a detailed block diagram illustrating a system on a chip according to an embodiment of the present disclosure;
FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure;
FIG. 11 is a schematic diagram illustrating a hash operation in Window mode according to an embodiment of the present disclosure;
FIG. 12 is a simplified block diagram illustrating a system on a chip according to an embodiment of the present disclosure;
FIG. 13 is a flow chart illustrating a method for a system on a chip according to an embodiment of the present disclosure; and
FIG. 14 is a block diagram illustrating operation of a system on a chip according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is apparent that the described embodiments are some embodiments of the present disclosure, not all embodiments, and the described embodiments may be appropriately combined to implement different applications according to scenes. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," and "third," etc. as may be used in the claims, the description, and the drawings of the present disclosure are used to distinguish between different objects, and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. It should be understood that the configuration and composition shown in FIG. 1 is merely an example, and is not intended to limit the aspects of the present disclosure in any way.
As shown in fig. 1, board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The combined processing device can be an artificial intelligence operation unit, is used for supporting various deep learning and machine learning algorithms, meets the intelligent processing requirements in complex scenes in the fields of computer vision, voice, natural language processing, data mining and the like, and particularly applies deep learning technology to the field of cloud intelligence in a large quantity. One of the significant characteristics of cloud-based intelligent application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high, whereas the board card 10 of the embodiment is suitable for cloud-based intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.
As further shown, the chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 according to the above-described embodiment. As shown in fig. 2, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a Dynamic Random Access Memory (DRAM) DRAM 204.
The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing means 203 through the interface means 202 to collectively complete the user-specified operations.
The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on-chip with the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the Processing device 203 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general purpose and/or special purpose Processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.
The DRAM204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.
Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The mononuclear computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining and the like, and the mononuclear computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33. The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution. The storage module 33 is used to store or transport related data, and includes a Neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333.NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.
Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, the multi-core computing device 41 being a system on a chip comprising at least one cluster (cluster) according to the present disclosure, each cluster in turn comprising a plurality of processor cores. In other words, the multi-core computing device 41 is constructed in a system-on-chip-cluster-processor core hierarchy. Looking at the system-on-chip hierarchy, as shown in fig. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.
There may be multiple (2 as exemplarily shown) external memory controllers 401 for accessing an external memory device, i.e., an off-chip memory (e.g., DRAM204 in fig. 2) in the context of the present disclosure, in response to an access request issued by the processor core, so as to read data from or write data to the off-chip memory. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a Global synchronization Barrier Controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41. Although 4 clusters are exemplarily shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently execute a deep learning algorithm.
Looking at the cluster hierarchy, as shown in fig. 4, each cluster 405 may include a plurality of processor cores (IPU core) 406 and a memory core (MEM core) 407, which may include, for example, a cache memory (e.g., LLC) as described in the context of the present disclosure.
The processor cores 406 are exemplarily shown as 4 in the figure, the present disclosure does not limit the number of the processor cores 406, and the internal architecture thereof is as shown in fig. 5. Each processor core 406 is similar to the single-core computing device 301 of fig. 3, and as such may include three modules: a control module 51, an arithmetic module 52 and a memory module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described herein again. It should be particularly noted that the storage module 53 may include an Input/Output Direct Memory Access (IODMA) module 533 and a transport Direct Memory Access (MVDMA) module 534.IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.
Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, the memory core 407 may have the capability of scalar operations to perform scalar operations.
The Memory core 407 may include a Static Random-Access Memory (SRAM) 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) 410, and a Global Direct Memory Access (GDMA) 411. In one implementation scenario, SRAM408 may assume the role of a high performance data hub. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be individually obtained by the processor cores 406 to the DRAM204, but rather is relayed between the processor cores 406 via the SRAM 408. Further, the memory core 407 only needs to quickly distribute multiplexed data from the SRAM408 to the plurality of processor cores 406, so that it is possible to improve inter-core communication efficiency and significantly reduce off-chip input/output access.
The broadcast bus 409, CDMA410, and GDMA 411 are used to perform communication between the processor cores 406, communication between the cluster 405, and data transmission between the cluster 405 and the DRAM204, respectively. As will be described separately below.
The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM408 to all processor cores 406, which is a special case of multicast.
CDMA410 is used to control access to SRAM408 between different clusters 405 within the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM408 of the cluster 405 to the DRAM204 or to read data from the DRAM204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 ways. The first way is to communicate with NRAM 431 or WRAM 432 directly with DRAM204 through IODAM 433; the second way is to transmit data between the DRAM204 and the SRAM408 through the GDMA 411, and transmit data between the SRAM408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although the second approach may require more components and longer data flow, in some embodiments, the bandwidth of the second approach is substantially greater than that of the first approach, and thus it may be more efficient to perform communication between DRAM204 and NRAM 431 or WRAM 432 in the second approach. It is understood that the data transmission schemes described herein are merely exemplary, and those skilled in the art can flexibly select and adapt various data transmission schemes according to the specific arrangement of hardware in light of the teachings of the present disclosure.
In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. Although the present disclosure considers GDMA 411 and IODMA 533 as different components for convenience of description, it will be within the scope of protection of the present disclosure for a person skilled in the art as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA410, and MVDMA 534 may be implemented by the same component.
The hardware architecture and its internal structure of the present disclosure are described in detail above in conjunction with fig. 1-5. It is to be understood that the above description is intended to be illustrative, and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also change the board card and the internal structure of the present disclosure, and these changes still fall into the protection scope of the present disclosure. For example, in the solution that will be described below in the present disclosure, the corresponding hardware architecture may be CDMA410 that does not include a mechanism for controlling access to SRAM408 between different clusters 405 within the same computing device 201. Instead, the underlying aspects of the present disclosure relate to improving and optimizing cache memory, e.g., disposed between the SRAM408 and the DRAM204, to achieve efficient data on-demand latching and communication between different clusters through the cache memory.
In order to efficiently use a cache (e.g., LLC) and improve hit rate of data accesses, the following scheme of the present disclosure proposes to configure a specific storage space in the cache as a latch region for a latch operation of data, especially for data that is to be frequently used. For example, the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that when data is only used once, then the data may not be latched in the cache.
Further, on the basis of the aforementioned configuration of the latch area for data latching, the following aspect of the present disclosure also proposes to configure the cache memory to support a plurality of latch modes, so as to cause the cache memory to operate in a latch mode corresponding to the aforementioned latch-related request, when the latch-related request is received. The various latching modes of the present disclosure may have a particular priority order to satisfy different latch-related operations, depending on different application scenarios and requirements. In addition, in order to make the cache memory support multiple latch modes, the scheme of the present disclosure also proposes multiple different configuration methods, so that the cache memory can be used more flexibly and utilized to realize communication between clusters.
FIG. 6 is a flow diagram illustrating a method 600 for caching, according to an embodiment of the present disclosure. As shown in fig. 6, the method 600 includes configuring a particular storage space in the cache memory as a latch region that supports multiple latch modes at step S602. In one embodiment, the aforementioned plurality of latch modes may include, but are not limited to, an instruction mode to perform latch related operations based on hardware instructions, a window mode to perform latch related operations based on window attributes, a stream mode to perform latch related operations based on dataflow, and/or a page mode to perform latch related operations based on cache pages. In one embodiment, the aforementioned data streams may be instruction streams or data streams having different types. Taking the data stream as an example, in an application scenario of the neural network, the data stream may be a neural data stream of the neural network model, a weight data stream, an output result data stream, and the like. Additionally, in the context of the present disclosure, the data for which the latch related operation is directed is data to be used multiple times by a processor of the system-on-chip, which has a relatively high priority relative to data for which no latch operation is performed. By latching (or residing) such multiple-use data in the latch region of the present disclosure, the cache hit rate can be significantly increased, thereby improving the overall performance of the system. In addition, by storing the data used repeatedly in the latch area of the LLC, the read/write operation of data between the on-chip system and the off-chip memory (e.g., DDR or DRAM) can be reduced, thereby also improving the access efficiency.
In one application scenario, the various latching modes described above may be set to have different priorities according to user preferences or system preferences. For example, in one embodiment, the high-low order of priority may be instruction mode- > window mode- > stream mode- > page mode; in another embodiment, the high-low order of priority can be instruction mode- > page mode- > stream mode- > window mode. With such multi-mode and priority settings, the latch regions in the cache can be used in more ways, increasing the flexibility of latch region use to cope with different application scenarios and system requirements. Further, the above-described latch modes may be sequentially traversed in order of priority, and when the latch mode of high priority is disabled, the latch mode of low priority may be employed.
In one embodiment, a particular memory space may be configured to support a corresponding latch region of a latch mode according to a received configuration command of a plurality of configuration commands. In one scenario, the configuration instruction may include one or more configuration items to implement the configuration of the aforementioned latch region. For example, the plurality of configuration items may include configuration items that enable the latch region, disable the latch region, and/or the latch region size. Further, a corresponding latch strategy (e.g., the size of the latch data or the specific data to be latched) may be configured in the aforementioned instruction mode, window mode, stream mode or page mode for latching different types or specific instructions, data or data streams, etc. Configuring the corresponding latching strategies in the different modes may be found in particular in the description below. With such enabling, disabling, and various specific configurations, the disclosed scheme may enable flexible use of the cache memory such that it may operate in one of the various latching modes of the present disclosure, or in a conventional mode, as desired.
Returning to the flowchart in fig. 6, after the configuration operation at step S602 described above is completed, at step S604, a latch related request for performing a latch related operation on data in the latch area is received. According to embodiments of the present disclosure, the latch related request may be triggered by an operation intended to reside specific data in the latch region. Alternatively, the latch related request may also be triggered by an operation aimed at removing or releasing specific data from the latch area. As described in detail above, the latch related requests of the present disclosure may also have different expressions or contents when operating in different latch modes. For example, for instruction mode, window mode, or stream mode, the latch related request may include a configuration item or the like for indicating a behavior attribute of the cache memory.
In one embodiment, the configuration item for indicating the behavior attribute of the cache memory at least includes one of the following configuration attributes:
transient (Transient) property: the buffer memory is not carried out in the LLC, namely, the read-write operation of data is directly carried out with an off-chip memory (such as DDR); the method is used for caching certain data which are only accessed once in the LLC, so that LLC resources are prevented from being occupied;
lock (Lock) attribute: specific data resides in a latch area, and data is read from and written to a cache line (cacheline). If the cache line belongs to the latch area, configuring the attribute of the cache line into a persistent attribute; if the cache line does not belong to the latch area, the attribute of the cache line is not changed, namely the following normal (normal) attribute is kept; it should be clear that the cache lines of the above-described latch regions have two properties, namely a persistent (persistent) property and a normal (normal) property. The cache line of the persistent attribute in this latch region can only be accessed and replaced by latch related requests accompanied by the Lock attribute.
Unlock (Unlock) attribute: after reading and writing data from the hit cache line, releasing the corresponding storage space of the data in the latch area in the LLC, and setting the corresponding cache line attribute in the latch area as the following conventional attribute;
conventional attributes: the request of normal cache in LLC can directly read and write data with the off-chip memory;
invalid (Invalid) attribute: directly invalidating data after reading to avoid being written into an off-chip memory in a replacement manner;
clean (Clean) attribute: when the write operation is executed, the data can be written into the hit cache line, the storage content of the whole cache memory (cache) is written back to the off-chip memory, and the attribute of the cache line is kept unchanged; in a read operation, data is read from the hit cache line. When the cache line of the hit is dirty (dirty), the cache line is written back to the off-chip memory;
default (Default) attribute: the default entry may be used to indicate that the configuration for the latch mode is to be ignored.
By appending the above-described exemplary configurable attributes to the latch related request, aspects of the present disclosure may perform corresponding latch related operations in the instruction mode according to these appended attributes.
For another example, for page mode, the latch related request may indicate that data related to a particular page is to be latched in the latch region for subsequent multiple uses, or may indicate that data related to a particular page is to be unlatched from the latch region after multiple uses to free up more storage space for subsequent latching of data. It can be understood that the storage space of the latch region can be flexibly used by the release operation, thereby improving the use efficiency of the latch region of the present disclosure.
Returning to the flow of fig. 6, in response to the latch related request of step S604 described above, at step S606, a latch related operation may be performed on the data in the latch area in a corresponding latch mode according to the latch related request. According to an embodiment of the present disclosure, the aforementioned latch related operation may include a read operation and a write operation with respect to the latch area. In one embodiment, for a write operation to a latch region, the method 600 may further include latching data or selected portions of data in a designated area of the latch region for subsequent multiple reads in accordance with a latch-related request. In another embodiment, for a read operation at a latch region, the method 600 may further include releasing data or a selected portion of the data from a designated area of the latch region according to a latch related request after the read operation is performed.
With respect to the selected partial data, in one embodiment, a predetermined proportion of data may be selected from the data in a random manner to form the partial data to be latched in the latch area. In another embodiment, a predetermined proportion of data may be selected from the data using a hashing algorithm to be latched in the latch area as the aforementioned partial data. In a further embodiment, when the access address of the data to be subjected to the latch related operation is within the address range of the lock window, the foregoing hash algorithm may be used to select a portion of the data that can be latched in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with fig. 11.
With the method described above in conjunction with fig. 6, the scheme of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptive, so that different application scenes and user requirements are met. In addition, due to the fact that the data are effectively latched in the latching area, sharing of the data between the producer kernel and one or more consumer kernels is promoted, and the accessibility and the utilization rate of the data are improved. A producer core and a consumer core are herein understood to be two tasks with dependencies, wherein the output of the producer core will be the input passed to the consumer core, so that the consumer core uses the input to complete the corresponding task. At the moment, the output of the producer kernel is used as the input of the subsequent operation, so the output of the producer kernel can be used as the subsequent data needing to be used for many times, and the subsequent data needing to be used for many times can be temporarily stored in a latch area of a cache memory, so that a consumer kernel can directly obtain the input from the cache memory without accessing an off-chip memory, thereby reducing the access interaction between an artificial intelligent processor and the off-chip memory, reducing the IO access cost and further improving the processing efficiency and the performance of the artificial intelligent processor.
FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the present disclosure. It will be appreciated that the cache 700 shown in fig. 7 may be the cache described in connection with fig. 6, and thus the cache described with respect to fig. 6 is equally applicable to the description below with respect to fig. 7.
As shown in fig. 7, the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702. Further, the cache memory 700 further includes a storage space for performing a cache operation, for example, 8 ways (way 0 to way 7) which are shown in the figure and divide the storage space into 8 shares on average, wherein each way includes a number of cache lines (cachelines).
In one embodiment, the configuration module described above may be used to configure a particular memory space in the cache memory to be a latch region that supports multiple latch modes, where the size of the particular memory space is less than the total memory size of the cache memory. For example, way 0-way 5 in FIG. 7 may be configured to support latched specific storage space. Correspondingly, ways 6-7 in FIG. 7 may maintain the normal attributes of a cache, i.e., as a general cache usage. As previously described, the latching mode may be an instruction mode, a window mode, a stream mode, and/or a page mode. Further, the latch execution module may be to receive a latch related request to latch related operation on the data in the latch area. Then, the latch execution module may perform a latch-related operation on the data in the latch area in a corresponding latch mode according to the latch-related request. As described above, the latch related operation herein may include a write operation for the latch area (i.e., writing data to the latch area) or releasing data in the latch area from the latch area. For example, when a consumer core runs out of data for a latch region and the data is no longer being used by other consumer cores, then the space in the latch region where the data is stored may be freed up for latching other data.
FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the present disclosure. As shown in fig. 8, a system on a chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in fig. 7. In one embodiment, the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to a latch-related request. With respect to the cache memory 700, it is described above in connection with fig. 6 and 7 and will not be described again here. With respect to the processor 802, it may be various types of processors and may include one or more processor cores to generate latch related requests in accordance with aspects of the present disclosure. In operation, the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request. For example, when the latch mode is an instruction mode, then the processor may be configured to generate a latch related request based on the received hardware instruction. For another example, when the latch mode is page mode, then the processor may be configured to generate a latch related request according to a cache page configuration. For another example, when the latching mode is a window mode or a stream mode, then the process may be used to configure a lock window and generate a latch related request based on the lock window.
According to various embodiments, the processor 802 may also be an intelligent processor or intelligent Processing Unit ("IPU") including multiple computing cores, which may be configured to perform computations for various artificial Intelligence domains (e.g., neural network aspects).
FIG. 9 is a detailed block diagram illustrating a system-on-chip 900 according to an embodiment of the present disclosure. It is to be appreciated that the system-on-chip 900 shown herein may be a specific implementation of the system-on-chip shown in FIG. 8, and thus what is described with respect to FIG. 8 is equally applicable to FIG. 9. Further, for purposes of example only, the operation of the system-on-chip 900 will be described in a windowed mode (or streaming mode) of a plurality of latched modes.
As shown in FIG. 9, the system-on-chip 900 may include a task Scheduler ("Job Scheduler") 902, which includes a scheduling unit 903 and a configurator 904. In one embodiment, the configurator 904 may be configured to generate configuration instructions based on allocated configuration tasks (e.g., available from a task queue) for transmission to a configuration module (e.g., CLR) of the cache memory (i.e., "LLC" 906). In one embodiment, the scheduling unit 903 may be used to schedule a plurality of tasks (i.e., "kernel" to be executed on an artificial intelligence processor) in a task scheduler for transmission to an Intelligent Processor (IPU) 905 in a system-on-chip of the present disclosure. In the solution of the present disclosure, the intelligent processor 905 herein may include a plurality of processor cores, and the plurality of processor cores may constitute one cluster (cluster) as shown in fig. 4. In one implementation scenario, in a multi-processor core architecture as before, the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (e.g., utilization) of the plurality of processor cores.
Further, the System on chip 900 further includes a System Memory Management Unit ("SMMU"), which is configured to convert a virtual address of the accessed data into a physical address, so as to implement access to the relevant storage location according to the physical address. In one embodiment, the system memory management unit includes a Translation Lookaside Buffer (TLB) (also called a fast table) provided therein. A page table is maintained in the TLB, and the page table comprises at least one page table entry, and each page table entry comprises a page (page) and a page Frame (Frame) corresponding to the page. In operation, the system memory management unit may determine a page corresponding to the virtual Address according to the received virtual Address, and then may determine a physical Address PA (physical Address) corresponding to the virtual Address according to a mapping relationship between the page and the page frame, so as to implement access to an associated storage location of the cache memory according to the physical Address.
In one embodiment, access to the cache memory may be achieved through the window mode or the stream mode described above. At this time, the intelligent processor may retrieve the parameter table from the memory and configure a Lock window ("Lock window") associated with data to be subjected to a latch-related operation according to the parameter table, and generate a latch-related request (i.e., an IO access request with, for example, a Lock/unlock attribute attached) according to the configured Lock window. The SMMU may then perform a latch-related operation on the LLC in accordance with the IO access request. In particular, the SMMU may send the aforementioned IO access request to a cache policy module 907 of LLC 906 (which performs the same operations similar to lock execution module 702 in fig. 7) for execution. In one embodiment, the parameter table may include parameter entries for configuring a stream latch attribute in a lock window or stream mode. For example, the parameter items may include, but are not limited to, a Lock/unlock window ("Lock/unlock window"), a Lock/unlock per stream ("per stream Lock/unlock"), a latch Ratio ("Lock Ratio"), a Lock window identification ("Lock window flag"), and the like. In one implementation scenario, the parameters in the parameter table may be user-defined. Thus, the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table is stored in the memory (e.g. DDR) for the intelligent processor (such as IPU 905 in the figure) to use during the execution phase.
In one embodiment, the lock window described above is used to represent the memory space that a software user wishes to latch, and the size of the lock window may be larger than the size of the latch area on the cache memory. The locking window includes one or more of: a base Address and a size of the window, wherein the base Address of the window may be a Virtual Address (for example, a Virtual Address ", abbreviated as" VA ") configured by upper layer software, the base Address of the window corresponds to a starting Address of data to be subjected to a latch related operation, and the size of the window may correspond to a size of the data to be latched.
Specifically, in the window mode, the intelligent processor may determine, according to a task issued by the task scheduler, a memory access address of data in the task (the memory access address may be a virtual address), and compare the memory access address of the data in the task with an address range defined by a lock window of the window. If the access address of the data in the task is in the address range of the lock window, it indicates that the lock window is hit, and the lock window (e.g., "Enabled") may be Enabled. Otherwise, if the access address of the data in the task is out of the address range of the locking window, the locking window is not hit. At this point, the lock window may be ignored, indicating that the data in the task will not be buffered in the cache. Further, when the access address of the data hits the lock window, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be stored in the latch area. The specific use of the hashing algorithm will be described in detail later in conjunction with fig. 11. The smart processor may then send a latch related request accompanied by a Lock attribute to the cache LLC via SMMU. The latch related request with the Lock attribute may be used to indicate that specific data is resident in the latch area, and the specific data may be a part of data selected according to a hash algorithm.
The latching and releasing process of the LLC is described below in window mode in connection with FIG. 9.
LLC parking (or locking) process:
step 1: the task scheduler configures the LLC (e.g. via a cache policy module) by means of a configurator to enable locked zones ("Lock enable"), disable locked zones ("Lock disable"), and the size of the locked zones, i.e. the number of ways ("Way") shown in the figure (Way 0 to Way7, for example).
And 2, step: the task scheduler issues a task kernel to the IPU;
and step 3: the IPU obtains a lock window flag from the parameter table, reads and configures the lock window. In one implementation scenario, the parameter table herein may be configured by software and stored at a Memory address of an off-chip Dynamic Random Access Memory ("DRAM"). The task scheduler may then communicate the address to the IPU and the IPU may read the parameter table based on the address to complete the configuration of the lock window.
And 4, step 4: the IPU generates a latch related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, may attach a lock attribute to the request according to the lock window information.
And 5: after receiving the latch-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (i.e., the latch area), for example, to be set as "persistent" as described above.
LLC delocalization (or release) process:
step 6: the task scheduler issues kernel to the IPU;
and 7: the IPU acquires an unlocking window identifier from the parameter table, reads and configures the unlocking window;
and step 8: when the IPU transmits a request, attaching an unlocking (unlock) attribute according to the unlocking window information;
and step 9: after receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line with the hit lock attribute to a Normal attribute, such as the Normal (Normal) attribute described above in connection with the instruction mode;
step 10: the task scheduler disables the latch area, i.e., LLC lock disable, with the configurator and through the CLR module. In one implementation scenario, the CLR module may clear the previous lock attribute configuration as directed by the configurator.
The latching scheme of the system on chip of the present disclosure in the window mode is described in detail above in conjunction with fig. 9. By such a latch operation, the probability of cache hit can be significantly increased, the use efficiency of the cache memory can be improved, and the application scenario can be expanded.
The embodiment of the present disclosure further supports a latch related operation in the streaming mode, and when the enable bit corresponding to the data stream is low in the task of the present disclosure, the task is regarded as a default situation, that is, the latch related operation in the streaming mode is not performed. Conversely, when the enable bit is high, then the corresponding latch related operation may be performed on the data stream in the streaming mode. Specifically, the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data from the data stream may be selected as the aforementioned partial data to be stored in the latch area using a hash algorithm and a latch ratio of the data stream. The specific use of the hashing algorithm will be described in detail later in conjunction with fig. 11.
As previously mentioned, in one embodiment, the disclosed embodiments also support latch-related operations in the page mode, which is described below in conjunction with FIG. 10.
FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure. As shown in fig. 10, according to the solution of the present application, a cache page may be directly configured to have the locking property of the present disclosure, so that the cache page forming a mapping relationship with a storage (e.g., "memory") may be used for sharing access data usage among a plurality of cores kernel (e.g., cores 0 to 2 shown in the figure). In one embodiment, a programmer may mark a cache page with a lock attribute using an instruction (e.g., malloc). When a kernel accesses a cache page marked as locked, the SMMU may lock data corresponding to the cache page in a latching region of the present disclosure. Then, when the subsequent core needs to access the aforementioned cache page again, it can read the previously locked data from the corresponding cache line of the latch area, thereby realizing a cache hit. Thus, through the page mode, the scheme of the disclosure improves the sharing and accessibility of data among multiple kernels.
Specifically, in page mode, the software driver can directly configure information in a ("System Memory Management Unit", abbreviated as "SMMU") page table by an instruction, and determine to perform both configurations of a page-based latch operation or a normal (normal) operation according to the information. When the information in the page table indicates SMMU is bypassed (bypass), this indicates that no latching of the cache is required, and the attribute of the buffered line in the cache may be a Normal (Normal) attribute. When the information indicates the SMMU is a linear mapping, then the page-based latching operation may be set according to the SMMU linear mapping window configuration. For example, the data corresponding to the cache page within the linear mapping window is locked into the latch region of the present disclosure. The SMMU may generate a corresponding latch related request based on the information in the page table, and send the latch related request to the LLC, and the cache policy module of the LLC may configure the cache line of the LLC according to the latch related request to perform the corresponding cache related operation.
In one embodiment, the embodiment of the present disclosure further supports an instruction mode, in which the system on chip may configure a latch area in the LLC through an access instruction (IO instruction) in an instruction set.
For example, an IO instruction is accompanied by at least one configuration field of latch related attributes, thereby flexibly configuring the LLC by means of the configuration field. Here, various configuration domains may represent operational behaviors that an LLC may perform correspondingly when performing data accesses to off-chip memory (e.g., DDR space). In an implementation scenario, the instruction includes the configuration attributes: transient (Transient) attribute, lock (Lock) attribute, unlock (Unlock) attribute, normal (Normal) attribute, invalidate (Invalid) attribute, clean (clear) attribute, default (Default) attribute, and the like. Since the instruction mode is highest priority, when the IO access instruction indicates the Default attribute, it means that the latch related operation can be performed by other modes (such as window mode, stream mode, or page mode).
By appending the above-described exemplary configurable attributes to the latch related request, aspects of the present disclosure may perform corresponding latch related operations in the instruction mode according to these appended attributes.
When the task scheduler issues the task to the intelligent processor IPU, the IPU may determine the latch related request according to the IO instruction in the task. Specifically, when the configuration field of the Lock attribute in the IO instruction is enabled, the Lock attribute may be appended to the latch related request at this time, so that the LLC stores specific data in the locked area according to the latch related request appended with the Lock attribute. When the configuration field of the Unlock attribute in the IO instruction is enabled, the Unlock attribute may be appended to the latch related request at this time, so that the LLC releases the lock region according to the latch related request appended with the Unlock attribute. Depending on the application scenario, other attributes may be similarly attached to the latch related request.
Further, in some operation scenarios, a specific configuration field for indicating the latch ratio is also included in the instruction. When a particular configuration field (e.g., a particular bit inst _ ratio _ en) in the instruction is low, then the latch operation may be considered to be instruction configuration dependent, i.e., the latch-related request is determined according to the particular IO instruction in the task. If the bit is high, a predetermined proportion of data from the data stream may be selected as the partial data to be stored in the latch area, in accordance with a hash algorithm to compare with a latch ratio (latch ratio) indicated by the instruction. The specific use of the hashing algorithm is described in detail below in conjunction with fig. 11.
Fig. 11 is a flowchart illustrating a hash operation in a window mode or a stream mode according to an embodiment of the present disclosure. The scheme of the present disclosure uses a hash operation to perform a certain proportion of parking (i.e., locking) because one of the key issues with LLC parking is the bandwidth-to-capacity tradeoff ("tradeoff"). Thus, the present disclosure proposes to perform a certain proportion of parking (i.e., lock Ratio) so that different bandwidths and parking capacities can be obtained for different tasks. Assuming a preset Lock Ratio value of P (e.g., in percentage), the expected bandwidth is B =6T P +2T (1-P), where 6T is the read rate at which data resides on the LLC and 2T is the read speed at which data is stored on memory (e.g., DRAM), where T =1000Gbit/s. As previously mentioned, the Lock Ratio may be configured in a Lock/unlock window or may be configured for a particular data stream. In addition, although the hash operation in the window mode or the stream mode is described below, a similar operation is also applicable to the hash operation in the instruction mode.
Specifically, in the window mode or the stream mode, the intelligent processor core firstly compares the access address of the data with the address range defined by the locking window to determine whether the requested address is in the address range of the locking window. When the requested address is within the address range of the locked window, then a hash operation may be performed on the hit window address range, as shown in FIG. 11. Here, the access address of each data may be a virtual address.
Specifically, with the help of the globally fixed Hash rule, the VA of the access address can be mapped onto the Hash space (i.e., "Hash Map" in the figure), and the Hash process can preferentially reserve the address low-order information. The Hash value obtained at 1102 may then be compared to the latch Ratio Lock Ratio at 1104 to randomly select a corresponding Ratio of data. Specifically, when the hash value of the access address is less than the latch ratio, a hit is considered and thus the portion of data (i.e., the ratio-matched data) may be latched in the cache. In contrast, when the hash value of the memory address is greater than or equal to the latch ratio, a miss is considered and, therefore, the portion of data will not be latched in the cache.
For example, when the latch Ratio Lock Ratio is set to 10%, the partial data corresponding to the previous 10% value, that is, the partial data whose Hash value of the latch address of the data is smaller than the latch Ratio, may be sequentially selected from the Hash values to perform the latch-related operation. In other examples, the latch ratio may be other values, the latch ratio may be set by a user defined by software, and the foregoing selecting operation may be implemented according to the setting of the Hash algorithm. For example, the latch ratio may also be 20% to 30%, and at this time, partial data corresponding to the first 20% to 30% of the Hash value may be sequentially selected from the Hash value to perform the latch-related operation. Thereafter, the processing may be performed at 1106 according to the specified request type, i.e., locking or unlocking portions of the data.
The cache latching scheme of the present disclosure is described in detail above in connection with fig. 6-11. Based on the idea of the aforementioned latching scheme, and in addition to the aforementioned latching scheme, the following will describe in connection with fig. 12-14 another extended application of the present disclosure to a cache memory, namely how to implement communication between clusters in a system on chip through a cache.
FIG. 12 is a simplified block diagram illustrating a system on a chip according to an embodiment of the present disclosure. In conjunction with the foregoing description, it is understood that the system-on-chip herein may be a system-on-chip included in the computing device 201 shown in fig. 2, such as a system-on-chip formed by the multi-core computing device 41. As shown in FIG. 6, the system-on-chip 1200 includes four clusters 0-4 as exemplarily shown. In view of the foregoing, the cluster has been described in detail, and will not be described in detail here. Further shown is a cache memory 1201 which may be provided, for example, in the SRAM408 as previously shown in fig. 5, for performing inter-cluster data transfer operations. In one implementation scenario, the cache 1201 may also communicate bi-directionally with DRAM (e.g., DDR) on and off chip, including the transfer of various types of data or instructions.
FIG. 13 is a flow chart illustrating a method 1300 for a system on a chip according to an embodiment of the present disclosure. The system-on-chip here may be a system-on-chip as shown in fig. 12. Specifically, the system on chip includes at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters. In one implementation scenario, each cluster may include a plurality of processor cores to perform the arithmetic operations. In one implementation scenario, the above-mentioned latch area determined in the cache memory may be used to complete data communication between clusters, so that the system on chip may no longer be provided with communication modules such as CDMA410 and GDMA 411.
In one embodiment, the latch regions may be used to pass data between tasks having dependencies, e.g., the latch regions may be used to pass data between producer cores and consumer cores. Specifically, the processor may lock in the LLC the data that the producer core needs to be swapped to the consumer core through a configured lock window. In one scenario, when the processor has finished executing the producer core, the data (which may be input data or output data of the producer core) that needs to be passed to the consumer core may be latched. In view of this, the processor may perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, SMMU as previously described, thereby latching the aforementioned data that needs to be exchanged in the windowed mode in the LLC for later use by the consumer core. Correspondingly, the processor may also release the latch area according to an unlock window configured in the consumer core, that is, when the processor completes execution of the consumer core by performing a read operation on the data latched in the LLC, the processor may release a corresponding storage space of the data in the latch area in the LLC.
The latch area can be configured for transferring data between tasks having dependency relationships based on the above, and can also be used in an application scenario of inter-chip communication. For example, one cluster or processor core of the processor transmits data (which may be data that the producer core needs to exchange for the consumer core) to processors in other clusters via the latch region for merging processing. And the processors in other clusters read data from the latch areas for processing, so that inter-chip transmission of the data is realized. The manner in which the latch area is used for inter-cluster communication can be seen in detail in the following description.
As shown in fig. 13, the present disclosure also includes a method for inter-cluster communication using a latch region of a cache memory, the method comprising:
at step S1302, the specified storage space of the off-chip memory is mapped to a given storage area (having the same physical attributes as the lock area described above in connection with the figures) of the cache memory ("cache") to use the given storage area as a cluster storage area for inter-cluster data communication. In one implementation scenario as illustrated in fig. 8, the cache memory may include an LLC, and the off-chip memory includes a DDR. Based on this, the specified storage space may be the storage space specified at 1402 in fig. 14. Correspondingly, the cluster storage may be a given storage in the cache at 1404 in FIG. 14. In one implementation scenario, the specified memory space for the DDR may be specified by software configuration and mapped to a given space on the cache for inter-cluster (e.g., cluster 0 and cluster 1 as shown in fig. 14) communication. After the division and determination of the cluster storage area are completed, at step S1304, the operation of the cluster may be performed using the determined cluster storage area.
In one embodiment, using the cluster storage area to perform the operations of the cluster may include using the cluster storage area for inter-cluster communication. In this case, using the cluster storage area for inter-cluster communication may specifically include: the cluster storage area is utilized to enable point-to-point communication between clusters. Additionally, the cluster storage area may be utilized to enable broadcast communication by one of the plurality of clusters to the remaining clusters. In a point-to-point communication scenario, the cluster storage area may be configured to receive a write operation from a first cluster for write data and to send a previous write operation of the first cluster to a second cluster in response to a read operation of the second cluster.
In an example implementation of the above-described write operation, the cluster storage area may also be used to receive a Lock indication, such as write Lock ("write Lock") shown in fig. 14, that is, the above-described Lock-related request with the Lock attribute, that the write data associated with the above-described write operation resides in the cluster storage area. The write data may then reside in the cluster storage based on the lock indication, where the cluster storage may be the latch zone determined in the above embodiments. By such a residence, the hit rate of data to be read multiple times in the cache memory can be significantly improved.
In one implementation scenario, a producer core executing in one of the clusters may lock data that needs to be exchanged to a consumer core in the LLC for later use by the consumer core through the write lock described above, e.g., the producer core transmits data via the LLC to processors in other clusters for merging processing. The processors in other clusters can read data from the cluster storage area for processing, so that inter-chip transmission of data is realized.
In one example implementation of the read operation described above, the cluster storage area may also be used to receive a read invalidate indication that causes the write data not to be written back to off-chip memory, such as a read invalidate ("read invalidate") issued from cluster 1 in fig. 14. The read-invalid indication may be a latch-related request accompanied by an invalid attribute, and the latch-related request is generated in a manner as described above. In different latch modes, their latch related requests may be different. The cluster storage area may then invalidate the cache line associated with the write data based on the read invalidate indication after sending the write data to cluster 1.
To achieve the above-described synchronization of data transfer (or communication) between clusters, a cluster (e.g., cluster 0) writing data to a cluster memory area may send a synchronization instruction to another cluster (e.g., cluster 1) after the write operation is completed, e.g., hsem ("hardware semaphore") as in fig. 14. Upon receiving the synchronization instruction, cluster 1 may send the above-mentioned read invalidate request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write back of the aforementioned data.
In the context of the present disclosure, the above-described actions of writing data to and reading data from a cluster storage area may also be collectively referred to as latch-related operations triggered by latch-related requests, which may be acknowledged as described above. In particular, the latch related request may be used to indicate a latching operation. Through the latching operation, data will be latched in the cluster memory area for subsequent multiple uses. Further, the latch-related request may be used to indicate a release operation, and with the release operation, data may be unlocked from the cluster memory region to free up more memory space for subsequent latching of the data. It can be understood that, through the releasing operation, the storage space of the cluster storage area can be flexibly used, thereby improving the use efficiency of the cluster storage area of the present disclosure.
In one embodiment, for a read operation of a cluster storage area, data or a selected part of the data may be released from a designated area of the cluster storage area according to a latch-related request after the read operation is performed. With respect to the selected partial data, in one embodiment, a predetermined proportion of data may be selected from the data in a random manner to form the partial data to be latched in the latch area. In another embodiment, a hash algorithm may be used to select a predetermined proportion of data from the data to be latched in the cluster memory area as the aforementioned partial data, which is described in detail above with reference to fig. 11.
The aspects of the present disclosure are described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the disclosed aspects are not limited by the order of acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
In particular implementation, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("Read Only Memory", abbreviated as ROM), a Random Access Memory ("Random Access Memory", abbreviated as RAM), a removable hard disk, a magnetic disk or an optical disk, and various media capable of storing program codes.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory ("Resistive Random Access Memory", abbreviated as RRAM), a Dynamic Random Access Memory ("Dynamic Random Access Memory", abbreviated as DRAM), a Static Random Access Memory ("Static Random Access Memory", abbreviated as SRAM), an Enhanced Dynamic Random Access Memory ("Enhanced Dynamic Random Access Memory", abbreviated as EDRAM), a High Bandwidth Memory ("High Bandwidth Memory", abbreviated as HBM), a Hybrid Memory Cube ("Hybrid Memory Cube", abbreviated as HMC), a ROM, a RAM, or the like.
The foregoing may be better understood in light of the following clauses:
clause a1. A method for a cache memory, comprising:
configuring a particular storage space in the cache memory as a latch region that supports a plurality of latch modes;
receiving a latch-related request to latch-related operations on the data in the latch region; and
and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
Clause a2. The method of clause A1, wherein the plurality of latching modes are performed in a predetermined order of priority.
Clause a3. The method of clause A1 or 2, wherein the plurality of latch modes includes an instruction mode to perform latch related operations based on hardware instructions, a window mode to perform latch related operations based on window attributes, a stream mode to perform latch related operations based on dataflow, and/or a page mode to perform latch related operations based on cache pages.
Clause a4. The method according to clause A3, wherein,
in the instruction mode, the latch related request is determined according to the hardware instruction;
in the page mode, the latch related request is determined according to a cache page configuration;
in the window mode or the stream mode, the latch related request is determined according to a lock window.
Clause a5. The method according to clause A4, wherein in the instruction mode, the window mode or the stream mode, the latch related request can be accompanied by a lock attribute for indicating that a specific data is resident in the latch area, the specific data being a partial data selected according to a hash algorithm.
Clause a6. The method of clause A3 or A4, wherein in page mode, the method comprises:
the cache page based latching operation is performed according to a linear mapping window of a system memory management unit.
Clause A7. the method of clause A3, wherein configuring the latching region that supports the plurality of latching modes comprises:
and configuring a specific storage space into a latch area supporting a corresponding one of the latch modes according to one of the received configuration instructions, wherein the configuration instruction comprises configuration items of enabling the latch area, disabling the latch area and/or the size of the latch area.
Clause A8. the method of clause A7, wherein for a write operation to a latch region, the method includes latching the data or a selected portion of the data within a designated area of the latch region according to the latch-related request for use in subsequent multiple reads.
Clause A9. the method of clause A7, wherein for a read operation of a latch region, the method includes releasing the data or a selected portion of the data from a designated area of the latch region according to the latch related request after the read operation is performed.
Clause a10. A cache memory comprising:
a configuration module for configuring a particular storage space in the cache memory into a latch region that supports a plurality of latch modes;
a latch execution module to:
receiving a latch-related request to latch-related operations on the data in the latch region; and
and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
Clause a11. A system on a chip, comprising:
the cache memory according to clause a 10; and
a processor for generating the latch related request;
wherein the latch execution module of the cache memory is configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
Clause a12. The system-on-chip of clause a11, wherein the latching mode comprises an instruction mode, and in the instruction mode, the processor is to generate the latch-related request according to the received hardware instruction.
Clause a13. The system-on-chip of clause a11, wherein the latching mode comprises a page mode, and in the page mode the processor is to generate the latch-related request according to a cache page configuration.
Clause a14. The system-on-chip of clause a11, wherein the latching mode comprises the window mode or the streaming mode, and in the window mode or streaming mode, the system-on-chip further comprises: a task scheduler comprising a configurator and a scheduling unit, wherein:
the configurator is used for generating the configuration instructions according to the allocated configuration tasks so as to send the configuration instructions to the configuration module of the cache memory; and
the scheduling unit is used for scheduling a plurality of tasks in the task scheduler so as to send the tasks to the processor cores.
Clause a15. The system-on-chip of clause a14, wherein the configuration instructions include configuration items to enable latch regions, disable latch regions, and/or latch region sizes.
Clause a16. The system-on-chip according to clause a15, wherein the processor further comprises a system memory management unit for, in either the window mode or the stream mode:
configuring a lock window associated with data to be subjected to latch-related operations according to a parameter table; and
generating the latch related request according to the configured lock window.
Clause a17. The system-on-chip of clause a16, wherein the configuration items of the lock window include one or more of:
a base address and a size of a window, wherein the base address of the window corresponds to a starting address of data to be subjected to a latch related operation and the size of the window corresponds to the size of the data;
a latch indication to latch data in the latch region;
an unlock indication to unlock data from the latch region; and
a latch ratio indicating a ratio of data to be actually latched among data to be subjected to the latch-related operation.
Clause a18. The system-on-chip according to clause a17, wherein the processor is further configured to select, using a hash algorithm, a portion of data that can be latched in the latch area when an access address of the data to be subjected to the latch-related operation is within an address range of the lock window.
Clause a19. The system-on-chip according to clause a17, wherein the processor is configured to randomly select the partial data satisfying a predetermined latch ratio from the data to be latched according to a hash algorithm, and to generate a latch-related request accompanied by a lock attribute for latching in the latch area.
Clause a20. The system-on-chip of clause a14, wherein the processor is configured to perform a write operation on the data in the latch area, the latch execution module is configured to latch the data or a selected portion of the data in a designated area of the latch area in accordance with the latch-related request, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the latch execution module is configured to release the data after performing the read operation from the designated area of the latch area in accordance with the latch-related request.
Clause a21. The system-on-chip according to any of clauses a16-a20, wherein the task comprises a producer kernel and a consumer kernel, wherein:
when a producer core is executed, the processor is used for latching data output by the producer core into the latch area through the latch-related request for use by the consumer core; and
the processor is configured to read data from the latch area when executing the consumer core, and unlock the data from the latch area by the latch related request after reading the data, so as to release the storage space for the data in the latch area.
Clause a22. A board comprising the system-on-chip of any one of clauses a11-a 21.
Clause a23. A computing device comprising the board of clause a22.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims (23)

1. A method for a cache memory, comprising:
configuring a particular storage space in the cache memory as a latch region that supports a plurality of latch modes;
receiving a latch-related request to latch-related operations on the data in the latch region; and
and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
2. The method of claim 1, wherein the plurality of latching modes have a predetermined order of priority.
3. The method of claim 1 or 2, wherein the plurality of latch modes comprise an instruction mode to perform latch related operations based on hardware instructions, a window mode to perform latch related operations based on window attributes, a stream mode to perform latch related operations based on dataflow, and/or a page mode to perform latch related operations based on cache pages.
4. The method of claim 3, wherein,
in the instruction mode, the latch related request is determined according to the hardware instruction;
in the page mode, the latch related request is determined according to a cache page configuration;
in the window mode or the stream mode, the latch related request is determined according to a lock window.
5. The method of claim 4, wherein in the instruction mode, the window mode, or the stream mode, the latch related request can be accompanied by a lock attribute indicating that a particular piece of data is resident in the latch area, the particular piece of data being a selected portion of data according to a hashing algorithm.
6. The method of claim 3 or 4, wherein in page mode, the method comprises:
the cache page based latching operation is performed according to a linear mapping window of a system memory management unit.
7. The method of claim 3, wherein configuring the latching region that supports the plurality of latching modes comprises:
configuring a specific storage space into a latch area supporting a corresponding one of the latch modes according to one of the received configuration instructions;
wherein the configuration instructions include configuration items to enable a latch region, disable a latch region, and/or a latch region size.
8. The method of claim 7, wherein for a write operation to a latch region, the method comprises latching the data or selected portions of the data within a designated area of the latch region in accordance with the latch related request for use in subsequent multiple reads.
9. The method of claim 7, wherein for a read operation at a latch region, the method comprises releasing the data or selected portions of the data from a designated area of the latch region in accordance with the latch-related request after the read operation is performed.
10. A cache memory, comprising:
a configuration module for configuring a particular storage space in the cache memory into a latch region that supports a plurality of latch modes;
a latch execution module to:
receiving a latch-related request to latch-related operations on the data in the latch region; and
and according to the latch-related request, performing latch-related operation on the data in the latch area in the corresponding latch mode.
11. A system on a chip, comprising:
the cache memory of claim 10; and
a processor for generating the latch related request;
wherein the latch execution module of the cache memory is configured to execute a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
12. The system-on-chip of claim 11, wherein the latching mode comprises an instruction mode, and in the instruction mode, the processor is to generate the latch-related request according to a received hardware instruction.
13. The system-on-chip of claim 11, wherein the latching mode comprises a page mode, and in the page mode, the processor is to generate the latch-related request according to a cache page configuration.
14. The system-on-chip of claim 11, wherein the latching mode comprises the window mode or the streaming mode, and in the window mode or streaming mode, the system-on-chip further comprises: a task scheduler comprising a configurator and a scheduling unit, wherein:
the configurator is used for generating the configuration instructions according to the allocated configuration tasks so as to send the configuration instructions to the configuration module of the cache memory; and
the scheduling unit is used for scheduling a plurality of tasks in the task scheduler so as to send the tasks to the processor.
15. The system on a chip of claim 14, wherein the configuration instructions comprise configuration items to enable a latch region, disable a latch region, and/or a latch region size.
16. The system on chip of claim 15, wherein the processor further comprises a system memory management unit to, in a window mode or a stream mode:
configuring a lock window associated with data to be subjected to latch-related operations according to a parameter table; and
generating the latch related request according to the configured lock window.
17. The system on a chip of claim 16, wherein the configuration items of the lock window include one or more of:
a base address and a size of a window, wherein the base address of the window corresponds to a starting address of data to be subjected to a latch related operation and the size of the window corresponds to the size of the data;
a latch indication to latch data in the latch region;
an unlock indication to unlock the data from the latch region; and
a latch ratio indicating a ratio of data to be actually latched among data to be subjected to the latch-related operation.
18. The system on a chip of claim 17, wherein the processor is further configured to:
and when the access address of the data to be subjected to the latch related operation is in the address range of the locking window, selecting partial data capable of being latched in the latch area by adopting a Hash algorithm.
19. The system-on-chip as claimed in claim 17, wherein said processor is configured to randomly select said portion of data satisfying a predetermined said latch ratio from data to be latched according to a hashing algorithm, and to generate a latch-related request accompanied by a locking attribute for latching in said latch area.
20. The system-on-chip of claim 14, wherein the processor is configured to perform a write operation on the data in the latch area, the latch execution module is configured to latch the data or a selected portion of the data in a designated area of the latch area according to the latch-related request, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the latch execution module is configured to release the data after performing the read operation from the designated area of the latch area according to the latch-related request.
21. The system on a chip of any of claims 16-20, wherein the tasks include a producer kernel and a consumer kernel, wherein:
when a producer core is executed, the processor is used for latching data output by the producer core into the latch area through the latch-related request for use by the consumer core; and
the processor is configured to read data from the latch area when executing the consumer core, and to unlock the data from the latch area by the latch related request after reading the data, so as to release a storage space for the data in the latch area.
22. A board comprising the system on chip of any of claims 11-21.
23. A computing device comprising the card of claim 22.
CN202110926707.5A 2021-08-12 2021-08-12 Method for cache memory and related product Pending CN115705300A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110926707.5A CN115705300A (en) 2021-08-12 2021-08-12 Method for cache memory and related product
PCT/CN2022/110740 WO2023016383A1 (en) 2021-08-12 2022-08-08 Method for cache memory and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110926707.5A CN115705300A (en) 2021-08-12 2021-08-12 Method for cache memory and related product

Publications (1)

Publication Number Publication Date
CN115705300A true CN115705300A (en) 2023-02-17

Family

ID=85180996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110926707.5A Pending CN115705300A (en) 2021-08-12 2021-08-12 Method for cache memory and related product

Country Status (1)

Country Link
CN (1) CN115705300A (en)

Similar Documents

Publication Publication Date Title
US9734056B2 (en) Cache structure and management method for use in implementing reconfigurable system configuration information storage
EP4231158A2 (en) Controller for locking of selected cache regions
JP5865931B2 (en) Platform independent power management
EP1182551A2 (en) Address space priority arbitration
US20140250250A1 (en) Power-Optimized Interrupt Delivery
US20140075125A1 (en) System cache with cache hint control
US9043570B2 (en) System cache with quota-based control
US20140089590A1 (en) System cache with coarse grain power management
US20210248006A1 (en) Hardware Resource Allocation System for Allocating Resources to Threads
US11868306B2 (en) Processing-in-memory concurrent processing system and method
US20150074357A1 (en) Direct snoop intervention
CN112527729A (en) Tightly-coupled heterogeneous multi-core processor architecture and processing method thereof
US20140059297A1 (en) System cache with sticky allocation
US20210224213A1 (en) Techniques for near data acceleration for a multi-core architecture
US12020065B2 (en) Hierarchical processor selection
US12013780B2 (en) Multi-partition memory sharing with multiple components
Abdallah Heterogeneous Computing: An Emerging Paradigm of Embedded Systems Design
CN115705300A (en) Method for cache memory and related product
WO2023016383A1 (en) Method for cache memory and related products
CN115878553A (en) Method for system on chip and related product
CN115686740A (en) Scalable access control checking for cross address space data movement
TW202111545A (en) Unified address translation
WO2023016382A1 (en) Method for system on chip, and related product thereof
US20150032929A1 (en) Circuitry for a computing system, lsu arrangement and memory arrangement as well as computing system
CN118113631B (en) Data processing system, method, device, medium and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination