CN115878553A - Method for system on chip and related product - Google Patents

Method for system on chip and related product Download PDF

Info

Publication number
CN115878553A
CN115878553A CN202110926703.7A CN202110926703A CN115878553A CN 115878553 A CN115878553 A CN 115878553A CN 202110926703 A CN202110926703 A CN 202110926703A CN 115878553 A CN115878553 A CN 115878553A
Authority
CN
China
Prior art keywords
cluster
data
chip
latch
storage area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110926703.7A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202110926703.7A priority Critical patent/CN115878553A/en
Priority to PCT/CN2022/110740 priority patent/WO2023016383A1/en
Publication of CN115878553A publication Critical patent/CN115878553A/en
Pending legal-status Critical Current

Links

Images

Abstract

The present disclosure relates to a method for a system on chip, a computing device and a board, the computing device being comprised in a combined processing device, which may further comprise interface means and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme of the disclosure can expand the use scene of the cache memory and improve the use efficiency of the cache memory.

Description

Method for system on chip and related product
Technical Field
The present disclosure relates generally to the field of storage. More particularly, the present disclosure relates to a method for a system on chip, a corresponding system on chip, a computing device comprising the system on chip and a board comprising the computing device.
Background
The operating performance of a computing system depends to a considerable extent on the average access latency of the memory. System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of a cache memory (referred to as "cache"). To this end, processors typically employ a caching mechanism and utilize caches to accommodate speed and performance mismatches between the processor and the lower-speed main memory. Current caches implement a multi-Level caching mechanism, e.g., employing three levels of Cache (L1, L2, and L3), and call the Cache closest to main memory the Last Level Cache ("Last Level Cache", LLC). In view of the frequent use of cache in system on chip and its important role, it is necessary to propose an effective management strategy to improve the utilization rate of cache and reduce the number of accesses to main memory. In addition, how to extend the application of LLC for different scenarios also becomes a problem to be solved.
Disclosure of Invention
In view of the technical problems mentioned in the background section above, the usage scenario of the cache memory is expanded. The present disclosure provides a scheme for a system on chip in a plurality of aspects as follows.
In a first aspect, the present disclosure provides a method for a system on a chip including at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters, each cluster including a plurality of processor cores for performing the arithmetic operations, the method comprising: mapping a designated storage space of the off-chip memory to a given storage area of the cache memory to use the given storage area as a cluster storage area for inter-cluster data communication; and performing operations of the cluster using the cluster storage area.
In a second aspect, the present disclosure provides a system on a chip comprising: a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing arithmetic operations; and a cache memory interconnected with the plurality of clusters and configured to: using a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relation with a designated storage space of an off-chip memory; and performing operations of the cluster using the cluster storage area.
In a third aspect, the present disclosure provides a computing device comprising a system on chip as described above and in various embodiments below.
In a fourth aspect, the present disclosure provides a board comprising a computing device as described above and in various embodiments below.
In a fifth aspect, the present disclosure provides a computing device comprising a board as described above and in various embodiments below.
According to the scheme provided in the above aspects of the present disclosure, efficient communication between clusters of a system on chip can be achieved with a given storage area of a cache memory. Therefore, data which needs to be transferred through the off-chip memory can be directly transferred through the given memory area, so that the access and storage of the data are accelerated, and the cache hit rate is obviously improved. Further, since the probability of cache hit is improved by a given storage area, the overall performance of the system on chip is also significantly improved by the scheme of the present disclosure. In addition, the division of the given memory area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. By means of the given storage area, multiple clusters of the system on chip can implement multiple flexible communication mechanisms, thereby also improving the operation performance of the clusters.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:
fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device, according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating an internal architecture of a multi-core computing device according to an embodiment of the disclosure;
FIG. 5 is an internal block diagram illustrating a processor core according to an embodiment of the disclosure;
FIG. 6 is a flow diagram illustrating a method for caching memory according to an embodiment of the present disclosure;
FIG. 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the present disclosure;
FIG. 8 is a simplified block diagram illustrating a system on a chip according to an embodiment of the present disclosure;
FIG. 9 is a detailed block diagram illustrating a system on a chip according to an embodiment of the present disclosure;
FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure;
FIG. 11 is a diagram illustrating a hash operation in window mode according to an embodiment of the disclosure
FIG. 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the present disclosure;
FIG. 13 is a flowchart illustrating a method for a system-on-chip according to an embodiment of the present disclosure; and
FIG. 14 is a block diagram illustrating operation of a system on a chip according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," and "third," etc. as may be used in the claims, the description, and the drawings of the present disclosure are used to distinguish between different objects, and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. It should be appreciated that the configuration and composition illustrated in FIG. 1 is merely an example, and is not intended to limit the aspects of the present disclosure in any way.
As shown in fig. 1, board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The combined processing device can be an artificial intelligence arithmetic unit, is used for supporting various deep learning and machine learning algorithms, meets the intelligent processing requirements in the fields of computer vision, voice, natural language processing, data mining and the like under complex scenes, and particularly applies deep learning technology to the field of cloud intelligence in a large quantity. A significant characteristic of cloud-based intelligent application is that the input data volume is large, and the requirements on the storage capacity and the calculation capacity of the platform are high, whereas the board card 10 of the embodiment is suitable for cloud-based intelligent application, and has huge off-chip storage, on-chip storage and powerful calculation capacity.
As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 according to the above-described embodiment. As shown in fig. 2, combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a Dynamic Random Access Memory (DRAM) DRAM 204.
The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing means 203 through the interface means 202 to collectively perform user specified operations.
The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201. Depending on the implementation, the Processing device 203 may be one or more types of processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other general and/or special purpose processors, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered together with the integration of the computing device 201 and the processing device 203, both are considered to form a heterogeneous multi-core structure.
The DRAM204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.
Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The mononuclear computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining and the like, and the mononuclear computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.
The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution. The storage module 33 is used for storing or transporting related data, and includes a Neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333.NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; the WRAM 332 is used for storing a convolution kernel, namely a weight, of the deep learning network; the DMA 333 is connected to the DRAM204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.
Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, the multi-core computing device 41 being a system on a chip comprising at least one cluster (cluster) according to the present disclosure, each cluster in turn comprising a plurality of processor cores. In other words, the multi-core computing device 41 is constructed in a system-on-chip-cluster-processor core hierarchy. In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.
There may be a plurality (e.g., 2 as illustrated) of external memory controllers 401, which are configured to access an external memory device, i.e., an off-chip memory (e.g., DRAM204 in fig. 2) in the context of the present disclosure, in response to an access request issued by the processor core, to read data from or write data to the off-chip memory. The peripheral communication module 402 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a Global synchronization Barrier Controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41. Although 4 clusters are exemplarily shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently execute a deep learning algorithm.
Looking at the cluster hierarchy, as shown in fig. 4, each cluster 405 may include a plurality of processor cores (IPU core) 406 and a memory core (MEM core) 407, which may include, for example, a cache memory (e.g., LLC) as described in the context of the present disclosure.
The processor cores 406 are exemplarily shown as 4 in the figure, the present disclosure does not limit the number of the processor cores 406, and the internal architecture thereof is as shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3, and as such may include three modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described herein again. It should be noted that the storage module 53 may include an Input/Output Direct Memory Access (IODMA) module 533 and a Direct Memory Access (MVDMA) module 534.IODMA 533 controls the access of NRAM 531/WRAM 532 to DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.
Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, the memory core 407 may have the capability of scalar operations to perform scalar operations.
The Memory core 407 may include a Static Random-Access Memory (SRAM) 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. In one implementation scenario, SRAM408 may assume the role of a high performance data transfer station. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be individually obtained by the processor cores 406 to the DRAM204, but rather is relayed between the processor cores 406 via the SRAM 408. Further, the memory core 407 only needs to quickly distribute multiplexed data from the SRAM408 to the plurality of processor cores 406, so that it is possible to improve inter-core communication efficiency and significantly reduce off-chip input/output access.
Broadcast bus 409, CDMA410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM204, respectively. As will be described separately below.
The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM408 to all processor cores 406, which is a special case of multicast.
CDMA410 is used to control access to SRAM408 between different clusters 405 within the same computing device 201. GDMA 411 cooperates with the external memory controller 401 to control access of SRAM408 of cluster 405 to DRAM204 or to read data from DRAM204 into SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 can be achieved via 2 ways. The first way is to communicate with the NRAM 431 or WRAM 432 directly with the DRAM204 through the IODAM 433; the second way is to transmit data between the DRAM204 and the SRAM408 through the GDMA 411, and transmit data between the SRAM408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although the second approach may require more components and longer data flow, in some embodiments, the bandwidth of the second approach is substantially greater than that of the first approach, and thus it may be more efficient to perform communication between DRAM204 and NRAM 431 or WRAM 432 in the second approach. It is to be understood that the data transmission approaches described herein are merely exemplary, and that a variety of data transmission approaches may be flexibly selected and adapted by one skilled in the art in light of the teachings of the present disclosure, depending on the particular arrangement of hardware.
In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. Although the present disclosure considers GDMA 411 and IODMA 533 as different components for convenience of description, it will be within the scope of protection of the present disclosure for a person skilled in the art as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA410, and MVDMA 534 may be implemented by the same component.
The hardware architecture and its internal structure of the present disclosure are described in detail above in conjunction with fig. 1-5. It is to be understood that the above description is intended to be illustrative, and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also change the board card and the internal structure of the present disclosure, and these changes still fall into the protection scope of the present disclosure. For example, in the solution that will be described below in this disclosure, the corresponding hardware architecture may be CDMA410 that does not include a mechanism to control access to SRAM408 between different clusters 405 within the same computing device 201. Instead, the underlying aspects of the present disclosure relate to improving and optimizing the cache memory disposed, for example, between SRAM408 and DRAM204 to enable efficient data on-demand latching and communication between different clusters through the cache memory.
In order to efficiently use a cache (e.g., LLC) and improve hit rate of data accesses, the following scheme of the present disclosure proposes to configure a specific storage space in the cache as a latch region for a latch operation of data, particularly for data to be frequently used. For example, the aforementioned frequently used data may be data to be reused between at least one task having a data dependency relationship. It will be appreciated that when data is only used once, then the data may not be latched in the cache.
Further, on the basis of the aforementioned configuration of the latch area for data latching, the following aspect of the present disclosure also proposes to configure the cache memory to support a plurality of latch modes, so as to cause the cache memory to operate in a latch mode corresponding to the aforementioned latch-related request, when the latch-related request is received. The various latching modes of the present disclosure may have a particular priority order to satisfy different latch-related operations, depending on different application scenarios and requirements. In addition, in order to make the cache memory support multiple latch modes, the scheme of the present disclosure also proposes multiple different configuration methods, so that the cache memory can be used more flexibly and utilized to realize communication between clusters.
FIG. 6 is a flow diagram illustrating a method 600 for caching, according to an embodiment of the present disclosure. As shown in fig. 6, the method 600 includes configuring a particular storage space in the cache memory as a latch region that supports multiple latch modes at step S602. In one embodiment, the aforementioned plurality of latch modes may include, but are not limited to, an instruction mode to perform latch related operations based on hardware instructions, a window mode to perform latch related operations based on window attributes, a stream mode to perform latch related operations based on dataflow, and/or a page mode to perform latch related operations based on cache pages. In one embodiment, the aforementioned data streams may be instruction streams or data streams having different types. Taking the data stream as an example, in an application scenario of the neural network, the data stream may be a neural data stream of the neural network model, a weight data stream, an output result data stream, and the like. Additionally, in the context of the present disclosure, the data for which the latch related operation is directed is data to be used multiple times by a processor of the system-on-chip, which has a relatively high priority relative to data for which the latch operation is not performed. By latching (or residing) such multiple-use data in the latch region of the present disclosure, the cache hit rate can be significantly increased, thereby improving the overall performance of the system. In addition, by storing the data used repeatedly in the latch area of the LLC, the read-write operation of data between the on-chip system and the off-chip memory (e.g. DDR or DRAM) can be reduced, thereby also improving the access efficiency.
In one application scenario, the various latching modes described above may be set to have different priorities according to user preferences or system preferences. For example, in one embodiment, the high-low order of priority may be instruction mode- > window mode- > stream mode- > page mode; in another embodiment, the high-low order of priority may be instruction mode- > page mode- > stream mode- > window mode. With such multi-mode and priority settings, the latch regions in the cache can be used in more ways, increasing the flexibility of latch region use to cope with different application scenarios and system requirements. Further, the above-described latch modes may be sequentially traversed in order of priority, and when the latch mode of high priority is disabled, the latch mode of low priority may be employed.
In one embodiment, a particular memory space may be configured to support a corresponding latch region of a latch mode according to a received configuration command of a plurality of configuration commands. In one scenario, the configuration instruction may include one or more configuration items to implement the configuration of the aforementioned latch region. For example, the plurality of configuration items may include configuration items that enable a latch region, disable a latch region, and/or a latch region size. Further, a corresponding latch strategy (e.g., the size of the latch data or the specific data to be latched) may be configured in the aforementioned instruction mode, window mode, stream mode or page mode for latching different types or specific instructions, data or data streams, etc. Configuring the corresponding latching strategies in the different modes may be found in particular in the description below. With such enabling, disabling, and various specific configurations, the scheme of the present disclosure may enable flexible use of the cache memory such that it may operate in one of the various latching modes of the present disclosure, or in a conventional mode, as desired.
Returning to the flowchart in fig. 6, after the configuration operation at step S602 described above is completed, at step S604, a latch-related request for a latch-related operation on data in the latch area is received. According to embodiments of the present disclosure, the latch related request may be triggered by an operation intended to reside specific data in the latch region. Alternatively, the latch related request may also be triggered by an operation aimed at removing or releasing specific data from the latch area. As described in detail above, the latch related requests of the present disclosure may also have different expressions or contents when operating in different latch modes. For example, for instruction mode, window mode, or stream mode, the latch related request may include a configuration item or the like for indicating a behavior attribute of the cache memory.
In one embodiment, the configuration item for indicating the behavior attribute of the cache memory at least comprises one of the following configuration attributes:
transient (Transient) attribute: the method does not perform caching in the LLC, namely directly performs data reading and writing operations with an off-chip memory (such as DDR); the method is used for caching certain data which are accessed only once in the LLC, so that LLC resources are prevented from being occupied;
lock (Lock) attribute: the specific data is resident in the latch area, and the data is read from and written to the cache line (cacheline) that hits the data. If the cache line belongs to the latch area, configuring the attribute of the cache line into a persistent attribute; if the cache line does not belong to the latch area, the attribute of the cache line is not changed, namely the following normal (normal) attribute is kept; it should be clear that the cache lines of the above-described latch regions have two attributes, namely a persistent (persistent) attribute and a normal (normal) attribute. The cache line of the persistent attribute in this latch region can only be accessed and replaced by latch related requests accompanied by the Lock attribute.
Unlock (Unlock) attribute: after reading and writing data from the hit cache line, releasing the corresponding storage space of the data in the latch area in the LLC, and setting the corresponding cache line attribute in the latch area as the following conventional attribute;
conventional attributes: the request of normal cache in LLC can directly read and write data with the off-chip memory;
invalid (Invalid) attribute: directly invalidating data after reading to avoid being written into an off-chip memory in a replacement manner;
clean (Clean) attribute: when the write operation is executed, the data can be written into the hit cache line, the storage content of the whole cache memory (cache) is written back to the off-chip memory, and the attribute of the cache line is kept unchanged; in a read operation, data is read from the hit cache line. When the cache line of the hit is dirty (dirty), writing the cache line of the hit back to the off-chip memory;
default (Default) attribute: the default entry may be used to indicate that the configuration for the latch mode is to be ignored.
By appending the above-described exemplary configurable attributes to the latch related request, aspects of the present disclosure may perform corresponding latch related operations in the instruction mode according to these appended attributes.
For another example, for page mode, the latch related request may indicate that data associated with a particular page is to be latched in the latch region for subsequent multiple uses, or may indicate that data associated with a particular page, after multiple uses, is unlatched from the latch region to free up more storage space for subsequent latching of data. It can be understood that the storage space of the latch region can be flexibly used by the release operation, thereby improving the use efficiency of the latch region of the present disclosure.
Returning to the flow of fig. 6, in response to the latch related request of step S604 described above, at step S606, a latch related operation may be performed on the data in the latch area in a corresponding latch mode according to the latch related request. According to an embodiment of the present disclosure, the aforementioned latch-related operation may include a read operation and a write operation with respect to the latch area. In one embodiment, for a write operation to a latch region, the method 600 may further include latching data or selected portions of data in a designated area of the latch region for subsequent multiple reads in accordance with a latch-related request. In another embodiment, for a read operation at a latch region, the method 600 may further include releasing data or a selected portion of the data from a designated area of the latch region according to a latch related request after the read operation is performed.
With respect to the selected partial data, in one embodiment, a predetermined proportion of data may be selected from the data in a random manner to form the partial data to be latched in the latch area. In another embodiment, a predetermined proportion of data may be selected from the data by a hashing algorithm to be latched in the latch area as the aforementioned partial data. In a further embodiment, when the access address of the data to be subjected to the latch related operation is within the address range of the lock window, the foregoing hash algorithm may be used to select a portion of the data that can be latched in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with fig. 11.
With the method described above in conjunction with fig. 6, the scheme of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of a plurality of latching modes, the use of the latching area is more flexible and adaptive, so that different application scenes and user requirements are met. In addition, due to the fact that the data are effectively latched in the latching area, sharing of the data between the producer kernel and one or more consumer kernels is promoted, and the accessibility and the utilization rate of the data are improved. A producer core and a consumer core are herein understood to be two tasks with dependencies, wherein the output of the producer core will be the input passed to the consumer core, so that the consumer core uses the input to complete the corresponding task. At the moment, the output of the producer kernel is used as the input of the subsequent operation, so that the output of the producer kernel can be used as the subsequent data needing to be used for multiple times, and the subsequent data needing to be used for multiple times can be temporarily stored in a latch area of the cache memory, so that the consumer kernel can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the access interaction between the artificial intelligent processor and the off-chip memory, reducing the IO access cost, and further improving the processing efficiency and the performance of the artificial intelligent processor.
FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the present disclosure. It will be appreciated that the cache memory 700 shown in fig. 7 may be the cache memory described in conjunction with fig. 6, and thus the cache memory described with respect to fig. 6 is equally applicable to the description below with respect to fig. 7.
As shown in fig. 7, the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702. Further, the cache memory 700 further includes a storage space for performing a cache operation, for example, 8 ways (way 0 to way 7) which are shown in the figure and divide the storage space into 8 shares on average, wherein each way includes a number of cache lines (cachelines).
In one embodiment, the configuration module described above may be used to configure a particular memory space in the cache memory to be a latch region that supports multiple latch modes, where the size of the particular memory space is less than the total memory size of the cache memory. For example, way 0-way 5 in FIG. 7 may be configured to support latched specific storage space. Correspondingly, ways 6-7 in FIG. 7 may maintain the normal attributes of a cache, i.e., as a general cache usage. As previously described, the latching mode may be an instruction mode, a window mode, a stream mode, and/or a page mode. Further, the latch execution module may be to receive a latch related request to latch related operation on the data in the latch area. Then, the latch execution module may perform a latch-related operation on the data in the latch area in a corresponding latch mode according to the latch-related request. As described above, the latch related operation herein may include a write operation for the latch area (i.e., writing data to the latch area) or releasing data in the latch area from the latch area. For example, when a consumer core runs out of data for a latch region and the data is no longer being used by other consumer cores, then the space in the latch region where the data is stored may be freed up for latching other data.
FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the present disclosure. As shown in fig. 8, a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in fig. 7. In one embodiment, the latch execution module of the cache memory may be configured to execute a latch-related operation on the data in the latch area in the corresponding latch mode according to a latch-related request. With respect to the cache memory 700, it is described above in connection with fig. 6 and 7 and will not be described again here. With respect to the processor 802, it may be various types of processors and may include one or more processor cores to generate latch related requests in accordance with aspects of the present disclosure. In operation, the latch execution module of the cache memory is configured to perform a latch-related operation on data in the latch area in a corresponding latch mode according to the generated latch-related request. For example, when the latch mode is an instruction mode, then the processor may be configured to generate a latch related request based on the received hardware instruction. For another example, when the latch mode is page mode, then the processor may be configured to generate latch related requests according to a cache page configuration. As another example, when the latching mode is a window mode or a streaming mode, then the process may be used to configure a lock window and generate a latch related request based on the lock window.
According to various embodiments, the processor 802 may also be an intelligent processor or intelligent Processing Unit ("IPU") including multiple computing cores, which may be configured to perform computations in various artificial Intelligence domains (e.g., neural network aspects).
FIG. 9 is a detailed block diagram illustrating a system on a chip 900 according to an embodiment of the disclosure. It is to be appreciated that the system-on-chip 900 shown herein may be a specific implementation of the system-on-chip shown in FIG. 8, and thus what is described with respect to FIG. 8 applies equally to FIG. 9. Further, for purposes of example only, the operation of the system-on-chip 900 will be described in a windowed mode (or streaming mode) of a plurality of latching modes.
As shown in FIG. 9, the system-on-chip 900 may include a task Scheduler ("Job Scheduler") 902, which includes a scheduling unit 903 and a configurator 904. In one embodiment, the configurator 904 may be configured to generate configuration instructions based on allocated configuration tasks (e.g., available from a task queue) for transmission to a configuration module (e.g., CLR) of the cache memory (i.e., "LLC" 906). In one embodiment, the scheduling unit 903 may be used to schedule a plurality of tasks (i.e., "kernel" to be executed on an artificial intelligence processor) in a task scheduler for transmission to an Intelligent Processor (IPU) 905 in a system-on-chip of the present disclosure. In the solution of the present disclosure, the intelligent processor 905 herein may include a plurality of processor cores, and the plurality of processor cores may constitute one cluster (cluster) as shown in fig. 4. In one implementation scenario, in a multi-processor core architecture as before, the scheduling unit may assign tasks to appropriate processor cores according to the idleness (e.g., utilization) of the multiple processor cores.
Further, the System on chip 900 further includes a System Memory Management Unit ("SMMU"), which is configured to convert a virtual address of the accessed data into a physical address, so as to implement access to the relevant storage location according to the physical address. In one embodiment, the system memory management unit includes a Translation Lookaside Buffer (TLB) (also called a fast table) provided therein. A page table is maintained in the TLB, and the page table comprises at least one page table entry, and each page table entry comprises a page (page) and a page Frame (Frame) corresponding to the page. In operation, the system memory management unit may determine a page corresponding to the virtual Address according to the received virtual Address, and then may determine a physical Address PA (physical Address) corresponding to the virtual Address according to a mapping relationship between the page and the page frame, so as to implement access to an associated storage location of the cache memory according to the physical Address.
In one embodiment, access to the cache memory may be achieved by the window mode or the stream mode described above. At this time, the intelligent processor may retrieve the parameter table from the memory and configure a Lock window ("Lock window") associated with data to be subjected to a latch-related operation according to the parameter table, and generate a latch-related request (i.e., an IO access request with, for example, a Lock/unlock attribute attached) according to the configured Lock window. The SMMU may then perform a latch-related operation on the LLC in accordance with the IO access request. In particular, the SMMU may send the aforementioned IO access request to a cache policy module 907 of LLC 906 (which performs the same operations similar to lock execution module 702 in fig. 7) for execution. In one embodiment, the parameter table may include parameter entries for configuring a flow latch attribute in a lock window or flow mode. For example, the parameter items may include, but are not limited to, locking/unlocking windows ("Lock/unlock window"), locking/unlocking per stream ("per stream Lock/unlock"), a latch Ratio ("Lock Ratio"), a locking window identification ("Lock window flag"), and the like. In one implementation scenario, the parameters in the parameter table may be user-defined. Thus, the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table is stored in the memory (e.g. DDR) for the intelligent processor (such as IPU 905 in the figure) to use during the execution phase.
In one embodiment, the lock window described above is used to represent the memory space that a software user wishes to latch, and the size of the lock window may be larger than the size of the latch area on the cache memory. The locking window includes one or more of: a base Address and a size of the window, wherein the base Address of the window may be a Virtual Address (for example, a Virtual Address ", abbreviated as" VA ") configured by upper layer software, the base Address of the window corresponds to a starting Address of data to be subjected to a latch related operation, and the size of the window may correspond to a size of the data to be latched.
Specifically, in the window mode, the intelligent processor may determine, according to a task issued by the task scheduler, a memory access address of data in the task (the memory access address may be a virtual address), and compare the memory access address of the data in the task with an address range defined by a lock window of the window. If the access address of the data in the task is in the address range of the lock window, it indicates that the lock window is hit, and the lock window (e.g., "Enabled") may be Enabled. Otherwise, if the access address of the data in the task is out of the address range of the locking window, the data is not hit in the locking window. At this point, the lock window may be ignored, i.e., indicating that the data in the task will not be buffered in the cache. Further, when the access address of the data hits the lock window, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be stored in the latch area. The specific use of the hashing algorithm will be described in detail later in conjunction with fig. 11. The smart processor may then send a latch related request accompanied by a Lock attribute to the cache LLC via SMMU. The latch related request with the Lock attribute may be used to indicate that specific data is resident in the latch area, and the specific data may be a part of data selected according to a hash algorithm.
The latching and releasing processes of the LLC are described below in window mode in conjunction with FIG. 9.
LLC parking (or locking) process:
step 1: the task scheduler configures the LLC (e.g. via a cache policy module) by means of a configurator to enable locked zones ("Lock enable"), disable locked zones ("Lock disable"), and the size of the locked zones, i.e. the number of ways ("Way") shown in the figure (Way 0 to Way7, for example).
Step 2: the task scheduler issues a task kernel to the IPU;
and 3, step 3: the IPU obtains a lock window flag from the parameter table, reads and configures the lock window. In one implementation scenario, the parameter table herein may be configured by software and stored at a Memory address of an off-chip Dynamic Random Access Memory ("DRAM"). The task scheduler may then communicate the address to the IPU and the IPU may read the parameter table based on the address to complete the configuration of the lock window.
And 4, step 4: the IPU generates a latch related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, may attach a lock attribute to the request according to the lock window information.
And 5: after receiving the latch-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (i.e., the latch area), for example, to be set as "persistent" as described above.
LLC delocalization (or release) process:
step 6: the task scheduler issues kernel to the IPU;
and 7: the IPU acquires an unlocking window identifier from the parameter table, reads and configures the unlocking window;
and 8: when the IPU transmits a request, attaching an unlocking (unlock) attribute according to unlocking window information;
and step 9: after receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a Normal attribute, such as the Normal attribute described above in connection with the instruction mode;
step 10: the task scheduler disables the latch area, i.e., LLC lock disable, by means of the configurator and through the CLR module. In one implementation scenario, the CLR module may clear the previous lock attribute configuration as directed by the configurator.
The latching scheme of the system on chip of the present disclosure in the window mode is described in detail above in conjunction with fig. 9. By such a latch operation, the probability of cache hit can be significantly increased, the use efficiency of the cache memory can be improved, and the application scenario can be expanded.
The embodiment of the present disclosure further supports a latch related operation in the streaming mode, and when the enable bit corresponding to the data stream in the task of the present disclosure is low, the task is regarded as a default condition, that is, the latch related operation in the streaming mode is not performed. Conversely, when the enable bit is high, then the corresponding latch related operation may be performed on the data stream in the streaming mode. Specifically, the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data from the data stream may be selected as the aforementioned partial data to be stored in the latch area using a hash algorithm and a latch ratio of the data stream. The specific use of the hashing algorithm will be described in detail later in conjunction with fig. 11.
As previously mentioned, in one embodiment, the disclosed embodiments also support latch-related operations in the page mode, which is described below in conjunction with FIG. 10.
FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure. As shown in fig. 10, according to the solution of the present application, a cache page may be directly configured to have the locking property of the present disclosure, so that the cache page forming a mapping relationship with a storage (e.g., "memory") may be used for sharing access data usage among a plurality of cores kernel (e.g., cores 0 to 2 shown in the figure). In one embodiment, a programmer may mark a cache page with a lock attribute using an instruction (e.g., malloc). When a core accesses a cache page marked as locked, the SMMU may lock data corresponding to the cache page in a latch area of the present disclosure. Then, when the subsequent core needs to access the aforementioned cache page again, it can read the previously locked data from the corresponding cache line of the latch area, thereby realizing a cache hit. Thus, through the page mode, the scheme of the disclosure improves the sharing and accessibility of data among multiple kernels.
Specifically, in page mode, the software driver can directly configure information in a ("System Memory Management Unit", abbreviated as "SMMU") page table by an instruction, and determine from the information to perform both configurations of a page-based latch operation or a normal (normal) operation. When the information in the page table indicates SMMU is bypassed, this indicates that no latch is needed for the cache, and the attribute of the cache line in the cache may be a Normal attribute. When the information indicates the SMMU is a linear mapping, then the page-based latching operation may be set according to the SMMU linear mapping window configuration. For example, data corresponding to a cache page within the linear mapping window is locked into the latching region of the present disclosure. The SMMU may generate a corresponding latch-related request based on the information in the page table, and send the latch-related request to the LLC, and a cache policy module of the LLC may configure a cache line of the LLC according to the latch-related request to perform a corresponding cache-related operation.
In one embodiment, the embodiment of the present disclosure further supports an instruction mode, in which the system on chip may configure a latch area in the LLC through a memory access instruction (IO instruction) in an instruction set.
For example, an IO instruction is accompanied by at least one configuration field that latches the relevant attributes, thereby flexibly configuring the LLC by means of the configuration field. Here, various configuration domains may represent operational behaviors that an LLC may perform in response to data accesses performed to off-chip memory (e.g., DDR space). In one implementation scenario, the instruction includes the configuration attributes: transient (Transient) attribute, lock (Lock) attribute, unlock (Unlock) attribute, normal (Normal) attribute, invalidate (Invalid) attribute, clean (clear) attribute, default (Default) attribute, and the like. Since the instruction mode is highest priority, when the IO access instruction indicates the Default attribute, it means that the latch-related operation can be performed by other modes (such as window mode, stream mode, or page mode).
By appending the above exemplary configurable attributes to the latch-related request, aspects of the present disclosure may perform corresponding latch-related operations in the instruction mode according to these appended attributes.
When the task scheduler issues the task to the intelligent processor IPU, the IPU may determine the latch related request according to the IO instruction in the task. Specifically, when the configuration field of the Lock attribute in the IO instruction is enabled, the Lock attribute may be appended to the latch related request at this time, so that the LLC stores specific data in the locked area according to the latch related request appended with the Lock attribute. When the configuration field of the Unlock attribute in the IO instruction is enabled, the Unlock attribute may be appended to the latch related request at this time, so that the LLC releases the lock region according to the latch related request appended with the Unlock attribute. Depending on the application scenario, other attributes may be similarly attached to the latch-related request.
Further, in some operation scenarios, a specific configuration field for indicating the latch ratio is also included in the instruction. When a particular configuration field (e.g., a particular bit inst _ ratio _ en) in the instruction is low, then the latch operation may be considered to be instruction configuration dependent, i.e., the latch-related request is determined according to the particular IO instruction in the task. If the bit is high, a predetermined proportion of data from the data stream may be selected as the partial data to be stored in the latch area according to a hash algorithm comparing with a latch ratio (latch ratio) indicated by the instruction. The specific use of the hashing algorithm is described in detail below in conjunction with fig. 11.
Fig. 11 is a flowchart illustrating a hash operation in a window mode or a stream mode according to an embodiment of the present disclosure. The scheme of the present disclosure uses a hash operation to perform a certain proportion of parking (i.e., locking) because one of the key issues with LLC parking is the bandwidth-to-capacity tradeoff ("tradeoff"). Thus, the present disclosure proposes to perform a certain proportion of parking (i.e., lock Ratio) so that different bandwidths and parking capacities can be obtained for different tasks. Assuming a preset Lock Ratio value of P (e.g., in percentage), the expected bandwidth is B =6T P +2T (1-P), where 6T is the read rate at which data resides on the LLC and 2T is the read speed at which data is stored on memory (e.g., DRAM), where T =1000Gbit/s. As previously described, the Lock Ratio may be configured in the Lock/unlock window or for a particular data stream. In addition, although the hash operation in the window mode or the stream mode is described below, a similar operation is also applicable to the hash operation in the instruction mode.
Specifically, in the window mode or the stream mode, the intelligent processor core firstly compares the access address of the data with the address range defined by the locking window to determine whether the requested address is in the address range of the locking window. When the requested address is within the address range of the locked window, then a hash operation may be performed on the hit window address range, as shown in FIG. 11. Here, the access address of each data may be a virtual address.
In particular, by means of a globally fixed Hash rule, VA of the memory address can be mapped onto a Hash space (i.e., "Hash Map" in the figure), and the Hash process can preferentially retain the address lower information. The Hash value obtained at 1102 may then be compared with the latch Ratio Lock Ratio at 1104 to randomly select a corresponding Ratio of data. Specifically, when the hash value of the access address is less than the latch ratio, a hit is considered and thus the portion of data (i.e., the ratio-matched data) may be latched in the cache. In contrast, when the hash value of the access address is greater than or equal to the latch ratio, a miss is considered and, therefore, the portion of data will not be latched in the cache.
For example, when the latch Ratio Lock Ratio is set to 10%, the partial data corresponding to the previous 10% value, that is, the partial data having the Hash value of the latch address of the data smaller than the latch Ratio, may be sequentially selected from the Hash values to perform the latch-related operation. In other examples, the latch ratio may be other values, the latch ratio may be set by a user defined by software, and the foregoing selecting operation may be implemented according to the setting of the Hash algorithm. For example, the latch ratio may also be 20% to 30%, and at this time, partial data corresponding to the first 20% to 30% of the Hash value may be sequentially selected from the Hash value to perform the latch-related operation. Thereafter, the processing may be performed at 1106 according to the specified request type, i.e., locking or unlocking portions of the data.
The cache latching scheme of the present disclosure is described in detail above in conjunction with fig. 6-11. Based on the idea of the aforementioned latching scheme, and in addition to the aforementioned latching scheme, the following will describe in connection with fig. 12-14 another extended application of the present disclosure to a cache memory, namely how to implement communication between clusters in a system on chip through a cache.
FIG. 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the present disclosure. In conjunction with the foregoing description, it is understood that the system-on-chip may be a system-on-chip included in the computing device 201 shown in fig. 2, for example, a system-on-chip formed by the multi-core computing device 41. As shown in FIG. 6, the system-on-chip 1200 includes four clusters 0-4 as exemplarily shown. In view of the foregoing, the cluster has been described in detail, and will not be described in detail here. Further shown is a cache memory 1201 which may be provided, for example, in the SRAM408 as previously shown in fig. 5, for performing inter-cluster data transfer operations. In one implementation scenario, the cache 1201 may also communicate bi-directionally with DRAM (e.g., DDR) on-chip and off-chip, including the transfer of various types of data or instructions.
FIG. 13 is a flow chart illustrating a method 1300 for a system on a chip according to an embodiment of the present disclosure. The system-on-chip here may be a system-on-chip as shown in fig. 12. Specifically, the system on chip includes at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters. In one implementation scenario, each cluster may include a plurality of processor cores to perform the arithmetic operations. In one implementation scenario, the above-mentioned latch area determined in the cache memory may be used to complete data communication between clusters, so that the system on chip may no longer be provided with communication modules such as CDMA410 and GDMA 411.
In one embodiment, the latch regions may be used to pass data between tasks having dependencies, e.g., the latch regions may be used to pass data between producer cores and consumer cores. Specifically, the processor may lock in the LLC the data that the producer core needs to swap to the consumer core through a configured lock window. In one scenario, when the processor has finished executing the producer core, the data (which may be input data or output data of the producer core) that needs to be passed to the consumer core may be latched. In view of this, the processor may perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, SMMU as previously described, thereby latching the aforementioned data that needs to be exchanged in the windowed mode in the LLC for later use by the consumer core. Correspondingly, the processor may also release the latch area according to an unlock window configured in the consumer kernel, that is, when the processor completes execution of the consumer kernel by performing a read operation on the data latched in the LLC, the processor may release a corresponding storage space of the data in the latch area in the LLC.
The latch area can be configured for transferring data between tasks having dependency relationships based on the above, and can also be used in an application scenario of inter-chip communication. For example, one cluster or processor core of the processor transmits data (which may be data that the producer core needs to exchange for the consumer core) to processors in other clusters via the latch region for merging processing. And the processors in other clusters read data from the latch areas for processing, so that inter-chip transmission of the data is realized. The manner in which the latch area is used for inter-cluster communication may be specifically referred to as described below.
As shown in fig. 13, the present disclosure also includes a method for inter-cluster communication using a latch region of a cache memory, the method comprising:
at step S1302, the specified storage space of the off-chip memory is mapped to a given storage area (having the same physical attributes as the lock area described above in connection with the figures) of the cache memory ("cache") to use the given storage area as a cluster storage area for inter-cluster data communication. In one implementation scenario as illustrated in fig. 8, the cache memory may include an LLC, and the off-chip memory includes a DDR. Based on this, the specified storage space may be the storage space specified at 1402 in fig. 14. Correspondingly, the cluster memory area may be a given memory area in the cache at 1404 in FIG. 14. In one implementation scenario, the specified memory space for the DDR may be specified by software configuration and mapped to a given space on the cache for inter-cluster (e.g., cluster 0 and cluster 1 as shown in fig. 14) communication. After the division and determination of the cluster storage area are completed, at step S1304, the operation of the cluster may be performed using the determined cluster storage area.
In one embodiment, using a cluster storage to perform clustering operations may include using the cluster storage for inter-cluster communication. In this case, using the cluster storage area for inter-cluster communication may specifically include: the cluster storage area is utilized to enable point-to-point communication between clusters. Additionally, the cluster storage area may be utilized to enable broadcast communication by one of the plurality of clusters to the remaining clusters. In a point-to-point communication scenario, the cluster storage area may be configured to receive a write operation from a first cluster for write data and to send, in response to a read operation by a second cluster, previous write data by the first cluster to the second cluster.
In an example implementation of the above-described write operation, the cluster storage area may also be used to receive a Lock indication, such as write Lock ("write Lock") shown in fig. 14, that is, the above-described Lock-related request with the Lock attribute, that the write data associated with the above-described write operation resides in the cluster storage area. The write data may then reside in the cluster storage based on the lock indication, where the cluster storage may be the latch zone determined in the above embodiments. By such a residence, the hit rate of data to be read multiple times in the cache memory can be significantly improved.
In one implementation scenario, the producer core executing in one of the clusters may lock data that needs to be exchanged to the consumer core in the LLC by the above-mentioned write lock for later use by the consumer core, e.g., the producer core transmits data via the LLC to processors in other clusters for merging processing. The processors in other clusters can read data from the cluster storage area for processing, so that inter-chip transmission of data is realized.
In one example implementation of the read operation described above, the cluster storage area may also be used to receive a read invalidate indication that causes the write data not to be written back to off-chip memory, such as a read invalidate ("read invalidate") issued from cluster 1 in fig. 14. The read-invalid indication may be a latch-related request accompanied by an invalid attribute, and the latch-related request is generated in a manner as described above. In different latch modes, their latch related requests may be different. The cluster storage area may then invalidate the cache line associated with the write data based on the read invalidate indication after sending the write data to cluster 1.
To achieve the above-described synchronization of data transfer (or communication) between clusters, a cluster (e.g., cluster 0) writing data to a cluster memory area may send a synchronization instruction to another cluster (e.g., cluster 1) after the write operation is completed, e.g., hsem ("hardware semaphore") as in fig. 14. Upon receiving the synchronization instruction, cluster 1 may send the above-mentioned read invalidate request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write back of the aforementioned data.
In the context of the present disclosure, the above-described actions of writing data to and reading data from a cluster memory area may also be referred to collectively as latch-related operations triggered by latch-related requests, which may be acknowledged in the manner described above. In particular, the latch related request may be used to indicate a latching operation. Through the latching operation, data will be latched in the cluster memory area for subsequent multiple uses. Further, the latch-related request may be used to indicate a release operation, and with the release operation, data may be unlocked from the cluster memory region to free up more memory space for subsequent latching of the data. It can be understood that, through the releasing operation, the storage space of the cluster storage area can be flexibly used, thereby improving the use efficiency of the cluster storage area of the present disclosure.
In one embodiment, for a read operation of a cluster storage area, data or a selected part of the data may be released from a designated area of the cluster storage area according to a latch-related request after the read operation is performed. With respect to the selected partial data, in one embodiment, a predetermined proportion of data may be selected from the data in a random manner to form the partial data to be latched in the latch area. In another embodiment, a hash algorithm may be used to select a predetermined proportion of data from the data to be latched in the cluster memory area as the aforementioned partial data, which is described in detail above with reference to fig. 11.
The aspects of the present disclosure are described in detail above with reference to the drawings. According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, an internet of things terminal, a mobile phone, a drive recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the disclosed aspects are not limited by the order of acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In some implementation scenarios, the integrated unit may be implemented in the form of a software program module. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("Read Only Memory", abbreviated as ROM), a Random Access Memory ("Random Access Memory", abbreviated as RAM), a removable hard disk, a magnetic disk or an optical disk, and various media capable of storing program codes.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory ("Resistive Random Access Memory", abbreviated as RRAM), a Dynamic Random Access Memory ("Dynamic Random Access Memory", abbreviated as DRAM), a Static Random Access Memory ("Static Random Access Memory", abbreviated as SRAM), an Enhanced Dynamic Random Access Memory ("Enhanced Dynamic Random Access Memory", abbreviated as EDRAM), a High Bandwidth Memory ("High Bandwidth Memory", abbreviated as HBM), a Hybrid Memory Cube ("Hybrid Memory Cube", abbreviated as HMC), a ROM, a RAM, or the like.
The foregoing may be better understood in light of the following clauses:
clause a1. A method for a system-on-chip including at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters, each cluster including a plurality of processor cores for performing the arithmetic operations, the method comprising:
mapping a designated storage space of the off-chip memory to a given storage area of the cache memory to use the given storage area as a cluster storage area for inter-cluster data communication; and
performing operations of the cluster using the cluster storage area.
Clause a2. The method of clause A1, wherein performing the operation of the cluster using the cluster storage comprises using the cluster storage for inter-cluster communication.
Clause a3. The method according to clause A2, wherein using the cluster storage area for inter-cluster communication comprises:
utilizing the cluster storage area to realize point-to-point communication among clusters; or alternatively
Broadcast communication by one of the plurality of clusters to the remaining clusters is implemented using the cluster storage area.
Clause a4. The method of clause A3, wherein using the cluster storage area to enable peer-to-peer communication between clusters comprises:
receiving a write operation from the first cluster for the write data; and
sending the write data to a second cluster in response to a read operation of the second cluster.
Clause a5. The method of clause A4, wherein in the write operation, the method further comprises:
receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and
the write data is resident in the cluster storage area based on the lock indication.
Clause a6. The method of clause A4 or A5, wherein in the read operation, the method further comprises:
receiving a read invalidate indication that causes the write data not to be written back to off-chip memory; and
after sending the write data to the second cluster, invalidating a cache line associated with the write data based on the read invalidate indication.
Clause a7. A system-on-chip, comprising:
a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing arithmetic operations; and
a cache memory interconnected with the plurality of clusters and configured to:
using a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relation with a designated storage space of an off-chip memory; and
performing operations of the cluster using the cluster storage area.
Clause a8. The system-on-chip according to clause A7, wherein the cluster storage area is configured for inter-cluster communication.
Clause a9. The system-on-chip according to clause A8, wherein the cluster storage area is configured for peer-to-peer communication between clusters or broadcast communication of one of the plurality of clusters to the remaining clusters.
Clause a10. The system-on-chip according to clause A9, wherein in the peer-to-peer communication, the cluster storage area is configured to:
receiving a write operation from the first cluster for the write data; and
sending the write data to a second cluster in response to a read operation of the second cluster.
Clause a11. The system-on-chip of clause a10, wherein the second cluster is configured to:
receiving a hardware semaphore from the first cluster; and
in response to receiving the hardware semaphore, performing the read operation on the cluster storage area.
Clause a12. The system-on-chip of clause a10, wherein in the write operation, the first cluster is configured to send a lock indication to the cluster storage area to park the write data in the cluster storage area, such that the cluster storage area parks the write data based on the lock indication.
Clause a13, the system-on-chip of clause a12, wherein in the read operation, the second cluster is configured to send a read invalidate indication to the cluster storage area to cause the write data not to be written back to the memory at the chip, such that the cluster storage area invalidates the cache line associated with the write data based on the read invalidate indication.
Clause a14. A computing device comprising the system-on-chip according to any one of clauses A7-a 13.
Clause a15. A board comprising the computing device of clause a14.
Clause a16, a computing device, comprising the board of clause a15.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims (15)

1. A method for a system-on-chip including at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters, each cluster including a plurality of processor cores for performing the arithmetic operations, the method comprising:
mapping a designated storage space of the off-chip memory to a given storage area of the cache memory to use the given storage area as a cluster storage area for inter-cluster data communication; and
performing operations of the cluster using the cluster storage area.
2. The method of claim 1, wherein using the cluster storage to perform the operations of the cluster comprises using the cluster storage for inter-cluster communication.
3. The method of claim 2, wherein using the cluster storage area for inter-cluster communication comprises:
utilizing the cluster storage area to implement point-to-point communication between clusters; or
Broadcast communication by one of the plurality of clusters to the remaining clusters is implemented using the cluster storage area.
4. The method of claim 3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:
receiving a write operation from the first cluster for write data; and
sending the write data to a second cluster in response to a read operation of the second cluster.
5. The method of claim 4, wherein in the write operation, the method further comprises:
receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and
the write data is resident in the cluster storage based on the lock indication.
6. The method of claim 4 or 5, wherein in the read operation, the method further comprises:
receiving a read invalidate indication that causes the write data not to be written back to off-chip memory; and
after sending the write data to the second cluster, invalidating a cache line associated with the write data based on the read invalidate indication.
7. A system on a chip, comprising:
a plurality of clusters, wherein each cluster includes at least a plurality of processor cores to perform arithmetic operations; and
a cache memory interconnected with the plurality of clusters and configured to:
using a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relation with a designated storage space of an off-chip memory; and
performing operations of the cluster using the cluster storage area.
8. The system on a chip of claim 7, wherein the cluster storage area is configured for inter-cluster communication.
9. The system on chip of claim 8, wherein the cluster storage area is configured for inter-cluster point-to-point communication or broadcast communication by one of the plurality of clusters to the remaining clusters.
10. The system on a chip of claim 9, wherein in the point-to-point communication, the cluster storage area is configured to:
receiving a write operation from the first cluster for the write data; and
sending the write data to a second cluster in response to a read operation of the second cluster.
11. The system on chip of claim 10, wherein the second cluster is configured to:
receiving a hardware semaphore from the first cluster; and
in response to receiving the hardware semaphore, performing the read operation on the cluster storage area.
12. The system on a chip of claim 10, wherein in the write operation, the first cluster is configured to send a lock indication to the cluster storage to park the write data in the cluster storage, such that the cluster storage parks the write data based on the lock indication.
13. The system on a chip of claim 12, wherein in the read operation, the second cluster is configured to send a read invalidate indication to the cluster storage area that causes the write data not to be written back to memory at a chip, such that the cluster storage area invalidates a cache line associated with the write data based on the read invalidate indication.
14. A computing device comprising the system on chip of any of claims 7-13.
15. A board comprising the computing device of claim 14.
CN202110926703.7A 2021-08-12 2021-08-12 Method for system on chip and related product Pending CN115878553A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110926703.7A CN115878553A (en) 2021-08-12 2021-08-12 Method for system on chip and related product
PCT/CN2022/110740 WO2023016383A1 (en) 2021-08-12 2022-08-08 Method for cache memory and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110926703.7A CN115878553A (en) 2021-08-12 2021-08-12 Method for system on chip and related product

Publications (1)

Publication Number Publication Date
CN115878553A true CN115878553A (en) 2023-03-31

Family

ID=85762180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110926703.7A Pending CN115878553A (en) 2021-08-12 2021-08-12 Method for system on chip and related product

Country Status (1)

Country Link
CN (1) CN115878553A (en)

Similar Documents

Publication Publication Date Title
US9218286B2 (en) System cache with partial write valid states
US9734056B2 (en) Cache structure and management method for use in implementing reconfigurable system configuration information storage
US6918012B2 (en) Streamlined cache coherency protocol system and method for a multiple processor single chip device
US7774551B2 (en) Hierarchical cache coherence directory structure
US9158685B2 (en) System cache with cache hint control
US11474951B2 (en) Memory management unit, address translation method, and processor
US10678702B2 (en) Using multiple memory elements in an input-output memory management unit for performing virtual address to physical address translations
US20230367722A1 (en) Data processing device and method, and related products
US9043570B2 (en) System cache with quota-based control
WO2014052383A1 (en) System cache with data pending state
US20210248006A1 (en) Hardware Resource Allocation System for Allocating Resources to Threads
US20150074357A1 (en) Direct snoop intervention
EP4078383A1 (en) Zero value memory compression
US20180074964A1 (en) Power aware hash function for cache memory mapping
US20220300421A1 (en) Memory Sharing
CN115878553A (en) Method for system on chip and related product
EP4060505A1 (en) Techniques for near data acceleration for a multi-core architecture
US20210200584A1 (en) Multi-processor system, multi-core processing device, and method of operating the same
CN115705300A (en) Method for cache memory and related product
Abdallah Heterogeneous Computing: An Emerging Paradigm of Embedded Systems Design
CN115686740A (en) Scalable access control checking for cross address space data movement
WO2023016383A1 (en) Method for cache memory and related products
CN116601613A (en) Data placement with grouping metadata
WO2023016382A1 (en) Method for system on chip, and related product thereof
CN117667212A (en) Instruction control device, method, processor, chip and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination