WO2023016383A1 - Method for cache memory and related products - Google Patents

Method for cache memory and related products Download PDF

Info

Publication number
WO2023016383A1
WO2023016383A1 PCT/CN2022/110740 CN2022110740W WO2023016383A1 WO 2023016383 A1 WO2023016383 A1 WO 2023016383A1 CN 2022110740 W CN2022110740 W CN 2022110740W WO 2023016383 A1 WO2023016383 A1 WO 2023016383A1
Authority
WO
WIPO (PCT)
Prior art keywords
latch
data
area
cluster
chip
Prior art date
Application number
PCT/CN2022/110740
Other languages
French (fr)
Chinese (zh)
Inventor
葛祥轩
张尧
刘少礼
梁军
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110926703.7A external-priority patent/CN115878553A/en
Priority claimed from CN202110926707.5A external-priority patent/CN115705300A/en
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023016383A1 publication Critical patent/WO2023016383A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems

Definitions

  • the present disclosure generally relates to the field of chip technology. More specifically, the present disclosure relates to a method for a cache memory, a cache memory, a system-on-chip including the cache memory, a board including the system-on-chip, and a computing device including the board.
  • the operational performance of a computing system is largely determined by the average memory access latency.
  • System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of the cache memory (referred to as "cache").
  • processors typically employ a cache mechanism, and use the cache to accommodate the mismatch in speed and performance between the processor and slow main memory.
  • the current cache implements a multi-level cache mechanism, such as three-level cache (L1, L2, and L3), and the cache closest to the main memory is called the last level cache (“Last Level Cache", LLC).
  • L1, L2, and L3 three-level cache
  • LLC last level cache
  • how to expand the application of LLC for different scenarios has also become a problem that needs to be solved.
  • the present disclosure provides a residency scheme for a cache memory.
  • a specific area in the cache memory can be configured as a locked area, and multiple-used data can be stored in the locked area, thereby improving the cache hit rate and improving the overall performance of the system.
  • the present disclosure provides a solution for a cache memory in the following aspects.
  • the present disclosure provides a method for a cache memory, comprising: configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each of the lock The latch mode corresponds to a latch-related operation performed on data in the latch area; receiving a latch-related request for performing a latch-related operation on the data in the latch area; and according to the lock A storage-related request, performing a latch-related operation on the data in the latch area in the corresponding latch mode.
  • the present disclosure provides a cache memory, including: a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each One of the latch modes corresponds to a latch-related operation performed on data in the latch area; a latch execution module is used for: receiving and latching the data in the latch area A latch-related request for a related operation; and performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the present disclosure provides a system-on-chip comprising a cache as described above and in various embodiments below; and a processor configured to generate said latch-related request; wherein The latch execution module of the cache memory is configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the present disclosure provides a board, including the system-on-chip as described above and in the following embodiments.
  • the present disclosure provides a computing device, including the board as described above and described in various embodiments below.
  • the latch area can be used to perform latch and unlock operations on data used multiple times, thereby significantly improving the cache hit rate. Further, since the latch area of the present disclosure supports multiple latch modes, and these latch modes can be selected and used according to the configuration, the application scenarios of the latch area are expanded. When used in the scenario of the producer core and the consumer core, the latch area of the present disclosure can serve as a medium for data transfer, thereby improving the accessibility and utilization of data. In addition, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the computing system.
  • the present disclosure proposes the use scenario of expanding the cache memory.
  • the present disclosure provides solutions for a system on chip in the following aspects.
  • the present disclosure provides a method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores for performing the operation, the method includes: mapping a designated storage space of the off-chip memory to a latch area of the cache memory, so as to use the latch area as a link for inter-cluster data communication a cluster storage area; and performing operations of the cluster using the cluster storage area.
  • the present disclosure provides a system-on-chip, comprising: a plurality of clusters, each of which includes a plurality of processor cores for at least performing arithmetic operations; and a cache memory associated with the plurality of The clusters are interconnected and configured to: use the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and use the cluster storage area to execute all Describe the operation of the cluster.
  • the present disclosure provides a computing device comprising a system-on-chip as described above and in various embodiments below.
  • the present disclosure provides a board, including the computing device as described above and described in various embodiments below.
  • the present disclosure provides a computing device, including the board as described above and in the following embodiments.
  • the latch area of the cache memory can be used to realize efficient communication between the clusters of the SoC. Therefore, the data that needs to be transferred through the off-chip memory can be directly transferred through the latch area, thereby speeding up data access and significantly improving the cache hit rate. Further, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the SoC. In addition, the division of the latch area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. With the help of the latch area, multiple clusters of the SoC can implement multiple flexible communication mechanisms, thereby also improving the operational performance of the cluster.
  • FIG. 1 is a structural diagram showing a board according to an embodiment of the present disclosure
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart illustrating a method for a cache memory according to an embodiment of the present disclosure
  • Figure 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the disclosure.
  • FIG. 8 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure.
  • FIG. 9 is a detailed block diagram illustrating a system-on-chip according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram showing a hash operation in window mode according to an embodiment of the present disclosure.
  • Figure 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure.
  • FIG. 13 is a flowchart illustrating a method for a system on a chip according to an embodiment of the present disclosure.
  • FIG. 14 is a block diagram illustrating an operation of a system on chip according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be construed, depending on the context, to mean “once determined” or “in response to the determination” or “once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 1 are only an example, and are not intended to limit the solution of the present disclosure in any respect.
  • the board 10 includes a chip 101, which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices.
  • the aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 may be configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a structural diagram showing a combination processing device in the chip 101 according to the above-described embodiment.
  • the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a DRAM (Dynamic Random Access Memory, DRAM) DRAM 204.
  • DRAM Dynamic Random Access Memory
  • the computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user.
  • the interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general and/or special purpose processors.
  • Processors including but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store data to be processed, and is a DDR memory, usually 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core.
  • the single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204.
  • FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core.
  • the multi-core computing device 41 adopts a hierarchical structure design, and the multi-core computing device 41 is a system on chip, which includes at least one cluster (cluster) according to the present disclosure, and each cluster includes multiple processor cores.
  • the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core.
  • the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .
  • the peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules.
  • the synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC Global Barrier Controller
  • the plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 4 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.
  • each cluster 405 may include a plurality of processor cores (IPU core) 406 and a storage core (MEM core) 407, which may include, for example, the high-speed Buffer memory (eg LLC).
  • IPU core processor core
  • MEM core storage core
  • the number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 5 .
  • Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 .
  • the functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here.
  • the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534.
  • IODMA 533 controls memory access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409;
  • MVDMA 534 is used to control memory access of NRAM 531/WRAM 532 and storage unit (SRAM) 408.
  • the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, processors communication between the cores 406 and the like.
  • the storage core 407 may have a scalar operation capability to perform scalar operations.
  • the storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411.
  • SRAM static random access memory
  • CDMA Cluster Direct Memory Access
  • GDMA global direct memory access module
  • the SRAM 408 can assume the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406 respectively, but is transferred between the processor cores 406 through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the efficiency of inter-core communication and significantly reducing on-chip and off-chip input/output access.
  • the broadcast bus 409, the CDMA 410 and the GDMA 411 are respectively used to perform communication between the processor cores 406, communication between the clusters 405, and data transmission between the clusters 405 and the DRAM 204. They will be described separately below.
  • the broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 .
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (for example, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406, and broadcasting is to transmit a data
  • the communication mode in which data is transmitted from SRAM 408 to all processor cores 406 belongs to a special case of multicast.
  • the CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 in the same computing device 201.
  • the GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408.
  • the communication between the DRAM 204 and the NRAM 431 or WRAM 432 can be realized in two ways.
  • the first way is to directly communicate with DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 Transfer between NRAM 431 or WRAM 432.
  • the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 431 or WRAM 432 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.
  • the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component.
  • the present disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure .
  • the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same part.
  • the hardware architecture and its internal structure of the present disclosure have been described in detail above with reference to FIGS. 1-5 . It is to be understood that the foregoing description is illustrative only and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also make changes to the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure.
  • the corresponding hardware architecture may not include the CDMA 410 used to control the access to the SRAM 408 among different clusters 405 in the same computing device 201 .
  • the underlying approach of the present disclosure involves improving and optimizing the cache, eg, disposed between SRAM 408 and DRAM 204, to enable efficient on-demand latching of data and communication between different clusters through the cache.
  • the following scheme of the present disclosure proposes to configure a specific storage space in the cache memory as a latch area for data latch operations, especially It is for data that will be frequently used.
  • the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that data need not be locked in the cache memory when the data need only be used once.
  • the following solution of the present disclosure also proposes to configure the cache memory to support multiple latch modes, so that when a latch-related request is received, the high-speed The buffer memory operates in a latch mode corresponding to the aforementioned latch-related request.
  • various latch modes of the present disclosure may have a specific priority order to satisfy different latch-related operations.
  • the solution of the present disclosure also proposes a variety of different configuration methods, so that the cache memory can be used more flexibly and used to realize inter-cluster communication.
  • FIG. 6 is a flowchart illustrating a method 600 for a cache memory according to an embodiment of the disclosure.
  • the method 600 includes, at step S602 , configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes.
  • the aforementioned multiple latch modes may include, but not limited to, an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a lock mode for performing latch-related operations based on data streams.
  • Streaming mode for store-related operations and/or page mode for latch-related operations based on cache pages.
  • the aforementioned data streams may be instruction streams or data streams of different types.
  • the data stream may be the neuron data stream, weight data stream, output result data stream, etc. of the neural network model.
  • the data targeted by the latch-related operation is data that will be used multiple times by the processor of the system-on-chip, and has relatively higher priority than the data that is not subjected to the latch operation.
  • the cache hit rate can be significantly improved, thereby improving the overall performance of the system.
  • the reused data in the latch area of the LLC the read and write operations of data between the on-chip system and off-chip memory (such as DDR or DRAM) can be reduced, thereby improving memory access efficiency .
  • the above-mentioned multiple latch modes can be set to have different priorities according to user preferences or system preferences.
  • the order of priority may be instruction mode -> window mode -> stream mode -> page mode; in another implementation, the order of priority may be instruction mode -> page mode —> Stream Mode —> Window Mode.
  • the latch area in the cache memory can be used in more ways, increasing the flexibility of using the latch area to cope with different application scenarios and system requirements. Further, it may traverse sequentially according to the priority order of the above-mentioned latch modes, and when the high-priority latch mode is disabled, the low-priority latch mode may be adopted.
  • a specific storage space may be configured as a latch area supporting a corresponding latch mode according to one configuration instruction among the received configuration instructions.
  • the configuration instruction may include one or more configuration items, so as to realize the configuration of the aforementioned latch area.
  • the plurality of configuration items may include configuration items for enabling a latch area, disabling a latch area, and/or a size of a latch area.
  • the corresponding latch strategy (such as the size of the latch data or the specific data to be latched) can be configured in the aforementioned instruction mode, window mode, stream mode or page mode, so as to latch different types or specific Instructions, data or data flow, etc.
  • the scheme of the present disclosure can realize the flexible use of the cache memory, so that it can operate in one of the various latch modes of the present disclosure, or operate in the normal mode as required.
  • a latch-related request for performing latch-related operations on data in the latch area is received.
  • the latch-related request may be triggered by an operation intended to reside specific data in a latch region.
  • the latch-related request may also be triggered by an operation intended to remove or release specific data from the latch area.
  • the latch-related requests of the present disclosure may also have different expressions or contents. For example, for an instruction mode, a window mode, or a stream mode, the latch-related request may include a configuration item for indicating a behavior attribute of the cache memory, and the like.
  • the above-mentioned configuration item for indicating the behavior attribute of the cache memory includes at least one of the following multiple configuration attributes:
  • Transient (Transient) attribute do not cache in LLC, that is, directly perform data read and write operations with off-chip memory (such as DDR); for some data that is only accessed once, do not cache in LLC, thereby avoiding Occupy LLC resources;
  • Lock (Lock) attribute Reside specific data in the latch area, and read and write data from the hit cache line (cacheline). If the cache line belongs to the latch area, the attribute of the cache line is configured as a persistent (persisting) attribute; if the cache line does not belong to the latch area, the attribute of the cache line remains unchanged, that is, the following normal (normal) attributes are maintained; it should be clear Yes, the above-mentioned cache line in the latch area has two attributes, namely a persistent (persisting) attribute and a normal (normal) attribute. A cache line with a persistent (persisting) attribute in the lock area can only be accessed and replaced by a lock-related request with a Lock attribute.
  • Unlock (Unlock) attribute After reading and writing data from the hit cache line, release the corresponding storage space of the data in the latch area in the LLC, and set the corresponding cache line attribute in the latch area to the following general attributes;
  • Invalid attribute Invalid data directly after reading to avoid being replaced and written to the off-chip memory
  • Clean (Clean) attribute When performing a write operation, data can be written into the hit cache line, and the storage content of the entire cache memory (cache) can be written back to the off-chip memory, and the attributes of the cache line remain unchanged; During a read operation, data is read from the hit cache line. When the hit cache line is dirty (dirty), write it back to the off-chip memory;
  • Default (Default) attribute the default item can be used to indicate that the configuration about the latch mode is ignored.
  • the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
  • the latch related request may indicate that the data related to the specific page will be latched in the latch area for subsequent multiple use, or may indicate that the data related to the specific page will be After multiple uses, unlock from the latch area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the latch area can be used flexibly, thereby improving the utilization efficiency of the latch area of the present disclosure.
  • a latch-related operation may be performed on data in the latch area in a corresponding latch mode.
  • the aforementioned latch-related operations may include a read operation and a write operation for the latch area.
  • the method 600 may also include latching data or a selected part of the data in a specified area of the latch area according to a latch-related request, so as to be used in subsequent multiple reads.
  • the method 600 may further include, after the read operation is completed, transferring the data or a selected part of the data from the latch area according to the latch-related request The specified area is released.
  • a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area.
  • a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be latched in the latch area.
  • the aforementioned hash algorithm may be used to select part of the data that can be locked in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the solution of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptable, so as to meet different application scenarios and user requirements. In addition, due to the effective latching of data in the latch area, the sharing of data between the producer kernel ("producer kernel”) and one or more consumer kernels (“consumer kernel”) is also promoted, improving the Data Accessibility and Usage.
  • the producer kernel and the consumer kernel here can be understood as two dependent tasks, where the output of the producer kernel will be used as the input to the consumer kernel, so that the consumer kernel can use the input to complete the corresponding task.
  • the output of the producer core can be used as data that needs to be used multiple times in the future, and the data that needs to be used multiple times in the future can be temporarily stored in the lock of the cache memory memory area, so that the consumer core can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the memory interaction between the artificial intelligence processor and the off-chip memory, and reducing the IO access memory overhead, which in turn can improve the processing efficiency and performance of artificial intelligence processors.
  • FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the disclosure. It can be understood that the cache memory 700 shown in FIG. 7 may be the cache memory described in conjunction with FIG. 6 , so the cache memory described in FIG. 6 is also applicable to the following description in relation to FIG. 7 .
  • the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702 . Further, the cache memory 700 also includes a storage space for performing cache operations, for example, as shown in the figure, the storage space is equally divided into 8 ways (way0-way7), wherein each way includes a number of The cache line (cacheline).
  • the above-mentioned configuration module can be used to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein the size of the specific storage space is smaller than the total storage size of the cache memory .
  • way0-way5 in FIG. 7 can be configured as a specific storage space that supports latching.
  • ways6-7 in FIG. 7 can maintain the common attributes of the cache memory, that is, be used as a general cache.
  • the latch mode can be instruction mode, window mode, stream mode and/or page mode.
  • the latch execution module may be configured to receive a latch-related request for performing latch-related operations on data in the latch area.
  • the latch execution module can perform latch-related operations on data in the latch area in a corresponding latch mode according to the latch-related request.
  • the latch-related operations here may include a write operation for the latch area (that is, writing data into the latch area) or releasing data in the latch area from the latch area. For example, when the consumer core has used up the data in the lock area and the data will no longer be used by other consumer cores, the space for storing data in the lock area can be released for locking other data .
  • FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the disclosure.
  • a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in FIG. 7 .
  • the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • the cache memory 700 it has been described above in conjunction with FIG. 6 and FIG. 7 , and will not be repeated here.
  • the processor 802 may be various types of processors, and may include one or more processor cores to generate latch-related requests.
  • the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request.
  • the processor can be configured to generate latch-related requests according to received hardware instructions.
  • the latch mode is the page mode
  • the processor may be configured to generate a latch-related request according to the cache page configuration.
  • the processing may be used to configure a lock window, and generate a latch-related request according to the lock window.
  • the processor 802 may also be an intelligent processor or an intelligent processing unit (“Intelligence Processing Unit”, abbreviated as "IPU") including multiple computing cores, which may be configured to execute various artificial intelligence fields (such as neural network calculations).
  • IPU Intelligent Processing Unit
  • multiple computing cores which may be configured to execute various artificial intelligence fields (such as neural network calculations).
  • FIG. 9 is a detailed block diagram illustrating a system on chip 900 according to an embodiment of the present disclosure. It can be understood that the system-on-chip 900 shown here may be a specific implementation of the system-on-chip shown in FIG. 8 , and therefore the content described with respect to FIG. 8 is also applicable to FIG. 9 . Further, for the purpose of example only, the operation of the system-on-chip 900 will be described in a window mode (or stream mode) among a plurality of latch modes.
  • a system on chip 900 may include a task scheduler (“Job Scheduler”) 902 including a scheduling unit 903 and a configurator 904 .
  • the configurator 904 may be configured to generate configuration instructions according to assigned configuration tasks (e.g., obtainable from a task queue) to be sent to a configuration module (such as a CLR) in a cache memory (that is, "LLC” 906) send.
  • the scheduling unit 903 can be used to schedule multiple tasks in the task scheduler (that is, the "kernel” to be executed on the artificial intelligence processor), so as to provide intelligent processing in the system on chip of the present disclosure processor (IPU) 905 to send.
  • the intelligent processor 905 here may include multiple processor cores, and the multiple processor cores may form a cluster as shown in FIG. 4 .
  • the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (eg utilization) of the multiple processor cores.
  • system-on-chip 900 also includes a system memory management unit (“System Memory Management Unit”, abbreviated as "SMMU"), which is used to convert the virtual address of the accessed data into a physical address, so as to An address enables access to an associated memory location.
  • system memory management unit includes an address translation buffer TLB (Translation Lookaside Buffer, also called fast table).
  • TLB Translation Lookaside Buffer
  • a page table is maintained in the TLB, and the page table includes at least one page table entry, and each page table entry includes a page (page) and a page frame (Frame) corresponding to the page.
  • the system memory management unit can determine the page corresponding to the virtual address according to the received virtual address, and then can determine the physical address PA (Physical Address) corresponding to the virtual address through the mapping relationship between the page and the page frame, Therefore, the access to the relevant storage location of the cache memory can be realized according to the physical address.
  • PA Physical Address
  • access to the cache memory can be implemented through the above-mentioned window mode or stream mode.
  • the intelligent processor can obtain the parameter table from the memory, and according to the parameter table, configure a lock window ("Lock window") associated with the data of the latch-related operation to be performed, and generate a lock window according to the configured lock window.
  • Requests ie, eg IO access requests with lock/unlock attributes attached.
  • the SMMU can perform latch-related operations on the LLC according to the IO access request. Specifically, the SMMU may send the aforementioned IO access request to the cache policy module 907 of the LLC 906 (which performs the same operation as the latch execution module 702 in FIG. 7 ) for execution.
  • the parameter table may include parameter items for configuring a lock window or a stream latch attribute in a stream mode.
  • parameter items may include, but are not limited to, lock/unlock window (“lock/unlock window”), lock/unlock per stream (“per stream lock/unlock”), lock ratio (“Lock Ratio”), lock Window identification ("lock window flag”) and other information.
  • the parameters in the parameter table may be user-defined.
  • the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table can be stored in the memory (such as DDR), so that the intelligent processor (such as the IPU 905 in the figure) can be used in the execution phase.
  • the above-mentioned lock window is used to represent the storage space that the software user wishes to lock, and the size of the lock window may be larger than the size of the lock area on the cache memory.
  • the above-mentioned locked window includes one or more of the following: the base address and size of the window, wherein the base address of the window can be a virtual address configured by the upper layer software (such as virtual address "Virtual Address", abbreviated "VA"), the window The base address of the window corresponds to the starting address of the data to be latched, and the size of the window may correspond to the size of the data to be latched.
  • the intelligent processor can determine the memory access address of the data in the task (the memory access address can be a virtual address) according to the task issued by the task scheduler, and make the memory access address of the data in the task The address is compared to the address range defined by the window's lock window. If the access address of the data in the task is within the address range of the lock window, it means that the lock window is hit, and the lock window can be enabled (such as "Enabled") at this time. Otherwise, if the access address of the data in the task is outside the address range of the lock window, it means that the lock window is not hit. At this time, the lock window can be ignored, which means that the data in the task will not be temporarily stored in the cache memory.
  • a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data and stored in the latch area.
  • the specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the intelligent processor can send the lock-related request attached with the Lock attribute to the cache memory LLC through the SMMU.
  • the lock-related request attached with the Lock attribute may be used to indicate that specific data resides in the lock area, and the specific data may be part of data selected according to a hash algorithm.
  • the latching process and release process of the LLC will be described below in the window mode with reference to FIG. 9 .
  • Step 1 The task scheduler configures the LLC with the help of the configurator (e.g. via the cache policy module) to enable the lock region ("Lock enable”), disable the lock region (“Lock disable”) and the size of the lock region, as shown in the figure
  • the number of ways (“Ways") to go out eg Way0-Way7.
  • Step 2 The task scheduler sends the task kernel to the IPU;
  • Step 3 The IPU obtains the lock window flag (“lock window flag”) from the parameter table, reads and configures the lock window.
  • the parameter table here can be configured by software and stored at a storage address of an off-chip dynamic random access memory ("Dynamic Random Access Memory", abbreviated as "DRAM"). Then, the task scheduler can transmit the address to the IPU and the IPU can read the parameter table according to the address, so as to complete the configuration of the locking window.
  • DRAM Dynamic Random Access Memory
  • Step 4 The IPU generates a lock-related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, the request can be attached with a lock attribute according to the lock window information.
  • Step 5 After receiving the lock-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (that is, the lock area), For example set to "persisting" as described above.
  • Step 6 The task scheduler sends the kernel to the IPU
  • Step 7 The IPU obtains the unlock window ID from the parameter table, reads and configures the unlock window;
  • Step 8 When the IPU transmits the request, it attaches the unlock (“unlock”) attribute according to the unlock window information;
  • Step 9 After receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a normal attribute, such as the normal (Normal) attribute described in conjunction with the instruction mode above;
  • Step 10 The task scheduler disables the lock area (ie, LLC lock disable) by means of the configurator and through the CLR module.
  • the CLR module may clear the previous locking attribute configuration according to the instruction of the configurator.
  • the latch scheme of the system on chip of the present disclosure in the window mode has been described in detail above with reference to FIG. 9 .
  • the probability of a cache hit can be significantly increased, the utilization efficiency of the cache memory is improved, and the application scenarios are expanded.
  • the embodiments of the present disclosure also support latch-related operations in stream mode.
  • the enable bit corresponding to the data stream in the task of the present disclosure is low, it is regarded as the default situation, that is, the latch-related operations in stream mode are not performed. .
  • the corresponding latch-related operations can be performed on the data stream in stream mode.
  • the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data can be selected from the data stream as the aforementioned partial data to be stored in the latch by using the hash algorithm and the lock ratio of the data stream. in the district. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
  • the embodiment of the present disclosure also supports latch-related operations in the page mode, and the page mode will be described below with reference to FIG. 10 .
  • FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure.
  • the cache page can be directly configured so that it has the lock attribute of the present disclosure, so that the cache page that forms a mapping relationship with the memory (such as "memory") can be used for multiple
  • the kernel kernels (kernel 0-2 shown in the figure) share access data for use.
  • the programmer may use an instruction (such as Malloc) to mark the cache page with a lock attribute.
  • Malloc an instruction
  • the SMMU can lock the data corresponding to the cache page in the latch area of the present disclosure.
  • the disclosed scheme improves the sharing and accessibility of data among multiple cores.
  • the software driver can directly configure ("System Memory Management Unit”, abbreviated as "SMMU") information in the page table through instructions, and determine to perform page-based latch operations or Normal operates both configurations.
  • SMMU System Memory Management Unit
  • the attribute of the buffer line in the cache memory can be a normal (Normal) attribute.
  • the page-based latch operation may be set according to the SMMU linearly mapped window configuration. For example, the data corresponding to the cache page in the linear mapping window is locked in the latch area of the present disclosure.
  • the SMMU can generate a corresponding lock-related request based on the information in the page table, and send the lock-related request to the LLC, and the cache policy module of the LLC can configure the cache line of the LLC according to the lock-related request to execute The corresponding cache-related operations.
  • the embodiment of the present disclosure also supports an instruction mode, at this time, the system-on-chip can configure the latch area in the LLC through a memory access instruction (IO instruction) in the instruction set.
  • IO instruction memory access instruction
  • the IO instruction is accompanied by at least one configuration domain that latches related attributes, so that the LLC can be flexibly configured by means of the configuration domain.
  • various configuration domains may represent corresponding operation behaviors that the LLC may perform when performing data access to off-chip memory (such as DDR space).
  • the above configuration attributes are included in the instruction: Transient (Transient) attribute, Lock (Lock) attribute, Unlock (Unlock) attribute, General (Normal) attribute, Invalid (Invalid) attribute, Clean (Clean) attribute Or the default (Default) attribute and so on. Since the instruction mode is the highest priority, when the IO access instruction is indicated as the Default attribute, it means that other modes (such as window mode, stream mode or page mode) can perform latch-related operations.
  • the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
  • the IPU can determine the latch related request according to the IO instruction in the task. Specifically, when the configuration domain of the Lock attribute in the IO instruction is enabled, the Lock attribute can be attached to the lock-related request at this time, so that the LLC can store specific data in the lock-related request according to the Lock attribute. locked area. When the configuration field of the Unlock attribute in the IO command is enabled, the Unlock attribute can be attached to the lock-related request at this time, so that the LLC can release the locked area according to the lock-related request attached with the Unlock attribute. According to different application scenarios, the latch-related request here can also have other attributes similarly attached.
  • the instruction when the instruction also includes a specific configuration field for indicating the latch ratio.
  • a specific configuration field in the instruction for example, a specific bit inst_ratio_en
  • the specific use of the hash algorithm will be described in detail below in conjunction with FIG. 11 .
  • FIG. 11 illustrates a hash operation in window mode or stream mode according to an embodiment of the present disclosure.
  • the scheme of the present disclosure uses a hash operation to enforce a certain percentage of residency (ie locking) because one of the key issues with LLC residency is the bandwidth versus capacity tradeoff ("tradeoff"). Therefore, this disclosure proposes to implement a certain ratio of residency (ie, Lock Ratio), so that different bandwidths and residency capacities can be obtained for different tasks.
  • Lock Ratio can be configured in the lock/unlock window or for specific data streams. Also, although hash operations in window mode or stream mode are described below, similar operations are also applicable to hash operations in instruction mode.
  • the intelligent processor core first compares the access address of the data with the address range defined by the lock window to determine whether the requested address is within the address range of the lock window.
  • a hash operation may be performed on the hit window address range.
  • the access address of each data may be a virtual address.
  • the VA of the access address can be mapped to the Hash space (that is, the "Hash Map” in the figure), and the Hash process can preferentially retain the low-order information of the address.
  • the Hash value obtained at 1102 can be compared with the lock ratio Lock Ratio at 1104 to randomly select data of a corresponding ratio.
  • the hash value of the access address is smaller than the latch ratio, it is considered a hit, and therefore the part of data (ie, data conforming to the ratio) can be latched in the cache memory.
  • the hash value of the access address is greater than or equal to the latch ratio, it is considered a miss, and therefore this part of data will not be latched in the cache memory.
  • the lock ratio Lock Ratio when the lock ratio Lock Ratio is set to 10%, you can select the part of the data corresponding to the first 10% value from the Hash value in order, that is, the part whose hash value of the latch address of the data is smaller than the lock ratio Data is latched for related operations.
  • the latch ratio can also be other values, and the latch ratio can be customized by the software user, and the aforementioned selection operation can also be implemented according to the setting of the Hash algorithm.
  • the latch ratio may also be 20%-30%, and at this time, partial data corresponding to the first 20%-30% of the Hash values may be sequentially selected to perform latch-related operations. Thereafter, at 1106, it can be processed according to the specified request type, that is, to lock or unlock some data.
  • the latch scheme of the cache memory of the present disclosure has been described in detail above with reference to FIGS. 6-11 . Based on the idea of the aforementioned latch scheme, and as a supplement to the aforementioned latch scheme, the following will describe another extended application of the present disclosure for the cache memory in conjunction with Fig. 12-Fig. Inter-cluster communication.
  • the system on chip here may be the system on chip included in the computing device 201 shown in FIG. 2 , for example, the system on chip constituted by the multi-core computing device 41 .
  • the system-on-chip 1200 includes four clusters 0 - cluster 4 exemplarily shown. Since the cluster has been described in detail above, it will not be repeated here.
  • a cache memory 1201 which can be set, for example, in the SRAM 408 as previously shown in FIG. 5, for performing inter-cluster data transfer operations.
  • the cache memory 1201 can also perform on-chip and off-chip bidirectional communication with DRAM (such as DDR), including the transfer of various types of data or instructions.
  • FIG. 13 is a flowchart illustrating a method 1300 for a system on chip according to an embodiment of the present disclosure.
  • the system on chip here may be the system on chip as shown in FIG. 12 .
  • the system-on-chip includes at least a plurality of clusters for performing computing operations and a cache memory interconnected with the plurality of clusters.
  • each cluster may include multiple processor cores for performing the computing operations.
  • the above-mentioned latch area determined in the cache memory can be used to complete inter-cluster data communication, so that the system-on-chip does not need to set communication modules such as CDMA 410 and GDMA 411.
  • the above-mentioned latch area can be used to transfer data between tasks with dependencies, for example, the latch area can be used to transfer data between a producer core and a consumer core.
  • the processor can lock the data that the producer core needs to exchange to the consumer core in the LLC through the configured lock window.
  • the processor after the processor finishes executing the producer kernel, it can latch the data that needs to be delivered to the consumer kernel (it may be the input data or output data of the producer kernel).
  • the processor can perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, the SMMU, so as to latch the above-mentioned data that needs to be exchanged in the LLC in the window mode, for later use by the consumer kernel.
  • the processor can also release the latch area according to the unlock window configured in the consumer kernel, that is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC, it can release the latch area in the LLC.
  • the corresponding storage space of the data in the latch area is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC.
  • the latch area can also be used in the application scenario of inter-chip communication.
  • a cluster or processor core of the processor transmits data (the data may be data that the producer core needs to exchange to the consumer core) via the latch area to processors in other clusters for merge processing.
  • Processors in other clusters read data from the latch area for processing, thereby realizing inter-chip data transfer.
  • inter-cluster communication is performed using the latch area, please refer to the description below.
  • the present disclosure also includes a method for performing inter-cluster communication using a latch area of a cache memory, the method comprising:
  • the specified storage space of the off-chip memory is mapped to a given storage area of the high-speed cache ("cache") (its physical properties are the same as the locking area described above in conjunction with the accompanying drawings), so that the given The storage area is used as the cluster storage area for inter-cluster data communication.
  • the cache memory may include LLC
  • the off-chip memory may include DDR.
  • the specified storage space may be the storage space specified at 1402 in FIG. 14 .
  • the cluster storage area may be a given storage area in the cache memory at 1404 in FIG. 14 .
  • the specified storage space of the DDR can be specified through software configuration, and the specified storage space of the DDR can be mapped to a given space on the cache for inter-cluster (for example, the cluster 0 shown in Figure 14 Communicate with cluster 1).
  • the determined cluster storage area may be used to perform cluster operations.
  • using the cluster store to perform operations of the cluster may include using the cluster store for inter-cluster communication.
  • using the cluster storage area for inter-cluster communication may specifically include: using the cluster storage area to implement point-to-point communication between clusters.
  • the cluster storage area may be used to implement broadcast communication from one of the multiple clusters to other clusters.
  • the cluster storage area can be used to receive the write operation of the first cluster for writing data and respond to the read operation of the second cluster, and send the previous write of the first cluster to the second cluster data.
  • the cluster storage area may also be used to receive a lock indication that the write data associated with the above-mentioned write operation resides in the cluster storage area, such as the write lock shown in FIG. 14 ("write lock"), that is, the above-mentioned lock-related request with the Lock attribute. Then, based on the lock indication, the written data may reside in the cluster storage area, wherein the cluster storage area may be the latch area determined in the above embodiment. Through such a residency manner, the hit ratio of data to be read many times in the cache memory can be significantly improved.
  • write lock shown in FIG. 14
  • the producer kernel executing in one of the clusters can lock the data that needs to be exchanged to the consumer kernel in the LLC through the above-mentioned write lock for later use by the consumer kernel, such as the producer
  • the core transmits data via LLC to processors in other clusters for merge processing. Processors in other clusters can read data from the cluster storage area for processing, thereby realizing inter-slice transmission of data.
  • the cluster memory area can also be used to receive a read invalidation indication that the write data is not written back to the off-chip memory, such as a read invalidation (“read invalidation” issued by cluster 1 in FIG. 14 ). invalid").
  • the read invalid indication may be a latch-related request with an invalid attribute, and the generation method of the latch-related request may refer to the above description for details. In different latch modes, the latch-related requests can be different. Then, after sending the write data to cluster 1, the cluster storage area may invalidate the cache line associated with the write data based on the read invalidation indication.
  • the cluster (such as cluster 0) that writes data to the cluster storage area can send a synchronization command to another cluster (such as cluster 1) after the write operation is completed , such as hsem ("hardware semaphore") in Figure 14.
  • cluster 1 can send the above-mentioned read invalidation request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write-back of the aforementioned data .
  • the above-mentioned behaviors of writing data to and reading data from the cluster storage area can also be collectively referred to as lock-related operations triggered by lock-related requests, and the confirmation method of the lock-related requests See description above.
  • the latch-related request may be used to indicate a latch operation. Through the latch operation, the data will be latched in the cluster storage area for subsequent multiple uses. Further, the latch-related request can be used to indicate a release operation, and through the release operation, data can be unlocked from the cluster storage area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the cluster storage area can be used flexibly, thereby improving the usage efficiency of the cluster storage area in the present disclosure.
  • the data or a selected part of the data may be released from the specified area of the cluster storage area according to a latch-related request.
  • a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area.
  • hash algorithm can be used to select a predetermined proportion of data from the data as the aforementioned partial data to be latched in the cluster storage area.
  • the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical care. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device such as a personal computer, a server, or Network devices, etc.
  • the aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory”, abbreviated as ROM), random access memory (“Random Access Memory”, abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.
  • ROM read-only memory
  • RAM random access memory
  • CDs compact discs
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory ("Resistive Random Access Memory”, abbreviated as RRAM), dynamic random access memory (“Dynamic Random Access Memory”, abbreviated as DRAM), static random access memory (“Static Random Access Memory”, abbreviated as SRAM), enhanced dynamic random access memory (“Enhanced Dynamic Random Access Memory”, abbreviated as "EDRAM”), high bandwidth memory (“High Bandwidth Memory”, abbreviated as "HBM”), hybrid memory cube ("Hybrid Memory Cube”, abbreviated as "HMC”), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a method for a cache memory comprising:
  • Clause A2 The method of Clause A1, wherein the plurality of latch modes are performed in a predetermined order of priority.
  • Clause A3 The method according to Clause A1 or 2, wherein the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.
  • the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.
  • the latch-related request is determined according to the hardware instruction
  • the latch-related request is determined according to a cache page configuration
  • said latch-related requests are determined according to a lock window.
  • Clause A5. The method of Clause A4, wherein in the command mode, the window mode, or the stream mode, the latch-related request can be accompanied by a lock attribute, the lock attribute is used to indicate that specific data is stored in Retained in the latch area, the specific data is part of the data selected according to the hash algorithm.
  • Clause A6 The method of clause A3 or A4, wherein in page mode, the method comprises:
  • the cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.
  • Clause A7 The method of Clause A3, wherein configuring the latch region to support the plurality of latch modes comprises:
  • Clause A8 The method of Clause A7, wherein for a write operation to a latched area, the method includes latching the data or a selected portion of the data in the latch according to the latch-related request. In the specified area of the storage area, it can be used for subsequent multiple reads.
  • Clause A9 The method according to Clause A7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, storing the data or the selected portion according to the latch-related request The data is released from a designated area of the latch area.
  • a cache memory comprising:
  • a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes
  • a latch execution module for:
  • a latch-related operation is performed on the data in the latch region in the corresponding latch mode.
  • a system on a chip comprising:
  • a processor configured to generate the latch-related request
  • the latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.
  • Clause A12 The system-on-chip of Clause A11, wherein the latch mode comprises an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request in accordance with a received hardware instruction.
  • Clause A13 The system-on-chip of Clause A11, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.
  • the configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory;
  • the scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor core.
  • Clause A15 The system-on-chip of Clause A14, wherein the configuration instructions include configuration items for enabling latch regions, disabling latch regions, and/or latch region sizes.
  • Clause A16 The system-on-chip of Clause A15, wherein the processor further comprises a system memory management unit for, in windowed mode or streaming mode:
  • Clause A17 The system-on-chip of Clause A16, wherein the configuration items of the lock window include one or more of the following:
  • the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data
  • the latch ratio which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.
  • Clause A18 The system-on-chip according to clause A17, wherein the processor is further configured to use a hash algorithm to select an access address of the data to be latch-related operations within the address range of the lock window Part of the data that can be locked in the latch area.
  • Clause A19 The system-on-chip according to Clause A17, wherein the processor is configured to randomly select the portion of data satisfying the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a Latching the associated request for latching in the latching area.
  • Clause A20 The system-on-chip of clause A14, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write The data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the The latch execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.
  • Clause A21 The system-on-chip of any one of Clauses A16-A20, wherein the tasks include producer cores and consumer cores, wherein:
  • the processor When executing the producer core, the processor is used to latch the data output by the producer core in the latch area through the latch-related request for use by the consumer core;
  • the processor When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.
  • Clause A23 A computing device comprising the board of Clause A22.
  • Clause B1 A method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising a plurality of clusters for performing said computational operations A plurality of processor cores, the method comprising:
  • Operations of the cluster are performed using the cluster storage area.
  • Clause B2 The method of Clause B1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.
  • the cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.
  • Clause B4 The method of Clause B3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:
  • the write data is sent to the second cluster in response to a read operation by the second cluster.
  • Clause B5. The method of Clause B4, wherein in the write operation, the method further comprises:
  • the write data is resident in the cluster storage area based on the lock indication.
  • Clause B6 The method of clause B4 or B5, wherein in the read operation, the method further comprises:
  • a cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.
  • a system on a chip comprising:
  • each cluster includes at least a plurality of processor cores for performing computational operations
  • a cache memory interconnected with the plurality of clusters and configured to:
  • the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory;
  • Operations of the cluster are performed using the cluster storage area.
  • Clause B8 The system-on-chip of Clause B7, wherein the cluster memory area is configured for inter-cluster communication.
  • Clause B9 The system-on-chip of Clause B8, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the remaining clusters.
  • Clause B10 The system-on-chip of Clause B9, wherein in the peer-to-peer communication, the cluster storage area is configured to:
  • the write data is sent to the second cluster in response to a read operation by the second cluster.
  • Clause B11 The system-on-chip of Clause B10, wherein the second cluster is configured to:
  • the read operation is performed on the cluster memory area.
  • Clause B12 The system-on-a-chip of clause B10, wherein in the write operation, the first cluster is configured to send a lock to the cluster storage area to reside the write data in the cluster storage area indication, so that the cluster store resides the write data based on the lock indication.
  • Clause B13 The system-on-chip of clause B12, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area that causes the write data not to be written back to the off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.
  • Clause B14 A computing device comprising the system-on-chip according to any one of clauses B7-B13.
  • Clause B15 A board comprising the computing device according to Clause B14.
  • Clause B16 A computing device comprising the board according to Clause B15.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A method for a cache memory, a cache memory, a system on chip, a board, and a computing device. The computing device is embodied by a computing processing means comprised in a combined processing means (20); the combined processing means (20) may also comprise a universal interconnection interface and other processing means. The computing processing means interacts with other processing means to jointly complete a computing operation specified by a user. The combined processing means (20) may also comprise a storage means, and the storage means is separately connected to the computing processing means and other processing means to store data of the computing processing means and other processing means. The combined processing means (20) can improve use efficiency of the cache memory.

Description

用于高速缓冲存储器的方法及其相关产品Method for cache memory and related products
相关申请的交叉引用Cross References to Related Applications
本公开要求以下中国专利申请的优先权:于2021年8月12日申请的、申请号为202110926703.7、发明名称为“用于片上系统的方法及其相关产品”的中国专利申请;于2021年8月12日申请的、申请号为202110926707.5、发明名称为“用于高速缓冲存储器的方法及其相关产品”的中国专利申请。This disclosure claims the priority of the following Chinese patent applications: the Chinese patent application filed on August 12, 2021, with the application number 202110926703.7, and the title of the invention is "Method for System-on-Chip and Related Products"; filed on August 2021 A Chinese patent application with the application number 202110926707.5 and the title of the invention "Methods for Cache Memory and Related Products" filed on May 12.
技术领域technical field
本公开一般地涉及芯片技术领域。更具体地,本公开涉及一种用于高速缓冲存储器的方法、高速缓冲存储器、包括该高速缓冲存储器的片上系统、包括该片上系统的板卡以及包括该板卡的计算设备。The present disclosure generally relates to the field of chip technology. More specifically, the present disclosure relates to a method for a cache memory, a cache memory, a system-on-chip including the cache memory, a board including the system-on-chip, and a computing device including the board.
背景技术Background technique
计算系统的操作性能在相当大程度上取决于内存平均访问延迟。通过提高高速缓冲存储器(简称为“缓存”)的命中率来有效减少内存访问次数,可以显著改善系统性能。为此,处理器通常采用缓存机制,并且利用缓存来调节处理器与低速主存之间速度和性能上的不匹配。当前的缓存实行多级缓存机制,例如采用三级缓存(L1,L2和L3),并且将最靠近主存的缓存称为最后一级缓存(“Last Level Cache”,LLC)。鉴于缓存在片上系统中的频繁使用及其重要作用,需要提出有效的管理策略,以便提高缓存的利用率,减少对主存的访问次数。另外,如何针对不同的场景来扩展LLC的应用也成为需要解决的问题。The operational performance of a computing system is largely determined by the average memory access latency. System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of the cache memory (referred to as "cache"). To this end, processors typically employ a cache mechanism, and use the cache to accommodate the mismatch in speed and performance between the processor and slow main memory. The current cache implements a multi-level cache mechanism, such as three-level cache (L1, L2, and L3), and the cache closest to the main memory is called the last level cache ("Last Level Cache", LLC). In view of the frequent use and important role of cache in SoCs, it is necessary to propose an effective management strategy in order to improve the utilization of cache and reduce the number of visits to main memory. In addition, how to expand the application of LLC for different scenarios has also become a problem that needs to be solved.
发明内容Contents of the invention
鉴于上述背景技术部分所提及的技术问题,本公开提供了一种用于高速缓冲存储器的驻留方案。通过本公开的方案,可以将高速缓冲存储器中的特定区域配置成锁定区,并且利用该锁定区将多次使用的数据驻留于其中,由此提升缓存命中率并改善系统的整体性能。基于此,本公开在如下的多个方面中提供用于高速缓冲存储器的方案。In view of the technical problems mentioned in the background technology section above, the present disclosure provides a residency scheme for a cache memory. Through the solution of the present disclosure, a specific area in the cache memory can be configured as a locked area, and multiple-used data can be stored in the locked area, thereby improving the cache hit rate and improving the overall performance of the system. Based on this, the present disclosure provides a solution for a cache memory in the following aspects.
在第一方面中,本公开提供了一种用于高速缓冲存储器的方法,包括:将高速缓冲存储器中的特定存储空间配置为支持多种锁存模式的锁存区,其中每种所述锁存模式对应于在所述锁存区中对数据执行的一种锁存相关操作;接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区中对所述数据执行锁存相关操作。In a first aspect, the present disclosure provides a method for a cache memory, comprising: configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each of the lock The latch mode corresponds to a latch-related operation performed on data in the latch area; receiving a latch-related request for performing a latch-related operation on the data in the latch area; and according to the lock A storage-related request, performing a latch-related operation on the data in the latch area in the corresponding latch mode.
在第二方面中,本公开提供了一种高速缓冲存储器,包括:配置模块,其用于将所述高速缓冲存储器中的特定存储空间配置成支持多种锁存模式的锁存区,其中每种所述锁存模式对应于在所述锁存区中对数据执行的一种锁存相关操作;锁存执行模块,其用于:接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。In a second aspect, the present disclosure provides a cache memory, including: a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each One of the latch modes corresponds to a latch-related operation performed on data in the latch area; a latch execution module is used for: receiving and latching the data in the latch area A latch-related request for a related operation; and performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
在第三方面中,本公开提供了一种片上系统,包括如上所述以及在下文多个实施例中所述的高速缓冲存储器;以及处理器,其用于生成所述锁存相关请求;其中所述高速缓冲存储器的锁存执行模块用于根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。In a third aspect, the present disclosure provides a system-on-chip comprising a cache as described above and in various embodiments below; and a processor configured to generate said latch-related request; wherein The latch execution module of the cache memory is configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
在第四方面中,本公开提供了一种板卡,包括如上所述以及在下文多个实施例中所述的片上系统。In a fourth aspect, the present disclosure provides a board, including the system-on-chip as described above and in the following embodiments.
在第五方面中,本公开提供了一种计算设备,包括如上所述以及在下文多个实施例中所述的板卡。In a fifth aspect, the present disclosure provides a computing device, including the board as described above and described in various embodiments below.
根据本公开上述多个方面中所提供的方案,可以利用锁存区来对多次使用的数据进行锁存和解锁操作,从而显著提升缓存命中率。进一步,由于本公开的锁存区支持多种锁存模式,而这些锁存模式可以根据配置来进行选择使用,因此扩展了锁存区的应用场景。当使用在生产者内核和消费者内核的 场景中时,本公开的锁存区可以充当数据传递的媒介,从而提升数据的可访问性和使用率。另外,由于通过锁存区提升了缓存命中的机率,本公开的方案也显著提升了计算系统的整体性能。According to the solutions provided in the above aspects of the present disclosure, the latch area can be used to perform latch and unlock operations on data used multiple times, thereby significantly improving the cache hit rate. Further, since the latch area of the present disclosure supports multiple latch modes, and these latch modes can be selected and used according to the configuration, the application scenarios of the latch area are expanded. When used in the scenario of the producer core and the consumer core, the latch area of the present disclosure can serve as a medium for data transfer, thereby improving the accessibility and utilization of data. In addition, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the computing system.
此外,鉴于上述背景技术部分所提及的技术问题,本公开提出扩展高速缓冲存储器的使用场景。为此,本公开在如下的多个方面中提供用于片上系统的方案。In addition, in view of the technical problems mentioned in the background technology section above, the present disclosure proposes the use scenario of expanding the cache memory. To this end, the present disclosure provides solutions for a system on chip in the following aspects.
在第六方面中,本公开提供了一种用于片上系统的方法,所述片上系统包括至少用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器,每个集群包括用于执行所述运算操作的多个处理器核,所述方法包括:将片外存储器的指定存储空间映射到高速缓冲存储器的锁存区,以将该锁存区用作集群间数据通信的集群存储区;以及使用所述集群存储区来执行所述集群的操作。In a sixth aspect, the present disclosure provides a method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores for performing the operation, the method includes: mapping a designated storage space of the off-chip memory to a latch area of the cache memory, so as to use the latch area as a link for inter-cluster data communication a cluster storage area; and performing operations of the cluster using the cluster storage area.
在第七方面中,本公开提供了一种片上系统,包括:多个集群,其中每个集群包括至少用于执行运算操作的多个处理器核;以及高速缓冲存储器,其与所述多个集群互联,并且配置成:将锁存区用作集群间数据通信的集群存储区,其中所述锁存区与片外存储器的指定存储空间形成映射关系;以及使用所述集群存储区来执行所述集群的操作。In a seventh aspect, the present disclosure provides a system-on-chip, comprising: a plurality of clusters, each of which includes a plurality of processor cores for at least performing arithmetic operations; and a cache memory associated with the plurality of The clusters are interconnected and configured to: use the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and use the cluster storage area to execute all Describe the operation of the cluster.
在第八方面中,本公开提供了一种计算装置,其包括如上所述以及在下文多个实施例中所述的片上系统。In an eighth aspect, the present disclosure provides a computing device comprising a system-on-chip as described above and in various embodiments below.
在第九方面中,本公开提供了一种板卡,包括如上所述以及在下文多个实施例中所述的计算装置。In a ninth aspect, the present disclosure provides a board, including the computing device as described above and described in various embodiments below.
在第十方面中,本公开提供了一种计算设备,包括如上所述以及在下文多个实施例中所述的板卡。In a tenth aspect, the present disclosure provides a computing device, including the board as described above and in the following embodiments.
根据本公开上述多个方面中所提供的方案,可以利用高速缓冲存储器的锁存区来实现片上系统的集群间的高效通信。由此,可以将本需要通过片外存储器来传递的数据直接通过该锁存区来进行传递,由此加速数据的访存并显著提升缓存命中率。进一步,由于通过锁存区提升了缓存命中的机率,本公开的方案也显著提升了片上系统的整体性能。另外,该锁存区的划分简化了高速缓冲存储器的管理,并且扩展了高速缓冲存储器的使用场景。借助于该锁存区,片上系统的多个集群可以实现多种灵活的通信机制,从而也提升了集群的操作性能。According to the solutions provided in the above multiple aspects of the present disclosure, the latch area of the cache memory can be used to realize efficient communication between the clusters of the SoC. Therefore, the data that needs to be transferred through the off-chip memory can be directly transferred through the latch area, thereby speeding up data access and significantly improving the cache hit rate. Further, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the SoC. In addition, the division of the latch area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. With the help of the latch area, multiple clusters of the SoC can implement multiple flexible communication mechanisms, thereby also improving the operational performance of the cluster.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts wherein:
图1是示出根据本公开实施例的板卡的结构图;FIG. 1 is a structural diagram showing a board according to an embodiment of the present disclosure;
图2是示出根据本公开实施例的集成电路装置的结构图;FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;
图3是示出根据本公开实施例的单核计算装置的内部结构示意图;3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;
图4是示出根据本公开实施例的多核计算装置的内部结构示意图;4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;
图5是示出根据本公开实施例的处理器核的内部结构示意图;FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;
图6是示出根据本公开实施例的用于高速缓冲存储器的方法流程图;FIG. 6 is a flowchart illustrating a method for a cache memory according to an embodiment of the present disclosure;
图7是示出根据本公开实施例的高速缓冲存储器的简化框图;Figure 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the disclosure;
图8是示出根据本公开实施例的片上系统的简化框图;8 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;
图9是示出根据本公开实施例的片上系统的详细框图;9 is a detailed block diagram illustrating a system-on-chip according to an embodiment of the present disclosure;
图10是示出根据本公开实施例的页模式的示意框图;FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure;
图11是示出根据本公开实施例的窗口模式下的哈希操作示意图;FIG. 11 is a schematic diagram showing a hash operation in window mode according to an embodiment of the present disclosure;
图12是示出根据本公开实施例的片上系统的简化框图;Figure 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;
图13是示出根据本公开实施例的用于片上系统的方法流程图;以及13 is a flowchart illustrating a method for a system on a chip according to an embodiment of the present disclosure; and
图14是示出根据本公开实施例的片上系统的操作框图。FIG. 14 is a block diagram illustrating an operation of a system on chip according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例,并且所描述的多个实施例 可以根据场景进行适当的组合以实现不同的应用。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, but not all of the embodiments, and the described multiple embodiments can be properly combined according to scenarios to achieve different applications. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中可能使用的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second" and "third" that may be used in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific sequence. The terms "comprising" and "comprises" used in the specification and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in this disclosure description is for the purpose of describing specific embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise. It should also be further understood that the term "and/or" used in the present disclosure and the claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context. Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
下面结合附图来详细描述本公开的具体实施方式。Specific embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings.
图1示出根据本公开实施例的一种板卡10的结构示意图。可以理解的是图1所示结构和组成仅仅是一种示例,其并不用于在任何方面对本公开的方案进行限制。FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 1 are only an example, and are not intended to limit the solution of the present disclosure in any respect.
如图1所示,板卡10包括芯片101,其可以是一种系统级芯片(System on Chip,SoC),也即本公开上下文中所描述的片上系统。在一个实施场景中,其可以集成有一个或多个组合处理装置。前述组合处理装置可以是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求,特别是深度学习技术大量应用在云端智能领域。云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,而本实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。As shown in FIG. 1 , the board 10 includes a chip 101, which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.
进一步如图中所示,芯片101通过对外接口装置102与外部设备103相连接。根据不同的应用场景,外部设备103例如可以是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102 . According to different application scenarios, the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
板卡10还可以包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106可以配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。The board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
图2是示出根据上述实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20可以包括计算装置201、接口装置202、处理装置203和动态随机存取存储器(Dynamic Random Access Memory,DRAM)DRAM 204。FIG. 2 is a structural diagram showing a combination processing device in the chip 101 according to the above-described embodiment. As shown in FIG. 2 , the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a DRAM (Dynamic Random Access Memory, DRAM) DRAM 204.
计算装置201可以配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器。在一些操作中,其可以用于执行深度学习或机器学习方面的计算,并且还可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。The computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user.
接口装置202可以用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 . Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 . Alternatively or optionally, the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启 和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本公开的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。As a general-purpose processing device, the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 . According to different implementations, the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general and/or special purpose processors. Processors, including but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, as far as the computing device 201 of the present disclosure is concerned, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
DRAM 204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。The DRAM 204 is used to store data to be processed, and is a DDR memory, usually 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.
图3示出了计算装置201为单核的内部结构示意图。单核计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,单核计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc. The single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(Instruction Fetch Unit,IFU)311及指令译码单元(Instruction Decode Unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312. The instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。存储模块33用来存储或搬运相关数据,包括神经元存储单元(Neuron RAM,NRAM)331、参数存储单元(Weight RAM,WRAM)332、直接内存访问模块(Direct Memory Access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责单核计算装置301与DRAM 204间的数据搬运。The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution. The storage module 33 is used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation; WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights; DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204.
图4示出了计算装置201为多核的内部结构示意图。多核计算装置41采用分层结构设计,多核计算装置41作为一个片上系统,其包括根据本公开的至少一个集群(cluster),每个集群又包括多个处理器核。换言之,多核计算装置41是以片上系统-集群-处理器核的层次所构成的。以片上系统的层级来看,如图4所示,多核计算装置41包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core. The multi-core computing device 41 adopts a hierarchical structure design, and the multi-core computing device 41 is a system on chip, which includes at least one cluster (cluster) according to the present disclosure, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core. Viewed from the system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .
外部存储控制器401可以有多个(如图中示例性地示出2个),其用以响应处理器核发出的访问请求,访问外部存储设备,也即本公开上下文中的片外存储器(例如图2中的DRAM 204),从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来,用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(Global Barrier Controller,GBC),用以协调各集群的工作进度,确保信息的同步。本公开的多个集群405是多核计算装置41的计算核心。尽管在图4中示例性地示出4个集群,然而,随着硬件的发展,本公开的多核计算装置41还可以包括8个、16个、64个、甚至更多的集群405。在一个应用场景中,集群405可以用于高效地执行深度学习算法。There may be multiple external memory controllers 401 (two are exemplarily shown in the figure), which are used to respond to the access request issued by the processor core, and access the external memory device, that is, the off-chip memory in the context of the present disclosure ( For example, the DRAM 204 in FIG. 2 ), so as to read or write data from off-chip. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks. The on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 4 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.
以集群的层级来看,如图4所示,每个集群405可以包括多个处理器核(IPU core)406及一个存储核(MEM core)407,其例如可以包括本公开上下文所描述的高速缓冲存储器(例如LLC)。Viewed at the cluster level, as shown in FIG. 4 , each cluster 405 may include a plurality of processor cores (IPU core) 406 and a storage core (MEM core) 407, which may include, for example, the high-speed Buffer memory (eg LLC).
处理器核406在图中示例性地示出为4个,本公开不限制处理器核406的数量,并且其内部架构如图5所示。每个处理器核406类似于图3的单核计算装置301,并且同样可以包括三个模块:控制模块51、运算模块52和存储模块53。控制模块51、运算模块52及存储模块53的功用及结构大致与控制模块31、运算模块32及存储模块33相同,此处不再赘述。需特别说明的是,存储模块53可以包括输入/输出直接内存访问模块(Input/Output Direct Memory Access,IODMA)533、搬运直接内存访问模块(Move Direct Memory Access,MVDMA)534。IODMA 533通过广 播总线409控制NRAM 531/WRAM 532与DRAM 204的访存;MVDMA 534则用以控制NRAM 531/WRAM 532与存储单元(SRAM)408的访存。The number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 5 . Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 . The functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here. It should be noted that the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534. IODMA 533 controls memory access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409; MVDMA 534 is used to control memory access of NRAM 531/WRAM 532 and storage unit (SRAM) 408.
回到图4,存储核407主要用以存储和通信,即存储处理器核406间的共享数据或中间结果、以及执行集群405与DRAM 204之间的通信、集群405间彼此的通信、处理器核406间彼此的通信等。在其他实施例中,存储核407可以具有标量运算的能力,用以执行标量运算。Returning to FIG. 4 , the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, processors communication between the cores 406 and the like. In other embodiments, the storage core 407 may have a scalar operation capability to perform scalar operations.
存储核407可以包括静态随机存取存储器(Static Random-Access Memory,SRAM)408、广播总线409、集群直接内存访问模块(Cluster Direct Memory Access,CDMA)410及全局直接内存访问模块(Global Direct Memory Access,GDMA)411。在一个实施场景中,SRAM 408可以承担高性能数据中转站的角色。由此,在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得,而是经SRAM 408在处理器核406间中转。进一步,存储核407仅需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可,从而可以提高核间通信效率,并显著减少片上片外的输入/输出访问。The storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411. In an implementation scenario, the SRAM 408 can assume the role of a high-performance data transfer station. Thus, the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406 respectively, but is transferred between the processor cores 406 through the SRAM 408. Further, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the efficiency of inter-core communication and significantly reducing on-chip and off-chip input/output access.
广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与DRAM 204的数据传输。以下将分别说明。The broadcast bus 409, the CDMA 410 and the GDMA 411 are respectively used to perform communication between the processor cores 406, communication between the clusters 405, and data transmission between the clusters 405 and the DRAM 204. They will be described separately below.
广播总线409用以完成集群405内各处理器核406间的高速通信,此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(例如单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式,而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式,属于多播的一种特例。The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 . The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (for example, a single processor core to a single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406, and broadcasting is to transmit a data The communication mode in which data is transmitted from SRAM 408 to all processor cores 406 belongs to a special case of multicast.
CDMA 410用以控制在同一个计算装置201内不同集群405间的SRAM 408的访存。GDMA 411与外部存储控制器401协同,用以控制集群405的SRAM 408到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 408中。从前述可知,DRAM 204与NRAM 431或WRAM 432间的通信可以经由2种方式来实现。第一种方式是通过IODAM 433直接和DRAM 204与NRAM 431或WRAM 432通信;第二种方式是先经由GDMA 411使得数据在DRAM 204与SRAM 408间传输,再经过MVDMA 534使得数据在SRAM 408与NRAM 431或WRAM 432间传输。尽管第二种方式可能需要更多的元件参与且数据流较长,但实际上在部分实施例中,第二种方式的带宽远大于第一种方式,因此通过第二种方式来执行DRAM 204与NRAM 431或WRAM 432间的通信可能更为有效。可以理解的是,这里所描述的数据传输方式仅仅是示例性的,并且本领域技术人员根据本公开的教导,也可以根据硬件的具体布置来灵活地选择和适用各种数据传输方式。The CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 in the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408. As can be seen from the foregoing, the communication between the DRAM 204 and the NRAM 431 or WRAM 432 can be realized in two ways. The first way is to directly communicate with DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 Transfer between NRAM 431 or WRAM 432. Although the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 431 or WRAM 432 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.
在其他的实施例中,GDMA 411的功能和IODMA 533的功能可以整合在同一部件中。尽管本公开为了方便描述,将GDMA 411和IODMA 533视为不同的部件,然而对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本公开类似,即属于本公开的保护范围。进一步地,GDMA 411的功能、IODMA 533的功能、CDMA 410的功能、MVDMA 534的功能也可以由同一部件来实现。In other embodiments, the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component. Although the present disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure . Further, the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same part.
以上结合图1-图5对本公开的硬件架构及其内部结构进行了详细的描述。可以理解的是上述描述仅仅是示例性的而非限制性的。根据不同的应用场景和硬件规格,本领域技术人员也可以对本公开的板卡及其内部结构进行改变,而这些改变依然落入本公开的保护范围内。例如,在本公开下面将用描述的方案中,其对应的硬件架构可以是不包括用以控制在同一个计算装置201内不同集群405间对SRAM 408访存的CDMA410。取而代之,本公开下面的方案涉及对例如设置在SRAM408和DRAM204之间的高速缓冲存储器进行改进和优化,以通过高速缓冲存储器实现高效的数据按需锁存和不同集群间的通信。The hardware architecture and its internal structure of the present disclosure have been described in detail above with reference to FIGS. 1-5 . It is to be understood that the foregoing description is illustrative only and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also make changes to the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure. For example, in the solution described below in the present disclosure, the corresponding hardware architecture may not include the CDMA 410 used to control the access to the SRAM 408 among different clusters 405 in the same computing device 201 . Instead, the underlying approach of the present disclosure involves improving and optimizing the cache, eg, disposed between SRAM 408 and DRAM 204, to enable efficient on-demand latching of data and communication between different clusters through the cache.
为了高效地使用高速缓冲存储器(例如LLC)并提高数据访问的命中率,本公开下面的方案提出将高速缓冲存储器中的特定存储空间配置成锁存区,以用于数据的锁存操作,特别是针对于将要被频繁使用的数据。例如,前述频繁使用的数据可以是将要在具有数据依赖关系的至少一个任务之间重复使用的数据。可以理解的是,当只需使用数据一次时,则可以不将该数据锁存在高速缓冲存储器中。In order to efficiently use the cache memory (such as LLC) and improve the hit rate of data access, the following scheme of the present disclosure proposes to configure a specific storage space in the cache memory as a latch area for data latch operations, especially It is for data that will be frequently used. For example, the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that data need not be locked in the cache memory when the data need only be used once.
进一步,在前述配置锁存区以用于数据锁存的基础上,本公开下面的方案还提出将高速缓冲 存储器配置成支持多种锁存模式,以便在接收到锁存相关请求时,令高速缓冲存储器操作在与前述锁存相关请求对应的锁存模式下。根据不同的应用场景和需求,本公开的多种锁存模式可以具有特定的优先级顺序,以满足不同的锁存相关操作。另外,为了使得高速缓冲存储器支持多种锁存模式,本公开的方案也提出多种不同的配置方法,使得可以更为灵活地使用高速缓冲存储器并且利用其实现集群间的通信。Further, on the basis of configuring the latch area for data latching, the following solution of the present disclosure also proposes to configure the cache memory to support multiple latch modes, so that when a latch-related request is received, the high-speed The buffer memory operates in a latch mode corresponding to the aforementioned latch-related request. According to different application scenarios and requirements, various latch modes of the present disclosure may have a specific priority order to satisfy different latch-related operations. In addition, in order to enable the cache memory to support multiple latch modes, the solution of the present disclosure also proposes a variety of different configuration methods, so that the cache memory can be used more flexibly and used to realize inter-cluster communication.
图6是示出根据本公开实施例的用于高速缓冲存储器的方法600的流程图。如图6中所示,方法600包括在步骤S602处,将所述高速缓冲存储器中的特定存储空间配置为支持多种锁存模式的锁存区。在一个实施例中,前述的多种锁存模式可以包括但不限于基于硬件指令来执行锁存相关操作的指令模式、基于窗口属性来执行锁存相关操作的窗口模式、基于数据流来执行锁存相关操作的流模式和/或基于缓存页来执行锁存相关操作的页模式。在一个实施例中,前述的数据流可以是具有不同类型的指令流或数据流。以数据流为例,在神经网络的应用场景中,该数据流可以是神经网络模型的神经元数据流、权重数据流、输出结果数据流等。另外,在本公开的上下文场景中,锁存相关操作所针对的数据是将要由片上系统的处理器多次使用到的数据,其相对于未进行锁存操作的数据具有相对高的优先级。通过将这样多次使用的数据锁存(或称驻留)于本公开的锁存区中,可以显著提升缓存命中率,由此改善系统的整体性能。另外,通过将重复使用到的数据驻留在LLC的锁存区中,可以减小数据在片上系统和片外存储器(例如DDR或DRAM)之间的读写操作,从而也提高了访存效率。FIG. 6 is a flowchart illustrating a method 600 for a cache memory according to an embodiment of the disclosure. As shown in FIG. 6 , the method 600 includes, at step S602 , configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes. In one embodiment, the aforementioned multiple latch modes may include, but not limited to, an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a lock mode for performing latch-related operations based on data streams. Streaming mode for store-related operations and/or page mode for latch-related operations based on cache pages. In one embodiment, the aforementioned data streams may be instruction streams or data streams of different types. Taking the data stream as an example, in the application scenario of the neural network, the data stream may be the neuron data stream, weight data stream, output result data stream, etc. of the neural network model. In addition, in the context of the present disclosure, the data targeted by the latch-related operation is data that will be used multiple times by the processor of the system-on-chip, and has relatively higher priority than the data that is not subjected to the latch operation. By latching (or residing in) such multiple-used data in the latch area of the present disclosure, the cache hit rate can be significantly improved, thereby improving the overall performance of the system. In addition, by residing the reused data in the latch area of the LLC, the read and write operations of data between the on-chip system and off-chip memory (such as DDR or DRAM) can be reduced, thereby improving memory access efficiency .
在一个应用场景中,上述的多种锁存模式可以根据用户的偏好或系统的优选项而设置为具有不同的优先级。例如,在一个实施方式中,优先级的高低顺序可以是指令模式—>窗口模式—>流模式—>页模式;在另一个实施方式中,优先级的高低顺序可以是指令模式—>页模式—>流模式—>窗口模式。通过这样的多模式和优先级设置,可以以更多方式来使用高速缓冲存储器中的锁存区,增加了锁存区使用的灵活性以应对不同的应用场景和系统要求。进一步地,可以根据上述锁存模式的优先级顺序依次遍历,当高优先级的锁存模式被禁用时,可以采用低优先级的锁存模式。In an application scenario, the above-mentioned multiple latch modes can be set to have different priorities according to user preferences or system preferences. For example, in one implementation, the order of priority may be instruction mode -> window mode -> stream mode -> page mode; in another implementation, the order of priority may be instruction mode -> page mode —> Stream Mode —> Window Mode. Through such multi-mode and priority setting, the latch area in the cache memory can be used in more ways, increasing the flexibility of using the latch area to cope with different application scenarios and system requirements. Further, it may traverse sequentially according to the priority order of the above-mentioned latch modes, and when the high-priority latch mode is disabled, the low-priority latch mode may be adopted.
在一个实施例中,可以根据接收到的多种配置指令中的一种配置指令将特定存储空间配置成支持对应的一种锁存模式的锁存区。在一个场景中,该配置指令可以包括一个或多个配置项,以实现对前述锁存区的配置。例如,该多个配置项可以包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。进一步,可以在前述的指令模式、窗口模式、流模式或页模式来配置相应的锁存策略(例如锁存数据的大小或需锁存的具体数据),以用于锁存不同类型或特定的指令、数据或数据流等。在不同模式下来配置相应的锁存策略具体可参见下文的描述。通过这样的启用、禁用和多种具体的配置,本公开的方案可以实现灵活地使用高速缓冲存储器,使得其可以根据需要操作于本公开的多种锁存模式之一,或者操作于常规模式。In an embodiment, a specific storage space may be configured as a latch area supporting a corresponding latch mode according to one configuration instruction among the received configuration instructions. In one scenario, the configuration instruction may include one or more configuration items, so as to realize the configuration of the aforementioned latch area. For example, the plurality of configuration items may include configuration items for enabling a latch area, disabling a latch area, and/or a size of a latch area. Further, the corresponding latch strategy (such as the size of the latch data or the specific data to be latched) can be configured in the aforementioned instruction mode, window mode, stream mode or page mode, so as to latch different types or specific Instructions, data or data flow, etc. For details on configuring the corresponding latch strategy in different modes, please refer to the description below. Through such enabling, disabling and various specific configurations, the scheme of the present disclosure can realize the flexible use of the cache memory, so that it can operate in one of the various latch modes of the present disclosure, or operate in the normal mode as required.
返回到图6中的流程图,在完成上述步骤S602处的配置操作后,在步骤S604处,接收在锁存区中对数据进行锁存相关操作的锁存相关请求。根据本公开的实施例,该锁存相关请求可以由旨在将特定的数据驻留于锁存区的操作来触发。替代地,该锁存相关请求也可以由旨在将特定的数据从锁存区移除或释放的操作来触发。如前文所详细描述的,当操作于不同的锁存模式时,本公开的锁存相关请求也可以具有不同的表达形式或内容。例如,对于指令模式、窗口模式或流模式,锁存相关请求可以包括用于指示高速缓冲存储器的行为属性的配置项等。Returning to the flow chart in FIG. 6, after the configuration operation at step S602 is completed, at step S604, a latch-related request for performing latch-related operations on data in the latch area is received. According to an embodiment of the present disclosure, the latch-related request may be triggered by an operation intended to reside specific data in a latch region. Alternatively, the latch-related request may also be triggered by an operation intended to remove or release specific data from the latch area. As described in detail above, when operating in different latch modes, the latch-related requests of the present disclosure may also have different expressions or contents. For example, for an instruction mode, a window mode, or a stream mode, the latch-related request may include a configuration item for indicating a behavior attribute of the cache memory, and the like.
在一个实施例中,上述用于指示高速缓冲存储器的行为属性的配置项至少包括以下多个配置属性中的一个:In one embodiment, the above-mentioned configuration item for indicating the behavior attribute of the cache memory includes at least one of the following multiple configuration attributes:
瞬态(Transient)属性:不在LLC内进行缓存,也即与片外存储器(如DDR)直接进行数据的读写操作;用以对于某些只访问一次的数据,不在LLC内进行缓存,从而避免占用LLC资源;Transient (Transient) attribute: do not cache in LLC, that is, directly perform data read and write operations with off-chip memory (such as DDR); for some data that is only accessed once, do not cache in LLC, thereby avoiding Occupy LLC resources;
锁定(Lock)属性:将特定的数据驻留于锁存区,从命中的缓存行(cacheline)中读写数据。若缓存行属于锁存区,则缓存行属性配置为持久(persisting)属性;若缓存行不属于锁存区,则缓存行的属性不变,即保持下面的常规(normal)属性;应当清楚的是,上述的锁存区的缓存行具有两种属性,即持久(persisting)属性和常规(normal)属性。该锁存区中持久(persisting) 属性的缓存行仅可以被附带Lock属性的锁存相关请求访问和替换。Lock (Lock) attribute: Reside specific data in the latch area, and read and write data from the hit cache line (cacheline). If the cache line belongs to the latch area, the attribute of the cache line is configured as a persistent (persisting) attribute; if the cache line does not belong to the latch area, the attribute of the cache line remains unchanged, that is, the following normal (normal) attributes are maintained; it should be clear Yes, the above-mentioned cache line in the latch area has two attributes, namely a persistent (persisting) attribute and a normal (normal) attribute. A cache line with a persistent (persisting) attribute in the lock area can only be accessed and replaced by a lock-related request with a Lock attribute.
解锁(Unlock)属性:将从命中的缓存行中读写数据后,释放LLC中锁存区内数据的对应存储空间,并将锁存区中的相应的缓存行属性设置为下面的常规属性;Unlock (Unlock) attribute: After reading and writing data from the hit cache line, release the corresponding storage space of the data in the latch area in the LLC, and set the corresponding cache line attribute in the latch area to the following general attributes;
常规属性:在LLC内正常缓存的请求,可以直接与片外存储器进行读写数据;General attributes: requests that are normally cached in the LLC can directly read and write data with off-chip memory;
无效(Invalid)属性:读取之后直接无效数据,避免被替换写入到片外存储器;Invalid attribute: Invalid data directly after reading to avoid being replaced and written to the off-chip memory;
干净(Clean)属性:在执行写操作时,可以将数据写入命中的缓存行中,并将整个高速缓冲存储器(cache)的存储内容写回到片外存储器,缓存行的属性保持不变;读操作时,从命中的缓存行中读取数据。当该命中的缓存行为脏(dirty)时,则将其写回到片外存储器中;Clean (Clean) attribute: When performing a write operation, data can be written into the hit cache line, and the storage content of the entire cache memory (cache) can be written back to the off-chip memory, and the attributes of the cache line remain unchanged; During a read operation, data is read from the hit cache line. When the hit cache line is dirty (dirty), write it back to the off-chip memory;
默认(Default)属性:该默认项可以用以指示忽略关于锁存模式的配置。Default (Default) attribute: the default item can be used to indicate that the configuration about the latch mode is ignored.
通过在锁存相关请求中附带上述示例性的可配置属性后,本公开的方案可以根据这些附带的属性来执行指令模式下的对应锁存相关操作。By attaching the above exemplary configurable attributes to the latch-related request, the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
再例如,对于页模式,锁存相关请求可以指示与特定的页相关的数据将被锁存于锁存区中以用于后续的多次使用,或可以指示与特定的页相关的数据在经多次使用后,从锁存区解锁以释放出更多的存储空间来用于后续的数据锁存。可以理解的是,通过释放操作,锁存区的存储空间可以被灵活地使用,从而提高了本公开锁存区的使用效率。For another example, for the page mode, the latch related request may indicate that the data related to the specific page will be latched in the latch area for subsequent multiple use, or may indicate that the data related to the specific page will be After multiple uses, unlock from the latch area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the latch area can be used flexibly, thereby improving the utilization efficiency of the latch area of the present disclosure.
返回到图6的流程,响应于上述步骤S604的锁存相关请求,在步骤S606处,可以根据锁存相关请求,以对应的锁存模式在锁存区中对数据执行锁存相关操作。根据本公开的实施例,前述的锁存相关操作可以包括针对于锁存区的读操作和写操作。在一个实施方式中,针对于锁存区的写操作,方法600还可以包括根据锁存相关请求将数据或选定的部分数据锁存于锁存区的指定区域内,以便用于后续的多次读取。在另一个实施方式中,针对于锁存区的读操作,方法600还可以包括在执行完读操作后,根据锁存相关请求将数据或选定的部分所述数据从所述锁存区的指定区域释放。Returning to the process of FIG. 6 , in response to the latch-related request in step S604 , at step S606 , according to the latch-related request, a latch-related operation may be performed on data in the latch area in a corresponding latch mode. According to an embodiment of the present disclosure, the aforementioned latch-related operations may include a read operation and a write operation for the latch area. In one embodiment, for the write operation of the latch area, the method 600 may also include latching data or a selected part of the data in a specified area of the latch area according to a latch-related request, so as to be used in subsequent multiple reads. In another embodiment, for the read operation of the latch area, the method 600 may further include, after the read operation is completed, transferring the data or a selected part of the data from the latch area according to the latch-related request The specified area is released.
关于前述选定的部分数据,在一个实施例中,可以以随机的方式从数据中选择预定比例的数据形成前述的部分数据来锁存于锁存区中。在另一个实施例中,可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来锁存于锁存区中。在进一步的实施例中,当待执行锁存相关操作的数据的访存地址处于所述锁定窗口的地址范围之内,可以采用前述的哈希算法选定能够锁存在锁存区的部分数据。关于哈希算法的具体使用,将再稍后结合附图11来进行详细描述。Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be latched in the latch area. In a further embodiment, when the access address of the data to be latch-related operations is within the address range of the lock window, the aforementioned hash algorithm may be used to select part of the data that can be locked in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
利用上述结合图6所描述的方法,本公开的方案使得高速缓冲存储器支持多个锁存模式,从而扩大了高速缓冲存储器的应用场景并且显著提升了缓存命中率。进一步,由于多种锁存模式的引入,也令锁存区的使用更具灵活性和适应性,从而满足不同的应用场景和用户需求。另外,由于在锁存区对数据进行有效的锁存,也促进了数据在生产者内核(“producer kernel”)和一个或多个消费者内核(“consumer kernel”)之间的共享,提升了数据的可访问性和使用率。此处的生产者内核和消费者内核可以理解为具有依赖性的两个任务,其中生产者内核的输出将作为传递至消费者内核的输入,以便消费者内核使用该输入来完成相应的任务。此时,由于生产者内核的输出将作为后续运算的输入,因此可以将该生产者内核的输出作为后续需要多次使用的数据,该后续需要多次使用的数据可以暂存在高速缓冲存储器的锁存区内,以便于消费者内核可以直接从该高速缓冲存储器中获取该输入,而无需访问片外存储器,从而减少了人工智能处理器与片外存储器之间的访存交互,降低了IO访存开销,进而可以提高人工智能处理器的处理效率及性能。Using the method described above in conjunction with FIG. 6 , the solution of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptable, so as to meet different application scenarios and user requirements. In addition, due to the effective latching of data in the latch area, the sharing of data between the producer kernel ("producer kernel") and one or more consumer kernels ("consumer kernel") is also promoted, improving the Data Accessibility and Usage. The producer kernel and the consumer kernel here can be understood as two dependent tasks, where the output of the producer kernel will be used as the input to the consumer kernel, so that the consumer kernel can use the input to complete the corresponding task. At this time, since the output of the producer core will be used as the input of subsequent operations, the output of the producer core can be used as data that needs to be used multiple times in the future, and the data that needs to be used multiple times in the future can be temporarily stored in the lock of the cache memory memory area, so that the consumer core can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the memory interaction between the artificial intelligence processor and the off-chip memory, and reducing the IO access memory overhead, which in turn can improve the processing efficiency and performance of artificial intelligence processors.
图7是示出根据本公开实施例的高速缓冲存储器700的简化框图。可以理解的是图7所示的高速缓冲存储器700可以是结合图6所描述的高速缓冲存储器,因此关于图6所描述的高速缓冲存储器也同样适用于下面关于图7的描述。FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the disclosure. It can be understood that the cache memory 700 shown in FIG. 7 may be the cache memory described in conjunction with FIG. 6 , so the cache memory described in FIG. 6 is also applicable to the following description in relation to FIG. 7 .
如图7中所示,本公开的高速缓冲存储器700可以包括配置模块701和锁存执行模块702。进一步,高速缓冲存储器700中还包括用于执行缓存操作的存储空间,例如,图中所示出的将存储空间平均划分成8份的8路(way0~way7),其中每路中包括若干数目的缓存行(cacheline)。As shown in FIG. 7 , the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702 . Further, the cache memory 700 also includes a storage space for performing cache operations, for example, as shown in the figure, the storage space is equally divided into 8 ways (way0-way7), wherein each way includes a number of The cache line (cacheline).
在一个实施例中,上述的配置模块可以用于将高速缓冲存储器中的特定存储空间配置成支持多种锁存模式的锁存区,其中该特定存储空间的大小小于高速缓冲存储器的总存储大小。例如, 图7中的way0~way5可以被配置为支持锁存的特定存储空间。对应地,图7中的way6~7可以保持高速缓冲存储器的普通属性,也即作为一般缓存使用。如前所述,该锁存模式可以是指令模式、窗口模式、流模式和/或页模式。进一步,锁存执行模块可以用于接收在锁存区中对数据进行锁存相关操作的锁存相关请求。接着,该锁存执行模块可以根据锁存相关请求,以对应的锁存模式在锁存区内对数据执行锁存相关操作。与前文描述相同,此处的锁存相关操作可以包括针对于锁存区的写操作(即将数据写入到锁存区)或将锁存区中的数据从锁存区释放。例如,当消费者内核使用完锁存区的数据,并且该数据将不再被其他消费者内核使用时,则可以将该锁存区内存储数据的空间进行释放,以用于锁存其他数据。In one embodiment, the above-mentioned configuration module can be used to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein the size of the specific storage space is smaller than the total storage size of the cache memory . For example, way0-way5 in FIG. 7 can be configured as a specific storage space that supports latching. Correspondingly, ways6-7 in FIG. 7 can maintain the common attributes of the cache memory, that is, be used as a general cache. As previously mentioned, the latch mode can be instruction mode, window mode, stream mode and/or page mode. Further, the latch execution module may be configured to receive a latch-related request for performing latch-related operations on data in the latch area. Next, the latch execution module can perform latch-related operations on data in the latch area in a corresponding latch mode according to the latch-related request. Same as the previous description, the latch-related operations here may include a write operation for the latch area (that is, writing data into the latch area) or releasing data in the latch area from the latch area. For example, when the consumer core has used up the data in the lock area and the data will no longer be used by other consumer cores, the space for storing data in the lock area can be released for locking other data .
图8是示出根据本公开实施例的片上系统800的简化框图。如图8中所示,本公开的片上系统800可以包括如图7中所示出的高速缓冲存储器700和处理器(或处理器核)802。在一个实施方式中,高速缓冲存储器的锁存执行模块可以用于根据锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。关于高速缓冲存储器700,前文结合图6和图7对其进行了描述,此处将不再赘述。关于处理器802,根据本公开的方案,其可以是各种类型的处理器,并且可以包括一个或多个处理器核以生成锁存相关请求。在操作中,高速缓冲存储器的锁存执行模块用于根据生成的锁存相关请求,以对应的锁存模式在锁存区内对数据执行锁存相关操作。例如,当锁存模式是指令模式时,则处理器可以用于根据接收到的硬件指令来生成锁存相关请求。又例如,当锁存模式是页模式时,则处理器可以用于根据缓存页配置来生成锁存相关请求。再如,当锁存模式是窗口模式或流模式时,则处理可以用于配置锁定窗口,并根据锁定窗口来生成锁存相关请求。FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the disclosure. As shown in FIG. 8 , a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in FIG. 7 . In one embodiment, the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request. Regarding the cache memory 700 , it has been described above in conjunction with FIG. 6 and FIG. 7 , and will not be repeated here. Regarding the processor 802, according to the aspects of the present disclosure, it may be various types of processors, and may include one or more processor cores to generate latch-related requests. In operation, the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request. For example, when the latch mode is an instruction mode, the processor can be configured to generate latch-related requests according to received hardware instructions. For another example, when the latch mode is the page mode, the processor may be configured to generate a latch-related request according to the cache page configuration. For another example, when the latch mode is a window mode or a stream mode, the processing may be used to configure a lock window, and generate a latch-related request according to the lock window.
根据不同的实施方式,处理器802还可以是包括多个计算核的智能处理器或智能处理单元(“Intelligence Processing Unit”,简写为“IPU”),其可以配置成执行各类人工智能领域(例如神经网络方面)的计算。According to different implementations, the processor 802 may also be an intelligent processor or an intelligent processing unit ("Intelligence Processing Unit", abbreviated as "IPU") including multiple computing cores, which may be configured to execute various artificial intelligence fields ( such as neural network calculations).
图9是示出根据本公开实施例的片上系统900的详细框图。可以理解的是这里所示出的片上系统900可以是图8所示片上系统的一种具体实现方式,因此关于图8所描述的内容也同样适用于图9。进一步,仅为了示例的目的,将以多个锁存模式中的窗口模式(或流模式)来描述片上系统900的操作。FIG. 9 is a detailed block diagram illustrating a system on chip 900 according to an embodiment of the present disclosure. It can be understood that the system-on-chip 900 shown here may be a specific implementation of the system-on-chip shown in FIG. 8 , and therefore the content described with respect to FIG. 8 is also applicable to FIG. 9 . Further, for the purpose of example only, the operation of the system-on-chip 900 will be described in a window mode (or stream mode) among a plurality of latch modes.
如图9中所示,片上系统900可以包括任务调度器(“Job Scheduler”)902,其包括调度单元903和配置器904。在一个实施例中,配置器904可以用于根据分配的配置任务(例如可以从任务队列中获得)来生成配置指令,以便向高速缓冲存储器(即“LLC”906)的配置模块(如CLR)发送。在一个实施例中,调度单元903可以用于对任务调度器中的多个任务(也即将在人工智能处理器上执行的“kernel”)进行调度,以便向本公开的片上系统中的智能处理器(IPU)905发送。在本公开的方案中,这里的智能处理器905可以是包括多个处理器核,多个处理器核可以构成如图4中所示出的一个群集(cluster)。在一个实施场景中,在如前的多处理器核架构中,调度单元可以根据多个处理器核的空闲度(例如利用率)来将任务分配给合适的处理器核。As shown in FIG. 9 , a system on chip 900 may include a task scheduler (“Job Scheduler”) 902 including a scheduling unit 903 and a configurator 904 . In one embodiment, the configurator 904 may be configured to generate configuration instructions according to assigned configuration tasks (e.g., obtainable from a task queue) to be sent to a configuration module (such as a CLR) in a cache memory (that is, "LLC" 906) send. In one embodiment, the scheduling unit 903 can be used to schedule multiple tasks in the task scheduler (that is, the "kernel" to be executed on the artificial intelligence processor), so as to provide intelligent processing in the system on chip of the present disclosure processor (IPU) 905 to send. In the solution of the present disclosure, the intelligent processor 905 here may include multiple processor cores, and the multiple processor cores may form a cluster as shown in FIG. 4 . In an implementation scenario, in the previous multi-processor core architecture, the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (eg utilization) of the multiple processor cores.
进一步,该片上系统900还包括系统内存管理单元(“System Memory Management Unit”,简写为“SMMU”),该系统内存管理单元用于将访存数据的虚拟地址转换为物理地址,以根据该物理地址实现对相关存储位置的访问。在一个实施方式中,该系统内存管理单元中包括设置有地址转换缓冲器TLB(Translation Lookaside Buffer,也称为快表)。该TLB中维护有页表,该页表包括至少一个页表项,每个页表项包括页(page)以及该页对应的页框(Frame)。在操作中,该系统内存管理单元可以根据接收到的虚拟地址来确定该虚拟地址对应的页,并且接着可以通过页与页框的映射关系确定该虚拟地址对应的物理地址PA(Physcial Address),从而可以根据该物理地址实现对高速缓冲存储器的相关存储位置的访问。Further, the system-on-chip 900 also includes a system memory management unit ("System Memory Management Unit", abbreviated as "SMMU"), which is used to convert the virtual address of the accessed data into a physical address, so as to An address enables access to an associated memory location. In one embodiment, the system memory management unit includes an address translation buffer TLB (Translation Lookaside Buffer, also called fast table). A page table is maintained in the TLB, and the page table includes at least one page table entry, and each page table entry includes a page (page) and a page frame (Frame) corresponding to the page. In operation, the system memory management unit can determine the page corresponding to the virtual address according to the received virtual address, and then can determine the physical address PA (Physical Address) corresponding to the virtual address through the mapping relationship between the page and the page frame, Therefore, the access to the relevant storage location of the cache memory can be realized according to the physical address.
在一个实施例中,可以通过上述的窗口模式或流模式实现对高速缓冲存储器的访问。此时,智能处理器可以从存储器中获取参数表,并根据参数表来配置与待执行锁存相关操作的数据关联的锁定窗口(“Lock window”),以及根据配置的锁定窗口生成锁存相关请求(即,例如附带lock/unlock属性的IO访问请求)。接着,SMMU可以根据该IO访问请求来对LLC执行锁存相 关操作。具体地,SMMU可以将前述的IO访问请求发送至LLC 906的缓存策略模块907(其执行类似于图7中锁存执行模块702的相同操作)来进行执行。在一个实施方式中,参数表可以包括用于配置锁定窗口或流模式中的流锁存属性的参数项。例如,参数项可以包括但不限于锁定/解锁窗口(“lock/unlock window”)、每数据流的锁定/解锁(“per stream lock/unlock”)、锁存比率(“Lock Ratio”)、锁定窗口标识(“lock window flag”)等信息。在一个实施场景中,该参数表中的参数可以是用户自定义设置的。由此,可以在程序的运行阶段获得该参数表中的相关参数,并且将该参数表存储在存储器(例如DDR)中,以便智能处理器(如图中的IPU 905)在执行阶段使用。In one embodiment, access to the cache memory can be implemented through the above-mentioned window mode or stream mode. At this time, the intelligent processor can obtain the parameter table from the memory, and according to the parameter table, configure a lock window ("Lock window") associated with the data of the latch-related operation to be performed, and generate a lock window according to the configured lock window. Requests (ie, eg IO access requests with lock/unlock attributes attached). Then, the SMMU can perform latch-related operations on the LLC according to the IO access request. Specifically, the SMMU may send the aforementioned IO access request to the cache policy module 907 of the LLC 906 (which performs the same operation as the latch execution module 702 in FIG. 7 ) for execution. In one embodiment, the parameter table may include parameter items for configuring a lock window or a stream latch attribute in a stream mode. For example, parameter items may include, but are not limited to, lock/unlock window (“lock/unlock window”), lock/unlock per stream (“per stream lock/unlock”), lock ratio (“Lock Ratio”), lock Window identification ("lock window flag") and other information. In an implementation scenario, the parameters in the parameter table may be user-defined. Thus, the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table can be stored in the memory (such as DDR), so that the intelligent processor (such as the IPU 905 in the figure) can be used in the execution phase.
在一个实施方式中,上述的锁定窗口用于表示软件用户希望锁存的存储空间,该锁定窗口的大小可以大于高速缓冲存储器上的锁存区的大小。上述锁定窗口包括以下中的一项或多项:窗口的基地址和大小,其中窗口的基地址可以是上层软件配置的虚拟地址(例如虚拟地址“Virtual Address”,简写“VA”),该窗口的基地址与待执行锁存相关操作的数据起始地址相对应,而窗口的大小可以与待锁存的数据大小相对应。In one embodiment, the above-mentioned lock window is used to represent the storage space that the software user wishes to lock, and the size of the lock window may be larger than the size of the lock area on the cache memory. The above-mentioned locked window includes one or more of the following: the base address and size of the window, wherein the base address of the window can be a virtual address configured by the upper layer software (such as virtual address "Virtual Address", abbreviated "VA"), the window The base address of the window corresponds to the starting address of the data to be latched, and the size of the window may correspond to the size of the data to be latched.
具体地,在窗口模式下,智能处理器可以根据任务调度器下发的任务,确定该任务中数据的访存地址(该访存地址可以是虚拟地址),并将该任务中数据的访存地址与窗口的锁定窗口定义的地址范围进行比较。如果该任务中数据的访存地址在锁定窗口的地址范围,则表示命中该锁定窗口,此时可以使能锁定窗口(如“Enabled”)。否则,若该任务中数据的访存地址在锁定窗口的地址范围之外,则表示未命中该锁定窗口。此时,该锁定窗口可以被忽略,也即表明该任务中的数据将不会被暂存在高速缓冲存储器中。进一步地,在数据的访存地址命中该锁定窗口时,则可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用,将再稍后结合附图11来进行详细描述。之后,智能处理器可以通过SMMU将附带Lock属性的锁存相关请求发送给高速缓冲存储器LLC。其中,该附带Lock属性的锁存相关请求可以用于指示将特定的数据驻留于锁存区,该特定的数据可以是根据哈希算法选定的部分数据。Specifically, in the window mode, the intelligent processor can determine the memory access address of the data in the task (the memory access address can be a virtual address) according to the task issued by the task scheduler, and make the memory access address of the data in the task The address is compared to the address range defined by the window's lock window. If the access address of the data in the task is within the address range of the lock window, it means that the lock window is hit, and the lock window can be enabled (such as "Enabled") at this time. Otherwise, if the access address of the data in the task is outside the address range of the lock window, it means that the lock window is not hit. At this time, the lock window can be ignored, which means that the data in the task will not be temporarily stored in the cache memory. Further, when the access address of the data hits the lock window, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data and stored in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 . Afterwards, the intelligent processor can send the lock-related request attached with the Lock attribute to the cache memory LLC through the SMMU. Wherein, the lock-related request attached with the Lock attribute may be used to indicate that specific data resides in the lock area, and the specific data may be part of data selected according to a hash algorithm.
下面以窗口模式,结合图9来描述LLC的锁存过程和释放过程。The latching process and release process of the LLC will be described below in the window mode with reference to FIG. 9 .
LLC驻留(或称锁定)过程:LLC residency (or lock) process:
步骤1:任务调度器借助配置器来配置LLC(例如经由缓存策略模块)以便启用锁定区(“Lock enable”)、禁用锁定区(“Lock disable”)和锁定区的大小,即图中所示出的路(“Way”)数(例如Way0~Way7)。Step 1: The task scheduler configures the LLC with the help of the configurator (e.g. via the cache policy module) to enable the lock region ("Lock enable"), disable the lock region ("Lock disable") and the size of the lock region, as shown in the figure The number of ways ("Ways") to go out (eg Way0-Way7).
步骤2:任务调度器下发任务kernel给IPU;Step 2: The task scheduler sends the task kernel to the IPU;
步骤3:IPU从参数表中获取锁定窗口标识(“lock window flag”),读取并配置锁定窗口。在一个实施场景中,这里的参数表可以通过软件来配置并存储在片外的动态随机存取存储器(“Dynamic Random Access Memory”,简写为“DRAM”)的一个存储地址处。接着,任务调度器可以向IPU传送该地址并且IPU可以根据该地址来读取参数表,以便完成对锁定窗口的配置。Step 3: The IPU obtains the lock window flag (“lock window flag”) from the parameter table, reads and configures the lock window. In an implementation scenario, the parameter table here can be configured by software and stored at a storage address of an off-chip dynamic random access memory ("Dynamic Random Access Memory", abbreviated as "DRAM"). Then, the task scheduler can transmit the address to the IPU and the IPU can read the parameter table according to the address, so as to complete the configuration of the locking window.
步骤4:IPU通过内存管理单元SMMU生成锁存相关请求,并且在向LLC的缓存策略模块发送该请求时,可以令该请求依据锁定窗口信息来附带lock属性。Step 4: The IPU generates a lock-related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, the request can be attached with a lock attribute according to the lock window information.
步骤5:LLC的缓存策略模块在接收到带有lock属性的锁存相关请求之后,将对应的数据存储在相应的缓存行中,并标记该缓存行(也即锁存区)的lock属性,例如设置为如前所述的“持久(persisting)”。Step 5: After receiving the lock-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (that is, the lock area), For example set to "persisting" as described above.
LLC解驻留(或称释放)过程:LLC deresidence (or release) process:
步骤6:任务调度器下发kernel给IPU;Step 6: The task scheduler sends the kernel to the IPU;
步骤7:IPU从参数表中获取解锁窗口标识,读取并配置解锁窗口;Step 7: The IPU obtains the unlock window ID from the parameter table, reads and configures the unlock window;
步骤8:IPU发射请求的时候,依据解锁窗口信息来附带解锁(“unlock”)属性;Step 8: When the IPU transmits the request, it attaches the unlock (“unlock”) attribute according to the unlock window information;
步骤9:LLC的缓存策略模块在接收到带有unlock属性的请求之后,将命中的lock属性的缓存行切换为常规属性,如前面结合指令模式所描述的常规(Normal)属性;Step 9: After receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a normal attribute, such as the normal (Normal) attribute described in conjunction with the instruction mode above;
步骤10:任务调度器借助配置器并且通过CLR模块来禁用锁存区即(即,LLC lock disable)。 在一个实现场景中,CLR模块可以根据配置器的指示来清空先前的锁定属性配置。Step 10: The task scheduler disables the lock area (ie, LLC lock disable) by means of the configurator and through the CLR module. In an implementation scenario, the CLR module may clear the previous locking attribute configuration according to the instruction of the configurator.
以上结合图9对本公开的片上系统在窗口模式下的锁存方案进行了详细的描述。通过这样的锁存操作,可以显著提升缓存命中的机率并且改善了高速缓冲存储器的使用效率和扩大了应用场景。The latch scheme of the system on chip of the present disclosure in the window mode has been described in detail above with reference to FIG. 9 . Through such a latch operation, the probability of a cache hit can be significantly increased, the utilization efficiency of the cache memory is improved, and the application scenarios are expanded.
本公开实施例还支持流模式下的锁存相关操作,当本公开的任务中数据流对应的使能比特位为低,则视为默认情形,也即不执行流模式下的锁存相关操作。反之,当使能比特位为高时,则可以在流模式下对该数据流执行相应的锁存相关操作。具体地,本公开的窗口模式和流模式具有相类似的操作,可以利用哈希算法和该数据流的锁存比例从该数据流中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用,将再稍后结合附图11来进行详细描述。The embodiments of the present disclosure also support latch-related operations in stream mode. When the enable bit corresponding to the data stream in the task of the present disclosure is low, it is regarded as the default situation, that is, the latch-related operations in stream mode are not performed. . Conversely, when the enable bit is high, the corresponding latch-related operations can be performed on the data stream in stream mode. Specifically, the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data can be selected from the data stream as the aforementioned partial data to be stored in the latch by using the hash algorithm and the lock ratio of the data stream. in the district. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .
如前所述,在一个实施例中,本公开实施例还支持页模式下的锁存相关操作,下面结合图10来描述该页模式。As mentioned above, in one embodiment, the embodiment of the present disclosure also supports latch-related operations in the page mode, and the page mode will be described below with reference to FIG. 10 .
图10是示出根据本公开实施例的页模式的示意框图。如图10所示,根据本申请的方案,可以对缓存页直接进行配置以使得其具有本公开的锁定属性,从而使得与存储器(如“内存”)形成映射关系的缓存页可以用于多个内核kernel(如图中所示内核0~2)之间共享访问数据使用。在一个实施方式中,程序员可以利用指令(例如Malloc)对缓存页进行锁定属性标记。当一个内核访问标记为锁定的缓存页时,SMMU可以将缓存页所对应的数据锁定于本公开的锁存区中。接着,当后续的内核需要再次访问到前述的缓存页时,其可以从锁存区的对应缓存行读取先前锁定的数据,从而实现缓存命中。由此,通过页模式,本公开的方案提升了数据在多个内核之间的共享和可访问性。FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure. As shown in FIG. 10, according to the solution of the present application, the cache page can be directly configured so that it has the lock attribute of the present disclosure, so that the cache page that forms a mapping relationship with the memory (such as "memory") can be used for multiple The kernel kernels (kernel 0-2 shown in the figure) share access data for use. In one embodiment, the programmer may use an instruction (such as Malloc) to mark the cache page with a lock attribute. When a kernel accesses a cache page marked as locked, the SMMU can lock the data corresponding to the cache page in the latch area of the present disclosure. Then, when the subsequent core needs to access the aforementioned cache page again, it can read the previously locked data from the corresponding cache line in the latch area, thereby achieving a cache hit. Thus, through the page mode, the disclosed scheme improves the sharing and accessibility of data among multiple cores.
具体地,在页模式下,软件驱动程序可以通过指令直接配置(“System Memory Management Unit”,简写为“SMMU”)页表中的信息,并且根据该信息来确定执行基于页的锁存操作或者常规(normal)操作两种配置。当页表中的信息指示SMMU被旁路(bypass)时,则表示无需对高速缓冲存储器进行锁存,此时高速缓冲存储器中的缓冲行的属性可以是常规(Normal)属性。当信息指示SMMU为线性映射时,此时则可以根据SMMU线性映射窗口配置来设置基于页的锁存操作。例如,该线性映射窗口内的缓存页所对应的数据锁定于本公开的锁存区中。该SMMU可以基于页表中的信息生成相应的锁存相关请求,并将该锁存相关请求发送至LLC,LLC的缓存策略模块可以根据该锁存相关请求对LLC的缓存行进行配置,以执行相应的缓存相关操作。Specifically, in the page mode, the software driver can directly configure ("System Memory Management Unit", abbreviated as "SMMU") information in the page table through instructions, and determine to perform page-based latch operations or Normal operates both configurations. When the information in the page table indicates that the SMMU is bypassed (bypass), it means that the cache memory does not need to be latched. At this time, the attribute of the buffer line in the cache memory can be a normal (Normal) attribute. When the information indicates that the SMMU is linearly mapped, then the page-based latch operation may be set according to the SMMU linearly mapped window configuration. For example, the data corresponding to the cache page in the linear mapping window is locked in the latch area of the present disclosure. The SMMU can generate a corresponding lock-related request based on the information in the page table, and send the lock-related request to the LLC, and the cache policy module of the LLC can configure the cache line of the LLC according to the lock-related request to execute The corresponding cache-related operations.
在一个实施例中,本公开实施例还支持指令模式,此时,该片上系统可以通过指令集中的访存指令(IO指令)来配置LLC中的锁存区。In one embodiment, the embodiment of the present disclosure also supports an instruction mode, at this time, the system-on-chip can configure the latch area in the LLC through a memory access instruction (IO instruction) in the instruction set.
例如IO指令附带锁存相关属性的至少一种配置域,从而借助于该配置域来灵活地配置LLC。这里,各种配置域可以代表当对片外存储器(例如DDR空间)执行数据访存时,LLC可以对应执行的操作行为。在一个实施场景中,指令中包括上述的配置属性:瞬态(Transient)属性、锁定(Lock)属性、解锁(Unlock)属性、常规(Normal)属性、无效(Invalid)属性、干净(Clean)属性或默认(Default)属性等等。由于指令模式为最高优先级,因此当IO访存指令指示为Default属性时,则意味着可以由其他模式(如窗口模式、流模式或页模式)来执行锁存相关操作。For example, the IO instruction is accompanied by at least one configuration domain that latches related attributes, so that the LLC can be flexibly configured by means of the configuration domain. Here, various configuration domains may represent corresponding operation behaviors that the LLC may perform when performing data access to off-chip memory (such as DDR space). In an implementation scenario, the above configuration attributes are included in the instruction: Transient (Transient) attribute, Lock (Lock) attribute, Unlock (Unlock) attribute, General (Normal) attribute, Invalid (Invalid) attribute, Clean (Clean) attribute Or the default (Default) attribute and so on. Since the instruction mode is the highest priority, when the IO access instruction is indicated as the Default attribute, it means that other modes (such as window mode, stream mode or page mode) can perform latch-related operations.
通过在锁存相关请求中附带上述示例性的可配置属性后,本公开的方案可以根据这些附带的属性来执行指令模式下的对应锁存相关操作。By attaching the above exemplary configurable attributes to the latch-related request, the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.
当任务调度器将任务下发至智能处理器IPU时,IPU可以根据任务中IO指令确定锁存相关请求。具体地,当IO指令中的Lock属性的配置域使能时,此时可以在锁存相关请求中附带Lock属性,以使得LLC根据该附带Lock属性的锁存相关请求来将特定的数据存储于锁定区。当IO指令中的Unlock属性的配置域使能时,此时可以在锁存相关请求中附带Unlock属性,以使得LLC根据该附带Unlock属性的锁存相关请求来释放锁定区。根据应用场景不同,这里的锁存相关请求还可以类似地附带其他的属性。When the task scheduler sends the task to the intelligent processor IPU, the IPU can determine the latch related request according to the IO instruction in the task. Specifically, when the configuration domain of the Lock attribute in the IO instruction is enabled, the Lock attribute can be attached to the lock-related request at this time, so that the LLC can store specific data in the lock-related request according to the Lock attribute. locked area. When the configuration field of the Unlock attribute in the IO command is enabled, the Unlock attribute can be attached to the lock-related request at this time, so that the LLC can release the locked area according to the lock-related request attached with the Unlock attribute. According to different application scenarios, the latch-related request here can also have other attributes similarly attached.
进一步地,在一些操作场景中,当指令中还包括用于指示锁存比例的特定配置域。当该指令中的特定配置域(例如特定比特位inst_ratio_en)为低时,则可以认为锁存操作取决于指令配置, 即根据任务中具体的IO指令来确定锁存相关请求。如果前述比特位为高,则可以根据哈希(hash)算法与该指令指示的锁存比率(lock ratio)进行比较,从该数据流中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用,下文结合附图11来进行详细描述。Further, in some operation scenarios, when the instruction also includes a specific configuration field for indicating the latch ratio. When a specific configuration field in the instruction (for example, a specific bit inst_ratio_en) is low, it can be considered that the latch operation depends on the instruction configuration, that is, the latch-related request is determined according to the specific IO instruction in the task. If the aforementioned bit is high, it can be compared with the lock ratio (lock ratio) indicated by the instruction according to the hash (hash) algorithm, and a predetermined proportion of data is selected from the data stream as the aforementioned partial data to be stored in the lock in storage area. The specific use of the hash algorithm will be described in detail below in conjunction with FIG. 11 .
图11是示出根据本公开实施例的窗口模式或流模式下的哈希操作。本公开的方案使用哈希操作来执行一定比例的驻留(即锁定)是因为LLC驻留关键问题之一是带宽与容量的权衡(“tradeoff”)。因此,本公开提出执行一定比例的驻留(即Lock Ratio),从而可以针对不同的任务,获得不同的带宽和驻留容量。假定预设Lock Ratio的值为P(例如以百分比计),则预期带宽为B=6T*P+2T*(1-P),这里6T是数据驻留于LLC上的读取速率,而2T是数据存储于内存(如DRAM)上的读取速度,其中T=1000Gbit/s。如前所述,Lock Ratio可以在lock/unlock window中配置或者是针对具体的数据流进行配置。另外,尽管下文对窗口模式或流模式下的哈希操作进行了描述,但类似的操作也适用于指令模式下的哈希操作。FIG. 11 illustrates a hash operation in window mode or stream mode according to an embodiment of the present disclosure. The scheme of the present disclosure uses a hash operation to enforce a certain percentage of residency (ie locking) because one of the key issues with LLC residency is the bandwidth versus capacity tradeoff ("tradeoff"). Therefore, this disclosure proposes to implement a certain ratio of residency (ie, Lock Ratio), so that different bandwidths and residency capacities can be obtained for different tasks. Assuming that the preset Lock Ratio value is P (for example, in percentage), the expected bandwidth is B=6T*P+2T*(1-P), where 6T is the read rate of data residing on the LLC, and 2T It is the reading speed of data stored in memory (such as DRAM), where T=1000Gbit/s. As mentioned earlier, Lock Ratio can be configured in the lock/unlock window or for specific data streams. Also, although hash operations in window mode or stream mode are described below, similar operations are also applicable to hash operations in instruction mode.
具体来说,在窗口模式或流模式下,智能处理器核首先数据的访存地址与锁定窗口限定的地址范围进行比较,以确定请求的地址是否在锁定窗口的地址范围内。当请求的地址在锁定窗口的地址范围内时,则如图11中所示,可以对命中的窗口地址范围执行哈希运算。这里,每个数据的访存地址可以是虚拟地址。Specifically, in the window mode or stream mode, the intelligent processor core first compares the access address of the data with the address range defined by the lock window to determine whether the requested address is within the address range of the lock window. When the requested address is within the address range of the locked window, as shown in FIG. 11 , a hash operation may be performed on the hit window address range. Here, the access address of each data may be a virtual address.
具体来说,借助于全局固定的Hash规则,可以将该访存地址的VA映射到Hash空间上(即图中的“Hash Map”),并且该Hash过程可以优先保留地址低位信息。接着,可以将在1102处获得的Hash值与锁存比例Lock Ratio在1104处进行比较,以随机选取出对应比例的数据。具体地,当访存地址的hash值小于该锁存比例时,则认为命中,并且因此该部分数据(即符合比例的数据)可以锁存在高速缓冲存储器中。与之相反,当访存地址的hash值大于或等于该锁存比例时,则认为未命中,并且因此该部分数据将不会锁存在高速缓冲存储器中。Specifically, with the help of globally fixed Hash rules, the VA of the access address can be mapped to the Hash space (that is, the "Hash Map" in the figure), and the Hash process can preferentially retain the low-order information of the address. Then, the Hash value obtained at 1102 can be compared with the lock ratio Lock Ratio at 1104 to randomly select data of a corresponding ratio. Specifically, when the hash value of the access address is smaller than the latch ratio, it is considered a hit, and therefore the part of data (ie, data conforming to the ratio) can be latched in the cache memory. On the contrary, when the hash value of the access address is greater than or equal to the latch ratio, it is considered a miss, and therefore this part of data will not be latched in the cache memory.
例如,当锁存比例Lock Ratio设定为10%,则可以依顺序从Hash值中选取前10%值所对应的部分数据,即数据的锁存地址的哈希值小于该锁存比例的部分数据进行锁存相关操作。当前在其他示例中,锁存比例还可以是其他数值,该锁存比例可以由软件用户自定义设置,前述的选取操作也可以根据Hash算法的设定来实现。例如,该锁存比例还可以是20%~30%,此时可以依顺序从Hash值中选取前20%~30%值所对应的部分数据进行锁存相关操作。此后,可以在1106处按照指定的请求类型来处理,即将部分数据进行锁定或解锁。For example, when the lock ratio Lock Ratio is set to 10%, you can select the part of the data corresponding to the first 10% value from the Hash value in order, that is, the part whose hash value of the latch address of the data is smaller than the lock ratio Data is latched for related operations. Currently, in other examples, the latch ratio can also be other values, and the latch ratio can be customized by the software user, and the aforementioned selection operation can also be implemented according to the setting of the Hash algorithm. For example, the latch ratio may also be 20%-30%, and at this time, partial data corresponding to the first 20%-30% of the Hash values may be sequentially selected to perform latch-related operations. Thereafter, at 1106, it can be processed according to the specified request type, that is, to lock or unlock some data.
以上结合图6-图11对本公开的高速缓冲存储器的锁存方案进行了详细地描述。基于前述锁存方案的思想,并且作为前述锁存方案的补充,下面将结合图12-图14对本公开针对高速缓冲存储器的另一扩展应用进行描述,即如何通过高速缓冲器来实现片上系统内集群间的通信。The latch scheme of the cache memory of the present disclosure has been described in detail above with reference to FIGS. 6-11 . Based on the idea of the aforementioned latch scheme, and as a supplement to the aforementioned latch scheme, the following will describe another extended application of the present disclosure for the cache memory in conjunction with Fig. 12-Fig. Inter-cluster communication.
图12是示出根据本公开实施例的片上系统的简化框图。结合前文的描述,可以理解的是这里的片上系统可以是包括在图2所示计算装置201中的片上系统,例如是由多核计算装置41所构成的片上系统。如图6中所示,该片上系统1200包括示例性示出的四个集群0-集群4。鉴于前文已经对集群进行了详细的描述,此处不再赘述。进一步示出的是高速缓冲存储器1201,其例如可以设置在如前图5中所示出的SRAM 408中,用于执行集群间的数据传输操作。在一个实施场景中,该高速缓冲存储器1201也可以与DRAM(例如DDR)进行片上与片外之间的双向通信,包括各种类型数据或指令的传递。12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure. In combination with the foregoing description, it can be understood that the system on chip here may be the system on chip included in the computing device 201 shown in FIG. 2 , for example, the system on chip constituted by the multi-core computing device 41 . As shown in FIG. 6 , the system-on-chip 1200 includes four clusters 0 - cluster 4 exemplarily shown. Since the cluster has been described in detail above, it will not be repeated here. Further shown is a cache memory 1201, which can be set, for example, in the SRAM 408 as previously shown in FIG. 5, for performing inter-cluster data transfer operations. In an implementation scenario, the cache memory 1201 can also perform on-chip and off-chip bidirectional communication with DRAM (such as DDR), including the transfer of various types of data or instructions.
图13是示出根据本公开实施例的用于片上系统的方法1300的流程图。这里的片上系统可以是如图12中所示出的片上系统。具体来说,该片上系统至少包括用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器。在一个实施场景中,每个集群可以包括用于执行所述运算操作的多个处理器核。在一个实施场景中,上述在高速缓冲存储器中确定的锁存区可以用于完成集群间的数据通信,从而使得该片上系统可以不再设置CDMA 410和GDMA 411等通信模块。FIG. 13 is a flowchart illustrating a method 1300 for a system on chip according to an embodiment of the present disclosure. The system on chip here may be the system on chip as shown in FIG. 12 . Specifically, the system-on-chip includes at least a plurality of clusters for performing computing operations and a cache memory interconnected with the plurality of clusters. In an implementation scenario, each cluster may include multiple processor cores for performing the computing operations. In an implementation scenario, the above-mentioned latch area determined in the cache memory can be used to complete inter-cluster data communication, so that the system-on-chip does not need to set communication modules such as CDMA 410 and GDMA 411.
在一个实施方式中,上述锁存区可以用于在具有依赖关系的任务之间传递数据,例如,锁存区可以用于在生产者内核和消费者内核之间传递数据。具体来说,处理器可以通过配置的锁定窗口,将生产者内核需要交换给消费者内核的数据锁存在LLC中。在一个场景中,当处理器执行完生产者内核后,可以将需要传递给消费者内核的数据(其可能是生产者内核的输入数据或输出数 据)进行锁存。鉴于此,处理器可以如前所述那样通过配置的锁定窗口并且借助于例如SMMU来对LLC进行本公开的锁存相关操作,从而在窗口模式下将需要交换的上述数据锁存于LLC中,以供消费者内核稍后使用。对应地,处理器还可以根据消费者内核中配置的解锁窗口来释放锁存区,即处理器在通过对于LLC中锁存的数据执行读取操作完成消费者内核的执行时,可以释放LLC中锁存区内数据的对应存储空间。In one embodiment, the above-mentioned latch area can be used to transfer data between tasks with dependencies, for example, the latch area can be used to transfer data between a producer core and a consumer core. Specifically, the processor can lock the data that the producer core needs to exchange to the consumer core in the LLC through the configured lock window. In one scenario, after the processor finishes executing the producer kernel, it can latch the data that needs to be delivered to the consumer kernel (it may be the input data or output data of the producer kernel). In view of this, the processor can perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, the SMMU, so as to latch the above-mentioned data that needs to be exchanged in the LLC in the window mode, for later use by the consumer kernel. Correspondingly, the processor can also release the latch area according to the unlock window configured in the consumer kernel, that is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC, it can release the latch area in the LLC. The corresponding storage space of the data in the latch area.
基于上述锁存区能够配置用于在具有依赖关系的任务之间传递数据,该锁存区还可以用于片间通信的应用场景。例如,该处理器的一个集群或处理器核中经由锁存区发射数据(该数据可以是生产者内核需要交换给消费者内核的数据)给其他集群中的处理器进行归并处理。其他集群中的处理器从锁存区读取数据进行处理,从而实现数据间的片间传输。采用该锁存区进行集群间通信的方式具体可参见下文的描述。Based on the above-mentioned latch area can be configured to transfer data between tasks with dependencies, the latch area can also be used in the application scenario of inter-chip communication. For example, a cluster or processor core of the processor transmits data (the data may be data that the producer core needs to exchange to the consumer core) via the latch area to processors in other clusters for merge processing. Processors in other clusters read data from the latch area for processing, thereby realizing inter-chip data transfer. For the manner in which inter-cluster communication is performed using the latch area, please refer to the description below.
如图13所示,本公开还包括使用高速缓冲存储器的锁存区来进行集群间通信的方法,该方法包括:As shown in FIG. 13 , the present disclosure also includes a method for performing inter-cluster communication using a latch area of a cache memory, the method comprising:
在步骤S1302处,将片外存储器的指定存储空间映射到高速缓冲存储器(“cache”)的给定存储区(其物理属性与前述结合附图所描述的锁定区相同),以将该给定存储区用作集群间数据通信的集群存储区。在如图8所示出的一个实施场景中,高速缓冲存储器可以包括LLC,并且片外存储器包括DDR。基于此,指定的存储空间可以是图14中1402处所指定的存储空间。对应地,集群存储区可以是图14中1404处高速缓冲存储器中的给定存储区。在一个实施场景中,可以通过软件配置来指定DDR的指定存储空间,并且将该DDR的指定存储空间映射到cache上的给定空间,用作cluster间(例如图14中所示出的集群0和集群1)通信。在完成集群存储区的划分和确定后,在步骤S1304处,可以使用确定的所述集群存储区来执行集群的操作。At step S1302, the specified storage space of the off-chip memory is mapped to a given storage area of the high-speed cache ("cache") (its physical properties are the same as the locking area described above in conjunction with the accompanying drawings), so that the given The storage area is used as the cluster storage area for inter-cluster data communication. In an implementation scenario as shown in FIG. 8 , the cache memory may include LLC, and the off-chip memory may include DDR. Based on this, the specified storage space may be the storage space specified at 1402 in FIG. 14 . Correspondingly, the cluster storage area may be a given storage area in the cache memory at 1404 in FIG. 14 . In an implementation scenario, the specified storage space of the DDR can be specified through software configuration, and the specified storage space of the DDR can be mapped to a given space on the cache for inter-cluster (for example, the cluster 0 shown in Figure 14 Communicate with cluster 1). After the cluster storage area is divided and determined, at step S1304, the determined cluster storage area may be used to perform cluster operations.
在一个实施例中,使用集群存储区来执行集群的操作可以包括将所述集群存储区用于集群间通信。在该情形中,将所述集群存储区用于集群间通信可以具体包括:利用所述集群存储区来实现集群之间的点对点通信。附加地,可以利用所述集群存储区来实现所述多个集群之一对其余集群的广播通信。当点对点通信的场景中,集群存储区可以用于接收来自于第一集群针对写入数据的写操作以及响应于第二集群的读操作,向所述第二集群发送第一集群先前的写入数据。In one embodiment, using the cluster store to perform operations of the cluster may include using the cluster store for inter-cluster communication. In this case, using the cluster storage area for inter-cluster communication may specifically include: using the cluster storage area to implement point-to-point communication between clusters. Additionally, the cluster storage area may be used to implement broadcast communication from one of the multiple clusters to other clusters. In the scenario of point-to-point communication, the cluster storage area can be used to receive the write operation of the first cluster for writing data and respond to the read operation of the second cluster, and send the previous write of the first cluster to the second cluster data.
在上述写操作的一个示例实现中,集群存储区还可以用于接收将上述写操作所关联的写入数据驻留于所述集群存储区的锁定指示,例如图14中所示出的写锁定(“write lock”),即上述的附带Lock属性的锁定相关请求。接着,可以基于锁定指示将写入数据驻留于集群存储区中,其中该集群存储区可以是上述实施例中确定的锁存区。通过这样的驻留方式,可以显著提高将被多次读取的数据在高速缓冲存储器中的命中率。In an example implementation of the above-mentioned write operation, the cluster storage area may also be used to receive a lock indication that the write data associated with the above-mentioned write operation resides in the cluster storage area, such as the write lock shown in FIG. 14 ("write lock"), that is, the above-mentioned lock-related request with the Lock attribute. Then, based on the lock indication, the written data may reside in the cluster storage area, wherein the cluster storage area may be the latch area determined in the above embodiment. Through such a residency manner, the hit ratio of data to be read many times in the cache memory can be significantly improved.
在一中实施场景中,在其中一个集群中执行的生产者内核可以通过上述的写锁定,将需要交换给消费者内核的数据锁存在LLC中,以供消费者内核稍后使用,例如生产者内核经由LLC发射数据给其他集群中的处理器进行归并处理。其他集群中的处理器可以从该集群存储区读取数据进行处理,从而实现数据间的片间传输。In one implementation scenario, the producer kernel executing in one of the clusters can lock the data that needs to be exchanged to the consumer kernel in the LLC through the above-mentioned write lock for later use by the consumer kernel, such as the producer The core transmits data via LLC to processors in other clusters for merge processing. Processors in other clusters can read data from the cluster storage area for processing, thereby realizing inter-slice transmission of data.
在上述读操作的一个示例实现中,集群存储区还可以用于接收令所述写入数据不写回片外存储器的读无效指示,例如从图14中集群1所发出的读无效(“read invalid”)。其中,该读无效指示可以是附带invalid属性的锁存相关请求,该锁存相关请求的生成方式具体可参见上文的描述。在不同的锁存模式下,其锁存相关请求可以不同。接着,集群存储区可以在向集群1发送所述写入数据后,基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。In an example implementation of the read operation described above, the cluster memory area can also be used to receive a read invalidation indication that the write data is not written back to the off-chip memory, such as a read invalidation (“read invalidation” issued by cluster 1 in FIG. 14 ). invalid"). Wherein, the read invalid indication may be a latch-related request with an invalid attribute, and the generation method of the latch-related request may refer to the above description for details. In different latch modes, the latch-related requests can be different. Then, after sending the write data to cluster 1, the cluster storage area may invalidate the cache line associated with the write data based on the read invalidation indication.
为了实现上述集群之间的数据传递(或者说通信)的同步,向集群存储区写入数据的集群(如集群0)可以在写操作后完成后向另一个集群(如集群1)发送同步指令,例如如图14中的hsem(“硬件信号量”)。接收到同步指令后,集群1可以发送针对于集群存储区的上述读无效请求,以便在读取集群0写入进集群存储区的数据后,令高速缓存行无效,从而防止前述数据的写回。In order to achieve the synchronization of data transfer (or communication) between the above clusters, the cluster (such as cluster 0) that writes data to the cluster storage area can send a synchronization command to another cluster (such as cluster 1) after the write operation is completed , such as hsem ("hardware semaphore") in Figure 14. After receiving the synchronization command, cluster 1 can send the above-mentioned read invalidation request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write-back of the aforementioned data .
在本披露的上下文中,上述将数据写入到集群存储区和从集群存储区读取数据的行为也可以统称为锁存相关请求所触发的锁存相关操作,该锁存相关请求的确认方式可参见上文的描述。具体地,该锁存相关的请求可以用于指示锁存操作。通过锁存操作,数据将被锁存于集群存储区 中以用于后续的多次使用。进一步,该锁存相关的请求可以用于指示释放操作,并且通过该释放操作,数据可以从集群存储区解锁以释放出更多的存储空间来用于后续的数据锁存。可以理解的是,通过释放操作,集群存储区的存储空间可以被灵活地使用,从而提高了本公开集群存储区的使用效率。In the context of this disclosure, the above-mentioned behaviors of writing data to and reading data from the cluster storage area can also be collectively referred to as lock-related operations triggered by lock-related requests, and the confirmation method of the lock-related requests See description above. Specifically, the latch-related request may be used to indicate a latch operation. Through the latch operation, the data will be latched in the cluster storage area for subsequent multiple uses. Further, the latch-related request can be used to indicate a release operation, and through the release operation, data can be unlocked from the cluster storage area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the cluster storage area can be used flexibly, thereby improving the usage efficiency of the cluster storage area in the present disclosure.
在一个实施方式中,针对于集群存储区的读操作,还可以在执行完读操作后,根据锁存相关请求将数据或选定的部分所述数据从所述集群存储区的指定区域释放。关于前述选定的部分数据,在一个实施例中,可以以随机的方式从数据中选择预定比例的数据形成前述的部分数据来锁存于锁存区中。在另一个实施例中,可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来锁存于集群存储区中,具体可参见上文图11部分的描述。In one embodiment, for the read operation of the cluster storage area, after the read operation is completed, the data or a selected part of the data may be released from the specified area of the cluster storage area according to a latch-related request. Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, hash algorithm can be used to select a predetermined proportion of data from the data as the aforementioned partial data to be latched in the cluster storage area. For details, please refer to the description in the part of FIG. 11 above.
以上结合附图对本公开的方案进行了详细的描述。根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。The solution of the present disclosure has been described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical care. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of brevity, the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teachings of the present disclosure, those skilled in the art may understand that several embodiments disclosed in the present disclosure may also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present disclosure, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件 产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(“Read Only Memory”,简写为ROM)、随机存取存储器(“Random Access Memory”,简写为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory", abbreviated as ROM), random access memory ("Random Access Memory", abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(“Resistive Random Access Memory”,简写为RRAM)、动态随机存取存储器(“Dynamic Random Access Memory”,简写为DRAM)、静态随机存取存储器(“Static Random Access Memory”,简写为SRAM)、增强动态随机存取存储器(“Enhanced Dynamic Random Access Memory”,简写为“EDRAM”)、高带宽存储器(“High Bandwidth Memory”,简写为“HBM”)、混合存储器立方体(“Hybrid Memory Cube”,简写为“HMC”)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory ("Resistive Random Access Memory", abbreviated as RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated as DRAM), static random access memory ("Static Random Access Memory", abbreviated as SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated as "EDRAM"), high bandwidth memory ("High Bandwidth Memory", abbreviated as "HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated as "HMC"), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in light of the following terms:
条款A1.一种用于高速缓冲存储器的方法,包括:Clause A1. A method for a cache memory comprising:
将所述高速缓冲存储器中的特定存储空间配置为支持多种锁存模式的锁存区;configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes;
接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及receiving a latch-related request for performing a latch-related operation on the data in the latch region; and
根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区中对所述数据执行锁存相关操作。Performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
条款A2.根据条款A1所述的方法,其中所述多种锁存模式按照预定的优先级顺序来执行。Clause A2. The method of Clause A1, wherein the plurality of latch modes are performed in a predetermined order of priority.
条款A3.根据条款A1或2所述的方法,其中所述多种锁存模式包括基于硬件指令来执行锁存相关操作的指令模式、基于窗口属性来执行锁存相关操作的窗口模式、基于数据流来执行锁存相关操作的流模式和/或基于缓存页来执行锁存相关操作的页模式。Clause A3. The method according to Clause A1 or 2, wherein the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.
条款A4.根据条款A3所述的方法,其中,Clause A4. The method of clause A3, wherein,
在所述指令模式中,所述锁存相关请求是根据所述硬件指令确定的;In the instruction mode, the latch-related request is determined according to the hardware instruction;
在所述页模式中,所述锁存相关请求是根据缓存页配置确定的;In the page mode, the latch-related request is determined according to a cache page configuration;
在所述窗口模式或流模式中,所述锁存相关请求是根据锁定窗口确定的。In said window mode or stream mode, said latch-related requests are determined according to a lock window.
条款A5.根据条款A4所述的方法,其中在所述指令模式、所述窗口模式或流模式中,所述锁存相关请求能够附带锁定属性,所述锁定属性用于指示将特定的数据驻留于锁存区,该特定的数据是根据哈希算法选定的部分数据。Clause A5. The method of Clause A4, wherein in the command mode, the window mode, or the stream mode, the latch-related request can be accompanied by a lock attribute, the lock attribute is used to indicate that specific data is stored in Retained in the latch area, the specific data is part of the data selected according to the hash algorithm.
条款A6.根据条款A3或A4所述的方法,其中在页模式中,所述方法包括:Clause A6. The method of clause A3 or A4, wherein in page mode, the method comprises:
根据系统内存管理单元的线性映射窗口来执行基于缓存页的锁存操作。The cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.
条款A7.根据条款A3所述的方法,其中配置支持所述多种锁存模式的所述锁存区包括:Clause A7. The method of Clause A3, wherein configuring the latch region to support the plurality of latch modes comprises:
根据接收到的多种配置指令中的一种配置指令将特定存储空间配置成支持对应的一种所述锁存模式的锁存区,其中所述配置指令包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。Configure a specific storage space as a latch area that supports a corresponding one of the latch modes according to one of the received configuration instructions, wherein the configuration instruction includes enabling the latch area and disabling the latch area and/or configuration items for the size of the latch area.
条款A8.根据条款A7所述的方法,其中针对于锁存区的写操作,所述方法包括根据所述锁存相关请求将所述数据或选定的部分所述数据锁存于所述锁存区的指定区域内,以便用于后续的多次读取。Clause A8. The method of Clause A7, wherein for a write operation to a latched area, the method includes latching the data or a selected portion of the data in the latch according to the latch-related request. In the specified area of the storage area, it can be used for subsequent multiple reads.
条款A9.根据条款A7所述的方法,其中针对于锁存区的读操作,所述方法包括在执行完所述读操作后,根据所述锁存相关请求将所述数据或选定的部分所述数据从所述锁存区的指定区域释放。Clause A9. The method according to Clause A7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, storing the data or the selected portion according to the latch-related request The data is released from a designated area of the latch area.
条款A10.一种高速缓冲存储器,包括:Clause A10. A cache memory comprising:
配置模块,其用于将所述高速缓冲存储器中的特定存储空间配置成支持多种锁存模式的锁存区;a configuration module, configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes;
锁存执行模块,其用于:A latch execution module for:
接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及receiving a latch-related request for performing a latch-related operation on the data in the latch region; and
根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。According to the latch-related request, a latch-related operation is performed on the data in the latch region in the corresponding latch mode.
条款A11.一种片上系统,包括:Clause A11. A system on a chip comprising:
根据条款A10所述的高速缓冲存储器;以及a cache memory as described in clause A10; and
处理器,用于生成所述锁存相关请求;a processor, configured to generate the latch-related request;
其中所述高速缓冲存储器的锁存执行模块用于根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。The latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.
条款A12.根据条款A11所述的片上系统,其中所述锁存模式包括指令模式,并且在所述指令模式中,所述处理器用于根据接收到的硬件指令来生成所述锁存相关请求。Clause A12. The system-on-chip of Clause A11, wherein the latch mode comprises an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request in accordance with a received hardware instruction.
条款A13.根据条款A11所述的片上系统,其中所述锁存模式包括页模式,并且在所述页模式中,所述处理器用于根据缓存页配置来生成所述锁存相关请求。Clause A13. The system-on-chip of Clause A11, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.
条款A14.根据条款A11所述的片上系统,其中所述锁存模式包括所述窗口模式或所述流模式,并且在所述窗口模式或流模式中,所述片上系统还包括:任务调度器,其包括配置器和调度单元,其中:Clause A14. The system-on-chip of clause A11, wherein the latched mode comprises the windowed mode or the streaming mode, and in the windowed mode or the streaming mode, the system-on-chip further comprises: a task scheduler , which includes a configurator and dispatch unit, where:
所述配置器用于根据分配的配置任务来生成所述配置指令,以便向所述高速缓冲存储器的配置模块发送;以及The configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory; and
所述调度单元用于对所述任务调度器中的多个任务进行调度,以便向所述处理器核发送。The scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor core.
条款A15.根据条款A14所述的片上系统,其中所述配置指令包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。Clause A15. The system-on-chip of Clause A14, wherein the configuration instructions include configuration items for enabling latch regions, disabling latch regions, and/or latch region sizes.
条款A16.根据条款A15所述的片上系统,其中所述处理器还包括系统内存管理单元,其用于在窗口模式或流模式中:Clause A16. The system-on-chip of Clause A15, wherein the processor further comprises a system memory management unit for, in windowed mode or streaming mode:
根据参数表来配置与待执行锁存相关操作的数据关联的锁定窗口;以及configuring a lock window associated with data to be latch-related operations according to the parameter table; and
根据配置的所述锁定窗口生成所述锁存相关请求。generating the latch-related request according to the configured lock window.
条款A17.根据条款A16所述的片上系统,其中所述锁定窗口的配置项包括以下中的一项或多项:Clause A17. The system-on-chip of Clause A16, wherein the configuration items of the lock window include one or more of the following:
窗口的基地址和大小,其中窗口的基地址与待执行锁存相关操作的数据起始地址相对应而所述窗口的大小与所述数据的大小相对应;a base address and a size of the window, wherein the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data;
将数据锁存于所述锁存区的锁存指示;a latch indication for latching data in the latch area;
将数据从锁存区解锁的解锁指示;以及an unlock indication to unlock data from the latch area; and
锁存比率,其指示待执行锁存相关操作的数据中将被实际锁存的数据的比率。The latch ratio, which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.
条款A18.根据条款A17所述的片上系统,其中所述处理器还用于在待执行锁存相关操作的数据的访存地址处于所述锁定窗口的地址范围之内,采用哈希算法选定能够锁存在锁存区的部分数据。Clause A18. The system-on-chip according to clause A17, wherein the processor is further configured to use a hash algorithm to select an access address of the data to be latch-related operations within the address range of the lock window Part of the data that can be locked in the latch area.
条款A19.根据条款A17所述的片上系统,其中所述处理器用于根据哈希算法从待锁存的数据中随机选择满足预定所述锁存比率的所述部分数据,并且生成附带锁定属性的锁存相关请求,以便在所述锁存区中进行锁存。Clause A19. The system-on-chip according to Clause A17, wherein the processor is configured to randomly select the portion of data satisfying the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a Latching the associated request for latching in the latching area.
条款A20.根据条款A14所述的片上系统,其中所述处理器用于对所述数据执行在所述锁存区中的写操作,所述锁存执行模块用于根据所述锁存相关请求将所述数据或选定的部分所述数据锁存于所述锁存区的指定区域内,并且其中所述处理器还用于对所述锁存区中的所述数据执行读操作,并且所述锁存执行模块用于根据所述锁存相关请求将执行完读操作后的所述数据从所述锁存区的指定区域释放。Clause A20. The system-on-chip of clause A14, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write The data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the The latch execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.
条款A21.根据条款A16-A20的任意一项所述的片上系统,其中所述任务包括生产者内核和消费者内核,其中:Clause A21. The system-on-chip of any one of Clauses A16-A20, wherein the tasks include producer cores and consumer cores, wherein:
在执行生产者内核时,所述处理器用于通过所述锁存相关请求将所述生产者内核输出的数据 锁存于所述锁存区中,以供所述消费者内核使用;以及When executing the producer core, the processor is used to latch the data output by the producer core in the latch area through the latch-related request for use by the consumer core; and
在执行消费者内核时,所述处理器用于从所述锁存区读取数据,并在读取所述数据后,通过所述锁存相关请求将所述数据从所述锁存区解锁,以便释放所述锁存区内用于所述数据的存储空间。When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.
条款A22.一种板卡,包括根据条款A11-A21的任意一项所述的片上系统。Clause A22. A board comprising the system-on-chip according to any one of clauses A11-A21.
条款A23.一种计算设备,包括根据条款A22所述的板卡。Clause A23. A computing device comprising the board of Clause A22.
条款B1.一种用于片上系统的方法,所述片上系统包括至少用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器,每个集群包括用于执行所述运算操作的多个处理器核,所述方法包括:Clause B1. A method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising a plurality of clusters for performing said computational operations A plurality of processor cores, the method comprising:
将片外存储器的指定存储空间映射到高速缓冲存储器的锁存区,以将该锁存区用作集群间数据通信的集群存储区;以及mapping a designated storage space of the off-chip memory to a latch area of the cache memory to use the latch area as a cluster storage area for inter-cluster data communication; and
使用所述集群存储区来执行所述集群的操作。Operations of the cluster are performed using the cluster storage area.
条款B2.根据条款B1所述的方法,其中使用所述集群存储区来执行所述集群的操作包括将所述集群存储区用于集群间通信。Clause B2. The method of Clause B1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.
条款B3.根据条款B2所述的方法,其中将所述集群存储区用于集群间通信包括:Clause B3. The method of clause B2, wherein using the cluster storage area for inter-cluster communication comprises:
利用所述集群存储区来实现集群之间的点对点通信;或者Utilizing the cluster storage area for peer-to-peer communication between clusters; or
利用所述集群存储区来实现所述多个集群之一对其余集群的广播通信。The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.
条款B4.根据条款B3所述的方法,其中利用所述集群存储区来实现集群之间的点对点通信包括:Clause B4. The method of Clause B3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:
接收来自于第一集群针对写入数据的写操作;以及receiving a write operation for writing data from the first cluster; and
响应于第二集群的读操作,向所述第二集群发送所述写入数据。The write data is sent to the second cluster in response to a read operation by the second cluster.
条款B5.根据条款B4所述的方法,其中在所述写操作中,所述方法还包括:Clause B5. The method of Clause B4, wherein in the write operation, the method further comprises:
接收将所述写操作所关联的写入数据驻留于所述集群存储区的锁定指示;以及receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and
基于所述锁定指示将所述写入数据驻留于所述集群存储区。The write data is resident in the cluster storage area based on the lock indication.
条款B6.根据条款B4或B5所述的方法,其中在所述读操作中,所述方法还包括:Clause B6. The method of clause B4 or B5, wherein in the read operation, the method further comprises:
接收令所述写入数据不写回片外存储器的读无效指示;以及receiving a read invalidation indication that the write data is not written back to the off-chip memory; and
在向所述第二集群发送所述写入数据后,基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.
条款B7.一种片上系统,包括:Clause B7. A system on a chip comprising:
多个集群,其中每个集群包括至少用于执行运算操作的多个处理器核;以及a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing computational operations; and
高速缓冲存储器,其与所述多个集群互联,并且配置成:a cache memory interconnected with the plurality of clusters and configured to:
将锁存区用作集群间数据通信的集群存储区,其中所述锁存区与片外存储器的指定存储空间形成映射关系;以及using the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and
使用所述集群存储区来执行所述集群的操作。Operations of the cluster are performed using the cluster storage area.
条款B8.根据条款B7所述的片上系统,其中所述集群存储区配置成用于集群间通信。Clause B8. The system-on-chip of Clause B7, wherein the cluster memory area is configured for inter-cluster communication.
条款B9.根据条款B8所述的片上系统,其中所述集群存储区配置成用于集群间的点对点通信或所述多个集群之一对其余集群的广播通信。Clause B9. The system-on-chip of Clause B8, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the remaining clusters.
条款B10.根据条款B9所述的片上系统,其中在所述点对点通信中,集群存储区配置成:Clause B10. The system-on-chip of Clause B9, wherein in the peer-to-peer communication, the cluster storage area is configured to:
接收来自于第一集群针对写入数据的写操作;以及receiving a write operation for writing data from the first cluster; and
响应于第二集群的读操作,向所述第二集群发送所述写入数据。The write data is sent to the second cluster in response to a read operation by the second cluster.
条款B11.根据条款B10所述的片上系统,其中所述第二集群配置成:Clause B11. The system-on-chip of Clause B10, wherein the second cluster is configured to:
接收来自于所述第一集群的硬件信号量;以及receiving a hardware semaphore from the first cluster; and
响应于接收到所述硬件信号量,对所述集群存储区执行所述读操作。In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.
条款B12.根据条款B10所述的片上系统,其中在所述写操作中,所述第一集群配置成向所述集群存储区发送将所述写入数据驻留于所述集群存储区的锁定指示,以便所述集群存储区基于所述锁定指示来驻留所述写入数据。Clause B12. The system-on-a-chip of clause B10, wherein in the write operation, the first cluster is configured to send a lock to the cluster storage area to reside the write data in the cluster storage area indication, so that the cluster store resides the write data based on the lock indication.
条款B13、根据条款B12所述的片上系统,其中在所述读操作中,所述第二集群配置成向所述集群存储区发送令所述写入数据不写回片处存储器的读无效指示,以便所述集群存储区基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。Clause B13. The system-on-chip of clause B12, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area that causes the write data not to be written back to the off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.
条款B14.一种计算装置,包括根据条款B7-B13的任意一项所述的片上系统。Clause B14. A computing device comprising the system-on-chip according to any one of clauses B7-B13.
条款B15.一种板卡,包括根据条款B14所述的计算装置。Clause B15. A board comprising the computing device according to Clause B14.
条款B16、一种计算设备,包括根据条款B15所述的板卡。Clause B16. A computing device comprising the board according to Clause B15.
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。While various embodiments of the present disclosure have been shown and described herein, it would be obvious to those skilled in the art that such embodiments are provided by way of example only. Many modifications, changes and substitutions may occur to those skilled in the art without departing from the idea and spirit of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the appended claims define the scope of protection of the present disclosure and thus cover equivalents or alternatives within the scope of these claims.

Claims (36)

  1. 一种用于高速缓冲存储器的方法,包括:A method for a cache memory comprising:
    将所述高速缓冲存储器中的特定存储空间配置为支持多种锁存模式的锁存区;configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes;
    接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及receiving a latch-related request for performing a latch-related operation on the data in the latch region; and
    根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区中对所述数据执行锁存相关操作。Performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
  2. 根据权利要求1所述的方法,其中所述多种锁存模式具有预定的优先级顺序。The method of claim 1, wherein the plurality of latch modes have a predetermined priority order.
  3. 根据权利要求1或2所述的方法,其中所述多种锁存模式包括基于硬件指令来执行锁存相关操作的指令模式、基于窗口属性来执行锁存相关操作的窗口模式、基于数据流来执行锁存相关操作的流模式和/或基于缓存页来执行锁存相关操作的页模式。The method according to claim 1 or 2, wherein the multiple latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a data flow-based Streaming mode for performing latch-related operations and/or page mode for performing latch-related operations based on cache pages.
  4. 根据权利要求3所述的方法,其中,The method according to claim 3, wherein,
    在所述指令模式中,所述锁存相关请求是根据所述硬件指令确定的;In the instruction mode, the latch-related request is determined according to the hardware instruction;
    在所述页模式中,所述锁存相关请求是根据缓存页配置确定的;In the page mode, the latch-related request is determined according to a cache page configuration;
    在所述窗口模式或流模式中,所述锁存相关请求是根据锁定窗口确定的。In said window mode or stream mode, said latch-related requests are determined according to a lock window.
  5. 根据权利要求4所述的方法,其中在所述指令模式、所述窗口模式或流模式中,所述锁存相关请求能够附带锁定属性,所述锁定属性用于指示将特定的数据驻留于锁存区,该特定的数据是根据哈希算法选定的部分数据。The method according to claim 4, wherein in the instruction mode, the window mode or the stream mode, the latch-related request can be accompanied by a lock attribute, and the lock attribute is used to indicate that specific data resides in In the latch area, the specific data is part of the data selected according to the hash algorithm.
  6. 根据权利要求3或4所述的方法,其中在页模式中,所述方法包括:The method according to claim 3 or 4, wherein in page mode, the method comprises:
    根据系统内存管理单元的线性映射窗口来执行基于缓存页的锁存操作。The cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.
  7. 根据权利要求3所述的方法,其中配置支持所述多种锁存模式的所述锁存区包括:The method according to claim 3, wherein configuring the latch region supporting the plurality of latch modes comprises:
    根据接收到的多种配置指令中的一种配置指令将特定存储空间配置成支持对应的一种所述锁存模式的锁存区;Configuring a specific storage space as a latch area supporting a corresponding one of the latch modes according to one configuration instruction among the received configuration instructions;
    其中所述配置指令包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。Wherein the configuration instruction includes configuration items for enabling the latch area, disabling the latch area and/or the size of the latch area.
  8. 根据权利要求7所述的方法,其中针对于锁存区的写操作,所述方法包括根据所述锁存相关请求将所述数据或选定的部分所述数据锁存于所述锁存区的指定区域内,以便用于后续的多次读取。The method according to claim 7, wherein for the write operation of the latch area, the method comprises latching the data or a selected part of the data in the latch area according to the latch-related request within the designated area for subsequent multiple reads.
  9. 根据权利要求7所述的方法,其中针对于锁存区的读操作,所述方法包括在执行完所述读操作后,根据所述锁存相关请求将所述数据或选定的部分所述数据从所述锁存区的指定区域释放。The method according to claim 7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, transferring the data or the selected part according to the latch-related request Data is released from the designated area of the latch area.
  10. 根据权利要求1所述的方法,其中所述高速缓冲存储器位于片上系统中,并且还与所述片上系统的多个集群互联,其中每个集群包括用于执行运算操作的多个处理器核,所述方法还包括:The method of claim 1, wherein the cache memory is located in a system-on-chip and is also interconnected with a plurality of clusters of the system-on-chip, wherein each cluster includes a plurality of processor cores for performing arithmetic operations, The method also includes:
    将片外存储器的指定存储空间映射到所述高速缓冲存储器的所述锁存区,以将该锁存区用作集群间数据通信的集群存储区;以及mapping a designated storage space of the off-chip memory to the latch area of the cache memory, so as to use the latch area as a cluster storage area for inter-cluster data communication; and
    使用所述集群存储区来执行所述集群的操作。Operations of the cluster are performed using the cluster storage area.
  11. 根据权利要求10所述的方法,其中使用所述集群存储区来执行所述集群的操作包括将所述集群存储区用于集群间通信。The method of claim 10, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.
  12. 根据权利要求11所述的方法,其中将所述集群存储区用于集群间通信包括:The method of claim 11, wherein using the cluster storage area for inter-cluster communication comprises:
    利用所述集群存储区来实现集群之间的点对点通信;或者Utilizing the cluster storage area for peer-to-peer communication between clusters; or
    利用所述集群存储区来实现所述多个集群之一对其余集群的广播通信。The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.
  13. 根据权利要求12所述的方法,其中利用所述集群存储区来实现集群之间的点对点通信包括:The method according to claim 12, wherein utilizing the cluster storage area to implement peer-to-peer communication between clusters comprises:
    接收来自于第一集群针对写入数据的写操作;以及receiving a write operation for writing data from the first cluster; and
    响应于第二集群的读操作,向所述第二集群发送所述写入数据。The write data is sent to the second cluster in response to a read operation by the second cluster.
  14. 根据权利要求13所述的方法,其中在所述写操作中,所述方法还包括:The method according to claim 13, wherein in the writing operation, the method further comprises:
    接收将所述写操作所关联的写入数据驻留于所述集群存储区的锁定指示;以及receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and
    基于所述锁定指示将所述写入数据驻留于所述集群存储区。The write data is resident in the cluster storage area based on the lock indication.
  15. 根据权利要求13或14所述的方法,其中在所述读操作中,所述方法还包括:The method according to claim 13 or 14, wherein in the read operation, the method further comprises:
    接收令所述写入数据不写回片外存储器的读无效指示;以及receiving a read invalidation indication that the write data is not written back to the off-chip memory; and
    在向所述第二集群发送所述写入数据后,基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.
  16. 一种高速缓冲存储器,包括:A cache memory comprising:
    配置模块,其用于将所述高速缓冲存储器中的特定存储空间配置成支持多种锁存模式的锁存区;a configuration module, configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes;
    锁存执行模块,其用于:A latched execution module for:
    接收在所述锁存区中对所述数据进行锁存相关操作的锁存相关请求;以及receiving a latch-related request for performing a latch-related operation on the data in the latch region; and
    根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。According to the latch-related request, a latch-related operation is performed on the data in the latch region in the corresponding latch mode.
  17. 一种片上系统,包括:A system on a chip comprising:
    根据权利要求16所述的高速缓冲存储器;以及A cache memory according to claim 16; and
    处理器,用于生成所述锁存相关请求;a processor, configured to generate the latch-related request;
    其中所述高速缓冲存储器的锁存执行模块用于根据所述锁存相关请求,以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。The latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.
  18. 根据权利要求17所述的片上系统,其中所述锁存模式包括指令模式,并且在所述指令模式中,所述处理器用于根据接收到的硬件指令来生成所述锁存相关请求。The system-on-chip of claim 17, wherein the latch mode includes an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request according to a received hardware instruction.
  19. 根据权利要求17所述的片上系统,其中所述锁存模式包括页模式,并且在所述页模式中,所述处理器用于根据缓存页配置来生成所述锁存相关请求。The system-on-chip of claim 17, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.
  20. 根据权利要求17所述的片上系统,其中所述锁存模式包括所述窗口模式或所述流模式,并且在所述窗口模式或流模式中,所述片上系统还包括:任务调度器,其包括配置器和调度单元,其中:The system-on-chip according to claim 17, wherein said latch mode comprises said window mode or said stream mode, and in said window mode or stream mode, said system-on-chip further comprises: a task scheduler, which Contains the configurator and dispatch unit, which:
    所述配置器用于根据分配的配置任务来生成所述配置指令,以便向所述高速缓冲存储器的配置模块发送;以及The configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory; and
    所述调度单元用于对所述任务调度器中的多个任务进行调度,以便向所述处理器发送。The scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor.
  21. 根据权利要求20所述的片上系统,其中所述配置指令包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。The system-on-a-chip according to claim 20, wherein the configuration instruction includes configuration items of enabling the latch area, disabling the latch area and/or the size of the latch area.
  22. 根据权利要求21所述的片上系统,其中所述处理器还包括系统内存管理单元,其用于在窗口模式或流模式中:The system-on-chip according to claim 21 , wherein said processor further comprises a system memory management unit for, in windowed mode or streaming mode:
    根据参数表来配置与待执行锁存相关操作的数据关联的锁定窗口;以及configuring a lock window associated with data to be latch-related operations according to the parameter table; and
    根据配置的所述锁定窗口生成所述锁存相关请求。generating the latch-related request according to the configured lock window.
  23. 根据权利要求22所述的片上系统,其中所述锁定窗口的配置项包括以下中的一项或多项:The system on chip according to claim 22, wherein the configuration items of the locking window include one or more of the following:
    窗口的基地址和大小,其中窗口的基地址与待执行锁存相关操作的数据起始地址相对应而所述窗口的大小与所述数据的大小相对应;a base address and a size of the window, wherein the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data;
    将数据锁存于所述锁存区的锁存指示;a latch indication for latching data in the latch area;
    将数据从锁存区解锁的解锁指示;以及an unlock indication to unlock data from the latch area; and
    锁存比率,其指示待执行锁存相关操作的数据中将被实际锁存的数据的比率。The latch ratio, which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.
  24. 根据权利要求23所述的片上系统,其中所述处理器还用于:The system-on-chip of claim 23, wherein the processor is further configured to:
    在待执行锁存相关操作的数据的访存地址处于所述锁定窗口的地址范围之内,采用哈希算法选定能够锁存在锁存区的部分数据。When the access address of the data to be latch-related operations is within the address range of the lock window, a hash algorithm is used to select part of the data that can be locked in the latch area.
  25. 根据权利要求23所述的片上系统,其中所述处理器用于根据哈希算法从待锁存的数据中随机选择满足预定所述锁存比率的所述部分数据,并且生成附带锁定属性的锁存相关请求,以便在所述锁存区中进行锁存。The system-on-chip according to claim 23, wherein the processor is configured to randomly select the part of data that satisfies the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a latch with a locking attribute Related requests to be latched in the latched area.
  26. 根据权利要求20所述的片上系统,其中所述处理器用于对所述数据执行在所述锁存区中 的写操作,所述锁存执行模块用于根据所述锁存相关请求将所述数据或选定的部分所述数据锁存于所述锁存区的指定区域内,并且其中所述处理器还用于对所述锁存区中的所述数据执行读操作,并且所述锁存执行模块用于根据所述锁存相关请求将执行完读操作后的所述数据从所述锁存区的指定区域释放。The system-on-chip according to claim 20, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write the latch-related request to the data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the lock The storage execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.
  27. 根据权利要求22-26的任意一项所述的片上系统,其中所述任务包括生产者内核和消费者内核,其中:A system-on-chip according to any one of claims 22-26, wherein said tasks comprise producer cores and consumer cores, wherein:
    在执行生产者内核时,所述处理器用于通过所述锁存相关请求将所述生产者内核输出的数据锁存于所述锁存区中,以供所述消费者内核使用;以及When executing the producer kernel, the processor is configured to latch the data output by the producer kernel in the latch area through the latch-related request, so as to be used by the consumer kernel; and
    在执行消费者内核时,所述处理器用于从所述锁存区读取数据,并在读取所述数据后,通过所述锁存相关请求将所述数据从所述锁存区解锁,以便释放所述锁存区内用于所述数据的存储空间。When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.
  28. 根据权利要求27所述的片上系统,其还包括多个集群,其中每个集群包括至少用于执行运算操作的多个处理器核,并且所述高速缓冲存储器还与所述多个集群互联,并且配置成:The system-on-chip according to claim 27, further comprising a plurality of clusters, wherein each cluster comprises at least a plurality of processor cores for performing arithmetic operations, and the cache memory is also interconnected with the plurality of clusters, And configured as:
    将所述锁存区用作集群间数据通信的集群存储区,其中所述锁存区与片外存储器的指定存储空间形成映射关系;以及using the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and
    使用所述集群存储区来执行所述集群的操作。Operations of the cluster are performed using the cluster storage area.
  29. 根据权利要求28所述的片上系统,其中所述集群存储区配置成用于集群间通信。The system on chip of claim 28, wherein the cluster memory area is configured for inter-cluster communication.
  30. 根据权利要求29所述的片上系统,其中所述集群存储区配置成用于集群间的点对点通信或所述多个集群之一对其余集群的广播通信。The system-on-chip according to claim 29, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the rest of the clusters.
  31. 根据权利要求30所述的片上系统,其中在所述点对点通信中,集群存储区配置成:The system on chip according to claim 30, wherein in the point-to-point communication, the cluster storage area is configured to:
    接收来自于第一集群针对写入数据的写操作;以及receiving a write operation for writing data from the first cluster; and
    响应于第二集群的读操作,向所述第二集群发送所述写入数据。The write data is sent to the second cluster in response to a read operation by the second cluster.
  32. 根据权利要求31所述的片上系统,其中所述第二集群配置成:The system-on-chip of claim 31, wherein the second cluster is configured to:
    接收来自于所述第一集群的硬件信号量;以及receiving a hardware semaphore from the first cluster; and
    响应于接收到所述硬件信号量,对所述集群存储区执行所述读操作。In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.
  33. 根据权利要求31所述的片上系统,其中在所述写操作中,所述第一集群配置成向所述集群存储区发送将所述写入数据驻留于所述集群存储区的锁定指示,以便所述集群存储区基于所述锁定指示来驻留所述写入数据。The system-on-chip of claim 31 , wherein during the write operation, the first cluster is configured to send a lock indication to the cluster storage area that the write data reside in the cluster storage area, so that the cluster storage area resides the write data based on the lock indication.
  34. 根据权利要求33所述的片上系统,其中在所述读操作中,所述第二集群配置成向所述集群存储区发送令所述写入数据不写回片处存储器的读无效指示,以便所述集群存储区基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。The system-on-chip according to claim 33, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area so that the write data is not written back to the off-chip memory, so that The cluster store invalidates a cache line associated with the write data based on the read invalidation indication.
  35. 一种板卡,包括根据权利要求17-34的任意一项所述的片上系统。A board, comprising the system-on-chip according to any one of claims 17-34.
  36. 一种计算设备,包括根据权利要求35所述的板卡。A computing device comprising the board according to claim 35.
PCT/CN2022/110740 2021-08-12 2022-08-08 Method for cache memory and related products WO2023016383A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110926703.7A CN115878553A (en) 2021-08-12 2021-08-12 Method for system on chip and related product
CN202110926707.5 2021-08-12
CN202110926707.5A CN115705300A (en) 2021-08-12 2021-08-12 Method for cache memory and related product
CN202110926703.7 2021-08-12

Publications (1)

Publication Number Publication Date
WO2023016383A1 true WO2023016383A1 (en) 2023-02-16

Family

ID=85200562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/110740 WO2023016383A1 (en) 2021-08-12 2022-08-08 Method for cache memory and related products

Country Status (1)

Country Link
WO (1) WO2023016383A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018855A1 (en) * 2001-07-16 2003-01-23 Mcwilliams Thomas M. Method and apparatus for caching with variable size locking regions
CN102750227A (en) * 2011-04-19 2012-10-24 飞思卡尔半导体公司 Cache memory with dynamic lockstep support
CN106547619A (en) * 2016-10-20 2017-03-29 深圳市云海麒麟计算机系统有限公司 Multi-user's memory management method and system
CN110634517A (en) * 2018-06-25 2019-12-31 成都康元多商贸有限公司 High-performance static random access memory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018855A1 (en) * 2001-07-16 2003-01-23 Mcwilliams Thomas M. Method and apparatus for caching with variable size locking regions
CN102750227A (en) * 2011-04-19 2012-10-24 飞思卡尔半导体公司 Cache memory with dynamic lockstep support
CN106547619A (en) * 2016-10-20 2017-03-29 深圳市云海麒麟计算机系统有限公司 Multi-user's memory management method and system
CN110634517A (en) * 2018-06-25 2019-12-31 成都康元多商贸有限公司 High-performance static random access memory

Similar Documents

Publication Publication Date Title
US10389839B2 (en) Method and apparatus for generating data prefetches specifying various sizes to prefetch data from a remote computing node
JP6431536B2 (en) Final level cache system and corresponding method
JP4322259B2 (en) Method and apparatus for synchronizing data access to local memory in a multiprocessor system
JP3802042B2 (en) Cache memory mounting method and apparatus, and cache memory system
US11341059B2 (en) Using multiple memory elements in an input-output memory management unit for performing virtual address to physical address translations
US20230367722A1 (en) Data processing device and method, and related products
JP2012252490A (en) Multiprocessor and image processing system using the same
US11468001B1 (en) Processing-in-memory concurrent processing system and method
US20210224213A1 (en) Techniques for near data acceleration for a multi-core architecture
CN118113631B (en) Data processing system, method, device, medium and computer program product
CN112527729A (en) Tightly-coupled heterogeneous multi-core processor architecture and processing method thereof
US20080244221A1 (en) Exposing system topology to the execution environment
EP3959611A1 (en) Intra-device notational data movement system
US12013780B2 (en) Multi-partition memory sharing with multiple components
WO2024045580A1 (en) Method for scheduling tasks, and related product thereof
WO2023016383A1 (en) Method for cache memory and related products
TW202111545A (en) Unified address translation
Abdallah Heterogeneous Computing: An Emerging Paradigm of Embedded Systems Design
US10884477B2 (en) Coordinating accesses of shared resources by clients in a computing device
CN115705300A (en) Method for cache memory and related product
CN115878553A (en) Method for system on chip and related product
WO2023016382A1 (en) Method for system on chip, and related product thereof
TWI831564B (en) Configurable memory system and memory managing method thereof
Chiu et al. Design and Implementation of the Link-List DMA Controller for High Bandwidth Data Streaming
CN116166468A (en) Method for processing ECC errors in heterogeneous system, heterogeneous system and related products thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22855364

Country of ref document: EP

Kind code of ref document: A1