WO2023016383A1

WO2023016383A1 - Method for cache memory and related products

Info

Publication number: WO2023016383A1
Application number: PCT/CN2022/110740
Authority: WO
Inventors: 葛祥轩; 张尧; 刘少礼; 梁军
Original assignee: 寒武纪(西安)集成电路有限公司
Priority date: 2021-08-12
Filing date: 2022-08-08
Publication date: 2023-02-16

Abstract

A method for a cache memory, a cache memory, a system on chip, a board, and a computing device. The computing device is embodied by a computing processing means comprised in a combined processing means (20); the combined processing means (20) may also comprise a universal interconnection interface and other processing means. The computing processing means interacts with other processing means to jointly complete a computing operation specified by a user. The combined processing means (20) may also comprise a storage means, and the storage means is separately connected to the computing processing means and other processing means to store data of the computing processing means and other processing means. The combined processing means (20) can improve use efficiency of the cache memory.

Description

Method for cache memory and related products

Cross References to Related Applications

This disclosure claims the priority of the following Chinese patent applications: the Chinese patent application filed on August 12, 2021, with the application number 202110926703.7, and the title of the invention is "Method for System-on-Chip and Related Products"; filed on August 2021 A Chinese patent application with the application number 202110926707.5 and the title of the invention "Methods for Cache Memory and Related Products" filed on May 12.

technical field

The present disclosure generally relates to the field of chip technology. More specifically, the present disclosure relates to a method for a cache memory, a cache memory, a system-on-chip including the cache memory, a board including the system-on-chip, and a computing device including the board.

Background technique

The operational performance of a computing system is largely determined by the average memory access latency. System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of the cache memory (referred to as "cache"). To this end, processors typically employ a cache mechanism, and use the cache to accommodate the mismatch in speed and performance between the processor and slow main memory. The current cache implements a multi-level cache mechanism, such as three-level cache (L1, L2, and L3), and the cache closest to the main memory is called the last level cache ("Last Level Cache", LLC). In view of the frequent use and important role of cache in SoCs, it is necessary to propose an effective management strategy in order to improve the utilization of cache and reduce the number of visits to main memory. In addition, how to expand the application of LLC for different scenarios has also become a problem that needs to be solved.

Contents of the invention

In view of the technical problems mentioned in the background technology section above, the present disclosure provides a residency scheme for a cache memory. Through the solution of the present disclosure, a specific area in the cache memory can be configured as a locked area, and multiple-used data can be stored in the locked area, thereby improving the cache hit rate and improving the overall performance of the system. Based on this, the present disclosure provides a solution for a cache memory in the following aspects.

In a first aspect, the present disclosure provides a method for a cache memory, comprising: configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each of the lock The latch mode corresponds to a latch-related operation performed on data in the latch area; receiving a latch-related request for performing a latch-related operation on the data in the latch area; and according to the lock A storage-related request, performing a latch-related operation on the data in the latch area in the corresponding latch mode.

In a second aspect, the present disclosure provides a cache memory, including: a configuration module configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein each One of the latch modes corresponds to a latch-related operation performed on data in the latch area; a latch execution module is used for: receiving and latching the data in the latch area A latch-related request for a related operation; and performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.

In a third aspect, the present disclosure provides a system-on-chip comprising a cache as described above and in various embodiments below; and a processor configured to generate said latch-related request; wherein The latch execution module of the cache memory is configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.

In a fourth aspect, the present disclosure provides a board, including the system-on-chip as described above and in the following embodiments.

In a fifth aspect, the present disclosure provides a computing device, including the board as described above and described in various embodiments below.

According to the solutions provided in the above aspects of the present disclosure, the latch area can be used to perform latch and unlock operations on data used multiple times, thereby significantly improving the cache hit rate. Further, since the latch area of the present disclosure supports multiple latch modes, and these latch modes can be selected and used according to the configuration, the application scenarios of the latch area are expanded. When used in the scenario of the producer core and the consumer core, the latch area of the present disclosure can serve as a medium for data transfer, thereby improving the accessibility and utilization of data. In addition, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the computing system.

In addition, in view of the technical problems mentioned in the background technology section above, the present disclosure proposes the use scenario of expanding the cache memory. To this end, the present disclosure provides solutions for a system on chip in the following aspects.

In a sixth aspect, the present disclosure provides a method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores for performing the operation, the method includes: mapping a designated storage space of the off-chip memory to a latch area of the cache memory, so as to use the latch area as a link for inter-cluster data communication a cluster storage area; and performing operations of the cluster using the cluster storage area.

In a seventh aspect, the present disclosure provides a system-on-chip, comprising: a plurality of clusters, each of which includes a plurality of processor cores for at least performing arithmetic operations; and a cache memory associated with the plurality of The clusters are interconnected and configured to: use the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and use the cluster storage area to execute all Describe the operation of the cluster.

In an eighth aspect, the present disclosure provides a computing device comprising a system-on-chip as described above and in various embodiments below.

In a ninth aspect, the present disclosure provides a board, including the computing device as described above and described in various embodiments below.

In a tenth aspect, the present disclosure provides a computing device, including the board as described above and in the following embodiments.

According to the solutions provided in the above multiple aspects of the present disclosure, the latch area of the cache memory can be used to realize efficient communication between the clusters of the SoC. Therefore, the data that needs to be transferred through the off-chip memory can be directly transferred through the latch area, thereby speeding up data access and significantly improving the cache hit rate. Further, since the probability of a cache hit is increased through the latch area, the solution of the present disclosure also significantly improves the overall performance of the SoC. In addition, the division of the latch area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. With the help of the latch area, multiple clusters of the SoC can implement multiple flexible communication mechanisms, thereby also improving the operational performance of the cluster.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts wherein:

FIG. 1 is a structural diagram showing a board according to an embodiment of the present disclosure;

FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;

3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;

4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a method for a cache memory according to an embodiment of the present disclosure;

Figure 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the disclosure;

8 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;

9 is a detailed block diagram illustrating a system-on-chip according to an embodiment of the present disclosure;

FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram showing a hash operation in window mode according to an embodiment of the present disclosure;

Figure 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;

13 is a flowchart illustrating a method for a system on a chip according to an embodiment of the present disclosure; and

FIG. 14 is a block diagram illustrating an operation of a system on chip according to an embodiment of the present disclosure.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, but not all of the embodiments, and the described multiple embodiments can be properly combined according to scenarios to achieve different applications. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present disclosure.

It should be understood that the terms "first", "second" and "third" that may be used in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific sequence. The terms "comprising" and "comprises" used in the specification and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.

It should also be understood that the terminology used in this disclosure description is for the purpose of describing specific embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise. It should also be further understood that the term "and/or" used in the present disclosure and the claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context. Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.

Specific embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings.

FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 1 are only an example, and are not intended to limit the solution of the present disclosure in any respect.

As shown in FIG. 1 , the board 10 includes a chip 101, which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.

As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102 . According to different application scenarios, the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.

The board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).

FIG. 2 is a structural diagram showing a combination processing device in the chip 101 according to the above-described embodiment. As shown in FIG. 2 , the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a DRAM (Dynamic Random Access Memory, DRAM) DRAM 204.

The computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user.

The interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 . Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 . Alternatively or optionally, the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .

As a general-purpose processing device, the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 . According to different implementations, the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general and/or special purpose processors. Processors, including but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, as far as the computing device 201 of the present disclosure is concerned, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.

The DRAM 204 is used to store data to be processed, and is a DDR memory, usually 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.

FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc. The single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .

The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312. The instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution. The storage module 33 is used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation; WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights; DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204.

FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core. The multi-core computing device 41 adopts a hierarchical structure design, and the multi-core computing device 41 is a system on chip, which includes at least one cluster (cluster) according to the present disclosure, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core. Viewed from the system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .

There may be multiple external memory controllers 401 (two are exemplarily shown in the figure), which are used to respond to the access request issued by the processor core, and access the external memory device, that is, the off-chip memory in the context of the present disclosure ( For example, the DRAM 204 in FIG. 2 ), so as to read or write data from off-chip. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks. The on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 4 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.

Viewed at the cluster level, as shown in FIG. 4 , each cluster 405 may include a plurality of processor cores (IPU core) 406 and a storage core (MEM core) 407, which may include, for example, the high-speed Buffer memory (eg LLC).

The number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 5 . Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 . The functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here. It should be noted that the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534. IODMA 533 controls memory access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409; MVDMA 534 is used to control memory access of NRAM 531/WRAM 532 and storage unit (SRAM) 408.

Returning to FIG. 4 , the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, processors communication between the cores 406 and the like. In other embodiments, the storage core 407 may have a scalar operation capability to perform scalar operations.

The storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411. In an implementation scenario, the SRAM 408 can assume the role of a high-performance data transfer station. Thus, the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406 respectively, but is transferred between the processor cores 406 through the SRAM 408. Further, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the efficiency of inter-core communication and significantly reducing on-chip and off-chip input/output access.

The broadcast bus 409, the CDMA 410 and the GDMA 411 are respectively used to perform communication between the processor cores 406, communication between the clusters 405, and data transmission between the clusters 405 and the DRAM 204. They will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 . The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (for example, a single processor core to a single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406, and broadcasting is to transmit a data The communication mode in which data is transmitted from SRAM 408 to all processor cores 406 belongs to a special case of multicast.

The CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 in the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408. As can be seen from the foregoing, the communication between the DRAM 204 and the NRAM 431 or WRAM 432 can be realized in two ways. The first way is to directly communicate with DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 Transfer between NRAM 431 or WRAM 432. Although the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 431 or WRAM 432 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.

In other embodiments, the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component. Although the present disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure . Further, the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same part.

The hardware architecture and its internal structure of the present disclosure have been described in detail above with reference to FIGS. 1-5 . It is to be understood that the foregoing description is illustrative only and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also make changes to the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure. For example, in the solution described below in the present disclosure, the corresponding hardware architecture may not include the CDMA 410 used to control the access to the SRAM 408 among different clusters 405 in the same computing device 201 . Instead, the underlying approach of the present disclosure involves improving and optimizing the cache, eg, disposed between SRAM 408 and DRAM 204, to enable efficient on-demand latching of data and communication between different clusters through the cache.

In order to efficiently use the cache memory (such as LLC) and improve the hit rate of data access, the following scheme of the present disclosure proposes to configure a specific storage space in the cache memory as a latch area for data latch operations, especially It is for data that will be frequently used. For example, the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that data need not be locked in the cache memory when the data need only be used once.

Further, on the basis of configuring the latch area for data latching, the following solution of the present disclosure also proposes to configure the cache memory to support multiple latch modes, so that when a latch-related request is received, the high-speed The buffer memory operates in a latch mode corresponding to the aforementioned latch-related request. According to different application scenarios and requirements, various latch modes of the present disclosure may have a specific priority order to satisfy different latch-related operations. In addition, in order to enable the cache memory to support multiple latch modes, the solution of the present disclosure also proposes a variety of different configuration methods, so that the cache memory can be used more flexibly and used to realize inter-cluster communication.

FIG. 6 is a flowchart illustrating a method 600 for a cache memory according to an embodiment of the disclosure. As shown in FIG. 6 , the method 600 includes, at step S602 , configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes. In one embodiment, the aforementioned multiple latch modes may include, but not limited to, an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a lock mode for performing latch-related operations based on data streams. Streaming mode for store-related operations and/or page mode for latch-related operations based on cache pages. In one embodiment, the aforementioned data streams may be instruction streams or data streams of different types. Taking the data stream as an example, in the application scenario of the neural network, the data stream may be the neuron data stream, weight data stream, output result data stream, etc. of the neural network model. In addition, in the context of the present disclosure, the data targeted by the latch-related operation is data that will be used multiple times by the processor of the system-on-chip, and has relatively higher priority than the data that is not subjected to the latch operation. By latching (or residing in) such multiple-used data in the latch area of the present disclosure, the cache hit rate can be significantly improved, thereby improving the overall performance of the system. In addition, by residing the reused data in the latch area of the LLC, the read and write operations of data between the on-chip system and off-chip memory (such as DDR or DRAM) can be reduced, thereby improving memory access efficiency .

In an application scenario, the above-mentioned multiple latch modes can be set to have different priorities according to user preferences or system preferences. For example, in one implementation, the order of priority may be instruction mode -> window mode -> stream mode -> page mode; in another implementation, the order of priority may be instruction mode -> page mode —> Stream Mode —> Window Mode. Through such multi-mode and priority setting, the latch area in the cache memory can be used in more ways, increasing the flexibility of using the latch area to cope with different application scenarios and system requirements. Further, it may traverse sequentially according to the priority order of the above-mentioned latch modes, and when the high-priority latch mode is disabled, the low-priority latch mode may be adopted.

In an embodiment, a specific storage space may be configured as a latch area supporting a corresponding latch mode according to one configuration instruction among the received configuration instructions. In one scenario, the configuration instruction may include one or more configuration items, so as to realize the configuration of the aforementioned latch area. For example, the plurality of configuration items may include configuration items for enabling a latch area, disabling a latch area, and/or a size of a latch area. Further, the corresponding latch strategy (such as the size of the latch data or the specific data to be latched) can be configured in the aforementioned instruction mode, window mode, stream mode or page mode, so as to latch different types or specific Instructions, data or data flow, etc. For details on configuring the corresponding latch strategy in different modes, please refer to the description below. Through such enabling, disabling and various specific configurations, the scheme of the present disclosure can realize the flexible use of the cache memory, so that it can operate in one of the various latch modes of the present disclosure, or operate in the normal mode as required.

Returning to the flow chart in FIG. 6, after the configuration operation at step S602 is completed, at step S604, a latch-related request for performing latch-related operations on data in the latch area is received. According to an embodiment of the present disclosure, the latch-related request may be triggered by an operation intended to reside specific data in a latch region. Alternatively, the latch-related request may also be triggered by an operation intended to remove or release specific data from the latch area. As described in detail above, when operating in different latch modes, the latch-related requests of the present disclosure may also have different expressions or contents. For example, for an instruction mode, a window mode, or a stream mode, the latch-related request may include a configuration item for indicating a behavior attribute of the cache memory, and the like.

In one embodiment, the above-mentioned configuration item for indicating the behavior attribute of the cache memory includes at least one of the following multiple configuration attributes:

Transient (Transient) attribute: do not cache in LLC, that is, directly perform data read and write operations with off-chip memory (such as DDR); for some data that is only accessed once, do not cache in LLC, thereby avoiding Occupy LLC resources;

Lock (Lock) attribute: Reside specific data in the latch area, and read and write data from the hit cache line (cacheline). If the cache line belongs to the latch area, the attribute of the cache line is configured as a persistent (persisting) attribute; if the cache line does not belong to the latch area, the attribute of the cache line remains unchanged, that is, the following normal (normal) attributes are maintained; it should be clear Yes, the above-mentioned cache line in the latch area has two attributes, namely a persistent (persisting) attribute and a normal (normal) attribute. A cache line with a persistent (persisting) attribute in the lock area can only be accessed and replaced by a lock-related request with a Lock attribute.

Unlock (Unlock) attribute: After reading and writing data from the hit cache line, release the corresponding storage space of the data in the latch area in the LLC, and set the corresponding cache line attribute in the latch area to the following general attributes;

General attributes: requests that are normally cached in the LLC can directly read and write data with off-chip memory;

Invalid attribute: Invalid data directly after reading to avoid being replaced and written to the off-chip memory;

Clean (Clean) attribute: When performing a write operation, data can be written into the hit cache line, and the storage content of the entire cache memory (cache) can be written back to the off-chip memory, and the attributes of the cache line remain unchanged; During a read operation, data is read from the hit cache line. When the hit cache line is dirty (dirty), write it back to the off-chip memory;

Default (Default) attribute: the default item can be used to indicate that the configuration about the latch mode is ignored.

By attaching the above exemplary configurable attributes to the latch-related request, the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.

For another example, for the page mode, the latch related request may indicate that the data related to the specific page will be latched in the latch area for subsequent multiple use, or may indicate that the data related to the specific page will be After multiple uses, unlock from the latch area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the latch area can be used flexibly, thereby improving the utilization efficiency of the latch area of the present disclosure.

Returning to the process of FIG. 6 , in response to the latch-related request in step S604 , at step S606 , according to the latch-related request, a latch-related operation may be performed on data in the latch area in a corresponding latch mode. According to an embodiment of the present disclosure, the aforementioned latch-related operations may include a read operation and a write operation for the latch area. In one embodiment, for the write operation of the latch area, the method 600 may also include latching data or a selected part of the data in a specified area of the latch area according to a latch-related request, so as to be used in subsequent multiple reads. In another embodiment, for the read operation of the latch area, the method 600 may further include, after the read operation is completed, transferring the data or a selected part of the data from the latch area according to the latch-related request The specified area is released.

Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be latched in the latch area. In a further embodiment, when the access address of the data to be latch-related operations is within the address range of the lock window, the aforementioned hash algorithm may be used to select part of the data that can be locked in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .

Using the method described above in conjunction with FIG. 6 , the solution of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptable, so as to meet different application scenarios and user requirements. In addition, due to the effective latching of data in the latch area, the sharing of data between the producer kernel ("producer kernel") and one or more consumer kernels ("consumer kernel") is also promoted, improving the Data Accessibility and Usage. The producer kernel and the consumer kernel here can be understood as two dependent tasks, where the output of the producer kernel will be used as the input to the consumer kernel, so that the consumer kernel can use the input to complete the corresponding task. At this time, since the output of the producer core will be used as the input of subsequent operations, the output of the producer core can be used as data that needs to be used multiple times in the future, and the data that needs to be used multiple times in the future can be temporarily stored in the lock of the cache memory memory area, so that the consumer core can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the memory interaction between the artificial intelligence processor and the off-chip memory, and reducing the IO access memory overhead, which in turn can improve the processing efficiency and performance of artificial intelligence processors.

FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the disclosure. It can be understood that the cache memory 700 shown in FIG. 7 may be the cache memory described in conjunction with FIG. 6 , so the cache memory described in FIG. 6 is also applicable to the following description in relation to FIG. 7 .

As shown in FIG. 7 , the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702 . Further, the cache memory 700 also includes a storage space for performing cache operations, for example, as shown in the figure, the storage space is equally divided into 8 ways (way0-way7), wherein each way includes a number of The cache line (cacheline).

In one embodiment, the above-mentioned configuration module can be used to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein the size of the specific storage space is smaller than the total storage size of the cache memory . For example, way0-way5 in FIG. 7 can be configured as a specific storage space that supports latching. Correspondingly, ways6-7 in FIG. 7 can maintain the common attributes of the cache memory, that is, be used as a general cache. As previously mentioned, the latch mode can be instruction mode, window mode, stream mode and/or page mode. Further, the latch execution module may be configured to receive a latch-related request for performing latch-related operations on data in the latch area. Next, the latch execution module can perform latch-related operations on data in the latch area in a corresponding latch mode according to the latch-related request. Same as the previous description, the latch-related operations here may include a write operation for the latch area (that is, writing data into the latch area) or releasing data in the latch area from the latch area. For example, when the consumer core has used up the data in the lock area and the data will no longer be used by other consumer cores, the space for storing data in the lock area can be released for locking other data .

FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the disclosure. As shown in FIG. 8 , a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in FIG. 7 . In one embodiment, the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request. Regarding the cache memory 700 , it has been described above in conjunction with FIG. 6 and FIG. 7 , and will not be repeated here. Regarding the processor 802, according to the aspects of the present disclosure, it may be various types of processors, and may include one or more processor cores to generate latch-related requests. In operation, the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request. For example, when the latch mode is an instruction mode, the processor can be configured to generate latch-related requests according to received hardware instructions. For another example, when the latch mode is the page mode, the processor may be configured to generate a latch-related request according to the cache page configuration. For another example, when the latch mode is a window mode or a stream mode, the processing may be used to configure a lock window, and generate a latch-related request according to the lock window.

According to different implementations, the processor 802 may also be an intelligent processor or an intelligent processing unit ("Intelligence Processing Unit", abbreviated as "IPU") including multiple computing cores, which may be configured to execute various artificial intelligence fields ( such as neural network calculations).

FIG. 9 is a detailed block diagram illustrating a system on chip 900 according to an embodiment of the present disclosure. It can be understood that the system-on-chip 900 shown here may be a specific implementation of the system-on-chip shown in FIG. 8 , and therefore the content described with respect to FIG. 8 is also applicable to FIG. 9 . Further, for the purpose of example only, the operation of the system-on-chip 900 will be described in a window mode (or stream mode) among a plurality of latch modes.

As shown in FIG. 9 , a system on chip 900 may include a task scheduler (“Job Scheduler”) 902 including a scheduling unit 903 and a configurator 904 . In one embodiment, the configurator 904 may be configured to generate configuration instructions according to assigned configuration tasks (e.g., obtainable from a task queue) to be sent to a configuration module (such as a CLR) in a cache memory (that is, "LLC" 906) send. In one embodiment, the scheduling unit 903 can be used to schedule multiple tasks in the task scheduler (that is, the "kernel" to be executed on the artificial intelligence processor), so as to provide intelligent processing in the system on chip of the present disclosure processor (IPU) 905 to send. In the solution of the present disclosure, the intelligent processor 905 here may include multiple processor cores, and the multiple processor cores may form a cluster as shown in FIG. 4 . In an implementation scenario, in the previous multi-processor core architecture, the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (eg utilization) of the multiple processor cores.

Further, the system-on-chip 900 also includes a system memory management unit ("System Memory Management Unit", abbreviated as "SMMU"), which is used to convert the virtual address of the accessed data into a physical address, so as to An address enables access to an associated memory location. In one embodiment, the system memory management unit includes an address translation buffer TLB (Translation Lookaside Buffer, also called fast table). A page table is maintained in the TLB, and the page table includes at least one page table entry, and each page table entry includes a page (page) and a page frame (Frame) corresponding to the page. In operation, the system memory management unit can determine the page corresponding to the virtual address according to the received virtual address, and then can determine the physical address PA (Physical Address) corresponding to the virtual address through the mapping relationship between the page and the page frame, Therefore, the access to the relevant storage location of the cache memory can be realized according to the physical address.

In one embodiment, access to the cache memory can be implemented through the above-mentioned window mode or stream mode. At this time, the intelligent processor can obtain the parameter table from the memory, and according to the parameter table, configure a lock window ("Lock window") associated with the data of the latch-related operation to be performed, and generate a lock window according to the configured lock window. Requests (ie, eg IO access requests with lock/unlock attributes attached). Then, the SMMU can perform latch-related operations on the LLC according to the IO access request. Specifically, the SMMU may send the aforementioned IO access request to the cache policy module 907 of the LLC 906 (which performs the same operation as the latch execution module 702 in FIG. 7 ) for execution. In one embodiment, the parameter table may include parameter items for configuring a lock window or a stream latch attribute in a stream mode. For example, parameter items may include, but are not limited to, lock/unlock window (“lock/unlock window”), lock/unlock per stream (“per stream lock/unlock”), lock ratio (“Lock Ratio”), lock Window identification ("lock window flag") and other information. In an implementation scenario, the parameters in the parameter table may be user-defined. Thus, the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table can be stored in the memory (such as DDR), so that the intelligent processor (such as the IPU 905 in the figure) can be used in the execution phase.

In one embodiment, the above-mentioned lock window is used to represent the storage space that the software user wishes to lock, and the size of the lock window may be larger than the size of the lock area on the cache memory. The above-mentioned locked window includes one or more of the following: the base address and size of the window, wherein the base address of the window can be a virtual address configured by the upper layer software (such as virtual address "Virtual Address", abbreviated "VA"), the window The base address of the window corresponds to the starting address of the data to be latched, and the size of the window may correspond to the size of the data to be latched.

Specifically, in the window mode, the intelligent processor can determine the memory access address of the data in the task (the memory access address can be a virtual address) according to the task issued by the task scheduler, and make the memory access address of the data in the task The address is compared to the address range defined by the window's lock window. If the access address of the data in the task is within the address range of the lock window, it means that the lock window is hit, and the lock window can be enabled (such as "Enabled") at this time. Otherwise, if the access address of the data in the task is outside the address range of the lock window, it means that the lock window is not hit. At this time, the lock window can be ignored, which means that the data in the task will not be temporarily stored in the cache memory. Further, when the access address of the data hits the lock window, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data and stored in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 . Afterwards, the intelligent processor can send the lock-related request attached with the Lock attribute to the cache memory LLC through the SMMU. Wherein, the lock-related request attached with the Lock attribute may be used to indicate that specific data resides in the lock area, and the specific data may be part of data selected according to a hash algorithm.

The latching process and release process of the LLC will be described below in the window mode with reference to FIG. 9 .

LLC residency (or lock) process:

Step 1: The task scheduler configures the LLC with the help of the configurator (e.g. via the cache policy module) to enable the lock region ("Lock enable"), disable the lock region ("Lock disable") and the size of the lock region, as shown in the figure The number of ways ("Ways") to go out (eg Way0-Way7).

Step 2: The task scheduler sends the task kernel to the IPU;

Step 3: The IPU obtains the lock window flag (“lock window flag”) from the parameter table, reads and configures the lock window. In an implementation scenario, the parameter table here can be configured by software and stored at a storage address of an off-chip dynamic random access memory ("Dynamic Random Access Memory", abbreviated as "DRAM"). Then, the task scheduler can transmit the address to the IPU and the IPU can read the parameter table according to the address, so as to complete the configuration of the locking window.

Step 4: The IPU generates a lock-related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, the request can be attached with a lock attribute according to the lock window information.

Step 5: After receiving the lock-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (that is, the lock area), For example set to "persisting" as described above.

LLC deresidence (or release) process:

Step 6: The task scheduler sends the kernel to the IPU;

Step 7: The IPU obtains the unlock window ID from the parameter table, reads and configures the unlock window;

Step 8: When the IPU transmits the request, it attaches the unlock (“unlock”) attribute according to the unlock window information;

Step 9: After receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a normal attribute, such as the normal (Normal) attribute described in conjunction with the instruction mode above;

Step 10: The task scheduler disables the lock area (ie, LLC lock disable) by means of the configurator and through the CLR module. In an implementation scenario, the CLR module may clear the previous locking attribute configuration according to the instruction of the configurator.

The latch scheme of the system on chip of the present disclosure in the window mode has been described in detail above with reference to FIG. 9 . Through such a latch operation, the probability of a cache hit can be significantly increased, the utilization efficiency of the cache memory is improved, and the application scenarios are expanded.

The embodiments of the present disclosure also support latch-related operations in stream mode. When the enable bit corresponding to the data stream in the task of the present disclosure is low, it is regarded as the default situation, that is, the latch-related operations in stream mode are not performed. . Conversely, when the enable bit is high, the corresponding latch-related operations can be performed on the data stream in stream mode. Specifically, the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data can be selected from the data stream as the aforementioned partial data to be stored in the latch by using the hash algorithm and the lock ratio of the data stream. in the district. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .

As mentioned above, in one embodiment, the embodiment of the present disclosure also supports latch-related operations in the page mode, and the page mode will be described below with reference to FIG. 10 .

FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure. As shown in FIG. 10, according to the solution of the present application, the cache page can be directly configured so that it has the lock attribute of the present disclosure, so that the cache page that forms a mapping relationship with the memory (such as "memory") can be used for multiple The kernel kernels (kernel 0-2 shown in the figure) share access data for use. In one embodiment, the programmer may use an instruction (such as Malloc) to mark the cache page with a lock attribute. When a kernel accesses a cache page marked as locked, the SMMU can lock the data corresponding to the cache page in the latch area of the present disclosure. Then, when the subsequent core needs to access the aforementioned cache page again, it can read the previously locked data from the corresponding cache line in the latch area, thereby achieving a cache hit. Thus, through the page mode, the disclosed scheme improves the sharing and accessibility of data among multiple cores.

Specifically, in the page mode, the software driver can directly configure ("System Memory Management Unit", abbreviated as "SMMU") information in the page table through instructions, and determine to perform page-based latch operations or Normal operates both configurations. When the information in the page table indicates that the SMMU is bypassed (bypass), it means that the cache memory does not need to be latched. At this time, the attribute of the buffer line in the cache memory can be a normal (Normal) attribute. When the information indicates that the SMMU is linearly mapped, then the page-based latch operation may be set according to the SMMU linearly mapped window configuration. For example, the data corresponding to the cache page in the linear mapping window is locked in the latch area of the present disclosure. The SMMU can generate a corresponding lock-related request based on the information in the page table, and send the lock-related request to the LLC, and the cache policy module of the LLC can configure the cache line of the LLC according to the lock-related request to execute The corresponding cache-related operations.

In one embodiment, the embodiment of the present disclosure also supports an instruction mode, at this time, the system-on-chip can configure the latch area in the LLC through a memory access instruction (IO instruction) in the instruction set.

For example, the IO instruction is accompanied by at least one configuration domain that latches related attributes, so that the LLC can be flexibly configured by means of the configuration domain. Here, various configuration domains may represent corresponding operation behaviors that the LLC may perform when performing data access to off-chip memory (such as DDR space). In an implementation scenario, the above configuration attributes are included in the instruction: Transient (Transient) attribute, Lock (Lock) attribute, Unlock (Unlock) attribute, General (Normal) attribute, Invalid (Invalid) attribute, Clean (Clean) attribute Or the default (Default) attribute and so on. Since the instruction mode is the highest priority, when the IO access instruction is indicated as the Default attribute, it means that other modes (such as window mode, stream mode or page mode) can perform latch-related operations.

When the task scheduler sends the task to the intelligent processor IPU, the IPU can determine the latch related request according to the IO instruction in the task. Specifically, when the configuration domain of the Lock attribute in the IO instruction is enabled, the Lock attribute can be attached to the lock-related request at this time, so that the LLC can store specific data in the lock-related request according to the Lock attribute. locked area. When the configuration field of the Unlock attribute in the IO command is enabled, the Unlock attribute can be attached to the lock-related request at this time, so that the LLC can release the locked area according to the lock-related request attached with the Unlock attribute. According to different application scenarios, the latch-related request here can also have other attributes similarly attached.

Further, in some operation scenarios, when the instruction also includes a specific configuration field for indicating the latch ratio. When a specific configuration field in the instruction (for example, a specific bit inst_ratio_en) is low, it can be considered that the latch operation depends on the instruction configuration, that is, the latch-related request is determined according to the specific IO instruction in the task. If the aforementioned bit is high, it can be compared with the lock ratio (lock ratio) indicated by the instruction according to the hash (hash) algorithm, and a predetermined proportion of data is selected from the data stream as the aforementioned partial data to be stored in the lock in storage area. The specific use of the hash algorithm will be described in detail below in conjunction with FIG. 11 .

FIG. 11 illustrates a hash operation in window mode or stream mode according to an embodiment of the present disclosure. The scheme of the present disclosure uses a hash operation to enforce a certain percentage of residency (ie locking) because one of the key issues with LLC residency is the bandwidth versus capacity tradeoff ("tradeoff"). Therefore, this disclosure proposes to implement a certain ratio of residency (ie, Lock Ratio), so that different bandwidths and residency capacities can be obtained for different tasks. Assuming that the preset Lock Ratio value is P (for example, in percentage), the expected bandwidth is B=6T*P+2T*(1-P), where 6T is the read rate of data residing on the LLC, and 2T It is the reading speed of data stored in memory (such as DRAM), where T=1000Gbit/s. As mentioned earlier, Lock Ratio can be configured in the lock/unlock window or for specific data streams. Also, although hash operations in window mode or stream mode are described below, similar operations are also applicable to hash operations in instruction mode.

Specifically, in the window mode or stream mode, the intelligent processor core first compares the access address of the data with the address range defined by the lock window to determine whether the requested address is within the address range of the lock window. When the requested address is within the address range of the locked window, as shown in FIG. 11 , a hash operation may be performed on the hit window address range. Here, the access address of each data may be a virtual address.

Specifically, with the help of globally fixed Hash rules, the VA of the access address can be mapped to the Hash space (that is, the "Hash Map" in the figure), and the Hash process can preferentially retain the low-order information of the address. Then, the Hash value obtained at 1102 can be compared with the lock ratio Lock Ratio at 1104 to randomly select data of a corresponding ratio. Specifically, when the hash value of the access address is smaller than the latch ratio, it is considered a hit, and therefore the part of data (ie, data conforming to the ratio) can be latched in the cache memory. On the contrary, when the hash value of the access address is greater than or equal to the latch ratio, it is considered a miss, and therefore this part of data will not be latched in the cache memory.

For example, when the lock ratio Lock Ratio is set to 10%, you can select the part of the data corresponding to the first 10% value from the Hash value in order, that is, the part whose hash value of the latch address of the data is smaller than the lock ratio Data is latched for related operations. Currently, in other examples, the latch ratio can also be other values, and the latch ratio can be customized by the software user, and the aforementioned selection operation can also be implemented according to the setting of the Hash algorithm. For example, the latch ratio may also be 20%-30%, and at this time, partial data corresponding to the first 20%-30% of the Hash values may be sequentially selected to perform latch-related operations. Thereafter, at 1106, it can be processed according to the specified request type, that is, to lock or unlock some data.

The latch scheme of the cache memory of the present disclosure has been described in detail above with reference to FIGS. 6-11 . Based on the idea of the aforementioned latch scheme, and as a supplement to the aforementioned latch scheme, the following will describe another extended application of the present disclosure for the cache memory in conjunction with Fig. 12-Fig. Inter-cluster communication.

12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure. In combination with the foregoing description, it can be understood that the system on chip here may be the system on chip included in the computing device 201 shown in FIG. 2 , for example, the system on chip constituted by the multi-core computing device 41 . As shown in FIG. 6 , the system-on-chip 1200 includes four clusters 0 - cluster 4 exemplarily shown. Since the cluster has been described in detail above, it will not be repeated here. Further shown is a cache memory 1201, which can be set, for example, in the SRAM 408 as previously shown in FIG. 5, for performing inter-cluster data transfer operations. In an implementation scenario, the cache memory 1201 can also perform on-chip and off-chip bidirectional communication with DRAM (such as DDR), including the transfer of various types of data or instructions.

FIG. 13 is a flowchart illustrating a method 1300 for a system on chip according to an embodiment of the present disclosure. The system on chip here may be the system on chip as shown in FIG. 12 . Specifically, the system-on-chip includes at least a plurality of clusters for performing computing operations and a cache memory interconnected with the plurality of clusters. In an implementation scenario, each cluster may include multiple processor cores for performing the computing operations. In an implementation scenario, the above-mentioned latch area determined in the cache memory can be used to complete inter-cluster data communication, so that the system-on-chip does not need to set communication modules such as CDMA 410 and GDMA 411.

In one embodiment, the above-mentioned latch area can be used to transfer data between tasks with dependencies, for example, the latch area can be used to transfer data between a producer core and a consumer core. Specifically, the processor can lock the data that the producer core needs to exchange to the consumer core in the LLC through the configured lock window. In one scenario, after the processor finishes executing the producer kernel, it can latch the data that needs to be delivered to the consumer kernel (it may be the input data or output data of the producer kernel). In view of this, the processor can perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, the SMMU, so as to latch the above-mentioned data that needs to be exchanged in the LLC in the window mode, for later use by the consumer kernel. Correspondingly, the processor can also release the latch area according to the unlock window configured in the consumer kernel, that is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC, it can release the latch area in the LLC. The corresponding storage space of the data in the latch area.

Based on the above-mentioned latch area can be configured to transfer data between tasks with dependencies, the latch area can also be used in the application scenario of inter-chip communication. For example, a cluster or processor core of the processor transmits data (the data may be data that the producer core needs to exchange to the consumer core) via the latch area to processors in other clusters for merge processing. Processors in other clusters read data from the latch area for processing, thereby realizing inter-chip data transfer. For the manner in which inter-cluster communication is performed using the latch area, please refer to the description below.

As shown in FIG. 13 , the present disclosure also includes a method for performing inter-cluster communication using a latch area of a cache memory, the method comprising:

At step S1302, the specified storage space of the off-chip memory is mapped to a given storage area of the high-speed cache ("cache") (its physical properties are the same as the locking area described above in conjunction with the accompanying drawings), so that the given The storage area is used as the cluster storage area for inter-cluster data communication. In an implementation scenario as shown in FIG. 8 , the cache memory may include LLC, and the off-chip memory may include DDR. Based on this, the specified storage space may be the storage space specified at 1402 in FIG. 14 . Correspondingly, the cluster storage area may be a given storage area in the cache memory at 1404 in FIG. 14 . In an implementation scenario, the specified storage space of the DDR can be specified through software configuration, and the specified storage space of the DDR can be mapped to a given space on the cache for inter-cluster (for example, the cluster 0 shown in Figure 14 Communicate with cluster 1). After the cluster storage area is divided and determined, at step S1304, the determined cluster storage area may be used to perform cluster operations.

In one embodiment, using the cluster store to perform operations of the cluster may include using the cluster store for inter-cluster communication. In this case, using the cluster storage area for inter-cluster communication may specifically include: using the cluster storage area to implement point-to-point communication between clusters. Additionally, the cluster storage area may be used to implement broadcast communication from one of the multiple clusters to other clusters. In the scenario of point-to-point communication, the cluster storage area can be used to receive the write operation of the first cluster for writing data and respond to the read operation of the second cluster, and send the previous write of the first cluster to the second cluster data.

In an example implementation of the above-mentioned write operation, the cluster storage area may also be used to receive a lock indication that the write data associated with the above-mentioned write operation resides in the cluster storage area, such as the write lock shown in FIG. 14 ("write lock"), that is, the above-mentioned lock-related request with the Lock attribute. Then, based on the lock indication, the written data may reside in the cluster storage area, wherein the cluster storage area may be the latch area determined in the above embodiment. Through such a residency manner, the hit ratio of data to be read many times in the cache memory can be significantly improved.

In one implementation scenario, the producer kernel executing in one of the clusters can lock the data that needs to be exchanged to the consumer kernel in the LLC through the above-mentioned write lock for later use by the consumer kernel, such as the producer The core transmits data via LLC to processors in other clusters for merge processing. Processors in other clusters can read data from the cluster storage area for processing, thereby realizing inter-slice transmission of data.

In an example implementation of the read operation described above, the cluster memory area can also be used to receive a read invalidation indication that the write data is not written back to the off-chip memory, such as a read invalidation (“read invalidation” issued by cluster 1 in FIG. 14 ). invalid"). Wherein, the read invalid indication may be a latch-related request with an invalid attribute, and the generation method of the latch-related request may refer to the above description for details. In different latch modes, the latch-related requests can be different. Then, after sending the write data to cluster 1, the cluster storage area may invalidate the cache line associated with the write data based on the read invalidation indication.

In order to achieve the synchronization of data transfer (or communication) between the above clusters, the cluster (such as cluster 0) that writes data to the cluster storage area can send a synchronization command to another cluster (such as cluster 1) after the write operation is completed , such as hsem ("hardware semaphore") in Figure 14. After receiving the synchronization command, cluster 1 can send the above-mentioned read invalidation request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write-back of the aforementioned data .

In the context of this disclosure, the above-mentioned behaviors of writing data to and reading data from the cluster storage area can also be collectively referred to as lock-related operations triggered by lock-related requests, and the confirmation method of the lock-related requests See description above. Specifically, the latch-related request may be used to indicate a latch operation. Through the latch operation, the data will be latched in the cluster storage area for subsequent multiple uses. Further, the latch-related request can be used to indicate a release operation, and through the release operation, data can be unlocked from the cluster storage area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the cluster storage area can be used flexibly, thereby improving the usage efficiency of the cluster storage area in the present disclosure.

In one embodiment, for the read operation of the cluster storage area, after the read operation is completed, the data or a selected part of the data may be released from the specified area of the cluster storage area according to a latch-related request. Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, hash algorithm can be used to select a predetermined proportion of data from the data as the aforementioned partial data to be latched in the cluster storage area. For details, please refer to the description in the part of FIG. 11 above.

The solution of the present disclosure has been described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical care. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.

It should be noted that, for the purpose of brevity, the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teachings of the present disclosure, those skilled in the art may understand that several embodiments disclosed in the present disclosure may also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.

In the present disclosure, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.

In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory", abbreviated as ROM), random access memory ("Random Access Memory", abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.

In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory ("Resistive Random Access Memory", abbreviated as RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated as DRAM), static random access memory ("Static Random Access Memory", abbreviated as SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated as "EDRAM"), high bandwidth memory ("High Bandwidth Memory", abbreviated as "HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated as "HMC"), ROM and RAM, etc.

The foregoing can be better understood in light of the following terms:

Clause A1. A method for a cache memory comprising:

configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes;

receiving a latch-related request for performing a latch-related operation on the data in the latch region; and

Performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.

Clause A2. The method of Clause A1, wherein the plurality of latch modes are performed in a predetermined order of priority.

Clause A3. The method according to Clause A1 or 2, wherein the plurality of latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, a data-based Streaming mode for performing latch-related operations on streams and/or page mode for performing latch-related operations based on cache pages.

Clause A4. The method of clause A3, wherein,

In the instruction mode, the latch-related request is determined according to the hardware instruction;

In the page mode, the latch-related request is determined according to a cache page configuration;

In said window mode or stream mode, said latch-related requests are determined according to a lock window.

Clause A5. The method of Clause A4, wherein in the command mode, the window mode, or the stream mode, the latch-related request can be accompanied by a lock attribute, the lock attribute is used to indicate that specific data is stored in Retained in the latch area, the specific data is part of the data selected according to the hash algorithm.

Clause A6. The method of clause A3 or A4, wherein in page mode, the method comprises:

The cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.

Clause A7. The method of Clause A3, wherein configuring the latch region to support the plurality of latch modes comprises:

Configure a specific storage space as a latch area that supports a corresponding one of the latch modes according to one of the received configuration instructions, wherein the configuration instruction includes enabling the latch area and disabling the latch area and/or configuration items for the size of the latch area.

Clause A8. The method of Clause A7, wherein for a write operation to a latched area, the method includes latching the data or a selected portion of the data in the latch according to the latch-related request. In the specified area of the storage area, it can be used for subsequent multiple reads.

Clause A9. The method according to Clause A7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, storing the data or the selected portion according to the latch-related request The data is released from a designated area of the latch area.

Clause A10. A cache memory comprising:

a configuration module, configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes;

A latch execution module for:

According to the latch-related request, a latch-related operation is performed on the data in the latch region in the corresponding latch mode.

Clause A11. A system on a chip comprising:

a cache memory as described in clause A10; and

a processor, configured to generate the latch-related request;

The latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.

Clause A12. The system-on-chip of Clause A11, wherein the latch mode comprises an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request in accordance with a received hardware instruction.

Clause A13. The system-on-chip of Clause A11, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.

Clause A14. The system-on-chip of clause A11, wherein the latched mode comprises the windowed mode or the streaming mode, and in the windowed mode or the streaming mode, the system-on-chip further comprises: a task scheduler , which includes a configurator and dispatch unit, where:

The configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory; and

The scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor core.

Clause A15. The system-on-chip of Clause A14, wherein the configuration instructions include configuration items for enabling latch regions, disabling latch regions, and/or latch region sizes.

Clause A16. The system-on-chip of Clause A15, wherein the processor further comprises a system memory management unit for, in windowed mode or streaming mode:

configuring a lock window associated with data to be latch-related operations according to the parameter table; and

generating the latch-related request according to the configured lock window.

Clause A17. The system-on-chip of Clause A16, wherein the configuration items of the lock window include one or more of the following:

a base address and a size of the window, wherein the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data;

a latch indication for latching data in the latch area;

an unlock indication to unlock data from the latch area; and

The latch ratio, which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.

Clause A18. The system-on-chip according to clause A17, wherein the processor is further configured to use a hash algorithm to select an access address of the data to be latch-related operations within the address range of the lock window Part of the data that can be locked in the latch area.

Clause A19. The system-on-chip according to Clause A17, wherein the processor is configured to randomly select the portion of data satisfying the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a Latching the associated request for latching in the latching area.

Clause A20. The system-on-chip of clause A14, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write The data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the The latch execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.

Clause A21. The system-on-chip of any one of Clauses A16-A20, wherein the tasks include producer cores and consumer cores, wherein:

When executing the producer core, the processor is used to latch the data output by the producer core in the latch area through the latch-related request for use by the consumer core; and

When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.

Clause A22. A board comprising the system-on-chip according to any one of clauses A11-A21.

Clause A23. A computing device comprising the board of Clause A22.

Clause B1. A method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising a plurality of clusters for performing said computational operations A plurality of processor cores, the method comprising:

mapping a designated storage space of the off-chip memory to a latch area of the cache memory to use the latch area as a cluster storage area for inter-cluster data communication; and

Operations of the cluster are performed using the cluster storage area.

Clause B2. The method of Clause B1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.

Clause B3. The method of clause B2, wherein using the cluster storage area for inter-cluster communication comprises:

Utilizing the cluster storage area for peer-to-peer communication between clusters; or

The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.

Clause B4. The method of Clause B3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:

receiving a write operation for writing data from the first cluster; and

The write data is sent to the second cluster in response to a read operation by the second cluster.

Clause B5. The method of Clause B4, wherein in the write operation, the method further comprises:

receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and

The write data is resident in the cluster storage area based on the lock indication.

Clause B6. The method of clause B4 or B5, wherein in the read operation, the method further comprises:

receiving a read invalidation indication that the write data is not written back to the off-chip memory; and

A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.

Clause B7. A system on a chip comprising:

a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing computational operations; and

a cache memory interconnected with the plurality of clusters and configured to:

using the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and

Operations of the cluster are performed using the cluster storage area.

Clause B8. The system-on-chip of Clause B7, wherein the cluster memory area is configured for inter-cluster communication.

Clause B9. The system-on-chip of Clause B8, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the remaining clusters.

Clause B10. The system-on-chip of Clause B9, wherein in the peer-to-peer communication, the cluster storage area is configured to:

receiving a write operation for writing data from the first cluster; and

Clause B11. The system-on-chip of Clause B10, wherein the second cluster is configured to:

receiving a hardware semaphore from the first cluster; and

In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.

Clause B12. The system-on-a-chip of clause B10, wherein in the write operation, the first cluster is configured to send a lock to the cluster storage area to reside the write data in the cluster storage area indication, so that the cluster store resides the write data based on the lock indication.

Clause B13. The system-on-chip of clause B12, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area that causes the write data not to be written back to the off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.

Clause B14. A computing device comprising the system-on-chip according to any one of clauses B7-B13.

Clause B15. A board comprising the computing device according to Clause B14.

Clause B16. A computing device comprising the board according to Clause B15.

While various embodiments of the present disclosure have been shown and described herein, it would be obvious to those skilled in the art that such embodiments are provided by way of example only. Many modifications, changes and substitutions may occur to those skilled in the art without departing from the idea and spirit of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the appended claims define the scope of protection of the present disclosure and thus cover equivalents or alternatives within the scope of these claims.

Claims

A method for a cache memory comprising:

configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes;

receiving a latch-related request for performing a latch-related operation on the data in the latch region; and

Performing a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request.
The method of claim 1, wherein the plurality of latch modes have a predetermined priority order.
The method according to claim 1 or 2, wherein the multiple latch modes include an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a data flow-based Streaming mode for performing latch-related operations and/or page mode for performing latch-related operations based on cache pages.
The method according to claim 3, wherein,

In the instruction mode, the latch-related request is determined according to the hardware instruction;

In the page mode, the latch-related request is determined according to a cache page configuration;

In said window mode or stream mode, said latch-related requests are determined according to a lock window.
The method according to claim 4, wherein in the instruction mode, the window mode or the stream mode, the latch-related request can be accompanied by a lock attribute, and the lock attribute is used to indicate that specific data resides in In the latch area, the specific data is part of the data selected according to the hash algorithm.
The method according to claim 3 or 4, wherein in page mode, the method comprises:

The cache page-based latch operation is performed according to the linear mapping window of the system memory management unit.
The method according to claim 3, wherein configuring the latch region supporting the plurality of latch modes comprises:

Configuring a specific storage space as a latch area supporting a corresponding one of the latch modes according to one configuration instruction among the received configuration instructions;

Wherein the configuration instruction includes configuration items for enabling the latch area, disabling the latch area and/or the size of the latch area.
The method according to claim 7, wherein for the write operation of the latch area, the method comprises latching the data or a selected part of the data in the latch area according to the latch-related request within the designated area for subsequent multiple reads.
The method according to claim 7, wherein for the read operation of the latch area, the method comprises, after performing the read operation, transferring the data or the selected part according to the latch-related request Data is released from the designated area of the latch area.
The method of claim 1, wherein the cache memory is located in a system-on-chip and is also interconnected with a plurality of clusters of the system-on-chip, wherein each cluster includes a plurality of processor cores for performing arithmetic operations, The method also includes:

mapping a designated storage space of the off-chip memory to the latch area of the cache memory, so as to use the latch area as a cluster storage area for inter-cluster data communication; and

Operations of the cluster are performed using the cluster storage area.
The method of claim 10, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.
The method of claim 11, wherein using the cluster storage area for inter-cluster communication comprises:

Utilizing the cluster storage area for peer-to-peer communication between clusters; or

The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.
The method according to claim 12, wherein utilizing the cluster storage area to implement peer-to-peer communication between clusters comprises:

receiving a write operation for writing data from the first cluster; and

The write data is sent to the second cluster in response to a read operation by the second cluster.
The method according to claim 13, wherein in the writing operation, the method further comprises:

receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and

The write data is resident in the cluster storage area based on the lock indication.
The method according to claim 13 or 14, wherein in the read operation, the method further comprises:

receiving a read invalidation indication that the write data is not written back to the off-chip memory; and

A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.
A cache memory comprising:

a configuration module, configured to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes;

A latched execution module for:

receiving a latch-related request for performing a latch-related operation on the data in the latch region; and

According to the latch-related request, a latch-related operation is performed on the data in the latch region in the corresponding latch mode.
A system on a chip comprising:

A cache memory according to claim 16; and

a processor, configured to generate the latch-related request;

The latch execution module of the cache memory is configured to perform latch-related operations on the data in the latch area in the corresponding latch mode according to the latch-related request.
The system-on-chip of claim 17, wherein the latch mode includes an instruction mode, and in the instruction mode, the processor is configured to generate the latch-related request according to a received hardware instruction.
The system-on-chip of claim 17, wherein the latch mode comprises a page mode, and in the page mode, the processor is configured to generate the latch-related request according to a cache page configuration.
The system-on-chip according to claim 17, wherein said latch mode comprises said window mode or said stream mode, and in said window mode or stream mode, said system-on-chip further comprises: a task scheduler, which Contains the configurator and dispatch unit, which:

The configurator is used to generate the configuration instruction according to the assigned configuration task, so as to send it to the configuration module of the cache memory; and

The scheduling unit is used to schedule multiple tasks in the task scheduler so as to send them to the processor.
The system-on-a-chip according to claim 20, wherein the configuration instruction includes configuration items of enabling the latch area, disabling the latch area and/or the size of the latch area.
The system-on-chip according to claim 21 , wherein said processor further comprises a system memory management unit for, in windowed mode or streaming mode:

configuring a lock window associated with data to be latch-related operations according to the parameter table; and

generating the latch-related request according to the configured lock window.
The system on chip according to claim 22, wherein the configuration items of the locking window include one or more of the following:

a base address and a size of the window, wherein the base address of the window corresponds to the start address of the data to be performed latch-related operations and the size of the window corresponds to the size of the data;

a latch indication for latching data in the latch area;

an unlock indication to unlock data from the latch area; and

The latch ratio, which indicates the ratio of the data to be actually latched among the data to be subjected to latch-related operations.
The system-on-chip of claim 23, wherein the processor is further configured to:

When the access address of the data to be latch-related operations is within the address range of the lock window, a hash algorithm is used to select part of the data that can be locked in the latch area.
The system-on-chip according to claim 23, wherein the processor is configured to randomly select the part of data that satisfies the predetermined latch ratio from the data to be latched according to a hash algorithm, and generate a latch with a locking attribute Related requests to be latched in the latched area.
The system-on-chip according to claim 20, wherein the processor is configured to perform a write operation on the data in the latch area, and the latch execution module is configured to write the latch-related request to the data or a selected portion of the data is latched in a specified area of the latch area, and wherein the processor is further configured to perform a read operation on the data in the latch area, and the lock The storage execution module is configured to release the data after the read operation is performed from the designated area of the latch area according to the latch-related request.
A system-on-chip according to any one of claims 22-26, wherein said tasks comprise producer cores and consumer cores, wherein:

When executing the producer kernel, the processor is configured to latch the data output by the producer kernel in the latch area through the latch-related request, so as to be used by the consumer kernel; and

When executing the consumer kernel, the processor is configured to read data from the latch area, and after reading the data, unlock the data from the latch area through the latch-related request, In order to release the storage space used for the data in the latch area.
The system-on-chip according to claim 27, further comprising a plurality of clusters, wherein each cluster comprises at least a plurality of processor cores for performing arithmetic operations, and the cache memory is also interconnected with the plurality of clusters, And configured as:

using the latch area as a cluster storage area for inter-cluster data communication, wherein the latch area forms a mapping relationship with a designated storage space of the off-chip memory; and

Operations of the cluster are performed using the cluster storage area.
The system on chip of claim 28, wherein the cluster memory area is configured for inter-cluster communication.
The system-on-chip according to claim 29, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the rest of the clusters.
The system on chip according to claim 30, wherein in the point-to-point communication, the cluster storage area is configured to:

receiving a write operation for writing data from the first cluster; and

The write data is sent to the second cluster in response to a read operation by the second cluster.
The system-on-chip of claim 31, wherein the second cluster is configured to:

receiving a hardware semaphore from the first cluster; and

In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.
The system-on-chip of claim 31 , wherein during the write operation, the first cluster is configured to send a lock indication to the cluster storage area that the write data reside in the cluster storage area, so that the cluster storage area resides the write data based on the lock indication.
The system-on-chip according to claim 33, wherein in the read operation, the second cluster is configured to send a read invalidation indication to the cluster storage area so that the write data is not written back to the off-chip memory, so that The cluster store invalidates a cache line associated with the write data based on the read invalidation indication.
A board, comprising the system-on-chip according to any one of claims 17-34.
A computing device comprising the board according to claim 35.