WO2023124304A1

WO2023124304A1 - Chip cache system, data processing method, device, storage medium, and chip

Info

Publication number: WO2023124304A1
Application number: PCT/CN2022/121033
Authority: WO
Inventors: 王文强; 夏晓旭; 朱志岐; 徐宁仪
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-12-31
Filing date: 2022-09-23
Publication date: 2023-07-06
Also published as: CN114297098A

Abstract

Provided in the present disclosure are a chip cache system, a data processing method and apparatus, a device, a storage medium, and a chip. The cache system comprises: a plurality of operation subsystems, wherein each operation subsystem comprises at least one operation unit and at least one local shared buffer; each operation unit is connected to any local shared buffer in the operation subsystem where the operation unit is located; the local shared buffers are used for caching operation data read by the operation units in the operation subsystem to which the local shared buffers belong; and the operation units are used for accessing, on the basis of a generated access request, a local shared buffer, which matches an access address indicated by the access request, in the operation subsystem, and performing an operation on the basis of the read operation data when operation data indicated by the access request is read from the accessed local shared buffer.

Description

Chip cache system, data processing method, device, storage medium and chip

cross-reference statement

This application claims priority to a Chinese patent application with application number 202111662634.X filed with the China Patent Office on December 31, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the technical field of integrated circuits, and in particular, relates to a chip cache system, a data processing method, electronic equipment, a computer-readable storage medium, and a chip.

Background technique

Neural network is an algorithmic mathematical model that imitates the behavior characteristics of animal neural network and performs distributed parallel information processing. With the rapid development of neural networks, neural networks have been applied in various fields.

Generally, a designed artificial intelligence (AI) chip can be used to process the calculation process of the neural network. Therefore, designing an efficient AI chip has become one of the effective means to improve the processing efficiency of the neural network.

Contents of the invention

In view of this, the present disclosure at least provides a chip cache system, a data processing method, an electronic device, a computer-readable storage medium, and a chip.

In a first aspect, the present disclosure provides a chip cache system, including: a plurality of computing subsystems, each of which includes at least one computing unit and at least one local shared buffer; each of the computing units It is connected to any one of the local shared buffers in the computing subsystem; the local shared buffer is used to cache the computing data read by the computing unit in the computing subsystem; the computing unit is used to generate access requests based on , accessing the local shared buffer in the operation subsystem that matches the access address indicated by the access request; and when the operation data indicated by the access request is read from the accessed local shared buffer, based on Operation is performed on the read operation data.

Here, the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register. For any computing subsystem, the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share The operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points. The operation data is read at a time, which improves the operation efficiency of the operation unit.

In a possible implementation manner, each of the computing subsystems further includes: a local interconnection bus unit; the local interconnection bus unit is used to connect each of the computing units with any local The shared buffer is connected; and after receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request; the operation unit is used to access the local interconnection bus The local shared buffer determined by the unit and matching the access address indicated by the access request.

Here, each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data. At the same time, after receiving the access request sent by the operation unit, the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the operation unit can read the corresponding local shared register based on the access request.

In a possible implementation manner, the operation unit is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request , based on the access request, reading the operation data indicated by the access request from the storage module of the chip, and buffering the operation data read from the storage module into the local shared buffer.

Here, when the operation data indicated by the access request does not exist in the local shared buffer, the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.

In a possible implementation manner, the storage module of the chip includes a global buffer; the operation unit is further configured to: read the local shared buffer that matches the access address indicated by the access request. In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from a global buffer of the chip.

In a possible implementation manner, the storage module of the chip further includes an external memory; the operation unit is further configured to: when the operation data indicated by the access request is not read from the global buffer Next, read the operation data indicated by the access request from the external memory of the chip, and cache the operation data read from the external memory into the global buffer.

In a possible implementation manner, the cache system further includes: a global interconnection bus unit, configured to connect each computing subsystem to the storage module respectively.

For the description of the effects of the following devices, electronic equipment, etc., refer to the description of the above method, and will not be repeated here.

In a second aspect, the present disclosure provides a data processing method, the method is applied to the cache system of the chip described in the first aspect or any implementation mode, and the method includes: any operator included in the cache system The computing unit in the system obtains the access request; the computing unit accesses the local shared buffer in the computing subsystem that matches the access address indicated by the access request based on the access request; in the local shared buffer accessed from the slave When the operation data indicated by the access request is read, the operation unit performs an operation based on the read operation data to obtain an operation result.

In a possible implementation manner, the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request reads the operation data indicated by the access request from the storage module of the chip, and caches the operation data read from the storage module into the local shared buffer.

In an optional implementation manner, when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the storage module of the chip, including: not reading the operation data indicated by the access request from the local shared buffer matching the access address indicated by the access request In the case of operation data, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.

In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the global buffer, reading the computing data indicated by the access request, and caching the computing data read from the external memory into the global buffer.

In a third aspect, the present disclosure provides a data processing device, the device comprising: an acquisition module, configured to acquire an access request; a reading module, configured to read the access information indicated by the access request based on the access request A local shared buffer whose address matches; an operation module configured to perform an operation based on the operation data read from the local shared buffer when the operation data indicated by the access request is read, to obtain Operation result.

In a possible implementation manner, the device further includes an access module, configured to: read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared cache.

In an optional implementation manner, the access module, if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the The above access request, when reading the operation data indicated by the access request from the memory module of the chip, it is used for: not reading the specified data from the local shared buffer matching the access address indicated by the access request In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.

In an optional implementation manner, the access module is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer The operation data indicated by the access request is fetched, and the operation data read from the external memory is cached in the global buffer.

In a fourth aspect, the present disclosure provides a chip, including: the cache system and the storage module described in the first aspect or any one of the implementation modes; the cache system is used to obtain computing data from the storage module, and store the Operational data is cached.

In a fifth aspect, the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the The memory communicates with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the data processing method described in the second aspect or any implementation manner are executed.

In a sixth aspect, the present disclosure provides an electronic device, including the chip as described in the fourth aspect.

In a seventh aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the data as described in the second aspect or any implementation mode above is executed. The steps of the processing method.

In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings required in the embodiments. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those skilled in the art, they can also make From these drawings other related drawings are obtained.

FIG. 1 shows a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of the architecture of another chip cache system provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic structural diagram of a chip provided by an embodiment of the present disclosure;

FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic structural diagram of a data processing device provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of another chip provided by an embodiment of the present disclosure;

Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. The described embodiments are only some of the embodiments of the present disclosure, not all of them. The components of the disclosed embodiments generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.

When the computing power unit in the chip is working, it can read data from the external memory of the chip and store the read data in the internal memory so that the computing power unit can read data from the internal memory. However, when there is a lot of data to be read and the number of times to read data is large, the interaction between the external storage and the internal memory makes the calculation efficiency of the computing power unit low. In the following, the chip is an artificial intelligence (AI) chip as an example for illustration.

Generally, the designed AI chip can be used to handle the calculation process of the neural network. Taking the cloud-side scenario as an example, the reasoning and training processes of large-scale neural networks place higher demands on the computing power of AI chips. In order to meet this demand, the scale of AI chips on the cloud side is also getting larger and larger. By increasing the area to accommodate more computing units, a single chip has greater computing power. However, more computing units also mean that a higher bandwidth data path is required to support the data requirements of the operation. In order to meet the bandwidth required for computing, the conventional approach is to increase the cache inside the AI chip, and reduce the access requirements of the computing unit to the external memory by multiplexing the computing data.

For example, a cache can be set inside each computing unit of the AI chip, for example, the set cache can be a first-level cache L1Cache of a graphics processing unit (GPU). In this way, the same calculation unit can access the same cached operation data multiple times, realizing the multiplexing of operation data in the time dimension, and meeting the bandwidth requirements of the calculation unit. However, using this method will cause the operation data cached in any computing unit to be unable to be read by other computing units. For example, for any computing unit, other computing units cannot access the operation data cached in the any computing unit. In this way, the multiplexing of the operation data in the space dimension cannot be realized, resulting in a low usage rate of the operation data.

Based on this, an embodiment of the present disclosure provides a chip cache system.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

Referring to FIG. 1 , which is a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure, the cache system includes a plurality of computing subsystems 11, and each of the computing subsystems 11 includes at least one computing unit 101 and at least A local shared buffer 102 . Each computing unit 101 is connected to any local shared register 102 in the computing subsystem.

The local shared buffer 102 is used for caching the operation data read by the operation unit in the operation subsystem to which it belongs.

The computing unit 101 is configured to access, based on the generated access request, a local shared buffer in the computing subsystem that matches the access address indicated by the access request; and read from the accessed local shared buffer In the case of the operation data indicated by the access request, the operation is performed based on the read operation data.

During implementation, for each computing subsystem, the computing subsystem may include a local cache module for dividing the local cache module into multiple physical memory banks. For example, each bank may correspond to a local shared register; and each computing unit in the computing subsystem may correspond to a computing core in an AI chip.

In each computing subsystem, each local shared buffer in the computing subsystem can cache computing data, and the cached computing data can be read from the storage module of the chip for any computing unit in the computing subsystem arrived. Furthermore, when performing calculations, the computing unit may generate an access request, and based on the generated access request, access the local shared register in the computing subsystem that matches the access address indicated by the access request. If the operation data indicated by the access request is read in the local shared buffer, the operation data read from the local shared buffer is used to perform an operation, such as a convolution operation, to obtain an operation result.

In an optional implementation manner, as shown in FIG. 2 , each of the computing subsystems further includes a local interconnection bus unit 103; the local interconnection bus unit 103 is used to connect each of the computing units with the operator Any local shared cache in the system is connected; and after receiving the access request sent by the operation unit, determining the local shared cache that matches the access address indicated by the access request.

The operation unit 101 is used for accessing the local shared buffer determined by the local interconnection bus unit and matching the access address indicated by the access request.

Each computing subsystem may also include a local interconnection bus unit, which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.

Furthermore, after determining the local shared register matching the access address indicated by the access request, the arithmetic unit can access the local shared register matched with the access address indicated by the access request. For example, the local interconnection bus unit may include: a network on chip (Network on Chip, NoC) and the like.

Here, each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data. At the same time, after receiving the access request sent by the computing unit, the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the computing unit can access the corresponding local shared register based on the access request.

In an optional implementation manner, the operation unit 101 is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request Next, based on the access request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module into the local shared buffer.

The chip also includes a storage module, which stores operation data required by the operation unit. During implementation, for any computing subsystem, if the computing unit in the computing subsystem does not read the computing data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, it can Request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module to the local shared buffer included in the operation subsystem.

In an optional implementation manner, the memory module of the chip includes a global register. Wherein, the operation unit 101 is further configured to: if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the global buffer of the chip.

During implementation, when the operation unit in the operation subsystem does not read the operation data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, according to the access request, read from the global buffer of the chip Reads the operation data indicated by the access request. If the operation data indicated by the access request is successfully read from the global register, the operation data read from the global register can be cached in the local shared register in the operation subsystem.

In an optional implementation manner, the storage module of the chip further includes an external memory. The operation unit is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer , and buffer the operation data read from the external memory into the global buffer.

If the operation unit in the operation subsystem does not read the operation data indicated by the access request from the global buffer, it can read the operation data indicated by the access request from the external memory of the chip according to the access request, and read The obtained operation data is cached in the global buffer so that other operation subsystems can read the operation data from the global buffer; and the read operation data is cached in the local shared register in the operation subsystem so that other computing units in the computing subsystem can read the computing data from the local shared buffer and the computing unit can read the computing data from the local shared buffer next time.

Wherein, the external memory of the chip can be used to store all the operation data required by the operation unit in the chip for operation. The global register can be used to store the operation data read by the operation units in each operation subsystem. The local shared register in any computing subsystem can be used to store the computing data read by each computing unit in the computing subsystem.

External memory and global registers are on-chip and external to the cache system. Wherein, the global buffer is respectively connected with the external storage area and the cache system. Exemplarily, when reading the operation data based on the access request, the operation data indicated by the access request can be read from the local shared cache first, and if it does not exist in the local shared cache, the access request is read from the global cache The indicated operation data; if it does not exist in the global cache, read the operation data indicated by the access request from the external memory. At the same time, after the operation data indicated by the access request is read from the external memory based on the access request, the read operation data indicated by the access request can be cached in the global cache first, and then the operation data indicated by the access request can be cached in the global cache. Cache into the local shared cache.

In an optional implementation manner, the cache system further includes: a global interconnection bus unit 104; the global interconnection bus unit 104 is configured to respectively connect the respective computing subsystems to the storage modules.

During implementation, the global interconnection bus unit can connect each computing subsystem to the storage module respectively, that is, each computing subsystem is connected to the global buffer in the storage module and the external memory, so that each computing unit in any computing subsystem Global caches and external memory can be read based on access requests.

As shown in Figure 3, the AI chip can be divided into X computing subsystems, each computing subsystem includes N computing units and M local shared registers, and each computing unit is connected to M local shared registers , so that each operation unit can read any connected local shared register. Wherein, X, N, M are positive integers.

The workflow of the caching system is exemplarily described with reference to FIG. 3 . Operational data is cached in the M local shared registers in the operation subsystem 0 . When the computing unit 0 in the computing subsystem 0 performs computing, the computing unit 0 can generate and issue an access request. After receiving the access request, the local interconnection bus unit determines the local shared register in the computing subsystem 0 that matches the access address indicated by the access request. If the determined local shared register is the local shared register 0, the computing unit may read the operation data cached in the local shared register 0 from the local shared register 0.

When the operation data indicated by the access request is read, the operation unit 0 performs an operation based on the read operation data to obtain an operation result. If the operation data indicated by the access request is not read from the local shared buffer 0, the operation unit 0 may read the global register based on the access request. If the operation data indicated by the access request is read from the global cache, the operation data read from the global cache can be cached in any local shared cache included in the operation subsystem 0, and based on The read operation data is operated to obtain the operation result.

If the operation data indicated by the access request is not read from the global buffer, the operation unit 0 may read the external memory based on the access request, and cache the operation data read from the external memory into the global buffer, and The read operation data is cached in any local shared buffer included in the operation subsystem 0, and operations are performed based on the read operation data to obtain operation results.

Generally, the operation result can be cached in other configured registers, and the operation result does not need to be cached in the local shared register. For example, other caches may be the first-level cache L1cache inside the computing unit. Therefore, in actual scheduling, the local shared registers are mainly used to store read-only computing data, and no write operation will be performed on the local shared registers. Therefore, at least one local shared registers included in different computing subsystems may not There is data consistency, for example, the operation data cached in the local shared register in the operation subsystem 0 may be different from the operation data cached in the local shared register in the operation subsystem 1 . Through the above design, the design complexity of the cache system can be reduced and the performance of the AI chip can be improved on the basis of meeting the data requirements of the computing unit for the global cache and the local cache.

Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.

Based on the same idea, the embodiment of the present disclosure also provides a data processing method, which is applied to the cache system of the chip described in the above embodiment, as shown in FIG. 4 , which is the data processing method provided by the embodiment of the present disclosure A schematic flow chart, the data processing method includes the following steps S401-S403.

S401. Any computing unit in any computing subsystem included in the cache system acquires an access request.

S402. Based on the access request, the computing unit accesses a local shared register in the computing subsystem that matches the access address indicated by the access request.

S403. When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.

In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, the The operation unit reads the operation data indicated by the access request from the storage module of the chip based on the access request, and caches the operation data read from the storage module into the local shared buffer.

In an optional implementation manner, when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, the operation unit based on the The above access request, reading the operation data indicated by the access request from the storage module of the chip includes: not reading the access request indication from the local shared buffer matching the access address indicated by the access request In the case of the operation data, the operation unit reads the operation data indicated by the access request from the global buffer of the chip based on the access request.

In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the global buffer, the operation unit reads the data from the external memory of the chip reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.

Based on the same idea, the embodiment of the present disclosure also provides a data processing device, as shown in FIG. 5 , which is a schematic diagram of the structure of the data processing device provided by the embodiment of the present disclosure. 502 and operation module 503.

The obtaining module 501 is used to obtain the access request;

The reading module 502 is configured to, based on the access request, read the local shared buffer that matches the access address indicated by the access request.

The operation module 503 is configured to perform an operation based on the operation data read from the local shared buffer to obtain an operation result when the operation data indicated by the access request is read.

In a possible implementation manner, the device further includes an access module 504, configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared buffer .

In an optional implementation manner, the access module 504, if the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request, when reading the operation data indicated by the access request from the memory module of the chip, is used for: the local shared buffer that matches the access address indicated by the access request does not read the When the operation data indicated by the access request is accessed, the operation data indicated by the access request is read from the global buffer of the chip based on the access request.

In an optional implementation manner, the access module 504 is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.

In some embodiments, the functions of the device provided by the embodiments of the present disclosure or the included templates can be used to execute the methods described in the above method embodiments, and its specific implementation can refer to the description of the above method embodiments. For brevity, here No longer.

Based on the same idea, an embodiment of the present disclosure further provides a chip, including: the cache system 601 and the storage module 602 described in the foregoing implementation manners.

The cache system 601 is used to acquire operation data from the storage module 602 and cache the operation data.

Wherein, the storage module 602 may include a global cache and an external memory. That is, the cache system can first read the operation data from the global cache based on the access request, and then read the operation data corresponding to the access request from the external memory based on the access request when there is no operation data corresponding to the access request in the global cache.

Based on the same technical idea, an embodiment of the present disclosure also provides an electronic device. Referring to FIG. 7 , which is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure, the electronic device includes a processor 701 , a memory 702 and a bus 703 . Among them, the memory 702 is used to store execution instructions, including a memory 7021 and an external memory 7022; the memory 7021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 701 and exchange data with an external memory 7022 such as a hard disk. The processor 701 exchanges data with the external memory 7022 through the memory 7021. When the electronic device 700 is running, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions: obtain an access request; Request, read the local shared buffer that matches the access address indicated by the access request; in the case of reading the operation data indicated by the access request, based on the read from the local shared buffer Perform calculations on the above calculation data to obtain calculation results.

Wherein, for the specific processing flow of the processor 701, reference may be made to the descriptions in the foregoing method embodiments, and details are not repeated here.

Alternatively, the electronic device may be the chip described in the above implementation manner.

In addition, an embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the data processing method described in the above-mentioned method embodiments are executed. . Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

Embodiments of the present disclosure also provide a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the data processing method described in the above method embodiment, for details, please refer to the above method The embodiment will not be repeated here.

Wherein, the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the system and device described above can refer to the corresponding process in the foregoing method embodiment, and details are not repeated here. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

The above is only the specific implementation of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, which should be covered within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Claims

A chip cache system, characterized in that it includes a plurality of computing subsystems,

Each of the computing subsystems includes at least one computing unit and at least one local shared register;

Each of the computing units is connected to any one of the local shared registers in the computing subsystem;

The local shared buffer is used to cache the operation data read by the operation unit in the operation subsystem to which it belongs;

The computing unit is used for:

based on the generated access request, accessing a local shared cache in the computing subsystem that matches an access address indicated by the access request; and

When the operation data indicated by the access request is read from the accessed local shared buffer, the operation is performed based on the read operation data.
The caching system according to claim 1, wherein:

Each of the computing subsystems also includes a local interconnection bus unit;

The local interconnect bus unit is used for:

connecting each of the computing units to any one of the local shared registers in the computing subsystem; and

After receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request;

The operation unit is used to access the local shared buffer determined by the local interconnection bus unit and matched with the access address indicated by the access request.
The cache system according to claim 1 or 2, wherein the computing unit is also used for:

If the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request, read from the memory module of the chip The operation data indicated by the access request, and cache the operation data read from the storage module into the local shared buffer.
The cache system according to claim 3, wherein the storage module of the chip includes a global cache; the computing unit is also used for:

If the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request, read from the global buffer of the chip Obtain the operation data indicated by the access request.
The cache system according to claim 4, wherein the storage module of the chip also includes an external memory; the computing unit is also used for:

If the operation data indicated by the access request is not read from the global buffer, the operation data indicated by the access request is read from the external memory of the chip, and the operation data indicated by the access request is read from the external memory The operation data read in is buffered into the global buffer.
The cache system according to any one of claims 3 to 5, characterized in that,

The cache system also includes a global interconnection bus unit;

The global interconnection bus unit is used to respectively connect each of the computing subsystems with the storage module.
A data processing method, wherein the method is applied to the cache system of the chip according to any one of claims 1 to 6, the method comprising:

An operation unit in any operation subsystem included in the cache system acquires an access request;

The computing unit accesses, based on the access request, a local shared register in the computing subsystem that matches the access address indicated by the access request;

When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.
A chip, characterized in that it comprises:

storage module; and

The cache system according to any one of claims 1 to 6, wherein the cache system is used to acquire operation data from the storage module and cache the operation data.
An electronic device, characterized by comprising: a processor, a memory, and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and The memories communicate through the bus, and the machine-readable instructions implement the steps of the data processing method according to claim 7 when executed by the processor.
An electronic device, characterized by comprising the chip according to claim 8.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, wherein the computer program executes the steps of the data processing method according to claim 7 when the computer program is run by a processor.