WO2023124304A1 - Chip cache system, data processing method, device, storage medium, and chip - Google Patents

Chip cache system, data processing method, device, storage medium, and chip Download PDF

Info

Publication number
WO2023124304A1
WO2023124304A1 PCT/CN2022/121033 CN2022121033W WO2023124304A1 WO 2023124304 A1 WO2023124304 A1 WO 2023124304A1 CN 2022121033 W CN2022121033 W CN 2022121033W WO 2023124304 A1 WO2023124304 A1 WO 2023124304A1
Authority
WO
WIPO (PCT)
Prior art keywords
access request
read
operation data
computing
local shared
Prior art date
Application number
PCT/CN2022/121033
Other languages
French (fr)
Chinese (zh)
Inventor
王文强
夏晓旭
朱志岐
徐宁仪
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023124304A1 publication Critical patent/WO2023124304A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the technical field of integrated circuits, and in particular, relates to a chip cache system, a data processing method, electronic equipment, a computer-readable storage medium, and a chip.
  • Neural network is an algorithmic mathematical model that imitates the behavior characteristics of animal neural network and performs distributed parallel information processing. With the rapid development of neural networks, neural networks have been applied in various fields.
  • a designed artificial intelligence (AI) chip can be used to process the calculation process of the neural network. Therefore, designing an efficient AI chip has become one of the effective means to improve the processing efficiency of the neural network.
  • the present disclosure at least provides a chip cache system, a data processing method, an electronic device, a computer-readable storage medium, and a chip.
  • the present disclosure provides a chip cache system, including: a plurality of computing subsystems, each of which includes at least one computing unit and at least one local shared buffer; each of the computing units It is connected to any one of the local shared buffers in the computing subsystem; the local shared buffer is used to cache the computing data read by the computing unit in the computing subsystem; the computing unit is used to generate access requests based on , accessing the local shared buffer in the operation subsystem that matches the access address indicated by the access request; and when the operation data indicated by the access request is read from the accessed local shared buffer, based on Operation is performed on the read operation data.
  • the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register.
  • the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share
  • the operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points.
  • the operation data is read at a time, which improves the operation efficiency of the operation unit.
  • each of the computing subsystems further includes: a local interconnection bus unit; the local interconnection bus unit is used to connect each of the computing units with any local The shared buffer is connected; and after receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request; the operation unit is used to access the local interconnection bus The local shared buffer determined by the unit and matching the access address indicated by the access request.
  • each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data.
  • the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the operation unit can read the corresponding local shared register based on the access request.
  • the operation unit is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request , based on the access request, reading the operation data indicated by the access request from the storage module of the chip, and buffering the operation data read from the storage module into the local shared buffer.
  • the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
  • the storage module of the chip includes a global buffer; the operation unit is further configured to: read the local shared buffer that matches the access address indicated by the access request. In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from a global buffer of the chip.
  • the storage module of the chip further includes an external memory; the operation unit is further configured to: when the operation data indicated by the access request is not read from the global buffer Next, read the operation data indicated by the access request from the external memory of the chip, and cache the operation data read from the external memory into the global buffer.
  • the cache system further includes: a global interconnection bus unit, configured to connect each computing subsystem to the storage module respectively.
  • the present disclosure provides a data processing method, the method is applied to the cache system of the chip described in the first aspect or any implementation mode, and the method includes: any operator included in the cache system
  • the computing unit in the system obtains the access request; the computing unit accesses the local shared buffer in the computing subsystem that matches the access address indicated by the access request based on the access request; in the local shared buffer accessed from the slave
  • the operation unit performs an operation based on the read operation data to obtain an operation result.
  • the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request reads the operation data indicated by the access request from the storage module of the chip, and caches the operation data read from the storage module into the local shared buffer.
  • the operation data indicated by the access request when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the storage module of the chip, including: not reading the operation data indicated by the access request from the local shared buffer matching the access address indicated by the access request In the case of operation data, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
  • the method further includes: when the operation data indicated by the access request is not read from the global buffer, reading the computing data indicated by the access request, and caching the computing data read from the external memory into the global buffer.
  • the present disclosure provides a data processing device, the device comprising: an acquisition module, configured to acquire an access request; a reading module, configured to read the access information indicated by the access request based on the access request A local shared buffer whose address matches; an operation module configured to perform an operation based on the operation data read from the local shared buffer when the operation data indicated by the access request is read, to obtain Operation result.
  • the device further includes an access module, configured to: read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared cache.
  • the access module if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the The above access request, when reading the operation data indicated by the access request from the memory module of the chip, it is used for: not reading the specified data from the local shared buffer matching the access address indicated by the access request In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
  • the access module is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer
  • the operation data indicated by the access request is fetched, and the operation data read from the external memory is cached in the global buffer.
  • the present disclosure provides a chip, including: the cache system and the storage module described in the first aspect or any one of the implementation modes; the cache system is used to obtain computing data from the storage module, and store the Operational data is cached.
  • the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the The memory communicates with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the data processing method described in the second aspect or any implementation manner are executed.
  • the present disclosure provides an electronic device, including the chip as described in the fourth aspect.
  • the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the data as described in the second aspect or any implementation mode above is executed. The steps of the processing method.
  • FIG. 1 shows a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of the architecture of another chip cache system provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic structural diagram of a chip provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic structural diagram of a data processing device provided by an embodiment of the present disclosure
  • FIG. 6 shows a schematic structural diagram of another chip provided by an embodiment of the present disclosure.
  • Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the computing power unit in the chip When the computing power unit in the chip is working, it can read data from the external memory of the chip and store the read data in the internal memory so that the computing power unit can read data from the internal memory.
  • the chip is an artificial intelligence (AI) chip as an example for illustration.
  • the designed AI chip can be used to handle the calculation process of the neural network.
  • the reasoning and training processes of large-scale neural networks place higher demands on the computing power of AI chips.
  • the scale of AI chips on the cloud side is also getting larger and larger.
  • more computing units also mean that a higher bandwidth data path is required to support the data requirements of the operation.
  • the conventional approach is to increase the cache inside the AI chip, and reduce the access requirements of the computing unit to the external memory by multiplexing the computing data.
  • a cache can be set inside each computing unit of the AI chip, for example, the set cache can be a first-level cache L1Cache of a graphics processing unit (GPU).
  • the same calculation unit can access the same cached operation data multiple times, realizing the multiplexing of operation data in the time dimension, and meeting the bandwidth requirements of the calculation unit.
  • using this method will cause the operation data cached in any computing unit to be unable to be read by other computing units.
  • other computing units cannot access the operation data cached in the any computing unit.
  • the multiplexing of the operation data in the space dimension cannot be realized, resulting in a low usage rate of the operation data.
  • an embodiment of the present disclosure provides a chip cache system.
  • the cache system includes a plurality of computing subsystems 11, and each of the computing subsystems 11 includes at least one computing unit 101 and at least A local shared buffer 102 .
  • Each computing unit 101 is connected to any local shared register 102 in the computing subsystem.
  • the local shared buffer 102 is used for caching the operation data read by the operation unit in the operation subsystem to which it belongs.
  • the computing unit 101 is configured to access, based on the generated access request, a local shared buffer in the computing subsystem that matches the access address indicated by the access request; and read from the accessed local shared buffer In the case of the operation data indicated by the access request, the operation is performed based on the read operation data.
  • the computing subsystem may include a local cache module for dividing the local cache module into multiple physical memory banks.
  • each bank may correspond to a local shared register; and each computing unit in the computing subsystem may correspond to a computing core in an AI chip.
  • each local shared buffer in the computing subsystem can cache computing data, and the cached computing data can be read from the storage module of the chip for any computing unit in the computing subsystem arrived. Furthermore, when performing calculations, the computing unit may generate an access request, and based on the generated access request, access the local shared register in the computing subsystem that matches the access address indicated by the access request. If the operation data indicated by the access request is read in the local shared buffer, the operation data read from the local shared buffer is used to perform an operation, such as a convolution operation, to obtain an operation result.
  • an operation such as a convolution operation
  • the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register.
  • the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share
  • the operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points.
  • the operation data is read at a time, which improves the operation efficiency of the operation unit.
  • each of the computing subsystems further includes a local interconnection bus unit 103; the local interconnection bus unit 103 is used to connect each of the computing units with the operator Any local shared cache in the system is connected; and after receiving the access request sent by the operation unit, determining the local shared cache that matches the access address indicated by the access request.
  • the operation unit 101 is used for accessing the local shared buffer determined by the local interconnection bus unit and matching the access address indicated by the access request.
  • Each computing subsystem may also include a local interconnection bus unit, which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.
  • a local interconnection bus unit which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.
  • the arithmetic unit can access the local shared register matched with the access address indicated by the access request.
  • the local interconnection bus unit may include: a network on chip (Network on Chip, NoC) and the like.
  • each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data.
  • the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the computing unit can access the corresponding local shared register based on the access request.
  • the operation unit 101 is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request Next, based on the access request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module into the local shared buffer.
  • the chip also includes a storage module, which stores operation data required by the operation unit.
  • a storage module which stores operation data required by the operation unit.
  • the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
  • the memory module of the chip includes a global register.
  • the operation unit 101 is further configured to: if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the global buffer of the chip.
  • the operation unit in the operation subsystem when the operation unit in the operation subsystem does not read the operation data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, according to the access request, read from the global buffer of the chip Reads the operation data indicated by the access request. If the operation data indicated by the access request is successfully read from the global register, the operation data read from the global register can be cached in the local shared register in the operation subsystem.
  • the storage module of the chip further includes an external memory.
  • the operation unit is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer , and buffer the operation data read from the external memory into the global buffer.
  • the operation unit in the operation subsystem does not read the operation data indicated by the access request from the global buffer, it can read the operation data indicated by the access request from the external memory of the chip according to the access request, and read The obtained operation data is cached in the global buffer so that other operation subsystems can read the operation data from the global buffer; and the read operation data is cached in the local shared register in the operation subsystem so that other computing units in the computing subsystem can read the computing data from the local shared buffer and the computing unit can read the computing data from the local shared buffer next time.
  • the external memory of the chip can be used to store all the operation data required by the operation unit in the chip for operation.
  • the global register can be used to store the operation data read by the operation units in each operation subsystem.
  • the local shared register in any computing subsystem can be used to store the computing data read by each computing unit in the computing subsystem.
  • External memory and global registers are on-chip and external to the cache system.
  • the global buffer is respectively connected with the external storage area and the cache system.
  • the operation data indicated by the access request can be read from the local shared cache first, and if it does not exist in the local shared cache, the access request is read from the global cache The indicated operation data; if it does not exist in the global cache, read the operation data indicated by the access request from the external memory.
  • the read operation data indicated by the access request can be cached in the global cache first, and then the operation data indicated by the access request can be cached in the global cache. Cache into the local shared cache.
  • the cache system further includes: a global interconnection bus unit 104; the global interconnection bus unit 104 is configured to respectively connect the respective computing subsystems to the storage modules.
  • the global interconnection bus unit can connect each computing subsystem to the storage module respectively, that is, each computing subsystem is connected to the global buffer in the storage module and the external memory, so that each computing unit in any computing subsystem Global caches and external memory can be read based on access requests.
  • the AI chip can be divided into X computing subsystems, each computing subsystem includes N computing units and M local shared registers, and each computing unit is connected to M local shared registers , so that each operation unit can read any connected local shared register.
  • X, N, M are positive integers.
  • Operational data is cached in the M local shared registers in the operation subsystem 0 .
  • the computing unit 0 in the computing subsystem 0 performs computing, the computing unit 0 can generate and issue an access request.
  • the local interconnection bus unit determines the local shared register in the computing subsystem 0 that matches the access address indicated by the access request. If the determined local shared register is the local shared register 0, the computing unit may read the operation data cached in the local shared register 0 from the local shared register 0.
  • the operation unit 0 When the operation data indicated by the access request is read, the operation unit 0 performs an operation based on the read operation data to obtain an operation result. If the operation data indicated by the access request is not read from the local shared buffer 0, the operation unit 0 may read the global register based on the access request. If the operation data indicated by the access request is read from the global cache, the operation data read from the global cache can be cached in any local shared cache included in the operation subsystem 0, and based on The read operation data is operated to obtain the operation result.
  • the operation unit 0 may read the external memory based on the access request, and cache the operation data read from the external memory into the global buffer, and The read operation data is cached in any local shared buffer included in the operation subsystem 0, and operations are performed based on the read operation data to obtain operation results.
  • the operation result can be cached in other configured registers, and the operation result does not need to be cached in the local shared register.
  • other caches may be the first-level cache L1cache inside the computing unit. Therefore, in actual scheduling, the local shared registers are mainly used to store read-only computing data, and no write operation will be performed on the local shared registers. Therefore, at least one local shared registers included in different computing subsystems may not There is data consistency, for example, the operation data cached in the local shared register in the operation subsystem 0 may be different from the operation data cached in the local shared register in the operation subsystem 1 .
  • the design complexity of the cache system can be reduced and the performance of the AI chip can be improved on the basis of meeting the data requirements of the computing unit for the global cache and the local cache.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a data processing method, which is applied to the cache system of the chip described in the above embodiment, as shown in FIG. 4 , which is the data processing method provided by the embodiment of the present disclosure
  • FIG. 4 is the data processing method provided by the embodiment of the present disclosure
  • a schematic flow chart, the data processing method includes the following steps S401-S403.
  • Any computing unit in any computing subsystem included in the cache system acquires an access request.
  • the computing unit accesses a local shared register in the computing subsystem that matches the access address indicated by the access request.
  • the operation unit When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.
  • the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, the The operation unit reads the operation data indicated by the access request from the storage module of the chip based on the access request, and caches the operation data read from the storage module into the local shared buffer.
  • reading the operation data indicated by the access request from the storage module of the chip includes: not reading the access request indication from the local shared buffer matching the access address indicated by the access request In the case of the operation data, the operation unit reads the operation data indicated by the access request from the global buffer of the chip based on the access request.
  • the method further includes: when the operation data indicated by the access request is not read from the global buffer, the operation unit reads the data from the external memory of the chip reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
  • the embodiment of the present disclosure also provides a data processing device, as shown in FIG. 5 , which is a schematic diagram of the structure of the data processing device provided by the embodiment of the present disclosure. 502 and operation module 503.
  • the obtaining module 501 is used to obtain the access request
  • the reading module 502 is configured to, based on the access request, read the local shared buffer that matches the access address indicated by the access request.
  • the operation module 503 is configured to perform an operation based on the operation data read from the local shared buffer to obtain an operation result when the operation data indicated by the access request is read.
  • the device further includes an access module 504, configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request.
  • an access module 504 configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request.
  • the access module 504 if the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request, when reading the operation data indicated by the access request from the memory module of the chip, is used for: the local shared buffer that matches the access address indicated by the access request does not read the When the operation data indicated by the access request is accessed, the operation data indicated by the access request is read from the global buffer of the chip based on the access request.
  • the access module 504 is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
  • the functions of the device provided by the embodiments of the present disclosure or the included templates can be used to execute the methods described in the above method embodiments, and its specific implementation can refer to the description of the above method embodiments. For brevity, here No longer.
  • an embodiment of the present disclosure further provides a chip, including: the cache system 601 and the storage module 602 described in the foregoing implementation manners.
  • the cache system 601 is used to acquire operation data from the storage module 602 and cache the operation data.
  • the storage module 602 may include a global cache and an external memory. That is, the cache system can first read the operation data from the global cache based on the access request, and then read the operation data corresponding to the access request from the external memory based on the access request when there is no operation data corresponding to the access request in the global cache.
  • an embodiment of the present disclosure also provides an electronic device.
  • the electronic device includes a processor 701 , a memory 702 and a bus 703 .
  • the memory 702 is used to store execution instructions, including a memory 7021 and an external memory 7022; the memory 7021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 701 and exchange data with an external memory 7022 such as a hard disk.
  • the processor 701 exchanges data with the external memory 7022 through the memory 7021.
  • the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions: obtain an access request; Request, read the local shared buffer that matches the access address indicated by the access request; in the case of reading the operation data indicated by the access request, based on the read from the local shared buffer Perform calculations on the above calculation data to obtain calculation results.
  • the electronic device may be the chip described in the above implementation manner.
  • an embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the data processing method described in the above-mentioned method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • Embodiments of the present disclosure also provide a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the data processing method described in the above method embodiment, for details, please refer to the above method The embodiment will not be repeated here.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • a software development kit Software Development Kit, SDK
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Abstract

Provided in the present disclosure are a chip cache system, a data processing method and apparatus, a device, a storage medium, and a chip. The cache system comprises: a plurality of operation subsystems, wherein each operation subsystem comprises at least one operation unit and at least one local shared buffer; each operation unit is connected to any local shared buffer in the operation subsystem where the operation unit is located; the local shared buffers are used for caching operation data read by the operation units in the operation subsystem to which the local shared buffers belong; and the operation units are used for accessing, on the basis of a generated access request, a local shared buffer, which matches an access address indicated by the access request, in the operation subsystem, and performing an operation on the basis of the read operation data when operation data indicated by the access request is read from the accessed local shared buffer.

Description

芯片的缓存系统、数据处理方法、设备、存储介质及芯片Chip cache system, data processing method, device, storage medium and chip
交叉引用声明cross-reference statement
本申请要求于2021年12月31日提交中国专利局的申请号为202111662634.X的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application with application number 202111662634.X filed with the China Patent Office on December 31, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及集成电路技术领域,具体而言,涉及一种芯片的缓存系统、数据处理方法、电子设备、计算机可读存储介质及芯片。The present disclosure relates to the technical field of integrated circuits, and in particular, relates to a chip cache system, a data processing method, electronic equipment, a computer-readable storage medium, and a chip.
背景技术Background technique
神经网络是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。随着神经网络的快速发展,神经网络被应用于各个领域内。Neural network is an algorithmic mathematical model that imitates the behavior characteristics of animal neural network and performs distributed parallel information processing. With the rapid development of neural networks, neural networks have been applied in various fields.
一般的,可以利用设计的人工智能(Artificial Intelligence,AI)芯片处理神经网络的运算过程。故设计高效的AI芯片成为了提升神经网络的处理效率的有效手段之一。Generally, a designed artificial intelligence (AI) chip can be used to process the calculation process of the neural network. Therefore, designing an efficient AI chip has become one of the effective means to improve the processing efficiency of the neural network.
发明内容Contents of the invention
有鉴于此,本公开至少提供一种芯片的缓存系统、数据处理方法、电子设备、计算机可读存储介质及芯片。In view of this, the present disclosure at least provides a chip cache system, a data processing method, an electronic device, a computer-readable storage medium, and a chip.
第一方面,本公开提供了一种芯片的缓存系统,包括:多个运算子系统,每个所述运算子系统中包括至少一个运算单元和至少一个局部共享缓存器;每个所述运算单元与所在运算子系统中的任一所述局部共享缓存器相连;所述局部共享缓存器用于缓存所属运算子系统内的运算单元读取的运算数据;所述运算单元用于基于生成的访问请求,访问所述运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器;并在从访问的局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,基于读取到的所述运算数据进行运算。In a first aspect, the present disclosure provides a chip cache system, including: a plurality of computing subsystems, each of which includes at least one computing unit and at least one local shared buffer; each of the computing units It is connected to any one of the local shared buffers in the computing subsystem; the local shared buffer is used to cache the computing data read by the computing unit in the computing subsystem; the computing unit is used to generate access requests based on , accessing the local shared buffer in the operation subsystem that matches the access address indicated by the access request; and when the operation data indicated by the access request is read from the accessed local shared buffer, based on Operation is performed on the read operation data.
这里,将芯片划分为多个运算子系统,每个运算子系统中包括至少一个运算单元和至少一个局部共享缓存器。针对任一运算子系统,该运算子系统中的运算单元可以将读取的运算数据缓存至该运算子系统的局部共享缓存器中,使得该运算子系统中的各个运算单元均能够从局部共享缓存器中读取该运算数据,以及使得每个运算单元能够在不同时间点多次从局部共享缓存器中读取该运算数据,运算数据的利用率较高,且无需从缓存系统的外部多次读取该运算数据,提高了运算单元的运算效率。Here, the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register. For any computing subsystem, the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share The operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points. The operation data is read at a time, which improves the operation efficiency of the operation unit.
在一种可能的实施方式中,每个所述运算子系统中还包括:局部互联总线单元;所述局部互联总线单元用于将每个所述运算单元与所在运算子系统中的任一局部共享缓存器相连;以及在接收到所述运算单元发出的所述访问请求后,确定与所述访问请求指示的访问地址匹配的局部共享缓存器;所述运算单元用于访问所述局部互联总线单元确定的、与所述访问请求指示的访问地址匹配的局部共享缓存器。In a possible implementation manner, each of the computing subsystems further includes: a local interconnection bus unit; the local interconnection bus unit is used to connect each of the computing units with any local The shared buffer is connected; and after receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request; the operation unit is used to access the local interconnection bus The local shared buffer determined by the unit and matching the access address indicated by the access request.
这里,通过局部互联总线单元将每个运算单元与任一局部共享缓存器相连,使得每个运算单元能够访问任一局部共享缓存器,提高运算数据的利用率。同时,局部互联总线单元在接收到运算单元发出的访问请求后,确定与访问请求指示的访问地址匹配的局部共享缓存器,以便运算单元能够基于访问请求读取对应的局部共享缓存器。Here, each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data. At the same time, after receiving the access request sent by the operation unit, the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the operation unit can read the corresponding local shared register based on the access request.
在一种可能的实施方式中,所述运算单元还用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取的所述运算数据缓存至所述局部共享缓存器。In a possible implementation manner, the operation unit is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request , based on the access request, reading the operation data indicated by the access request from the storage module of the chip, and buffering the operation data read from the storage module into the local shared buffer.
这里,在局部共享缓存器中不存在访问请求指示的运算数据时,运算单元可以基于访问请求,从芯片的存储模块中读取访问请求指示的运算数据,并将从存储模块读取的运算数据缓存至局部共享缓存器,以便后续能够从局部共享缓存器中读取到该运算数据,无需再次从存储模块读取该运算数据,提高了运算数据的复用程度。Here, when the operation data indicated by the access request does not exist in the local shared buffer, the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
在一种可能的实施方式中,所述芯片的存储模块包括全局缓存器;所述运算单元还用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。In a possible implementation manner, the storage module of the chip includes a global buffer; the operation unit is further configured to: read the local shared buffer that matches the access address indicated by the access request. In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from a global buffer of the chip.
在一种可能的实施方式中,所述芯片的存储模块还包括外部存储器;所述运算单元还用于:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In a possible implementation manner, the storage module of the chip further includes an external memory; the operation unit is further configured to: when the operation data indicated by the access request is not read from the global buffer Next, read the operation data indicated by the access request from the external memory of the chip, and cache the operation data read from the external memory into the global buffer.
在一种可能的实施方式中,所述缓存系统还包括:全局互联总线单元,用于将所述各个运算子系统分别与所述存储模块相连。In a possible implementation manner, the cache system further includes: a global interconnection bus unit, configured to connect each computing subsystem to the storage module respectively.
以下装置、电子设备等的效果描述参见上述方法的说明,这里不再赘述。For the description of the effects of the following devices, electronic equipment, etc., refer to the description of the above method, and will not be repeated here.
第二方面,本公开提供了一种数据处理方法,所述方法应用于第一方面或任一实施方式所述的芯片的缓存系统,所述方法包括:所述缓存系统包括的任一运算子系统中的运算单元获取访问请求;所述运算单元基于所述访问请求,访问所在运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器;在从访问的局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,所述运算单元基于读取到的所述运算数据进行运算,得到运算结果。In a second aspect, the present disclosure provides a data processing method, the method is applied to the cache system of the chip described in the first aspect or any implementation mode, and the method includes: any operator included in the cache system The computing unit in the system obtains the access request; the computing unit accesses the local shared buffer in the computing subsystem that matches the access address indicated by the access request based on the access request; in the local shared buffer accessed from the slave When the operation data indicated by the access request is read, the operation unit performs an operation based on the read operation data to obtain an operation result.
在一种可能的实施方式中,所述方法还包括:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取到的所述运算数据缓存至所述局部共享缓存器。In a possible implementation manner, the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request reads the operation data indicated by the access request from the storage module of the chip, and caches the operation data read from the storage module into the local shared buffer.
在一种可选实施方式中,所述在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据,包括:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。In an optional implementation manner, when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the storage module of the chip, including: not reading the operation data indicated by the access request from the local shared buffer matching the access address indicated by the access request In the case of operation data, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
在一种可选实施方式中,所述方法还包括:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the global buffer, reading the computing data indicated by the access request, and caching the computing data read from the external memory into the global buffer.
第三方面,本公开提供了一种数据处理装置,所述装置包括:获取模块,用于获取访问请求;读取模块,用于基于所述访问请求,读取与所述访问请求指示的访问地址匹配的局部共享缓存器;运算模块,用于在读取到所述访问请求指示的运算数据的情况下,基于从所述局部共享缓存器中读取到的所述运算数据进行运算,得到运算结果。In a third aspect, the present disclosure provides a data processing device, the device comprising: an acquisition module, configured to acquire an access request; a reading module, configured to read the access information indicated by the access request based on the access request A local shared buffer whose address matches; an operation module configured to perform an operation based on the operation data read from the local shared buffer when the operation data indicated by the access request is read, to obtain Operation result.
在一种可能的实施方式中,所述装置还包括访问模块,用于:在从与所述访问请求 指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取到的所述运算数据缓存至所述局部共享缓存器。In a possible implementation manner, the device further includes an access module, configured to: read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared cache.
在一种可选实施方式中,所述访问模块,在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据时,用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。In an optional implementation manner, the access module, if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the The above access request, when reading the operation data indicated by the access request from the memory module of the chip, it is used for: not reading the specified data from the local shared buffer matching the access address indicated by the access request In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
在一种可选实施方式中,所述访问模块还用于:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In an optional implementation manner, the access module is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer The operation data indicated by the access request is fetched, and the operation data read from the external memory is cached in the global buffer.
第四方面,本公开提供一种芯片,包括:第一方面或任一实施方式所述的缓存系统和存储模块;所述缓存系统用于从所述存储模块中获取运算数据,并将所述运算数据进行缓存。In a fourth aspect, the present disclosure provides a chip, including: the cache system and the storage module described in the first aspect or any one of the implementation modes; the cache system is used to obtain computing data from the storage module, and store the Operational data is cached.
第五方面,本公开提供一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如上述第二方面或任一实施方式所述的数据处理方法的步骤。In a fifth aspect, the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the The memory communicates with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the data processing method described in the second aspect or any implementation manner are executed.
第六方面,本公开提供一种电子设备,包括如第四方面所述的芯片。In a sixth aspect, the present disclosure provides an electronic device, including the chip as described in the fourth aspect.
第七方面,本公开提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如上述第二方面或任一实施方式所述的数据处理方法的步骤。In a seventh aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the data as described in the second aspect or any implementation mode above is executed. The steps of the processing method.
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings required in the embodiments. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those skilled in the art, they can also make From these drawings other related drawings are obtained.
图1示出了本公开实施例所提供的一种芯片的缓存系统的架构示意图;FIG. 1 shows a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure;
图2示出了本公开实施例所提供的另一种芯片的缓存系统的架构示意图;FIG. 2 shows a schematic diagram of the architecture of another chip cache system provided by an embodiment of the present disclosure;
图3示出了本公开实施例所提供的一种芯片的架构示意图;FIG. 3 shows a schematic structural diagram of a chip provided by an embodiment of the present disclosure;
图4示出了本公开实施例所提供的一种数据处理方法的流程示意图;FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present disclosure;
图5示出了本公开实施例所提供的一种数据处理装置的架构示意图;FIG. 5 shows a schematic structural diagram of a data processing device provided by an embodiment of the present disclosure;
图6示出了本公开实施例所提供的另一种芯片的架构示意图;FIG. 6 shows a schematic structural diagram of another chip provided by an embodiment of the present disclosure;
图7示出了本公开实施例所提供的一种电子设备的结构示意图。Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述无意限制要求保护的本公开的范围。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. The described embodiments are only some of the embodiments of the present disclosure, not all of them. The components of the disclosed embodiments generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.
芯片中的算力单元工作时,可以从芯片的外部存储器中读取数据,并将读取的数据存储在内部存储器中,以便算力单元从内部存储器中读取数据。但是,在需要读取的数据较多、读取数据的次数较多时,外存与内存之间的交互,使得算力单元的运算效率较低。下述以芯片为人工智能(Artificial Intelligence,AI)芯片为例进行说明。When the computing power unit in the chip is working, it can read data from the external memory of the chip and store the read data in the internal memory so that the computing power unit can read data from the internal memory. However, when there is a lot of data to be read and the number of times to read data is large, the interaction between the external storage and the internal memory makes the calculation efficiency of the computing power unit low. In the following, the chip is an artificial intelligence (AI) chip as an example for illustration.
一般的,可以利用设计的AI芯片处理神经网络的运算过程。以云侧场景为例,大规模的神经网络的推理过程和训练过程,对AI芯片的计算能力提出了更高的需求。为了满足该需求,云侧AI芯片的规模也越来越大,通过增加面积来容纳更多的计算单元,从而使得单芯片具有更大的算力。但是,更多的计算单元也意味着需要更高带宽的数据通路以支撑运算的数据需求。为了满足计算所需的带宽,常规的做法是在AI芯片内部增加缓存,通过对运算数据的复用,降低计算单元对外部存储器的访问需求。Generally, the designed AI chip can be used to handle the calculation process of the neural network. Taking the cloud-side scenario as an example, the reasoning and training processes of large-scale neural networks place higher demands on the computing power of AI chips. In order to meet this demand, the scale of AI chips on the cloud side is also getting larger and larger. By increasing the area to accommodate more computing units, a single chip has greater computing power. However, more computing units also mean that a higher bandwidth data path is required to support the data requirements of the operation. In order to meet the bandwidth required for computing, the conventional approach is to increase the cache inside the AI chip, and reduce the access requirements of the computing unit to the external memory by multiplexing the computing data.
比如,可以在AI芯片的每个计算单元内部设置缓存器,比如,设置的缓存器可以为图形处理器(graphics processing unit,GPU)的一级缓存L1Cache。通过这种方式使得同一计算单元能够多次访问缓存的同一运算数据,实现了运算数据在时间维度上的复用,满足了计算单元的带宽要求。但是,使用这种方式会造成任一计算单元内缓存的运算数据无法被其他计算单元读取。例如,针对任一计算单元,其他计算单元无法访问到该任一计算单元内缓存的运算数据。这样,无法实现运算数据在空间维度上的复用,导致运算数据的使用率较低。For example, a cache can be set inside each computing unit of the AI chip, for example, the set cache can be a first-level cache L1Cache of a graphics processing unit (GPU). In this way, the same calculation unit can access the same cached operation data multiple times, realizing the multiplexing of operation data in the time dimension, and meeting the bandwidth requirements of the calculation unit. However, using this method will cause the operation data cached in any computing unit to be unable to be read by other computing units. For example, for any computing unit, other computing units cannot access the operation data cached in the any computing unit. In this way, the multiplexing of the operation data in the space dimension cannot be realized, resulting in a low usage rate of the operation data.
基于此,本公开实施例提供了一种芯片的缓存系统。Based on this, an embodiment of the present disclosure provides a chip cache system.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
参见图1所示,为本公开实施例所提供的芯片的缓存系统的架构示意图,该缓存系统包括多个运算子系统11,每个所述运算子系统11中包括至少一个运算单元101和至少一个局部共享缓存器102。每个所述运算单元101与所在运算子系统中的任一局部共享缓存器102相连。Referring to FIG. 1 , which is a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure, the cache system includes a plurality of computing subsystems 11, and each of the computing subsystems 11 includes at least one computing unit 101 and at least A local shared buffer 102 . Each computing unit 101 is connected to any local shared register 102 in the computing subsystem.
所述局部共享缓存器102用于缓存所属的所述运算子系统内的运算单元读取的运算数据。The local shared buffer 102 is used for caching the operation data read by the operation unit in the operation subsystem to which it belongs.
所述运算单元101用于基于生成的访问请求,访问所述运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器;并在从访问的所述局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,基于读取到的所述运算数据进行运算。The computing unit 101 is configured to access, based on the generated access request, a local shared buffer in the computing subsystem that matches the access address indicated by the access request; and read from the accessed local shared buffer In the case of the operation data indicated by the access request, the operation is performed based on the read operation data.
实施时,针对每个运算子系统,该运算子系统中可以包括局部缓存模块,用于将局部缓存模块划分成多个物理的内存库bank。例如,每个bank可以对应一个局部共享缓存器;以及该运算子系统中的每个运算单元可以对应一个AI芯片中的计算核。During implementation, for each computing subsystem, the computing subsystem may include a local cache module for dividing the local cache module into multiple physical memory banks. For example, each bank may correspond to a local shared register; and each computing unit in the computing subsystem may correspond to a computing core in an AI chip.
在每个运算子系统内,该运算子系统内的每个局部共享缓存器可以缓存运算数据,缓存的该运算数据可以为该运算子系统内的任一运算单元从芯片的存储模块中读取到 的。进而运算单元在进行运算的时候,可以生成访问请求,并基于生成的访问请求,访问运算子系统中与该访问请求指示的访问地址匹配的局部共享缓存器。若在局部共享缓存器中读取到访问请求指示的运算数据,利用从局部共享缓存器中读取到的运算数据进行运算,比如卷积运算,得到运算结果。In each computing subsystem, each local shared buffer in the computing subsystem can cache computing data, and the cached computing data can be read from the storage module of the chip for any computing unit in the computing subsystem arrived. Furthermore, when performing calculations, the computing unit may generate an access request, and based on the generated access request, access the local shared register in the computing subsystem that matches the access address indicated by the access request. If the operation data indicated by the access request is read in the local shared buffer, the operation data read from the local shared buffer is used to perform an operation, such as a convolution operation, to obtain an operation result.
这里,将芯片划分为多个运算子系统,每个运算子系统中包括至少一个运算单元和至少一个局部共享缓存器。针对任一运算子系统,该运算子系统中的运算单元可以将读取的运算数据缓存至该运算子系统的局部共享缓存器中,使得该运算子系统中的各个运算单元均能够从局部共享缓存器中读取该运算数据,以及使得每个运算单元能够在不同时间点多次从局部共享缓存器中读取该运算数据,运算数据的利用率较高,且无需从缓存系统的外部多次读取该运算数据,提高了运算单元的运算效率。Here, the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register. For any computing subsystem, the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share The operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points. The operation data is read at a time, which improves the operation efficiency of the operation unit.
在一种可选实施方式中,参见图2所示,每个所述运算子系统还包括局部互联总线单元103;所述局部互联总线单元103用于将每个所述运算单元与所在运算子系统中的任一局部共享缓存器相连;以及在接收到所述运算单元发出的所述访问请求后,确定与所述访问请求指示的访问地址匹配的局部共享缓存器。In an optional implementation manner, as shown in FIG. 2 , each of the computing subsystems further includes a local interconnection bus unit 103; the local interconnection bus unit 103 is used to connect each of the computing units with the operator Any local shared cache in the system is connected; and after receiving the access request sent by the operation unit, determining the local shared cache that matches the access address indicated by the access request.
所述运算单元101用于访问所述局部互联总线单元确定的、与所述访问请求指示的访问地址匹配的局部共享缓存器。The operation unit 101 is used for accessing the local shared buffer determined by the local interconnection bus unit and matching the access address indicated by the access request.
每个运算子系统中还可以包括局部互联总线单元,该局部互联总线单元用于将运算子系统内的每个运算单元与任一局部共享缓存器相连,并在接收到运算单元发出的访问请求后,确定与访问请求指示的访问地址匹配的局部共享缓存器。Each computing subsystem may also include a local interconnection bus unit, which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.
进而,在确定了与访问请求指示的访问地址匹配的局部共享缓存器之后,运算单元能够访问与访问请求指示的访问地址匹配的局部共享缓存器。比如,局部互联总线单元可以包括:片上网络系统(Network on Chip,NoC)等。Furthermore, after determining the local shared register matching the access address indicated by the access request, the arithmetic unit can access the local shared register matched with the access address indicated by the access request. For example, the local interconnection bus unit may include: a network on chip (Network on Chip, NoC) and the like.
这里,通过局部互联总线单元将每个运算单元与任一局部共享缓存器相连,使得每个运算单元能够访问任一局部共享缓存器,提高运算数据的利用率。同时,局部互联总线单元在接收到运算单元发出的访问请求后,确定与访问请求指示的访问地址匹配的局部共享缓存器,以便运算单元能够基于访问请求访问对应的局部共享缓存器。Here, each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data. At the same time, after receiving the access request sent by the computing unit, the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the computing unit can access the corresponding local shared register based on the access request.
在一种可选实施方式中,所述运算单元101还用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取的所述运算数据缓存至所述局部共享缓存器。In an optional implementation manner, the operation unit 101 is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request Next, based on the access request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module into the local shared buffer.
芯片中还包括存储模块,该存储模块中存储有运算单元需要的运算数据。实施时,针对任一运算子系统,若该运算子系统内的运算单元在从与访问请求指示的访问地址匹配的局部共享缓存器中未读取到访问请求指示的运算数据时,可以基于访问请求,从芯片的存储模块中读取访问请求指示的运算数据,并将从存储模块中读取到的运算数据缓存至该运算子系统包括的局部共享缓存器。The chip also includes a storage module, which stores operation data required by the operation unit. During implementation, for any computing subsystem, if the computing unit in the computing subsystem does not read the computing data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, it can Request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module to the local shared buffer included in the operation subsystem.
这里,在局部共享缓存器中不存在访问请求指示的运算数据时,运算单元可以基于访问请求,从芯片的存储模块中读取访问请求指示的运算数据,并将从存储模块读取的运算数据缓存至局部共享缓存器,以便后续能够从局部共享缓存器中读取到该运算数据,无需再次从存储模块读取该运算数据,提高了运算数据的复用程度。Here, when the operation data indicated by the access request does not exist in the local shared buffer, the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
在一种可选实施方式中,所述芯片的存储模块包括全局缓存器。其中,所述运算单元101还用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器 中读取所述访问请求指示的运算数据。In an optional implementation manner, the memory module of the chip includes a global register. Wherein, the operation unit 101 is further configured to: if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the global buffer of the chip.
实施时,运算子系统中的运算单元在从与访问请求指示的访问地址匹配的局部共享缓存器中未读取到访问请求指示的运算数据时,根据该访问请求,从芯片的全局缓存器中读取访问请求指示的运算数据。若成功从全局缓存器中读取到访问请求指示的运算数据时,可以将从全局缓存器中读取到的运算数据缓存至该运算子系统内的局部共享缓存器中。During implementation, when the operation unit in the operation subsystem does not read the operation data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, according to the access request, read from the global buffer of the chip Reads the operation data indicated by the access request. If the operation data indicated by the access request is successfully read from the global register, the operation data read from the global register can be cached in the local shared register in the operation subsystem.
在一种可选实施方式中,所述芯片的存储模块还包括外部存储器。所述运算单元还用于:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In an optional implementation manner, the storage module of the chip further includes an external memory. The operation unit is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer , and buffer the operation data read from the external memory into the global buffer.
若运算子系统中的运算单元在从全局缓存器中未读取到访问请求指示的运算数据时,可以根据该访问请求,从芯片的外部存储器中读取访问请求指示的运算数据,并将读取到的运算数据缓存至全局缓存器中,以便其他运算子系统能够从全局缓存器中读取到该运算数据;以及将读取到的运算数据缓存至该运算子系统内的局部共享缓存器中,以便该运算子系统内的其他运算单元能够从该局部共享缓存器中读取到该运算数据和该运算单元下一次能够从该局部共享缓存器中读取到该运算数据。If the operation unit in the operation subsystem does not read the operation data indicated by the access request from the global buffer, it can read the operation data indicated by the access request from the external memory of the chip according to the access request, and read The obtained operation data is cached in the global buffer so that other operation subsystems can read the operation data from the global buffer; and the read operation data is cached in the local shared register in the operation subsystem so that other computing units in the computing subsystem can read the computing data from the local shared buffer and the computing unit can read the computing data from the local shared buffer next time.
其中,芯片的外部存储器可以用于存储芯片中运算单元进行运算时所需的全部运算数据。全局缓存器可以用于存储各个运算子系统中的运算单元读取过的运算数据。任一运算子系统内的局部共享缓存器可以用于存储该运算子系统中每个运算单元读取过的运算数据。Wherein, the external memory of the chip can be used to store all the operation data required by the operation unit in the chip for operation. The global register can be used to store the operation data read by the operation units in each operation subsystem. The local shared register in any computing subsystem can be used to store the computing data read by each computing unit in the computing subsystem.
外部存储器和全局缓存器位于芯片内且位于缓存系统的外部。其中,全局缓存器分别与外部存储区和缓存系统相连。示例性的,在基于访问请求读取运算数据时,可以先从局部共享缓存区中读取访问请求指示的运算数据,若局部共享缓存区中不存在,则从全局缓存器中读取访问请求指示的运算数据;若全局缓存区中不存在,则从外部存储器中读取访问请求指示的运算数据。同时,可以在基于访问请求从外部存储器中读取到访问请求指示的运算数据后,先将读取到的该访问请求指示的运算数据缓存至全局缓存器,再将该访问请求指示的运算数据缓存至局部共享缓存器中。External memory and global registers are on-chip and external to the cache system. Wherein, the global buffer is respectively connected with the external storage area and the cache system. Exemplarily, when reading the operation data based on the access request, the operation data indicated by the access request can be read from the local shared cache first, and if it does not exist in the local shared cache, the access request is read from the global cache The indicated operation data; if it does not exist in the global cache, read the operation data indicated by the access request from the external memory. At the same time, after the operation data indicated by the access request is read from the external memory based on the access request, the read operation data indicated by the access request can be cached in the global cache first, and then the operation data indicated by the access request can be cached in the global cache. Cache into the local shared cache.
在一种可选实施方式中,所述缓存系统还包括:全局互联总线单元104;所述全局互联总线单元104用于将所述各个运算子系统分别与所述存储模块相连。In an optional implementation manner, the cache system further includes: a global interconnection bus unit 104; the global interconnection bus unit 104 is configured to respectively connect the respective computing subsystems to the storage modules.
实施时,全局互联总线单元可以将各个运算子系统分别与存储模块相连,即将每个运算子系统与存储模块中的全局缓存器和外部存储器相连,以便任一运算子系统中的每个运算单元能够基于访问请求读取全局缓存器和外部存储器。During implementation, the global interconnection bus unit can connect each computing subsystem to the storage module respectively, that is, each computing subsystem is connected to the global buffer in the storage module and the external memory, so that each computing unit in any computing subsystem Global caches and external memory can be read based on access requests.
参见图3所示,可以将AI芯片划分为X个运算子系统,每个运算子系统下包括N个运算单元和M个局部共享缓存器,将每个运算单元与M个局部共享缓存器相连,使得每个运算单元可以读取相连的任一局部共享缓存器。其中,X、N、M为正整数。As shown in Figure 3, the AI chip can be divided into X computing subsystems, each computing subsystem includes N computing units and M local shared registers, and each computing unit is connected to M local shared registers , so that each operation unit can read any connected local shared register. Wherein, X, N, M are positive integers.
结合图3对缓存系统的工作流程进行示例性说明。运算子系统0中的M个局部共享缓存器中缓存有运算数据。若运算子系统0中的运算单元0进行运算时,运算单元0可以生成并发出访问请求。局部互联总线单元在接收到访问请求之后,确定运算子系统0中与访问请求指示的访问地址匹配的局部共享缓存器。若确定的局部共享缓存器为局部共享缓存器0时,运算单元可以从局部共享缓存器0中读取局部共享缓存器0中缓存的运算数据。The workflow of the caching system is exemplarily described with reference to FIG. 3 . Operational data is cached in the M local shared registers in the operation subsystem 0 . When the computing unit 0 in the computing subsystem 0 performs computing, the computing unit 0 can generate and issue an access request. After receiving the access request, the local interconnection bus unit determines the local shared register in the computing subsystem 0 that matches the access address indicated by the access request. If the determined local shared register is the local shared register 0, the computing unit may read the operation data cached in the local shared register 0 from the local shared register 0.
若读取到访问请求指示的运算数据时,运算单元0基于读取到的运算数据进行运算, 得到运算结果。若从局部共享缓存器0中未读取到访问请求指示的运算数据时,运算单元0可以基于访问请求读取全局缓存器。若从全局缓存器中读取到访问请求指示的运算数据时,可以将从全局缓存器中读取到的运算数据,缓存至运算子系统0中包括的任一局部共享缓存器中,以及基于读取到的运算数据进行运算,得到运算结果。When the operation data indicated by the access request is read, the operation unit 0 performs an operation based on the read operation data to obtain an operation result. If the operation data indicated by the access request is not read from the local shared buffer 0, the operation unit 0 may read the global register based on the access request. If the operation data indicated by the access request is read from the global cache, the operation data read from the global cache can be cached in any local shared cache included in the operation subsystem 0, and based on The read operation data is operated to obtain the operation result.
若从全局缓存器中未读取到访问请求指示的运算数据时,运算单元0可以基于访问请求读取外部存储器,并将从外部存储器中读取到的运算数据缓存至全局缓存器中,以及将读取到的运算数据缓存至运算子系统0中包括的任一局部共享缓存器中,并基于读取到的运算数据进行运算,得到运算结果。If the operation data indicated by the access request is not read from the global buffer, the operation unit 0 may read the external memory based on the access request, and cache the operation data read from the external memory into the global buffer, and The read operation data is cached in any local shared buffer included in the operation subsystem 0, and operations are performed based on the read operation data to obtain operation results.
一般的,可以将运算结果缓存至设置的其他缓存器中,无需将运算结果缓存至局部共享缓存器中。比如,其他缓存器可以为运算单元内部的一级缓存器L1cache等。故在实际调度中,局部共享缓存器主要用于存储只读的运算数据,不会对该局部共享缓存器进行写操作,因此,不同运算子系统包括的至少一个局部共享缓存器之间可以不具有数据一致性,比如运算子系统0中的局部共享缓存器缓存的运算数据与运算子系统1中的局部共享缓存器缓存的运算数据可以不相同。通过上述设计可以在满足运算单元对全局缓存和局部缓存的数据需求的基础上,降低了缓存系统的设计复杂度,提高了AI芯片的性能。Generally, the operation result can be cached in other configured registers, and the operation result does not need to be cached in the local shared register. For example, other caches may be the first-level cache L1cache inside the computing unit. Therefore, in actual scheduling, the local shared registers are mainly used to store read-only computing data, and no write operation will be performed on the local shared registers. Therefore, at least one local shared registers included in different computing subsystems may not There is data consistency, for example, the operation data cached in the local shared register in the operation subsystem 0 may be different from the operation data cached in the local shared register in the operation subsystem 1 . Through the above design, the design complexity of the cache system can be reduced and the performance of the AI chip can be improved on the basis of meeting the data requirements of the computing unit for the global cache and the local cache.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.
基于相同的构思,本公开实施例还提供了一种数据处理方法,该方法应用于上述实施方式所描述的芯片的缓存系统,参见图4所示,为本公开实施例提供的数据处理方法的流程示意图,该数据处理方法包括以下步骤S401-S403。Based on the same idea, the embodiment of the present disclosure also provides a data processing method, which is applied to the cache system of the chip described in the above embodiment, as shown in FIG. 4 , which is the data processing method provided by the embodiment of the present disclosure A schematic flow chart, the data processing method includes the following steps S401-S403.
S401,缓存系统所包括的任一运算子系统中的任一运算单元获取访问请求。S401. Any computing unit in any computing subsystem included in the cache system acquires an access request.
S402,所述运算单元基于所述访问请求,访问所在运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器。S402. Based on the access request, the computing unit accesses a local shared register in the computing subsystem that matches the access address indicated by the access request.
S403,在从访问的局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,所述运算单元基于读取到的所述运算数据进行运算,得到运算结果。S403. When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.
在一种可选实施方式中,所述方法还包括:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,该运算单元基于所述访问请求,从芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取到的所述运算数据缓存至所述局部共享缓存器。In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, the The operation unit reads the operation data indicated by the access request from the storage module of the chip based on the access request, and caches the operation data read from the storage module into the local shared buffer.
在一种可选实施方式中,所述在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,该运算单元基于所述访问请求,从芯片的存储模块中读取所述访问请求指示的运算数据,包括:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,该运算单元基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。In an optional implementation manner, when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, the operation unit based on the The above access request, reading the operation data indicated by the access request from the storage module of the chip includes: not reading the access request indication from the local shared buffer matching the access address indicated by the access request In the case of the operation data, the operation unit reads the operation data indicated by the access request from the global buffer of the chip based on the access request.
在一种可选实施方式中,所述方法还包括:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,该运算单元从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In an optional implementation manner, the method further includes: when the operation data indicated by the access request is not read from the global buffer, the operation unit reads the data from the external memory of the chip reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
基于相同的构思,本公开实施例还提供了一种数据处理装置,参见图5所示,为本公开实施例提供的数据处理装置的架构示意图,该数据处理装置包括获取模块501、读 取模块502和运算模块503。Based on the same idea, the embodiment of the present disclosure also provides a data processing device, as shown in FIG. 5 , which is a schematic diagram of the structure of the data processing device provided by the embodiment of the present disclosure. 502 and operation module 503.
获取模块501用于获取访问请求;The obtaining module 501 is used to obtain the access request;
读取模块502用于基于所述访问请求,读取与所述访问请求指示的访问地址匹配的局部共享缓存器。The reading module 502 is configured to, based on the access request, read the local shared buffer that matches the access address indicated by the access request.
运算模块503用于在读取到所述访问请求指示的运算数据的情况下,基于从所述局部共享缓存器中读取到的所述运算数据进行运算,得到运算结果。The operation module 503 is configured to perform an operation based on the operation data read from the local shared buffer to obtain an operation result when the operation data indicated by the access request is read.
在一种可能的实施方式中,所述装置还包括访问模块504,用于在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取到的所述运算数据缓存至所述局部共享缓存器。In a possible implementation manner, the device further includes an access module 504, configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared buffer .
在一种可选实施方式中,所述访问模块504,在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从芯片的存储模块中读取所述访问请求指示的运算数据时,用于:在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。In an optional implementation manner, the access module 504, if the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request, when reading the operation data indicated by the access request from the memory module of the chip, is used for: the local shared buffer that matches the access address indicated by the access request does not read the When the operation data indicated by the access request is accessed, the operation data indicated by the access request is read from the global buffer of the chip based on the access request.
在一种可选实施方式中,所述访问模块504还用于:在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。In an optional implementation manner, the access module 504 is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模板可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。In some embodiments, the functions of the device provided by the embodiments of the present disclosure or the included templates can be used to execute the methods described in the above method embodiments, and its specific implementation can refer to the description of the above method embodiments. For brevity, here No longer.
基于相同的构思,本公开实施例还提供了一种芯片,包括:上述实施方式所述的缓存系统601和存储模块602。Based on the same idea, an embodiment of the present disclosure further provides a chip, including: the cache system 601 and the storage module 602 described in the foregoing implementation manners.
所述缓存系统601用于从所述存储模块602中获取运算数据,并将所述运算数据进行缓存。The cache system 601 is used to acquire operation data from the storage module 602 and cache the operation data.
其中,存储模块602可以包括全局缓存器和外部存储器。即缓存系统可以先基于访问请求从全局缓存器中读取运算数据,在全局缓存器中不存在访问请求对应的运算数据时,再基于访问请求从外部存储器中读取访问请求对应的运算数据。Wherein, the storage module 602 may include a global cache and an external memory. That is, the cache system can first read the operation data from the global cache based on the access request, and then read the operation data corresponding to the access request from the external memory based on the access request when there is no operation data corresponding to the access request in the global cache.
基于同一技术构思,本公开实施例还提供了一种电子设备。参照图7所示,为本公开实施例提供的电子设备的结构示意图,该电子设备包括处理器701、存储器702和总线703。其中,存储器702用于存储执行指令,包括内存7021和外部存储器7022;这里的内存7021也称内存储器,用于暂时存放处理器701中的运算数据,以及与硬盘等外部存储器7022交换的数据,处理器701通过内存7021与外部存储器7022进行数据交换,当电子设备700运行时,处理器701与存储器702之间通过总线703通信,使得处理器701执行以下指令:获取访问请求;基于所述访问请求,读取与所述访问请求指示的访问地址匹配的局部共享缓存器;在读取到所述访问请求指示的运算数据的情况下,基于从所述局部共享缓存器中读取到的所述运算数据进行运算,得到运算结果。Based on the same technical idea, an embodiment of the present disclosure also provides an electronic device. Referring to FIG. 7 , which is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure, the electronic device includes a processor 701 , a memory 702 and a bus 703 . Among them, the memory 702 is used to store execution instructions, including a memory 7021 and an external memory 7022; the memory 7021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 701 and exchange data with an external memory 7022 such as a hard disk. The processor 701 exchanges data with the external memory 7022 through the memory 7021. When the electronic device 700 is running, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions: obtain an access request; Request, read the local shared buffer that matches the access address indicated by the access request; in the case of reading the operation data indicated by the access request, based on the read from the local shared buffer Perform calculations on the above calculation data to obtain calculation results.
其中,处理器701的具体处理流程可以参照上述方法实施例的记载,这里不再赘述。Wherein, for the specific processing flow of the processor 701, reference may be made to the descriptions in the foregoing method embodiments, and details are not repeated here.
或者,该电子设备可以如上述实施方式所述的芯片。Alternatively, the electronic device may be the chip described in the above implementation manner.
此外,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的数据处理方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。In addition, an embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the data processing method described in the above-mentioned method embodiments are executed. . Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的数据处理方法的步骤,具体可参见上述方法实施例,在此不再赘述。Embodiments of the present disclosure also provide a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the data processing method described in the above method embodiment, for details, please refer to the above method The embodiment will not be repeated here.
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。Wherein, the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the system and device described above can refer to the corresponding process in the foregoing method embodiment, and details are not repeated here. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
以上仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。The above is only the specific implementation of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, which should be covered within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Claims (11)

  1. 一种芯片的缓存系统,其特征在于,包括多个运算子系统,A chip cache system, characterized in that it includes a plurality of computing subsystems,
    每个所述运算子系统中包括至少一个运算单元和至少一个局部共享缓存器;Each of the computing subsystems includes at least one computing unit and at least one local shared register;
    每个所述运算单元与所在运算子系统中的任一所述局部共享缓存器相连;Each of the computing units is connected to any one of the local shared registers in the computing subsystem;
    所述局部共享缓存器用于缓存所属运算子系统内的运算单元读取的运算数据;The local shared buffer is used to cache the operation data read by the operation unit in the operation subsystem to which it belongs;
    所述运算单元用于:The computing unit is used for:
    基于生成的访问请求,访问所述运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器;并based on the generated access request, accessing a local shared cache in the computing subsystem that matches an access address indicated by the access request; and
    在从访问的局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,基于读取到的所述运算数据进行运算。When the operation data indicated by the access request is read from the accessed local shared buffer, the operation is performed based on the read operation data.
  2. 根据权利要求1所述的缓存系统,其特征在于,The caching system according to claim 1, wherein:
    每个所述运算子系统中还包括局部互联总线单元;Each of the computing subsystems also includes a local interconnection bus unit;
    所述局部互联总线单元用于:The local interconnect bus unit is used for:
    将每个所述运算单元与所在运算子系统中的任一所述局部共享缓存器相连;以及connecting each of the computing units to any one of the local shared registers in the computing subsystem; and
    在接收到所述运算单元发出的所述访问请求后,确定与所述访问请求指示的访问地址匹配的局部共享缓存器;After receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request;
    所述运算单元用于访问所述局部互联总线单元确定的、与所述访问请求指示的访问地址匹配的局部共享缓存器。The operation unit is used to access the local shared buffer determined by the local interconnection bus unit and matched with the access address indicated by the access request.
  3. 根据权利要求1或2所述的缓存系统,其特征在于,所述运算单元还用于:The cache system according to claim 1 or 2, wherein the computing unit is also used for:
    在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的存储模块中读取所述访问请求指示的运算数据,并将从所述存储模块读取的所述运算数据缓存至所述局部共享缓存器。If the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request, read from the memory module of the chip The operation data indicated by the access request, and cache the operation data read from the storage module into the local shared buffer.
  4. 根据权利要求3所述的缓存系统,其特征在于,所述芯片的存储模块包括全局缓存器;所述运算单元还用于:The cache system according to claim 3, wherein the storage module of the chip includes a global cache; the computing unit is also used for:
    在从与所述访问请求指示的访问地址匹配的局部共享缓存器中未读取到所述访问请求指示的运算数据的情况下,基于所述访问请求,从所述芯片的全局缓存器中读取所述访问请求指示的运算数据。If the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request, read from the global buffer of the chip Obtain the operation data indicated by the access request.
  5. 根据权利要求4所述的缓存系统,其特征在于,所述芯片的存储模块还包括外部存储器;所述运算单元还用于:The cache system according to claim 4, wherein the storage module of the chip also includes an external memory; the computing unit is also used for:
    在从所述全局缓存器中未读取到所述访问请求指示的运算数据的情况下,从所述芯片的外部存储器中读取所述访问请求指示的运算数据,并将从所述外部存储器中读取的运算数据缓存至所述全局缓存器。If the operation data indicated by the access request is not read from the global buffer, the operation data indicated by the access request is read from the external memory of the chip, and the operation data indicated by the access request is read from the external memory The operation data read in is buffered into the global buffer.
  6. 根据权利要求3至5任一项所述的缓存系统,其特征在于,The cache system according to any one of claims 3 to 5, characterized in that,
    所述缓存系统还包括全局互联总线单元;The cache system also includes a global interconnection bus unit;
    所述全局互联总线单元用于将各个所述运算子系统分别与所述存储模块相连。The global interconnection bus unit is used to respectively connect each of the computing subsystems with the storage module.
  7. 一种数据处理方法,其特征在于,所述方法应用于权利要求1至6任一项所述的芯片的缓存系统,所述方法包括:A data processing method, wherein the method is applied to the cache system of the chip according to any one of claims 1 to 6, the method comprising:
    所述缓存系统包括的任一运算子系统中的运算单元获取访问请求;An operation unit in any operation subsystem included in the cache system acquires an access request;
    所述运算单元基于所述访问请求,访问所在运算子系统中与所述访问请求指示的访问地址匹配的局部共享缓存器;The computing unit accesses, based on the access request, a local shared register in the computing subsystem that matches the access address indicated by the access request;
    在从访问的局部共享缓存器中读取到所述访问请求指示的运算数据的情况下,所述运算单元基于读取到的所述运算数据进行运算,得到运算结果。When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.
  8. 一种芯片,其特征在于,包括:A chip, characterized in that it comprises:
    存储模块;和storage module; and
    如权利要求1至6任一项所述的缓存系统,其中,所述缓存系统用于从所述存储模块中获取运算数据,并将所述运算数据进行缓存。The cache system according to any one of claims 1 to 6, wherein the cache system is used to acquire operation data from the storage module and cache the operation data.
  9. 一种电子设备,其特征在于,包括:处理器、存储器和总线,其中,所述存储器存储有所述处理器可执行的机器可读指令,当所述电子设备运行时,所述处理器与所述存储器之间通过所述总线通信,所述机器可读指令被所述处理器执行时实现如权利要求7所述的数据处理方法的步骤。An electronic device, characterized by comprising: a processor, a memory, and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and The memories communicate through the bus, and the machine-readable instructions implement the steps of the data processing method according to claim 7 when executed by the processor.
  10. 一种电子设备,其特征在于,包括如权利要求8所述的芯片。An electronic device, characterized by comprising the chip according to claim 8.
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,其中,所述计算机程序被处理器运行时执行如权利要求7所述的数据处理方法的步骤。A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, wherein the computer program executes the steps of the data processing method according to claim 7 when the computer program is run by a processor.
PCT/CN2022/121033 2021-12-31 2022-09-23 Chip cache system, data processing method, device, storage medium, and chip WO2023124304A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111662634.XA CN114297098A (en) 2021-12-31 2021-12-31 Chip cache system, data processing method, device, storage medium and chip
CN202111662634.X 2021-12-31

Publications (1)

Publication Number Publication Date
WO2023124304A1 true WO2023124304A1 (en) 2023-07-06

Family

ID=80974096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121033 WO2023124304A1 (en) 2021-12-31 2022-09-23 Chip cache system, data processing method, device, storage medium, and chip

Country Status (2)

Country Link
CN (1) CN114297098A (en)
WO (1) WO2023124304A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297098A (en) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 Chip cache system, data processing method, device, storage medium and chip

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233409A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Managing shared memory used by compute nodes
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN104981786A (en) * 2013-03-05 2015-10-14 国际商业机器公司 Prefetching for parent core in multi-core chip
CN107291629A (en) * 2016-04-12 2017-10-24 华为技术有限公司 A kind of method and apparatus for accessing internal memory
CN114297098A (en) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 Chip cache system, data processing method, device, storage medium and chip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233409A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Managing shared memory used by compute nodes
CN104981786A (en) * 2013-03-05 2015-10-14 国际商业机器公司 Prefetching for parent core in multi-core chip
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN107291629A (en) * 2016-04-12 2017-10-24 华为技术有限公司 A kind of method and apparatus for accessing internal memory
CN114297098A (en) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 Chip cache system, data processing method, device, storage medium and chip

Also Published As

Publication number Publication date
CN114297098A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
US11294599B1 (en) Registers for restricted memory
JP3628595B2 (en) Interconnected processing nodes configurable as at least one NUMA (NON-UNIFORMMOMERYACCESS) data processing system
US8352656B2 (en) Handling atomic operations for a non-coherent device
CN108268385B (en) Optimized caching agent with integrated directory cache
TWI767111B (en) Sever system
US20200242042A1 (en) System, Apparatus and Method for Performing a Remote Atomic Operation Via an Interface
WO2019153702A1 (en) Interrupt processing method, apparatus and server
US7454576B2 (en) System and method for cache coherency in a cache with different cache location lengths
US11868306B2 (en) Processing-in-memory concurrent processing system and method
CN114860329A (en) Dynamic consistency biasing configuration engine and method
WO2023124304A1 (en) Chip cache system, data processing method, device, storage medium, and chip
US11157191B2 (en) Intra-device notational data movement system
US20220269433A1 (en) System, method and apparatus for peer-to-peer communication
US20180336034A1 (en) Near memory computing architecture
TW202215223A (en) Devices for accelerators and method for processing data
WO2023134735A1 (en) Computing device, data processing method and system, and related device
WO2016049807A1 (en) Cache directory processing method and directory controller of multi-core processor system
US11847049B2 (en) Processing system that increases the memory capacity of a GPGPU
US20220342835A1 (en) Method and apparatus for disaggregation of computing resources
CN115563053A (en) High-performance on-chip memory controller and execution method thereof
US7930459B2 (en) Coherent input output device
US11275669B2 (en) Methods and systems for hardware-based statistics management using a general purpose memory
US11907144B1 (en) Early semaphore update
US20240070107A1 (en) Memory device with embedded deep learning accelerator in multi-client environment
US11550736B1 (en) Tensorized direct memory access descriptors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913569

Country of ref document: EP

Kind code of ref document: A1