WO2023124304A1 - Système de cache de puce, procédé de traitement de données, dispositif, support de stockage et puce - Google Patents

Système de cache de puce, procédé de traitement de données, dispositif, support de stockage et puce Download PDF

Info

Publication number
WO2023124304A1
WO2023124304A1 PCT/CN2022/121033 CN2022121033W WO2023124304A1 WO 2023124304 A1 WO2023124304 A1 WO 2023124304A1 CN 2022121033 W CN2022121033 W CN 2022121033W WO 2023124304 A1 WO2023124304 A1 WO 2023124304A1
Authority
WO
WIPO (PCT)
Prior art keywords
access request
read
operation data
computing
local shared
Prior art date
Application number
PCT/CN2022/121033
Other languages
English (en)
Chinese (zh)
Inventor
王文强
夏晓旭
朱志岐
徐宁仪
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023124304A1 publication Critical patent/WO2023124304A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the technical field of integrated circuits, and in particular, relates to a chip cache system, a data processing method, electronic equipment, a computer-readable storage medium, and a chip.
  • Neural network is an algorithmic mathematical model that imitates the behavior characteristics of animal neural network and performs distributed parallel information processing. With the rapid development of neural networks, neural networks have been applied in various fields.
  • a designed artificial intelligence (AI) chip can be used to process the calculation process of the neural network. Therefore, designing an efficient AI chip has become one of the effective means to improve the processing efficiency of the neural network.
  • the present disclosure at least provides a chip cache system, a data processing method, an electronic device, a computer-readable storage medium, and a chip.
  • the present disclosure provides a chip cache system, including: a plurality of computing subsystems, each of which includes at least one computing unit and at least one local shared buffer; each of the computing units It is connected to any one of the local shared buffers in the computing subsystem; the local shared buffer is used to cache the computing data read by the computing unit in the computing subsystem; the computing unit is used to generate access requests based on , accessing the local shared buffer in the operation subsystem that matches the access address indicated by the access request; and when the operation data indicated by the access request is read from the accessed local shared buffer, based on Operation is performed on the read operation data.
  • the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register.
  • the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share
  • the operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points.
  • the operation data is read at a time, which improves the operation efficiency of the operation unit.
  • each of the computing subsystems further includes: a local interconnection bus unit; the local interconnection bus unit is used to connect each of the computing units with any local The shared buffer is connected; and after receiving the access request sent by the operation unit, determine the local shared buffer that matches the access address indicated by the access request; the operation unit is used to access the local interconnection bus The local shared buffer determined by the unit and matching the access address indicated by the access request.
  • each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data.
  • the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the operation unit can read the corresponding local shared register based on the access request.
  • the operation unit is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request , based on the access request, reading the operation data indicated by the access request from the storage module of the chip, and buffering the operation data read from the storage module into the local shared buffer.
  • the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
  • the storage module of the chip includes a global buffer; the operation unit is further configured to: read the local shared buffer that matches the access address indicated by the access request. In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from a global buffer of the chip.
  • the storage module of the chip further includes an external memory; the operation unit is further configured to: when the operation data indicated by the access request is not read from the global buffer Next, read the operation data indicated by the access request from the external memory of the chip, and cache the operation data read from the external memory into the global buffer.
  • the cache system further includes: a global interconnection bus unit, configured to connect each computing subsystem to the storage module respectively.
  • the present disclosure provides a data processing method, the method is applied to the cache system of the chip described in the first aspect or any implementation mode, and the method includes: any operator included in the cache system
  • the computing unit in the system obtains the access request; the computing unit accesses the local shared buffer in the computing subsystem that matches the access address indicated by the access request based on the access request; in the local shared buffer accessed from the slave
  • the operation unit performs an operation based on the read operation data to obtain an operation result.
  • the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request reads the operation data indicated by the access request from the storage module of the chip, and caches the operation data read from the storage module into the local shared buffer.
  • the operation data indicated by the access request when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the storage module of the chip, including: not reading the operation data indicated by the access request from the local shared buffer matching the access address indicated by the access request In the case of operation data, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
  • the method further includes: when the operation data indicated by the access request is not read from the global buffer, reading the computing data indicated by the access request, and caching the computing data read from the external memory into the global buffer.
  • the present disclosure provides a data processing device, the device comprising: an acquisition module, configured to acquire an access request; a reading module, configured to read the access information indicated by the access request based on the access request A local shared buffer whose address matches; an operation module configured to perform an operation based on the operation data read from the local shared buffer when the operation data indicated by the access request is read, to obtain Operation result.
  • the device further includes an access module, configured to: read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request In the case of , based on the access request, the operation data indicated by the access request is read from the storage module of the chip, and the operation data read from the storage module is cached in the local shared cache.
  • the access module if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the The above access request, when reading the operation data indicated by the access request from the memory module of the chip, it is used for: not reading the specified data from the local shared buffer matching the access address indicated by the access request In the case of the operation data indicated by the access request, based on the access request, the operation data indicated by the access request is read from the global buffer of the chip.
  • the access module is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer
  • the operation data indicated by the access request is fetched, and the operation data read from the external memory is cached in the global buffer.
  • the present disclosure provides a chip, including: the cache system and the storage module described in the first aspect or any one of the implementation modes; the cache system is used to obtain computing data from the storage module, and store the Operational data is cached.
  • the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the The memory communicates with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the data processing method described in the second aspect or any implementation manner are executed.
  • the present disclosure provides an electronic device, including the chip as described in the fourth aspect.
  • the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the data as described in the second aspect or any implementation mode above is executed. The steps of the processing method.
  • FIG. 1 shows a schematic diagram of the architecture of a chip cache system provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of the architecture of another chip cache system provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic structural diagram of a chip provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic structural diagram of a data processing device provided by an embodiment of the present disclosure
  • FIG. 6 shows a schematic structural diagram of another chip provided by an embodiment of the present disclosure.
  • Fig. 7 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the computing power unit in the chip When the computing power unit in the chip is working, it can read data from the external memory of the chip and store the read data in the internal memory so that the computing power unit can read data from the internal memory.
  • the chip is an artificial intelligence (AI) chip as an example for illustration.
  • the designed AI chip can be used to handle the calculation process of the neural network.
  • the reasoning and training processes of large-scale neural networks place higher demands on the computing power of AI chips.
  • the scale of AI chips on the cloud side is also getting larger and larger.
  • more computing units also mean that a higher bandwidth data path is required to support the data requirements of the operation.
  • the conventional approach is to increase the cache inside the AI chip, and reduce the access requirements of the computing unit to the external memory by multiplexing the computing data.
  • a cache can be set inside each computing unit of the AI chip, for example, the set cache can be a first-level cache L1Cache of a graphics processing unit (GPU).
  • the same calculation unit can access the same cached operation data multiple times, realizing the multiplexing of operation data in the time dimension, and meeting the bandwidth requirements of the calculation unit.
  • using this method will cause the operation data cached in any computing unit to be unable to be read by other computing units.
  • other computing units cannot access the operation data cached in the any computing unit.
  • the multiplexing of the operation data in the space dimension cannot be realized, resulting in a low usage rate of the operation data.
  • an embodiment of the present disclosure provides a chip cache system.
  • the cache system includes a plurality of computing subsystems 11, and each of the computing subsystems 11 includes at least one computing unit 101 and at least A local shared buffer 102 .
  • Each computing unit 101 is connected to any local shared register 102 in the computing subsystem.
  • the local shared buffer 102 is used for caching the operation data read by the operation unit in the operation subsystem to which it belongs.
  • the computing unit 101 is configured to access, based on the generated access request, a local shared buffer in the computing subsystem that matches the access address indicated by the access request; and read from the accessed local shared buffer In the case of the operation data indicated by the access request, the operation is performed based on the read operation data.
  • the computing subsystem may include a local cache module for dividing the local cache module into multiple physical memory banks.
  • each bank may correspond to a local shared register; and each computing unit in the computing subsystem may correspond to a computing core in an AI chip.
  • each local shared buffer in the computing subsystem can cache computing data, and the cached computing data can be read from the storage module of the chip for any computing unit in the computing subsystem arrived. Furthermore, when performing calculations, the computing unit may generate an access request, and based on the generated access request, access the local shared register in the computing subsystem that matches the access address indicated by the access request. If the operation data indicated by the access request is read in the local shared buffer, the operation data read from the local shared buffer is used to perform an operation, such as a convolution operation, to obtain an operation result.
  • an operation such as a convolution operation
  • the chip is divided into multiple computing subsystems, and each computing subsystem includes at least one computing unit and at least one local shared register.
  • the computing unit in the computing subsystem can cache the read computing data into the local shared buffer of the computing subsystem, so that each computing unit in the computing subsystem can share
  • the operation data is read from the buffer, and each operation unit can read the operation data from the local shared buffer multiple times at different time points.
  • the operation data is read at a time, which improves the operation efficiency of the operation unit.
  • each of the computing subsystems further includes a local interconnection bus unit 103; the local interconnection bus unit 103 is used to connect each of the computing units with the operator Any local shared cache in the system is connected; and after receiving the access request sent by the operation unit, determining the local shared cache that matches the access address indicated by the access request.
  • the operation unit 101 is used for accessing the local shared buffer determined by the local interconnection bus unit and matching the access address indicated by the access request.
  • Each computing subsystem may also include a local interconnection bus unit, which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.
  • a local interconnection bus unit which is used to connect each computing unit in the computing subsystem with any local shared buffer, and when receiving an access request sent by the computing unit After that, determine the local shared buffer that matches the access address indicated by the access request.
  • the arithmetic unit can access the local shared register matched with the access address indicated by the access request.
  • the local interconnection bus unit may include: a network on chip (Network on Chip, NoC) and the like.
  • each computing unit is connected to any local shared register through a local interconnection bus unit, so that each computing unit can access any local shared register, thereby improving the utilization rate of computing data.
  • the local interconnection bus unit determines the local shared register matching the access address indicated by the access request, so that the computing unit can access the corresponding local shared register based on the access request.
  • the operation unit 101 is further configured to: when the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request Next, based on the access request, read the operation data indicated by the access request from the storage module of the chip, and cache the operation data read from the storage module into the local shared buffer.
  • the chip also includes a storage module, which stores operation data required by the operation unit.
  • a storage module which stores operation data required by the operation unit.
  • the operation unit can read the operation data indicated by the access request from the storage module of the chip based on the access request, and store the operation data read from the storage module Caching to the local shared buffer, so that the operation data can be read from the local shared buffer later, without reading the operation data from the storage module again, which improves the multiplexing degree of the operation data.
  • the memory module of the chip includes a global register.
  • the operation unit 101 is further configured to: if the operation data indicated by the access request is not read from the local shared buffer that matches the access address indicated by the access request, based on the access request , reading the operation data indicated by the access request from the global buffer of the chip.
  • the operation unit in the operation subsystem when the operation unit in the operation subsystem does not read the operation data indicated by the access request from the local shared buffer that matches the access address indicated by the access request, according to the access request, read from the global buffer of the chip Reads the operation data indicated by the access request. If the operation data indicated by the access request is successfully read from the global register, the operation data read from the global register can be cached in the local shared register in the operation subsystem.
  • the storage module of the chip further includes an external memory.
  • the operation unit is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer , and buffer the operation data read from the external memory into the global buffer.
  • the operation unit in the operation subsystem does not read the operation data indicated by the access request from the global buffer, it can read the operation data indicated by the access request from the external memory of the chip according to the access request, and read The obtained operation data is cached in the global buffer so that other operation subsystems can read the operation data from the global buffer; and the read operation data is cached in the local shared register in the operation subsystem so that other computing units in the computing subsystem can read the computing data from the local shared buffer and the computing unit can read the computing data from the local shared buffer next time.
  • the external memory of the chip can be used to store all the operation data required by the operation unit in the chip for operation.
  • the global register can be used to store the operation data read by the operation units in each operation subsystem.
  • the local shared register in any computing subsystem can be used to store the computing data read by each computing unit in the computing subsystem.
  • External memory and global registers are on-chip and external to the cache system.
  • the global buffer is respectively connected with the external storage area and the cache system.
  • the operation data indicated by the access request can be read from the local shared cache first, and if it does not exist in the local shared cache, the access request is read from the global cache The indicated operation data; if it does not exist in the global cache, read the operation data indicated by the access request from the external memory.
  • the read operation data indicated by the access request can be cached in the global cache first, and then the operation data indicated by the access request can be cached in the global cache. Cache into the local shared cache.
  • the cache system further includes: a global interconnection bus unit 104; the global interconnection bus unit 104 is configured to respectively connect the respective computing subsystems to the storage modules.
  • the global interconnection bus unit can connect each computing subsystem to the storage module respectively, that is, each computing subsystem is connected to the global buffer in the storage module and the external memory, so that each computing unit in any computing subsystem Global caches and external memory can be read based on access requests.
  • the AI chip can be divided into X computing subsystems, each computing subsystem includes N computing units and M local shared registers, and each computing unit is connected to M local shared registers , so that each operation unit can read any connected local shared register.
  • X, N, M are positive integers.
  • Operational data is cached in the M local shared registers in the operation subsystem 0 .
  • the computing unit 0 in the computing subsystem 0 performs computing, the computing unit 0 can generate and issue an access request.
  • the local interconnection bus unit determines the local shared register in the computing subsystem 0 that matches the access address indicated by the access request. If the determined local shared register is the local shared register 0, the computing unit may read the operation data cached in the local shared register 0 from the local shared register 0.
  • the operation unit 0 When the operation data indicated by the access request is read, the operation unit 0 performs an operation based on the read operation data to obtain an operation result. If the operation data indicated by the access request is not read from the local shared buffer 0, the operation unit 0 may read the global register based on the access request. If the operation data indicated by the access request is read from the global cache, the operation data read from the global cache can be cached in any local shared cache included in the operation subsystem 0, and based on The read operation data is operated to obtain the operation result.
  • the operation unit 0 may read the external memory based on the access request, and cache the operation data read from the external memory into the global buffer, and The read operation data is cached in any local shared buffer included in the operation subsystem 0, and operations are performed based on the read operation data to obtain operation results.
  • the operation result can be cached in other configured registers, and the operation result does not need to be cached in the local shared register.
  • other caches may be the first-level cache L1cache inside the computing unit. Therefore, in actual scheduling, the local shared registers are mainly used to store read-only computing data, and no write operation will be performed on the local shared registers. Therefore, at least one local shared registers included in different computing subsystems may not There is data consistency, for example, the operation data cached in the local shared register in the operation subsystem 0 may be different from the operation data cached in the local shared register in the operation subsystem 1 .
  • the design complexity of the cache system can be reduced and the performance of the AI chip can be improved on the basis of meeting the data requirements of the computing unit for the global cache and the local cache.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a data processing method, which is applied to the cache system of the chip described in the above embodiment, as shown in FIG. 4 , which is the data processing method provided by the embodiment of the present disclosure
  • FIG. 4 is the data processing method provided by the embodiment of the present disclosure
  • a schematic flow chart, the data processing method includes the following steps S401-S403.
  • Any computing unit in any computing subsystem included in the cache system acquires an access request.
  • the computing unit accesses a local shared register in the computing subsystem that matches the access address indicated by the access request.
  • the operation unit When the operation data indicated by the access request is read from the accessed local shared buffer, the operation unit performs an operation based on the read operation data to obtain an operation result.
  • the method further includes: when the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, the The operation unit reads the operation data indicated by the access request from the storage module of the chip based on the access request, and caches the operation data read from the storage module into the local shared buffer.
  • reading the operation data indicated by the access request from the storage module of the chip includes: not reading the access request indication from the local shared buffer matching the access address indicated by the access request In the case of the operation data, the operation unit reads the operation data indicated by the access request from the global buffer of the chip based on the access request.
  • the method further includes: when the operation data indicated by the access request is not read from the global buffer, the operation unit reads the data from the external memory of the chip reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
  • the embodiment of the present disclosure also provides a data processing device, as shown in FIG. 5 , which is a schematic diagram of the structure of the data processing device provided by the embodiment of the present disclosure. 502 and operation module 503.
  • the obtaining module 501 is used to obtain the access request
  • the reading module 502 is configured to, based on the access request, read the local shared buffer that matches the access address indicated by the access request.
  • the operation module 503 is configured to perform an operation based on the operation data read from the local shared buffer to obtain an operation result when the operation data indicated by the access request is read.
  • the device further includes an access module 504, configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request.
  • an access module 504 configured to read the operation data indicated by the access request from the local shared cache that matches the access address indicated by the access request.
  • the access module 504 if the operation data indicated by the access request is not read from the local shared cache that matches the access address indicated by the access request, based on The access request, when reading the operation data indicated by the access request from the memory module of the chip, is used for: the local shared buffer that matches the access address indicated by the access request does not read the When the operation data indicated by the access request is accessed, the operation data indicated by the access request is read from the global buffer of the chip based on the access request.
  • the access module 504 is further configured to: read the operation data indicated by the access request from the external memory of the chip when the operation data indicated by the access request is not read from the global buffer reading the operation data indicated by the access request, and caching the operation data read from the external memory into the global buffer.
  • the functions of the device provided by the embodiments of the present disclosure or the included templates can be used to execute the methods described in the above method embodiments, and its specific implementation can refer to the description of the above method embodiments. For brevity, here No longer.
  • an embodiment of the present disclosure further provides a chip, including: the cache system 601 and the storage module 602 described in the foregoing implementation manners.
  • the cache system 601 is used to acquire operation data from the storage module 602 and cache the operation data.
  • the storage module 602 may include a global cache and an external memory. That is, the cache system can first read the operation data from the global cache based on the access request, and then read the operation data corresponding to the access request from the external memory based on the access request when there is no operation data corresponding to the access request in the global cache.
  • an embodiment of the present disclosure also provides an electronic device.
  • the electronic device includes a processor 701 , a memory 702 and a bus 703 .
  • the memory 702 is used to store execution instructions, including a memory 7021 and an external memory 7022; the memory 7021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 701 and exchange data with an external memory 7022 such as a hard disk.
  • the processor 701 exchanges data with the external memory 7022 through the memory 7021.
  • the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions: obtain an access request; Request, read the local shared buffer that matches the access address indicated by the access request; in the case of reading the operation data indicated by the access request, based on the read from the local shared buffer Perform calculations on the above calculation data to obtain calculation results.
  • the electronic device may be the chip described in the above implementation manner.
  • an embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the data processing method described in the above-mentioned method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • Embodiments of the present disclosure also provide a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the data processing method described in the above method embodiment, for details, please refer to the above method The embodiment will not be repeated here.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • a software development kit Software Development Kit, SDK
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

La présente invention concerne un système de cache de puce, un procédé et un appareil de traitement de données, un dispositif, un support de stockage et une puce. Le système de cache comprend : une pluralité de sous-systèmes d'opération, chaque sous-système d'opération comprenant au moins une unité d'opération et au moins un tampon partagé local ; chaque unité d'opération étant connectée à tout tampon partagé local dans le sous-système d'opération où se trouve l'unité d'opération ; les tampons partagés locaux étant utilisés pour mettre en cache des données d'opération lues par les unités d'opération dans le sous-système d'opération auquel appartiennent les tampons partagés locaux ; et les unités d'opération étant utilisées pour accéder, sur la base d'une demande d'accès générée, à un tampon partagé local, qui correspond à une adresse d'accès indiquée par la demande d'accès, dans le sous-système d'opération, et effectuer une opération sur la base des données d'opération lues lorsque des données d'opération indiquées par la demande d'accès sont lues dans le tampon partagé local faisant l'objet de l'accès.
PCT/CN2022/121033 2021-12-31 2022-09-23 Système de cache de puce, procédé de traitement de données, dispositif, support de stockage et puce WO2023124304A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111662634.XA CN114297098A (zh) 2021-12-31 2021-12-31 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
CN202111662634.X 2021-12-31

Publications (1)

Publication Number Publication Date
WO2023124304A1 true WO2023124304A1 (fr) 2023-07-06

Family

ID=80974096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121033 WO2023124304A1 (fr) 2021-12-31 2022-09-23 Système de cache de puce, procédé de traitement de données, dispositif, support de stockage et puce

Country Status (2)

Country Link
CN (1) CN114297098A (fr)
WO (1) WO2023124304A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297098A (zh) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 芯片的缓存系统、数据处理方法、设备、存储介质及芯片

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233409A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Managing shared memory used by compute nodes
CN104699631A (zh) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Gpdsp中多层次协同与共享的存储装置和访存方法
CN104981786A (zh) * 2013-03-05 2015-10-14 国际商业机器公司 在多核芯片中为母核预取
CN107291629A (zh) * 2016-04-12 2017-10-24 华为技术有限公司 一种用于访问内存的方法和装置
CN114297098A (zh) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 芯片的缓存系统、数据处理方法、设备、存储介质及芯片

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233409A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Managing shared memory used by compute nodes
CN104981786A (zh) * 2013-03-05 2015-10-14 国际商业机器公司 在多核芯片中为母核预取
CN104699631A (zh) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Gpdsp中多层次协同与共享的存储装置和访存方法
CN107291629A (zh) * 2016-04-12 2017-10-24 华为技术有限公司 一种用于访问内存的方法和装置
CN114297098A (zh) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 芯片的缓存系统、数据处理方法、设备、存储介质及芯片

Also Published As

Publication number Publication date
CN114297098A (zh) 2022-04-08

Similar Documents

Publication Publication Date Title
US11294599B1 (en) Registers for restricted memory
JP3628595B2 (ja) 少なくとも1つのnuma(non−uniformmemoryaccess)データ処理システムとして構成可能な相互接続された処理ノード
US8352656B2 (en) Handling atomic operations for a non-coherent device
CN108268385B (zh) 具有集成目录高速缓存的优化的高速缓存代理
JP2008525904A (ja) 異なるキャッシュロケーション長を有するキャッシュにおいてキャッシュコヒーレンシを保持するためのシステム及び方法
US11914903B2 (en) Systems, methods, and devices for accelerators with virtualization and tiered memory
TWI767111B (zh) 伺服器系統
US20200242042A1 (en) System, Apparatus and Method for Performing a Remote Atomic Operation Via an Interface
CN114860329A (zh) 动态一致性偏置配置引擎及方法
WO2023124304A1 (fr) Système de cache de puce, procédé de traitement de données, dispositif, support de stockage et puce
US20220269433A1 (en) System, method and apparatus for peer-to-peer communication
CN117377943A (zh) 存算一体化并行处理系统和方法
WO2016049807A1 (fr) Procédé de traitement de répertoire de cache et dispositif de commande de répertoire d'un système de processeur multicœur
US11157191B2 (en) Intra-device notational data movement system
WO2023134735A1 (fr) Dispositif informatique, procédé et système de traitement de données, et dispositif associé
US11847049B2 (en) Processing system that increases the memory capacity of a GPGPU
US11550736B1 (en) Tensorized direct memory access descriptors
US20220342835A1 (en) Method and apparatus for disaggregation of computing resources
US11275669B2 (en) Methods and systems for hardware-based statistics management using a general purpose memory
CN115563053A (zh) 高性能片上内存控制器及其执行的方法
US7930459B2 (en) Coherent input output device
US12001352B1 (en) Transaction ordering based on target address
US11907144B1 (en) Early semaphore update
US20240070107A1 (en) Memory device with embedded deep learning accelerator in multi-client environment
CN116775510B (zh) 数据访问方法、装置、服务器和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913569

Country of ref document: EP

Kind code of ref document: A1