WO2023134360A1 - Data processing method and apparatus, and storage medium - Google Patents

Data processing method and apparatus, and storage medium Download PDF

Info

Publication number
WO2023134360A1
WO2023134360A1 PCT/CN2022/138424 CN2022138424W WO2023134360A1 WO 2023134360 A1 WO2023134360 A1 WO 2023134360A1 CN 2022138424 W CN2022138424 W CN 2022138424W WO 2023134360 A1 WO2023134360 A1 WO 2023134360A1
Authority
WO
WIPO (PCT)
Prior art keywords
output data
data
cache
memory
cache unit
Prior art date
Application number
PCT/CN2022/138424
Other languages
French (fr)
Chinese (zh)
Inventor
孙炜
祝叶华
Original Assignee
哲库科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哲库科技(上海)有限公司 filed Critical 哲库科技(上海)有限公司
Publication of WO2023134360A1 publication Critical patent/WO2023134360A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F2015/761Indexing scheme relating to architectures of general purpose stored programme computers
    • G06F2015/765Cache

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a data processing method and device, and a storage medium.
  • the structure of separation of calculation and storage is often adopted.
  • the structure of hierarchical storage is adopted in artificial intelligence processors, that is, a buffer memory is set between the computing engine and the memory storage. Store some data for the calculation engine to perform temporary data interaction. When the data read by the calculation engine does not hit in the buffer memory, it is necessary to transfer new data from the memory storage into the buffer memory, so that the calculation engine reads from the buffer memory.
  • the data mapping method of memory memory and buffer memory is designed for CPU, taking into account the characteristics of high flexibility and uncertain data access address during CPU execution, while embedded neural network processor (Neural-network Processing Unit, NPU ) architecture, if a data caching mechanism needs to be added, the buffer memory design scheme of the CPU is reused, resulting in the problem of low data caching efficiency for the NPU.
  • NPU Neuro-network Processing Unit
  • Embodiments of the present application provide a data processing method and device, and a storage medium, which can improve the process of data caching efficiency for an NPU.
  • the embodiment of the present application proposes a data processing device, which includes: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data blocks and a cache counter;
  • the cache data block is used to cache the stored data in the internal memory and/or the output data generated by the neural network processor;
  • the cache counter is used to cache the number of reads corresponding to the stored data and/or the output data.
  • the number of algorithm layers is the same.
  • the embodiment of the present application proposes a data processing method, which is applied to the above-mentioned data processing device, and the method includes:
  • the data and the read count of the output data are added to the buffer memory.
  • the embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned data processing method is implemented.
  • FIG. 1 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of an exemplary data processing device using separation of computing and storage provided in the embodiment of the present application;
  • FIG. 3 is a schematic network structure diagram of an exemplary NPU-executed algorithm network provided in an embodiment of the present application
  • FIG. 4 is a schematic diagram of an exemplary storage mapping method between a memory storage and a buffer storage provided in an embodiment of the present application
  • FIG. 5 is a flowchart of a data processing method provided by an embodiment of the present application.
  • the embodiment of the present application proposes a data processing device, which includes: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data blocks and a cache counter;
  • the cache data block is used to cache the stored data in the internal memory and/or the output data generated by the neural network processor;
  • the cache counter is used to cache the number of reads corresponding to the stored data and/or the output data.
  • the number of algorithm layers is the same.
  • the buffer memory is further configured to count the number of read times in the cache counter corresponding to the cache data block each time a read operation on the stored data and/or the output data is detected minus one; until the number of reads in the cache counter is set to zero, then delete the stored data and/or the output data.
  • the neural network processor is configured to determine the number of reads of the output data according to the number of algorithm layers for reading the output data; according to the destination address of the output data in the memory storage, and a storage mapping method between the memory memory and the buffer memory, determining a first storage unit from the buffer memory and/or the memory memory; , and/or buffer the output data into the first storage unit.
  • the neural network processor is further configured to determine the first address corresponding to the output data from the buffer memory according to the destination address of the output data in the internal memory and the storage mapping method.
  • a cache unit group if the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit;
  • the neural network processor is further configured to if the first free cache unit is not included in the first cache unit group, and it is not found from the first cache unit group that the number of reads is less than the output data
  • the neural network processor is further configured to: if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the A cache unit whose output data read times is determined as the first cache unit whose read times are less than the output data read times;
  • the buffer memory is further configured to delete the currently stored output data and the remaining read times corresponding to the currently stored output data in the cache unit whose read times are smaller than the output data read times.
  • the neural network processor is further configured to store the output data, and/or the output data and the number of reads of the output data is updated to the internal memory.
  • the neural network processor is further configured to set a to-be-synchronized flag for the output data when the number of reads of the output data and the output data are cached in the buffer memory, and in the When the buffer memory deletes the output data and the read times of the output data, the output data, and/or the output data and the read times of the output data are updated to in the memory storage.
  • the internal memory is configured to determine from the buffer memory the second cache unit group corresponding to the stored data according to the storage address of the stored data and the storage mapping method, if the second If there is a second free storage unit in the cache unit group, then cache the stored data into the second free storage unit; if there is no second free storage unit in the second cache unit group, then from the second free storage unit Find the second cache unit with the smallest number of reads in the second cache unit group, and cache the stored data into the second cache unit;
  • the neural network processor is also used to determine the number of reads of the stored data; if there is a second free storage unit in the second cache unit group, cache the number of reads of the stored data to the In the second free storage unit, if there is no second free storage unit in the second cache unit group, the read times of the stored data are cached in the second cache unit.
  • the neural network processor is further configured to acquire the storage data from the storage counter times, the number of reads in the storage counter is determined according to the number of algorithm layers for reading data in the storage data block, and/or determined according to the number of reads transmitted by the neural network processor.
  • the neural network processor is further configured to determine from the buffer memory the number of reads currently stored in the second cache unit, And according to the read times currently stored in the second cache unit, determine the read times of the stored data, and/or determine the read times of the stored data according to the number of algorithm layers for reading the stored data.
  • the embodiment of the present application proposes a data processing method, which is applied to the above-mentioned data processing device, and the method includes:
  • the data and the read count of the output data are added to the buffer memory.
  • the number of reads of the stored data and/or the output data in the buffer memory is reduced by one;
  • adding the output data and the read times of the output data to the buffer memory includes:
  • the read times of the output data and the output data are cached in the first storage unit.
  • determining the first storage unit from the buffer memory according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory include:
  • the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit.
  • the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the number of reads of the output data cache unit, then determine the cache unit of the read times of the output data as the first cache unit.
  • the first idle cache unit is not included in the first cache unit group, and no cache with a read count less than the read count of the output data is found from the first cache unit group unit, then according to the destination address, determine the first storage unit from the memory storage.
  • the method further includes:
  • the method further includes:
  • adding the stored data and the read times of the stored data to the buffer memory includes:
  • the embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned data processing method is implemented.
  • An embodiment of the present application provides a data processing method and device, and a storage medium, the device including: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data block and a cache counter; cache data blocks for caching stored data in memory storage and/or output data generated by the neural network processor; cache counters for caching stored data and/or output data corresponding to the number of reads, read The fetch times are the same as the number of algorithm layers for reading the stored data and/or the output data determined according to the network structure of the algorithm network.
  • the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the buffer counter is set in the buffer memory. Storing the number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, greatly reducing the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
  • references to “some embodiments” describe a subset of all possible embodiments, but it is understood that “some embodiments” may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
  • first ⁇ second ⁇ third involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ The specific order or sequence of "third” may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
  • the device 1 includes: a neural network processor 10, a buffer memory 11, and a memory memory 12; wherein, the buffer memory 11 includes a cache unit 110, Each cache unit 110 includes a cache data block 1100 and a cache counter 1101;
  • the cache data block 1100 is used to cache the stored data in the memory storage 12 and/or the output data generated by the neural network processor 10;
  • the cache counter 1101 is used to cache the number of reads corresponding to the stored data and/or the output data, and the number of reads is the same as the read count of the stored data and/or The number of algorithm layers of the output data is the same.
  • the data processing device proposed in the embodiment of the present application is a data cache device designed for an NPU architecture.
  • the neural network processor is an NPU
  • the buffer memory is a cache memory
  • the memory memory can be a synchronous dynamic random-access memory (SDRAM, synchronous dynamic random-access memory), double rate synchronous dynamic random-access memory (DDR , Double Data Rate SDRAM) and other memory.
  • SDRAM synchronous dynamic random-access memory
  • DDR double rate synchronous dynamic random-access memory
  • DDR Double Data Rate SDRAM
  • the data processing device adopts a structure in which calculation and storage are separated.
  • the cache memory is a buffer memory close to the NPU computing engine. A certain amount of data is stored in the buffer memory for the NPU computing engine to perform temporary data interaction. The read and write speed is fast, but the storage capacity is small. The memory memory is far away from the NPU computing engine. All the data is stored in it, and the storage capacity is large, but the reading and writing speed is slow, and each reading path is long, so the reading and writing efficiency is low.
  • the algorithm layer in the algorithm network can be determined by analyzing the network structure of the algorithm network executed by the NPU. As shown in Figure 3, it is a schematic diagram of the network structure of the algorithm network executed by the NPU. Each circle is an algorithm layer. For an algorithm layer, its workflow is to read in the source data and perform calculation. Sub-processing, in which the operator can be convolution, pooling, activation, full connection, etc. After the operator processing is completed, it enters the next algorithm layer for processing. For example, the output data of the algorithm layer 0 will be used as the input data by 1 No. 2 algorithm layer and No. 2 algorithm layer read, the output data of No.
  • a cache counter is set for each cache data block, and a cache data block and a corresponding cache counter together form a cache unit.
  • what is filled in the cache counter is the number of times the data in the corresponding cache data block is read out.
  • the data stored in the cache memory can be the stored data from the internal memory, or the output data generated after the neural network processor performs operator processing on the stored data, which can be selected according to the actual situation.
  • the embodiments of the present application do not make specific limitations.
  • the buffer memory 11 is further configured to read the cache counter corresponding to the cache data block each time a read operation on the stored data and/or the output data is detected. The number of times is reduced by one; until the number of times of reading in the cache counter is set to zero, the stored data and/or the output data are deleted.
  • the operator layer in the NPU will read the storage data and/or output data from the buffer memory, and each time the storage data and/or output data are read from the buffer memory, the buffer memory determines to cache the A cache unit that stores data and/or outputs data, and subtracts one from the number of reads in the counter of the cache unit; until the number of reads in the cache counter is set to zero, it means that the data in the cache unit will not be read out later , at this time, the stored data and/or output data are deleted from the buffer memory, and the corresponding cache unit of the buffer memory is cleared, so that the data can be written into the cache unit later.
  • the number of reads also reflects the importance of the corresponding cache data block. The greater the number of reads, the more times the data stored in the cache data block will be read by the subsequent algorithm layer, which means The more important the data stored in the cache data block, on the contrary, the smaller the number of reads, indicating that the data stored in the cache data block will be read less frequently by the subsequent algorithm layer, which means that the data stored in the cache data block is more important. unimportant.
  • the neural network processor 10 is configured to determine the number of reads of the output data according to the number of algorithm layers for reading the output data; according to the destination address of the output data in the memory storage , and the memory mapping method between the memory memory and the buffer memory, determine the first storage unit from the buffer memory and/or the memory memory; the number of reads of the output data and the output The data and/or the output data are buffered into the first storage unit.
  • the neural network processor may determine the number of algorithm layers for reading the output data according to the network structure of the algorithm network, and then determine the number of reads of the output data according to the number of algorithm layers.
  • the storage mapping method between the memory storage and the cache memory can be determined according to hardware parameters and cache efficiency.
  • the storage data in each storage data block in the memory storage can be mapped to Among the four cache data blocks in the cache memory, for example, the cache data block includes 16 cache data blocks 0-15, and each cache data block is preceded by a cache counter cnt, which together constitute 16 cache units.
  • No. 0, No. 8, ..., No. 2040 storage data blocks in the memory are mapped to No. 0-3 cache data blocks in the cache memory, and so on, and No. 7, No. 15, and No. 2047 storage data blocks in the memory storage are mapped to
  • the cache data blocks No. 12-15 in the cache memory realize the memory mapping mode between the memory memory and the buffer memory.
  • the NPU receives an instruction, which includes a source data address, a target data address, and a convolution operation command, wherein the target data address is the destination address of the output data in the memory storage in this application,
  • the NPU may determine the first storage unit from the buffer memory and/or the internal memory according to the destination address and the storage mapping relationship.
  • the neural network processor 10 is further configured to determine the first address corresponding to the output data from the buffer memory according to the destination address of the output data in the internal memory and the storage mapping method A cache unit group; if the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as being in the first cache unit;
  • the neural network processor 10 is further configured to if the first free cache unit is not included in the first cache unit group, and the number of reads is not found from the first cache unit group to be less than the output
  • the neural network processor 10 is further configured to: if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the set If the number of reads of the output data is the cache unit, then the cache unit whose read times is less than the number of reads of the output data is determined as the first cache unit;
  • the buffer memory 11 is further configured to delete the currently stored output data and the remaining read times corresponding to the currently stored output data in the buffer units whose read times are smaller than the output data read times.
  • the neural network processor writes the output data back to the cache memory or memory memory. Specifically, the neural network processor first determines the output data from the buffer memory according to the destination address and storage mapping method of the output data in the memory memory The first cache unit group, and judging whether there is a first idle cache unit in the first cache unit group; if there is a first idle cache unit in the first cache unit group, then determining the first idle cache unit as the first cache unit, And the output data and the read times of the output data are directly cached in the first free cache unit of the buffer memory.
  • the number of reads of the output data is compared with the number of reads stored in the first cache unit group in sequence, and if the first cache In the unit group, there are cache units whose read times are less than the read times of the output data, and the importance of representing the output data is higher than the data cached in the cache units whose read times are less than the read times of the output data; at this time, the read times
  • the cache unit that is less than the read times of the output data is determined as the first cache unit, and deletes the currently stored output data in the cache unit of the output data read times and the remaining read times corresponding to the currently stored output data, and then
  • the output data and the number of reads of the output data are cached in a buffer unit whose read count is smaller than the read count of the output data in the buffer memory.
  • For data at this time, directly determine the first cache unit from the memory storage according to the destination address, and store the output data, and/or the output data and the read times of the output data into the first cache unit in the memory storage.
  • the neural network processor 10 is further configured to store the output data and/or the output data and the number of reads of said output data are updated to said memory storage;
  • the neural network processor 10 is further configured to set a synchronization flag for the output data when the output data read times and the output data are cached in the buffer memory, and in the buffer When the memory deletes the output data and the read times of the output data, the output data, and/or the output data and the read times of the output data are updated to the in the memory storage described above.
  • the data in the cache memory is equivalent to the backup of the data in the memory memory, so when the neural network memory caches the output data and the number of reads of the output data to the buffer memory, it is also necessary to save the output data and/or Or the output data and the number of reads of the output data are synchronized to the memory storage to ensure data consistency between the cache memory and the memory storage.
  • the synchronization process there are two ways, one is to update the output data, and/or the output data and the read times of the output data to the memory storage when the output data read times and the output data are cached to the buffer memory, The other is to set a pending synchronization flag for the output data, and when the buffer memory deletes the output data and the number of reads of the output data, update the output data, and/or the output data and the number of readings of the output data according to the pending synchronization flag to memory storage.
  • the scenario where the buffer memory deletes the output data and the number of reads of the output data may be that the number of reads of the output data is reduced to zero; it may also be that when the NPU caches new output data in the buffer memory, it determines that the output data There is no cache unit in an idle state in the corresponding cache unit group, and the number of reads of output data is less than the number of reads of new output data; it can also be determined that when the memory storage writes new storage data to the buffer memory, There is no cache unit in an idle state in the cache unit group corresponding to the output data, and the read times of the output data are the minimum read times in the corresponding cache unit group.
  • the memory storage 12 is configured to determine from the buffer memory the second cache unit group corresponding to the stored data according to the storage address of the stored data and the storage mapping method, if the first There is a second free storage unit in the second cache unit group, then cache the stored data into the second free storage unit, if there is no second free storage unit in the second cache unit group, then from the Searching for a second cache unit with the smallest number of reads in the second cache unit group, and caching the stored data into the second cache unit;
  • the neural network processor 10 is also used to determine the read times of the stored data; if there is a second free storage unit in the second cache unit group, cache the read times of the stored data to the In the second free storage unit, if there is no second free storage unit in the second cache unit group, the read times of the stored data are cached in the second cache unit.
  • the neural network processor when the memory memory caches the stored data to the cache memory, the neural network processor also determines the number of reads of the stored data. Determine the second cache unit group corresponding to the stored data, and judge whether there is a second free storage unit in the second cache unit group; if there is a second free storage unit in the second cache unit group, then store the data and store the data read The number of fetches is cached in the second free storage unit; if there is no second free storage unit in the second cache unit group, then search for the second cache unit with the smallest number of read times in the second cache unit group, and then store the data and The read times of the stored data are cached in the second cache unit.
  • the neural network processor 10 is further configured to obtain the storage data from the storage counter
  • the number of reads in the storage counter is determined according to the number of algorithm layers for reading data in the stored data block, and/or determined according to the number of reads transmitted by the neural network processor;
  • the neural network processor 10 is also used to determine the number of reads currently stored in the second cache unit from the buffer memory, according to The reading times currently stored in the second cache unit determine the reading times of the stored data, and/or determine the reading times of the stored data according to the number of algorithm layers for reading the stored data.
  • a storage counter can be set correspondingly for each storage data block in the memory storage, and the number of reads of the storage data is stored in the corresponding storage counter; the neural network processor can directly read from the storage counter Get the number of reads of stored data in .
  • only storage data blocks can be set in the memory storage, at this time, the number of times of reading the storage data is not stored in the memory storage; the neural network processor can read the storage data according to the The number of algorithm layers determines the number of reads of the stored data, and can also determine the number of reads of the stored data according to the number of reads currently stored in the second cache unit, wherein the number of reads of the stored data is greater than the number of times currently stored in the second cache unit
  • the number of times of reading specifically the difference between the number of times of reading the stored data and the number of times of reading currently stored in the second cache can be obtained based on the previous evaluation of the algorithm network.
  • the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the cache counter is set in the buffer memory to store
  • the number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, which greatly reduces the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
  • the embodiment of the present application also proposes a data processing method, as shown in FIG. 5, which is applied to the above-mentioned data processing device, and the method includes:
  • the network structure of the algorithm network executed by the NPU can be analyzed to determine the The data dependencies between the algorithm layers, as shown in Figure 2, is a schematic diagram of the network structure of the algorithm network executed by the NPU.
  • Each circle is an algorithm layer.
  • For an algorithm layer its workflow is to read the source data, Perform operator processing, where the operator can be convolution, pooling, activation, full connection, etc. After the operator processing is completed, enter the next algorithm layer for processing.
  • the output data of the algorithm layer 0 will be used as input data It is read by Algorithm Layer 1 and Algorithm Layer 2, the output data of Algorithm Layer 1 will be read by Algorithm Layer 3 and Algorithm Layer 4 as input data, and the output data of Algorithm Layer 2 will be read by Algorithm Layer 5 as input data.
  • S102 Determine the number of algorithm layers as the number of reads of stored data and/or the number of reads of output data; and add the stored data and the number of reads of stored data, and/or the number of reads of output data and output data to in buffer memory.
  • adding the output data and the number of reads of the output data to the buffer memory includes: according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory, from The buffer memory determines the first storage unit; caches the read times of the output data and the output data into the first storage unit.
  • determining the first storage unit from the buffer memory includes: according to the destination address of the output data in the memory memory and In the storage mapping mode, the first cache unit group corresponding to the output data is determined from the buffer memory; if the first cache unit group includes the first idle cache unit, the first idle cache unit is determined as the first cache unit; if the first cache unit is included in the first idle cache unit; A cache unit group does not include the first idle first cache unit, and a cache unit whose read times is less than the read times of the output data is found from the first cache unit group, then the cache unit of the output data read times is The unit is determined as the first cache unit.
  • the memory determines a first storage unit.
  • adding the stored data and the number of reads of the stored data to the buffer memory includes: determining the number of reads of the stored data; The second cache unit group corresponding to the data; if there is a second free storage unit in the second cache unit group, the stored data and the read times of the stored data are cached in the second free storage unit; if in the second cache unit group If there is no second free storage unit, the second cache unit with the smallest number of reads is searched from the second cache unit group, and the read times of the stored data are cached in the second cache unit.
  • the number of reads of the stored data and/or the number of reads of the output data in the buffer memory is reduced by one ; Until the number of reads of stored data and/or the number of reads of output data in the buffer memory is set to zero, then the corresponding stored data and/or output data will be deleted from the buffer memory.
  • a process of data synchronization is also performed, specifically: updating the output data, and/or the number of reads of the output data and the output data to or, set the flag to be synchronized for the output data, and when the buffer memory deletes the output data and the read times of the output data, according to the flag to be synchronized, the output data, and/or the read of the output data and the output data
  • the number of fetches is updated to the memory storage.
  • the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the cache counter is set in the buffer memory to store
  • the number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, which greatly reduces the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
  • An embodiment of the present application provides a storage medium on which a computer program is stored.
  • the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more neural network processors.
  • the computer program realizes the above-mentioned data processing method.
  • the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk, etc.) ) includes several instructions to make an image display device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in various embodiments of the present disclosure.
  • a storage medium such as ROM/RAM, magnetic disk, optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data processing method and apparatus, and a storage medium. The apparatus comprises: a neural network processor (10), a cache memory (11), and an internal memory (12). The cache memory (11) comprises cache units (110). Each cache unit (110) comprises a cache data block (1100) and a cache counter (1101); the cache data block (1100) is used for caching stored data in the internal memory (12) and/or output data generated by the neural network processor (10); and the cache counter (1101) is used for caching the number of reads corresponding to the stored data and/or the output data, the number of reads being the same as the number of algorithm layers for reading the stored data and/or the output data, the number of algorithm layers being determined according to a network structure of an algorithm network.

Description

一种数据处理方法及装置、存储介质A data processing method and device, storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202210044147.5、申请日为2022年01月14日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is based on a Chinese patent application with application number 202210044147.5 and a filing date of January 14, 2022, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated into this application by reference.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种数据处理方法及装置、存储介质。The present application relates to the field of artificial intelligence, in particular to a data processing method and device, and a storage medium.
背景技术Background technique
人工智能处理器的架构设计中,多采用计算和存储分离的结构,其中,人工智能处理器中采用的是分层存储的结构,即在计算引擎和内存存储器之间设置缓冲存储器,缓冲存储器中存储一些数据供计算引擎进行临时数据交互。当计算引擎读取的数据没有在缓冲存储器中命中时,需要把新数据从内存存储器中调入缓冲存储器中,进而使得计算引擎从缓冲存储器中读取。In the architecture design of artificial intelligence processors, the structure of separation of calculation and storage is often adopted. Among them, the structure of hierarchical storage is adopted in artificial intelligence processors, that is, a buffer memory is set between the computing engine and the memory storage. Store some data for the calculation engine to perform temporary data interaction. When the data read by the calculation engine does not hit in the buffer memory, it is necessary to transfer new data from the memory storage into the buffer memory, so that the calculation engine reads from the buffer memory.
目前,针对CPU设计了内存存储器和缓冲存储器的数据映射方式,考虑到了CPU执行过程中灵活性高,数据访问地址不确定的特点,而在嵌入式神经网络处理器(Neural-network Processing Unit,NPU)架构中,若需添加数据缓存机制,则复用CPU的缓冲存储器的设计方案,导致针对NPU的数据缓存效率低的问题。At present, the data mapping method of memory memory and buffer memory is designed for CPU, taking into account the characteristics of high flexibility and uncertain data access address during CPU execution, while embedded neural network processor (Neural-network Processing Unit, NPU ) architecture, if a data caching mechanism needs to be added, the buffer memory design scheme of the CPU is reused, resulting in the problem of low data caching efficiency for the NPU.
发明内容Contents of the invention
本申请实施例提供一种数据处理方法及装置、存储介质,能够提高针对NPU的数据缓存效率的过程。Embodiments of the present application provide a data processing method and device, and a storage medium, which can improve the process of data caching efficiency for an NPU.
本申请的技术方案是这样实现的:The technical scheme of the present application is realized like this:
第一方面,本申请实施例提出一种数据处理装置,所述装置包括:神经网络处理器、缓冲存储器和内存存储器;其中,所述缓冲存储器中包括缓存单元,每个缓存单元包括一个缓存数据块和一个缓存计数器;In the first aspect, the embodiment of the present application proposes a data processing device, which includes: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data blocks and a cache counter;
所述缓存数据块,用于缓存所述内存存储器中的存储数据和/或神经网络处理器生成的输出数据;The cache data block is used to cache the stored data in the internal memory and/or the output data generated by the neural network processor;
所述缓存计数器,用于缓存所述存储数据和/或所述输出数据对应的读 取次数,所述读取次数与根据算法网络的网络结构确定出的读取存储数据和/或输出数据的算法层数量相同。The cache counter is used to cache the number of reads corresponding to the stored data and/or the output data. The number of algorithm layers is the same.
第二方面,本申请实施例提出一种数据处理方法,应用于上述数据处理装置,所述方法包括:In the second aspect, the embodiment of the present application proposes a data processing method, which is applied to the above-mentioned data processing device, and the method includes:
获取待执行的算法网络的网络结构,并根据所述网络结构、确定读取内存存储器中的存储数据和/或所述算法网络中的每一个算法层的输出数据的算法层数量;Obtain the network structure of the algorithm network to be executed, and determine the number of algorithm layers for reading the stored data in the memory storage and/or the output data of each algorithm layer in the algorithm network according to the network structure;
将所述算法层数量确定为所述存储数据的读取次数和/或所述输出数据的读取次数;并将所述存储数据和所述存储数据的读取次数、和/或所述输出数据和所述输出数据的读取次数添加至缓冲存储器中。Determining the number of layers of the algorithm as the number of reads of the stored data and/or the number of reads of the output data; and determining the number of reads of the stored data and the stored data, and/or the output The data and the read count of the output data are added to the buffer memory.
第三方面,本申请实施例提出一种存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述的数据处理方法。In a third aspect, the embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned data processing method is implemented.
附图说明Description of drawings
图1为本申请实施例提供的一种数据处理装置的结构示意图;FIG. 1 is a schematic structural diagram of a data processing device provided in an embodiment of the present application;
图2为本申请实施例提供的一种示例性的数据处理装置采用计算和存储分离的结构示意图;FIG. 2 is a schematic structural diagram of an exemplary data processing device using separation of computing and storage provided in the embodiment of the present application;
图3为本申请实施例提供的一种示例性的NPU执行的算法网络的网络结构示意图;FIG. 3 is a schematic network structure diagram of an exemplary NPU-executed algorithm network provided in an embodiment of the present application;
图4为本申请实施例提供的一种示例性的内存存储器和缓冲存储器之间的存储映射方式的示意图;FIG. 4 is a schematic diagram of an exemplary storage mapping method between a memory storage and a buffer storage provided in an embodiment of the present application;
图5为本申请实施例提供的一种数据处理方法的流程图。FIG. 5 is a flowchart of a data processing method provided by an embodiment of the present application.
具体实施方式Detailed ways
第一方面,本申请实施例提出一种数据处理装置,所述装置包括:神经网络处理器、缓冲存储器和内存存储器;其中,所述缓冲存储器中包括缓存单元,每个缓存单元包括一个缓存数据块和一个缓存计数器;In the first aspect, the embodiment of the present application proposes a data processing device, which includes: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data blocks and a cache counter;
所述缓存数据块,用于缓存所述内存存储器中的存储数据和/或神经网络处理器生成的输出数据;The cache data block is used to cache the stored data in the internal memory and/or the output data generated by the neural network processor;
所述缓存计数器,用于缓存所述存储数据和/或所述输出数据对应的读取次数,所述读取次数与根据算法网络的网络结构确定出的读取存储数据和/或输出数据的算法层数量相同。The cache counter is used to cache the number of reads corresponding to the stored data and/or the output data. The number of algorithm layers is the same.
可选的,所述缓冲存储器,还用于在每检测到一次对所述存储数据和/或所述输出数据的读取操作时,将所述缓存数据块对应的缓存计数器中的读取次数减一;直至所述缓存计数器中的读取次数置零,则删除所述存储数据和/或所述输出数据。Optionally, the buffer memory is further configured to count the number of read times in the cache counter corresponding to the cache data block each time a read operation on the stored data and/or the output data is detected minus one; until the number of reads in the cache counter is set to zero, then delete the stored data and/or the output data.
可选的,所述神经网络处理器,用于根据读取所述输出数据的算法 层数量,确定所述输出数据的读取次数;根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器和/或所述内存存储器中确定第一存储单元;将所述输出数据的读取次数和所述输出数据、和/或所述输出数据缓存至所述第一存储单元中。Optionally, the neural network processor is configured to determine the number of reads of the output data according to the number of algorithm layers for reading the output data; according to the destination address of the output data in the memory storage, and a storage mapping method between the memory memory and the buffer memory, determining a first storage unit from the buffer memory and/or the memory memory; , and/or buffer the output data into the first storage unit.
可选的,所述神经网络处理器,还用于根据所述输出数据在所述内存存储器中的目的地址和所述存储映射方式,从所述缓冲存储器中确定所述输出数据对应的第一缓存单元组;若所述第一缓存单元组中包括处于第一空闲缓存单元,则将所述第一空闲缓存单元确定为所述第一缓存单元;Optionally, the neural network processor is further configured to determine the first address corresponding to the output data from the buffer memory according to the destination address of the output data in the internal memory and the storage mapping method. A cache unit group; if the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit;
所述神经网络处理器,还用于若所述第一缓存单元组中不包括所述第一空闲缓存单元,且未从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则根据所述目的地址,从所述内存存储器中确定所述第一存储单元;The neural network processor is further configured to if the first free cache unit is not included in the first cache unit group, and it is not found from the first cache unit group that the number of reads is less than the output data The cache unit of the read times, then according to the destination address, determine the first storage unit from the internal memory;
所述神经网络处理器,还用于若所述第一缓存单元组中不包括所述第一空闲的第一缓存单元,且从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则将读取次数小于所述输出数据的读取次数的缓存单元确定为所述第一缓存单元;The neural network processor is further configured to: if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the A cache unit whose output data read times is determined as the first cache unit whose read times are less than the output data read times;
所述缓冲存储器,还用于删除读取次数小于所述输出数据的读取次数的缓存单元中的当前存储的输出数据和当前存储的输出数据对应的剩余读取次数。The buffer memory is further configured to delete the currently stored output data and the remaining read times corresponding to the currently stored output data in the cache unit whose read times are smaller than the output data read times.
可选的,所述神经网络处理器,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器。Optionally, the neural network processor is further configured to store the output data, and/or the output data and the number of reads of the output data is updated to the internal memory.
可选的,所述神经网络处理器,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,为所述输出数据设置待同步标志,在所述缓冲存储器将所述输出数据和所述输出数据的读取次数删除时,根据所述待同步标志,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。Optionally, the neural network processor is further configured to set a to-be-synchronized flag for the output data when the number of reads of the output data and the output data are cached in the buffer memory, and in the When the buffer memory deletes the output data and the read times of the output data, the output data, and/or the output data and the read times of the output data are updated to in the memory storage.
可选的,所述内存存储器,用于根据所述存储数据的存储地址和所述存储映射方式,从所述缓冲存储器中确定所述存储数据对应的第二缓存单元组,若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在第二空闲存储单元,则从所述第二缓存单元组中查找读取次数最小的第二缓存单元,并将所述存储数据缓存至所述第二缓存单元中;Optionally, the internal memory is configured to determine from the buffer memory the second cache unit group corresponding to the stored data according to the storage address of the stored data and the storage mapping method, if the second If there is a second free storage unit in the cache unit group, then cache the stored data into the second free storage unit; if there is no second free storage unit in the second cache unit group, then from the second free storage unit Find the second cache unit with the smallest number of reads in the second cache unit group, and cache the stored data into the second cache unit;
所述神经网络处理器,还用于确定所述存储数据的读取次数;若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在 第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二缓存单元中。The neural network processor is also used to determine the number of reads of the stored data; if there is a second free storage unit in the second cache unit group, cache the number of reads of the stored data to the In the second free storage unit, if there is no second free storage unit in the second cache unit group, the read times of the stored data are cached in the second cache unit.
可选的,若所述内存存储器中包括存储数据块和所述存储数据块对应的存储计数器,则所述神经网络处理器,还用于从所述存储计数器中获取所述存储数据的读取次数,所述存储计数器中的读取次数根据读取所述存储数据块中数据的算法层数量确定、和/或根据所述神经网络处理器传输的读取次数确定。Optionally, if the memory storage includes a storage data block and a storage counter corresponding to the storage data block, the neural network processor is further configured to acquire the storage data from the storage counter times, the number of reads in the storage counter is determined according to the number of algorithm layers for reading data in the storage data block, and/or determined according to the number of reads transmitted by the neural network processor.
可选的,若所述内存存储器中仅包括所述存储数据块,则所述神经网络处理器,还用于从所述缓冲存储器中确定所述第二缓存单元中当前存储的读取次数,并根据所述第二缓存单元中当前存储的读取次数,确定所述存储数据的读取次数,和/或根据读取所述存储数据的算法层数量确定所述存储数据的读取次数。Optionally, if the internal memory only includes the stored data block, the neural network processor is further configured to determine from the buffer memory the number of reads currently stored in the second cache unit, And according to the read times currently stored in the second cache unit, determine the read times of the stored data, and/or determine the read times of the stored data according to the number of algorithm layers for reading the stored data.
第二方面,本申请实施例提出一种数据处理方法,应用于上述数据处理装置,所述方法包括:In the second aspect, the embodiment of the present application proposes a data processing method, which is applied to the above-mentioned data processing device, and the method includes:
获取待执行的算法网络的网络结构,并根据所述网络结构、确定读取内存存储器中的存储数据和/或所述算法网络中的每一个算法层的输出数据的算法层数量;Obtain the network structure of the algorithm network to be executed, and determine the number of algorithm layers for reading the stored data in the memory storage and/or the output data of each algorithm layer in the algorithm network according to the network structure;
将所述算法层数量确定为所述存储数据的读取次数和/或所述输出数据的读取次数;并将所述存储数据和所述存储数据的读取次数、和/或所述输出数据和所述输出数据的读取次数添加至缓冲存储器中。Determining the number of layers of the algorithm as the number of reads of the stored data and/or the number of reads of the output data; and determining the number of reads of the stored data and the stored data, and/or the output The data and the read count of the output data are added to the buffer memory.
可选的,在每检测到一个算法层对所述存储数据和/或所述输出数据的读取操作时,将所述缓冲存储器中、所述存储数据的读取次数和/或所述输出数据的读取次数减一;Optionally, when an algorithm layer read operation on the stored data and/or the output data is detected, the number of reads of the stored data and/or the output data in the buffer memory The number of data reads is reduced by one;
直至所述缓冲存储器中、所述存储数据的读取次数和/或所述输出数据的读取次数置零,则将对应的所述存储数据和/或所述输出数据从所述缓冲存储器中删除。Until the number of reads of the stored data and/or the number of reads of the output data in the buffer memory is set to zero, then the corresponding stored data and/or the output data are read from the buffer memory delete.
可选的,所述将所述输出数据和所述输出数据的读取次数添加至缓冲存储器中,包括:Optionally, adding the output data and the read times of the output data to the buffer memory includes:
根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器确定第一存储单元;determining a first storage unit from the buffer memory according to a destination address of the output data in the memory memory and a storage mapping manner between the memory memory and the buffer memory;
将所述输出数据的读取次数和所述输出数据缓存至所述第一存储单元中。The read times of the output data and the output data are cached in the first storage unit.
可选的,所述根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器中确定第一存储单元,包括:Optionally, determining the first storage unit from the buffer memory according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory, include:
根据所述输出数据在所述内存存储器中的目的地址和所述存储映射方式,从所述缓冲存储器中确定所述输出数据对应的第一缓存单元组。Determine the first cache unit group corresponding to the output data from the buffer memory according to the destination address of the output data in the memory storage and the storage mapping manner.
可选的,若所述第一缓存单元组中包括处于第一空闲缓存单元,则将所述第一空闲缓存单元确定为所述第一缓存单元。Optionally, if the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit.
可选的,若所述第一缓存单元组中不包括所述第一空闲的第一缓存单元,且从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则将所述输出数据的读取次数的缓存单元确定为所述第一缓存单元。Optionally, if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the number of reads of the output data cache unit, then determine the cache unit of the read times of the output data as the first cache unit.
可选的,若所述第一缓存单元组中不包括所述第一空闲缓存单元,且未从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则根据所述目的地址,从所述内存存储器确定所述第一存储单元。Optionally, if the first idle cache unit is not included in the first cache unit group, and no cache with a read count less than the read count of the output data is found from the first cache unit group unit, then according to the destination address, determine the first storage unit from the memory storage.
可选的,所述将所述输出数据和所述输出数据的读取次数添加至缓冲存储器中之后,所述方法还包括:Optionally, after adding the output data and the read times of the output data to the buffer memory, the method further includes:
将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中.updating the output data, and/or the output data and the number of reads of the output data into the memory storage.
可选的,所述将所述输出数据和所述输出数据的读取次数添加至缓冲存储器中之后,所述方法还包括:Optionally, after adding the output data and the read times of the output data to the buffer memory, the method further includes:
为所述输出数据设置待同步标志,并在所述缓冲存储器将所述输出数据和所述输出数据的读取次数删除时,根据所述待同步标志,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。Setting a flag to be synchronized for the output data, and when the buffer memory deletes the output data and the number of reads of the output data, according to the flag to be synchronized, the output data, and/or all The output data and the reading times of the output data are updated to the memory storage.
可选的,所述将所述存储数据和所述存储数据的读取次数添加至缓冲存储器中,包括:Optionally, adding the stored data and the read times of the stored data to the buffer memory includes:
确定所述存储数据的读取次数;determining the number of reads of the stored data;
根据所述存储数据的存储地址和所述存储映射方式,从所述缓冲存储器中确定所述存储数据对应的第二缓存单元组;determining a second cache unit group corresponding to the stored data from the buffer memory according to the storage address of the stored data and the storage mapping method;
若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据和所述存储数据的读取次数缓存至所述第二空闲存储单元中;If there is a second free storage unit in the second cache unit group, cache the stored data and the read times of the stored data into the second free storage unit;
若所述第二缓存单元组中不存在第二空闲存储单元,则从所述第二缓存单元组中查找读取次数最小的第二缓存单元,并将所述存储数据的读取次数缓存至所述第二缓存单元中。If there is no second free storage unit in the second cache unit group, then search for the second cache unit with the smallest number of reads from the second cache unit group, and cache the read times of the stored data to In the second cache unit.
第三方面,本申请实施例提出一种存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述的数据处理方法。In a third aspect, the embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned data processing method is implemented.
本申请实施例提供了一种数据处理方法及装置、存储介质,该装置包括:神经网络处理器、缓冲存储器和内存存储器;其中,缓冲存储器中包括缓存单元,每个缓存单元包括一个缓存数据块和一个缓存计数器;缓存数据块,用于缓存内存存储器中的存储数据和/或神经网络处理器生成的输出数据;缓存计数器,用于缓存存储数据和/或输出数据对应的读取次数,读取次数与根据算法网络的网络结构确定出的读取所述存储数据和/或所述 输出数据的算法层数量相同。采用上述装置实现方案,针对神经网络处理器的数据流固定、可预先判断的特点,预先根据算法网络的网络结构得知读取缓冲存储器中数据的算法层数量,在缓冲存储器中设置缓存计数器来存储该算法层数量,能够保证即将处理的数据缓存在缓冲存储器中,大大减少了数据从内存存储器写入缓冲存储器的次数,进而提高了针对NPU的数据缓存效率。An embodiment of the present application provides a data processing method and device, and a storage medium, the device including: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data block and a cache counter; cache data blocks for caching stored data in memory storage and/or output data generated by the neural network processor; cache counters for caching stored data and/or output data corresponding to the number of reads, read The fetch times are the same as the number of algorithm layers for reading the stored data and/or the output data determined according to the network structure of the algorithm network. Using the above-mentioned device implementation scheme, aiming at the fixed and predictable data flow of the neural network processor, the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the buffer counter is set in the buffer memory. Storing the number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, greatly reducing the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。In order to understand the characteristics and technical contents of the embodiments of the present application in more detail, the implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. The attached drawings are only for reference and description, and are not intended to limit the embodiments of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。还需要指出,本申请实施例所涉及的术语“第一\第二\第三”仅是用于区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. It should also be pointed out that the term "first\second\third" involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\ The specific order or sequence of "third" may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
本申请实施例提供一种数据处理装置1,如图1所示,该装置1包括:神经网络处理器10、缓冲存储器11和内存存储器12;其中,所述缓冲存储器11中包括缓存单元110,每个缓存单元110包括一个缓存数据块1100和一个缓存计数器1101;The embodiment of the present application provides a data processing device 1. As shown in FIG. 1, the device 1 includes: a neural network processor 10, a buffer memory 11, and a memory memory 12; wherein, the buffer memory 11 includes a cache unit 110, Each cache unit 110 includes a cache data block 1100 and a cache counter 1101;
所述缓存数据块1100,用于缓存所述内存存储器12中的存储数据和/或神经网络处理器10生成的输出数据;The cache data block 1100 is used to cache the stored data in the memory storage 12 and/or the output data generated by the neural network processor 10;
所述缓存计数器1101,用于缓存所述存储数据和/或所述输出数据对应的读取次数,所述读取次数与根据算法网络的网络结构确定出的读取所述存储数据和/或所述输出数据的算法层数量相同。The cache counter 1101 is used to cache the number of reads corresponding to the stored data and/or the output data, and the number of reads is the same as the read count of the stored data and/or The number of algorithm layers of the output data is the same.
本申请实施例提出的数据处理装置为一种针对NPU架构设计的数据缓存装置。The data processing device proposed in the embodiment of the present application is a data cache device designed for an NPU architecture.
在本申请实施例中,神经网络处理器为NPU,缓冲存储器为cache存储器,内存存储器可以为同步动态随机存取内存(SDRAM,synchronous dynamic random-access memory)、双倍速率同步动态随机存储器(DDR,Double Data Rate SDRAM)等存储器。In the embodiment of the present application, the neural network processor is an NPU, the buffer memory is a cache memory, and the memory memory can be a synchronous dynamic random-access memory (SDRAM, synchronous dynamic random-access memory), double rate synchronous dynamic random-access memory (DDR , Double Data Rate SDRAM) and other memory.
在本申请实施例中,数据处理装置采用计算和存储分离的结构,如图2所示,数据处理装置包括NPU计算引擎、cache存储器和内存存储器,其中,NPU计算引擎中包含大量的计算单元,而cache存储器为靠近NPU计算引擎的缓冲存储器,在缓冲存储器中存储一定量的数据供NPU计算引擎 进行临时数据交互,读写速度快,但是存储容量小,内存存储器为远离NPU计算引擎的存储器,其中存储了全部数据,存储容量大,但是读写速度慢,且每次读取路径较长,故读写效率低。In the embodiment of the present application, the data processing device adopts a structure in which calculation and storage are separated. As shown in FIG. The cache memory is a buffer memory close to the NPU computing engine. A certain amount of data is stored in the buffer memory for the NPU computing engine to perform temporary data interaction. The read and write speed is fast, but the storage capacity is small. The memory memory is far away from the NPU computing engine. All the data is stored in it, and the storage capacity is large, but the reading and writing speed is slow, and each reading path is long, so the reading and writing efficiency is low.
需要说明的是,由于NPU具有固定的数据流,且可预先判断的特点,因此,在执行算法网络之前,可通过对NPU执行的算法网络的网络结构进行解析,确定出算法网络中的算法层之间的数据依赖关系,如图3所示,为NPU执行的算法网络的网络结构示意图,每一个圆圈为一个算法层,对于一个算法层而言,其工作流程为读入源数据,进行算子处理,其中,算子可以为卷积、池化、激活、全连接等,在算子处理完成之后,进入下一个算法层进行处理,如0号算法层的输出数据会作为输入数据被1号算法层和2号算法层读取,1号算法层的输出数据会作为输入数据被3号算法层和4号算法层读取,2号算法层的输出数据会作为输入数据被5号算法层、6号算法层和7号算法层读取,因此,通过对算法网络的网络结构的分析,可以得知算法网络中每个算法层的输出数据后续被读出的次数,如0号算法层的输出数据被读出的次数为2,1号算法层的输出数据被读出的次数为2,2号算法层的输出数据被读出的次数为3。It should be noted that since the NPU has a fixed data flow and can be pre-judged, before executing the algorithm network, the algorithm layer in the algorithm network can be determined by analyzing the network structure of the algorithm network executed by the NPU. As shown in Figure 3, it is a schematic diagram of the network structure of the algorithm network executed by the NPU. Each circle is an algorithm layer. For an algorithm layer, its workflow is to read in the source data and perform calculation. Sub-processing, in which the operator can be convolution, pooling, activation, full connection, etc. After the operator processing is completed, it enters the next algorithm layer for processing. For example, the output data of the algorithm layer 0 will be used as the input data by 1 No. 2 algorithm layer and No. 2 algorithm layer read, the output data of No. 1 algorithm layer will be read as input data by No. 3 algorithm layer and No. 4 algorithm layer, and the output data of No. 2 algorithm layer will be used as input data by No. 5 algorithm Layer, No. 6 algorithm layer and No. 7 algorithm layer are read. Therefore, through the analysis of the network structure of the algorithm network, we can know the number of times the output data of each algorithm layer in the algorithm network is subsequently read out, such as No. 0 algorithm The number of times the output data of the layer is read out is 2, the number of times the output data of the No. 1 algorithm layer is read out is 2, and the number of times the output data of the No. 2 algorithm layer is read out is 3.
基于上述构思,本申请实施例中,在NPU和内存存储器之间的cache存储器中,为每一个缓存数据块设置一个缓存计数器,一个缓存数据块和对应的一个缓存计数器共同组成了一个缓存单元,其中,缓存计数器中填写的是对应的缓存数据块中的数据被读出的次数。Based on the above concept, in the embodiment of the present application, in the cache memory between the NPU and the internal memory, a cache counter is set for each cache data block, and a cache data block and a corresponding cache counter together form a cache unit. Wherein, what is filled in the cache counter is the number of times the data in the corresponding cache data block is read out.
需要说明的是,cache存储器中存储的数据可以是来自内存存储器的存储数据,也可以来自神经网络处理器对存储数据进行算子处理后、生成的输出数据,具体的可以根据实际情况进行选择,本申请实施例不做具体的限定。It should be noted that the data stored in the cache memory can be the stored data from the internal memory, or the output data generated after the neural network processor performs operator processing on the stored data, which can be selected according to the actual situation. The embodiments of the present application do not make specific limitations.
可选的,所述缓冲存储器11,还用于在每检测到一次对所述存储数据和/或所述输出数据的读取操作时,将所述缓存数据块对应的缓存计数器中的读取次数减一;直至所述缓存计数器中的读取次数置零,则删除所述存储数据和/或所述输出数据。Optionally, the buffer memory 11 is further configured to read the cache counter corresponding to the cache data block each time a read operation on the stored data and/or the output data is detected. The number of times is reduced by one; until the number of times of reading in the cache counter is set to zero, the stored data and/or the output data are deleted.
在本申请实施例中,NPU中的算子层会从缓冲存储器中读取存储数据和/或输出数据,每从缓冲存储器中读取一次存储数据和/或输出数据时,缓冲存储器确定缓存该存储数据和/或输出数据的缓存单元,并将缓存单元的计数器中的读取次数减一;直至缓存计数器中的读取次数置零,表征该缓存单元中的数据后续不会再被读出,此时将存储数据和/或输出数据从缓冲存储器中删除,并清空对应的缓冲存储器的缓存单元,以供后续将数据写入该缓存单元中。In the embodiment of this application, the operator layer in the NPU will read the storage data and/or output data from the buffer memory, and each time the storage data and/or output data are read from the buffer memory, the buffer memory determines to cache the A cache unit that stores data and/or outputs data, and subtracts one from the number of reads in the counter of the cache unit; until the number of reads in the cache counter is set to zero, it means that the data in the cache unit will not be read out later , at this time, the stored data and/or output data are deleted from the buffer memory, and the corresponding cache unit of the buffer memory is cleared, so that the data can be written into the cache unit later.
需要说明的是,读取次数也反应了对应的缓存数据块的重要程度,读取次数越大,表明缓存数据块中存储的数据会被后续的算法层读取的次数越多,也就说明缓存数据块中存储的数据越重要,相反,读取次数越小, 表明缓存数据块中存储的数据会被后续的算法层读取的次数越少,也就说明缓存数据块中存储的数据越不重要。It should be noted that the number of reads also reflects the importance of the corresponding cache data block. The greater the number of reads, the more times the data stored in the cache data block will be read by the subsequent algorithm layer, which means The more important the data stored in the cache data block, on the contrary, the smaller the number of reads, indicating that the data stored in the cache data block will be read less frequently by the subsequent algorithm layer, which means that the data stored in the cache data block is more important. unimportant.
可选的,所述神经网络处理器10,用于根据读取所述输出数据的算法层数量,确定所述输出数据的读取次数;根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器和/或所述内存存储器中确定第一存储单元;将所述输出数据的读取次数和所述输出数据、和/或所述输出数据缓存至所述第一存储单元中。Optionally, the neural network processor 10 is configured to determine the number of reads of the output data according to the number of algorithm layers for reading the output data; according to the destination address of the output data in the memory storage , and the memory mapping method between the memory memory and the buffer memory, determine the first storage unit from the buffer memory and/or the memory memory; the number of reads of the output data and the output The data and/or the output data are buffered into the first storage unit.
在本申请实施例中,神经网络处理器可以根据算法网络的网络结构,确定出读取输出数据的算法层数量,之后根据算法层数量确定输出数据的读取次数。In the embodiment of the present application, the neural network processor may determine the number of algorithm layers for reading the output data according to the network structure of the algorithm network, and then determine the number of reads of the output data according to the number of algorithm layers.
在本申请实施例中,内存存储器和缓冲存储器之间的存储映射方式可以根据硬件参数和cache效率来确定,如图4所示,内存存储器中的每个存储数据块中的存储数据可以映射到cache存储器中的四个缓存数据块中,如,缓存数据块包括0-15这16个缓存数据块,每个缓存数据块之前都对应有一个缓存计数器cnt,共同组成了16个缓存单元,内存存储器中的0号、8号、…、2040号存储数据块映射到cache存储器中的0-3号缓存数据块,依次类推,内存存储器中的7号、15号、2047号存储数据块映射到cache存储器中的12-15号缓存数据块,现了内存存储器和缓冲存储器之间的存储映射方式。In the embodiment of the present application, the storage mapping method between the memory storage and the cache memory can be determined according to hardware parameters and cache efficiency. As shown in FIG. 4, the storage data in each storage data block in the memory storage can be mapped to Among the four cache data blocks in the cache memory, for example, the cache data block includes 16 cache data blocks 0-15, and each cache data block is preceded by a cache counter cnt, which together constitute 16 cache units. No. 0, No. 8, ..., No. 2040 storage data blocks in the memory are mapped to No. 0-3 cache data blocks in the cache memory, and so on, and No. 7, No. 15, and No. 2047 storage data blocks in the memory storage are mapped to The cache data blocks No. 12-15 in the cache memory realize the memory mapping mode between the memory memory and the buffer memory.
在本申请实施例中,NPU接收到一条指令,该指令中包含源数据地址、目标数据地址和卷积运算命令,其中的目标数据地址即为本申请中输出数据在内存存储器中的目的地址,NPU可以根据目的地址以及存储映射关系,从缓冲存储器和/或内存存储器中确定第一存储单元。In the embodiment of the present application, the NPU receives an instruction, which includes a source data address, a target data address, and a convolution operation command, wherein the target data address is the destination address of the output data in the memory storage in this application, The NPU may determine the first storage unit from the buffer memory and/or the internal memory according to the destination address and the storage mapping relationship.
具体的,所述神经网络处理器10,还用于根据所述输出数据在所述内存存储器中的目的地址和所述存储映射方式,从所述缓冲存储器中确定所述输出数据对应的第一缓存单元组;若所述第一缓存单元组中包括处于第一空闲缓存单元,则将所述第一空闲缓存单元确定为所述第一缓存单元中;Specifically, the neural network processor 10 is further configured to determine the first address corresponding to the output data from the buffer memory according to the destination address of the output data in the internal memory and the storage mapping method A cache unit group; if the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as being in the first cache unit;
所述神经网络处理器10,还用于若所述第一缓存单元组中不包括所述第一空闲缓存单元,且未从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则根据所述目的地址,从所述内存存储器中确定所述第一存储单元;The neural network processor 10 is further configured to if the first free cache unit is not included in the first cache unit group, and the number of reads is not found from the first cache unit group to be less than the output The caching unit for the number of read times of data, then according to the destination address, determine the first storage unit from the internal memory;
所述神经网络处理器10,还用于若所述第一缓存单元组中不包括所述第一空闲的第一缓存单元,且从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则将读取次数小于所述输出数据的读取次数的缓存单元确定为所述第一缓存单元;The neural network processor 10 is further configured to: if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the set If the number of reads of the output data is the cache unit, then the cache unit whose read times is less than the number of reads of the output data is determined as the first cache unit;
所述缓冲存储器11,还用于删除读取次数小于所述输出数据的读取次数的缓存单元中的当前存储的输出数据和当前存储的输出数据对应的剩余读取次数。The buffer memory 11 is further configured to delete the currently stored output data and the remaining read times corresponding to the currently stored output data in the buffer units whose read times are smaller than the output data read times.
在本申请实施例中,神经网络处理器将输出数据写回cache存储器或内存存储器,具体的,神经网络处理器先根据输出数据在内存存储器中的目的地址和存储映射方式,从缓冲存储器中确定第一缓存单元组,并判断第一缓存单元组中是否存在第一空闲缓存单元;若第一缓存单元组中存在第一空闲缓存单元,则将第一空闲缓存单元确定为第一缓存单元,并将输出数据和输出数据的读取次数直接缓存在缓冲存储器的第一空闲缓存单元中。In the embodiment of the present application, the neural network processor writes the output data back to the cache memory or memory memory. Specifically, the neural network processor first determines the output data from the buffer memory according to the destination address and storage mapping method of the output data in the memory memory The first cache unit group, and judging whether there is a first idle cache unit in the first cache unit group; if there is a first idle cache unit in the first cache unit group, then determining the first idle cache unit as the first cache unit, And the output data and the read times of the output data are directly cached in the first free cache unit of the buffer memory.
在本申请实施例中,若第一缓存单元组中不存在第一空闲缓存单元,则将输出数据的读取次数与第一缓存单元组中存储的读取次数依次进行比较,若第一缓存单元组中存在读取次数小于输出数据的读取次数的缓存单元,表征输出数据的重要性高于读取次数小于输出数据的读取次数的缓存单元中缓存的数据;此时将读取次数小于输出数据的读取次数的缓存单元确定为第一缓存单元,并删除输出数据的读取次数的缓存单元中的当前存储的输出数据和当前存储的输出数据对应的剩余读取次数,之后再将输出数据和输出数据的读取次数缓存在缓冲存储器中的读取次数小于输出数据的读取次数的缓存单元中。In the embodiment of the present application, if there is no first idle cache unit in the first cache unit group, the number of reads of the output data is compared with the number of reads stored in the first cache unit group in sequence, and if the first cache In the unit group, there are cache units whose read times are less than the read times of the output data, and the importance of representing the output data is higher than the data cached in the cache units whose read times are less than the read times of the output data; at this time, the read times The cache unit that is less than the read times of the output data is determined as the first cache unit, and deletes the currently stored output data in the cache unit of the output data read times and the remaining read times corresponding to the currently stored output data, and then The output data and the number of reads of the output data are cached in a buffer unit whose read count is smaller than the read count of the output data in the buffer memory.
在本申请实施例中,若第一缓存单元组中不存在读取次数小于输出数据的读取次数的缓存单元,表征第一缓存单元组中缓存的数据中不存在重要性低于输出数据的数据,此时,直接根据目的地址,从内存存储器中确定第一缓存单元,并将输出数据、和/或输出数据和输出数据的读取次数存储至内存存储器中的第一缓存单元中。In the embodiment of the present application, if there is no cache unit in the first cache unit group whose reading frequency is less than that of the output data, it means that there is no data cached in the first cache unit group that is less important than the output data. For data, at this time, directly determine the first cache unit from the memory storage according to the destination address, and store the output data, and/or the output data and the read times of the output data into the first cache unit in the memory storage.
可选的,所述神经网络处理器10,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器;Optionally, the neural network processor 10 is further configured to store the output data and/or the output data and the number of reads of said output data are updated to said memory storage;
或,所述神经网络处理器10,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,为所述输出数据设置待同步标志,在所述缓冲存储器将所述输出数据和所述输出数据的读取次数删除时,根据所述待同步标志,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。Or, the neural network processor 10 is further configured to set a synchronization flag for the output data when the output data read times and the output data are cached in the buffer memory, and in the buffer When the memory deletes the output data and the read times of the output data, the output data, and/or the output data and the read times of the output data are updated to the in the memory storage described above.
需要说明的是,cache存储器中的数据相当于内存存储器中的数据的备份,所以,在神经网络存储器将输出数据和输出数据的读取次数缓存至缓冲存储器时,也需要将输出数据、和/或输出数据和输出数据的读取次数同步至内存存储器中,以保证cache存储器和内存存储器中的数据一致性。对于同步过程,有两种方式,一种是在将输出数据的读取次数和输出数据缓存至缓冲存储器时,将输出数据、和/或输出数据和输出数据的读取次数更新至内存存储器,另一种是为输出数据设置待同步标志,在缓冲存储器将输出数据和输出数据的读取次数删除时,根据待同步标志,将输出数据、和/或输出数据和输出数据的读取次数更新至内存存储器中。It should be noted that the data in the cache memory is equivalent to the backup of the data in the memory memory, so when the neural network memory caches the output data and the number of reads of the output data to the buffer memory, it is also necessary to save the output data and/or Or the output data and the number of reads of the output data are synchronized to the memory storage to ensure data consistency between the cache memory and the memory storage. For the synchronization process, there are two ways, one is to update the output data, and/or the output data and the read times of the output data to the memory storage when the output data read times and the output data are cached to the buffer memory, The other is to set a pending synchronization flag for the output data, and when the buffer memory deletes the output data and the number of reads of the output data, update the output data, and/or the output data and the number of readings of the output data according to the pending synchronization flag to memory storage.
需要说明的是,缓冲存储器删除输出数据和输出数据的读取次数的场 景可以是输出数据的读取次数减少至零;也可以是NPU向缓冲存储器中缓存新的输出数据时,判断出输出数据对应的缓存单元组中无处于空闲状态的缓存单元、且输出数据的读取次数小于新的输出数据的读取次数;还可以是内存存储器向缓冲存储器中写入新的存储数据时,判断出输出数据对应的缓存单元组中无处于空闲状态的缓存单元、且输出数据的读取次数为对应的缓存单元组中读取次数最小的。It should be noted that the scenario where the buffer memory deletes the output data and the number of reads of the output data may be that the number of reads of the output data is reduced to zero; it may also be that when the NPU caches new output data in the buffer memory, it determines that the output data There is no cache unit in an idle state in the corresponding cache unit group, and the number of reads of output data is less than the number of reads of new output data; it can also be determined that when the memory storage writes new storage data to the buffer memory, There is no cache unit in an idle state in the cache unit group corresponding to the output data, and the read times of the output data are the minimum read times in the corresponding cache unit group.
可选的,所述内存存储器12,用于根据所述存储数据的存储地址和所述存储映射方式,从所述缓冲存储器中确定所述存储数据对应的第二缓存单元组,若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在第二空闲存储单元,则从所述第二缓存单元组中查找读取次数最小的第二缓存单元,并将所述存储数据缓存至所述第二缓存单元中;Optionally, the memory storage 12 is configured to determine from the buffer memory the second cache unit group corresponding to the stored data according to the storage address of the stored data and the storage mapping method, if the first There is a second free storage unit in the second cache unit group, then cache the stored data into the second free storage unit, if there is no second free storage unit in the second cache unit group, then from the Searching for a second cache unit with the smallest number of reads in the second cache unit group, and caching the stored data into the second cache unit;
所述神经网络处理器10,还用于确定所述存储数据的读取次数;若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二缓存单元中。The neural network processor 10 is also used to determine the read times of the stored data; if there is a second free storage unit in the second cache unit group, cache the read times of the stored data to the In the second free storage unit, if there is no second free storage unit in the second cache unit group, the read times of the stored data are cached in the second cache unit.
在本申请实施例中,在内存存储器将存储数据缓存至cache存储器时,神经网络处理器还确定存储数据的读取次数,内存存储器先根据存储数据的存储地址和存储映射关系,从缓冲存储器中确定存储数据对应的第二缓存单元组,并判断第二缓存单元组中是否存在第二空闲存储单元;若第二缓存单元组中存在第二空闲存储单元,则将存储数据和存储数据的读取次数缓存至第二空闲存储单元中;若第二缓存单元组中不存在第二空闲存储单元,则查找第二缓存单元组中查找读取次数最小的第二缓存单元,之后将存储数据和存储数据的读取次数缓存至第二缓存单元中。In the embodiment of the present application, when the memory memory caches the stored data to the cache memory, the neural network processor also determines the number of reads of the stored data. Determine the second cache unit group corresponding to the stored data, and judge whether there is a second free storage unit in the second cache unit group; if there is a second free storage unit in the second cache unit group, then store the data and store the data read The number of fetches is cached in the second free storage unit; if there is no second free storage unit in the second cache unit group, then search for the second cache unit with the smallest number of read times in the second cache unit group, and then store the data and The read times of the stored data are cached in the second cache unit.
可选的,若所述内存存储器12中包括:存储数据块和所述存储数据块对应的存储计数器;则所述神经网络处理器10,还用于从所述存储计数器中获取所述存储数据的读取次数,所述存储计数器中的读取次数根据读取所述存储数据块中数据的算法层数量确定、和/或根据所述神经网络处理器传输的读取次数确定;Optionally, if the memory storage 12 includes: a storage data block and a storage counter corresponding to the storage data block; then the neural network processor 10 is further configured to obtain the storage data from the storage counter The number of reads in the storage counter is determined according to the number of algorithm layers for reading data in the stored data block, and/or determined according to the number of reads transmitted by the neural network processor;
若所述内存存储器中12仅包括:所述存储数据块;则所述神经网络处理器10,还用于从所述缓冲存储器中确定所述第二缓存单元中当前存储的读取次数,根据所述第二缓存单元中当前存储的读取次数,确定所述存储数据的读取次数,和/或根据读取所述存储数据的算法层数量确定所述存储数据的读取次数。If 12 in the memory storage only includes: the storage data block; then the neural network processor 10 is also used to determine the number of reads currently stored in the second cache unit from the buffer memory, according to The reading times currently stored in the second cache unit determine the reading times of the stored data, and/or determine the reading times of the stored data according to the number of algorithm layers for reading the stored data.
在一种可选的实施例中,可以为内存存储器中的每个存储数据块对应设置存储计数器,将存储数据的读取次数存储至对应的存储计数器中;神经网络处理器可直接从存储计数器中获取存储数据的读取次数。In an optional embodiment, a storage counter can be set correspondingly for each storage data block in the memory storage, and the number of reads of the storage data is stored in the corresponding storage counter; the neural network processor can directly read from the storage counter Get the number of reads of stored data in .
在另一种可选的实施例中,也可以只在内存存储器中设置存储数据块, 此时,内存存储器中不对存储数据的读取次数进行存储;神经网络处理器可以根据读取存储数据的算法层数量确定存储数据的读取次数、也可以根据第二缓存单元中当前存储的读取次数确定存储数据的读取次数,其中,存储数据的读取次数大于第二缓存单元中当前存储的读取次数,具体的存储数据的读取次数与第二缓存当前中当前存储的读取次数之间的差值可基于前期对算法网络的评估得到。In another optional embodiment, only storage data blocks can be set in the memory storage, at this time, the number of times of reading the storage data is not stored in the memory storage; the neural network processor can read the storage data according to the The number of algorithm layers determines the number of reads of the stored data, and can also determine the number of reads of the stored data according to the number of reads currently stored in the second cache unit, wherein the number of reads of the stored data is greater than the number of times currently stored in the second cache unit The number of times of reading, specifically the difference between the number of times of reading the stored data and the number of times of reading currently stored in the second cache can be obtained based on the previous evaluation of the algorithm network.
可以理解的是,针对神经网络处理器的数据流固定、可预先判断的特点,预先根据算法网络的网络结构得知读取缓冲存储器中数据的算法层数量,在缓冲存储器中设置缓存计数器来存储该算法层数量,能够保证即将处理的数据缓存在缓冲存储器中,大大减少了数据从内存存储器写入缓冲存储器的次数,进而提高了针对NPU的数据缓存效率。It can be understood that, in view of the fixed and predictable data flow of the neural network processor, the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the cache counter is set in the buffer memory to store The number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, which greatly reduces the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
基于上述实施例,本申请实施例还提出一种数据处理方法,如图5所示,应用于上述数据处理装置,该方法包括:Based on the above-mentioned embodiments, the embodiment of the present application also proposes a data processing method, as shown in FIG. 5, which is applied to the above-mentioned data processing device, and the method includes:
S101、获取待执行的算法网络的网络结构,并根据网络结构、确定读取内存存储器中的存储数据和/或算法网络中的每一个算法层的输出数据的算法层数量。S101. Obtain the network structure of the algorithm network to be executed, and determine the number of algorithm layers for reading the stored data in the memory storage and/or the output data of each algorithm layer in the algorithm network according to the network structure.
在本申请实施例中,由于NPU具有固定的数据流,且可预先判断的特点,因此,在执行算法网络之前,可通过对NPU执行的算法网络的网络结构进行解析,确定出算法网络中的算法层之间的数据依赖关系,如图2所示,为NPU执行的算法网络的网络结构示意图,每一个圆圈为一个算法层,对于一个算法层而言,其工作流程为读入源数据,进行算子处理,其中,算子可以为卷积、池化、激活、全连接等,在算子处理完成之后,进入下一个算法层进行处理,如0号算法层的输出数据会作为输入数据被1号算法层和2号算法层读取,1号算法层的输出数据会作为输入数据被3号算法层和4号算法层读取,2号算法层的输出数据会作为输入数据被5号算法层、6号算法层和7号算法层读取,因此,通过对算法网络的网络结构的分析,可以得知算法网络中每个算法层的输出数据后续被读出的次数,如0号算法层的输出数据被读出的次数为2,1号算法层的输出数据被读出的次数为2,2号算法层的输出数据被读出的次数为3。In the embodiment of the present application, since the NPU has a fixed data flow and can be pre-judged, before executing the algorithm network, the network structure of the algorithm network executed by the NPU can be analyzed to determine the The data dependencies between the algorithm layers, as shown in Figure 2, is a schematic diagram of the network structure of the algorithm network executed by the NPU. Each circle is an algorithm layer. For an algorithm layer, its workflow is to read the source data, Perform operator processing, where the operator can be convolution, pooling, activation, full connection, etc. After the operator processing is completed, enter the next algorithm layer for processing. For example, the output data of the algorithm layer 0 will be used as input data It is read by Algorithm Layer 1 and Algorithm Layer 2, the output data of Algorithm Layer 1 will be read by Algorithm Layer 3 and Algorithm Layer 4 as input data, and the output data of Algorithm Layer 2 will be read by Algorithm Layer 5 as input data. Algorithm layer No. 6, Algorithm layer No. 6, and Algorithm layer No. 7. Therefore, through the analysis of the network structure of the algorithm network, we can know the number of times the output data of each algorithm layer in the algorithm network is subsequently read out, such as 0 The number of times the output data of the No. algorithm layer is read out is 2, the number of times the output data of the No. 1 algorithm layer is read out is 2, and the number of times the output data of the No. 2 algorithm layer is read out is 3.
S102、将算法层数量确定为存储数据的读取次数和/或输出数据的读取次数;并将存储数据和存储数据的读取次数、和/或输出数据和输出数据的读取次数添加至缓冲存储器中。S102. Determine the number of algorithm layers as the number of reads of stored data and/or the number of reads of output data; and add the stored data and the number of reads of stored data, and/or the number of reads of output data and output data to in buffer memory.
在本申请实施例中,将输出数据和输出数据的读取次数添加至缓冲存储器中,包括:根据输出数据在内存存储器中的目的地址、以及内存存储器和缓冲存储器之间的存储映射方式,从缓冲存储器确定第一存储单元;将输出数据的读取次数和输出数据缓存至第一存储单元中。In the embodiment of the present application, adding the output data and the number of reads of the output data to the buffer memory includes: according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory, from The buffer memory determines the first storage unit; caches the read times of the output data and the output data into the first storage unit.
具体的,根据输出数据在内存存储器中的目的地址、以及内存存储器和缓冲存储器之间的存储映射方式,从缓冲存储器中确定第一存储单元, 包括:根据输出数据在内存存储器中的目的地址和存储映射方式,从缓冲存储器中确定输出数据对应的第一缓存单元组;若第一缓存单元组中包括处于第一空闲缓存单元,则将第一空闲缓存单元确定为第一缓存单元;若第一缓存单元组中不包括第一空闲的第一缓存单元,且从第一缓存单元组中查找到读取次数小于输出数据的读取次数的缓存单元,则将输出数据的读取次数的缓存单元确定为第一缓存单元。Specifically, according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory, determining the first storage unit from the buffer memory includes: according to the destination address of the output data in the memory memory and In the storage mapping mode, the first cache unit group corresponding to the output data is determined from the buffer memory; if the first cache unit group includes the first idle cache unit, the first idle cache unit is determined as the first cache unit; if the first cache unit is included in the first idle cache unit; A cache unit group does not include the first idle first cache unit, and a cache unit whose read times is less than the read times of the output data is found from the first cache unit group, then the cache unit of the output data read times is The unit is determined as the first cache unit.
进一步地,若第一缓存单元组中不包括第一空闲缓存单元,且未从第一缓存单元组中查找到读取次数小于输出数据的读取次数的缓存单元,则根据目的地址,从内存存储器确定第一存储单元。Further, if the first free cache unit is not included in the first cache unit group, and no cache unit whose read times is less than the output data read times is not found from the first cache unit group, then according to the destination address, from the memory The memory determines a first storage unit.
在本申请实施例中,将存储数据和存储数据的读取次数添加至缓冲存储器中,包括:确定存储数据的读取次数;根据存储数据的存储地址和存储映射方式,从缓冲存储器中确定存储数据对应的第二缓存单元组;若第二缓存单元组中存在第二空闲存储单元,则将存储数据和存储数据的读取次数缓存至第二空闲存储单元中;若第二缓存单元组中不存在第二空闲存储单元,则从第二缓存单元组中查找读取次数最小的第二缓存单元,并将存储数据的读取次数缓存至第二缓存单元中。In the embodiment of the present application, adding the stored data and the number of reads of the stored data to the buffer memory includes: determining the number of reads of the stored data; The second cache unit group corresponding to the data; if there is a second free storage unit in the second cache unit group, the stored data and the read times of the stored data are cached in the second free storage unit; if in the second cache unit group If there is no second free storage unit, the second cache unit with the smallest number of reads is searched from the second cache unit group, and the read times of the stored data are cached in the second cache unit.
在本申请实施例中,在每检测到一个算法层对存储数据和/或输出数据的读取操作时,将缓冲存储器中、存储数据的读取次数和/或输出数据的读取次数减一;直至缓冲存储器中、存储数据的读取次数和/或输出数据的读取次数置零,则将对应的存储数据和/或输出数据从缓冲存储器中删除。In the embodiment of the present application, when an algorithm layer is detected to read the stored data and/or the output data, the number of reads of the stored data and/or the number of reads of the output data in the buffer memory is reduced by one ; Until the number of reads of stored data and/or the number of reads of output data in the buffer memory is set to zero, then the corresponding stored data and/or output data will be deleted from the buffer memory.
需要说明的是,将输出数据和输出数据的读取次数添加至缓冲存储器中之后,还执行数据同步的过程,具体的:将输出数据、和/或输出数据和输出数据的读取次数更新至内存存储器中;或,为输出数据设置待同步标志,并在缓冲存储器将输出数据和输出数据的读取次数删除时,根据待同步标志,将输出数据、和/或输出数据和输出数据的读取次数更新至内存存储器中。It should be noted that after adding the output data and the number of reads of the output data to the buffer memory, a process of data synchronization is also performed, specifically: updating the output data, and/or the number of reads of the output data and the output data to or, set the flag to be synchronized for the output data, and when the buffer memory deletes the output data and the read times of the output data, according to the flag to be synchronized, the output data, and/or the read of the output data and the output data The number of fetches is updated to the memory storage.
可以理解的是,针对神经网络处理器的数据流固定、可预先判断的特点,预先根据算法网络的网络结构得知读取缓冲存储器中数据的算法层数量,在缓冲存储器中设置缓存计数器来存储该算法层数量,能够保证即将处理的数据缓存在缓冲存储器中,大大减少了数据从内存存储器写入缓冲存储器的次数,进而提高了针对NPU的数据缓存效率。It can be understood that, in view of the fixed and predictable data flow of the neural network processor, the number of algorithm layers for reading data in the buffer memory is known in advance according to the network structure of the algorithm network, and the cache counter is set in the buffer memory to store The number of algorithm layers can ensure that the data to be processed is cached in the buffer memory, which greatly reduces the number of times data is written from the memory memory to the buffer memory, thereby improving the data cache efficiency for the NPU.
本申请实施例提供一种存储介质,其上存储有计算机程序,上述计算机可读存储介质存储有一个或者多个程序,上述一个或者多个程序可被一个或者多个神经网络处理器执行,应用于数据处理装置中,该计算机程序实现如上述的数据处理方法。An embodiment of the present application provides a storage medium on which a computer program is stored. The computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more neural network processors. In the data processing device, the computer program realizes the above-mentioned data processing method.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素, 或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台图像显示设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本公开各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or the part that contributes to the related technology. The computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk, etc.) ) includes several instructions to make an image display device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in various embodiments of the present disclosure.
以上所述,仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the protection scope of the present application.
.

Claims (20)

  1. 一种数据处理装置,所述装置包括:神经网络处理器、缓冲存储器和内存存储器;其中,所述缓冲存储器中包括缓存单元,每个缓存单元包括一个缓存数据块和一个缓存计数器;A data processing device, the device comprising: a neural network processor, a buffer memory, and a memory memory; wherein, the buffer memory includes cache units, and each cache unit includes a cache data block and a cache counter;
    所述缓存数据块,用于缓存所述内存存储器中的存储数据和/或神经网络处理器生成的输出数据;The cache data block is used to cache the stored data in the internal memory and/or the output data generated by the neural network processor;
    所述缓存计数器,用于缓存所述存储数据和/或所述输出数据对应的读取次数,所述读取次数与根据算法网络的网络结构确定出的读取所述存储数据和/或所述输出数据的算法层数量相同。The cache counter is used to cache the number of reads corresponding to the stored data and/or the output data. The number of algorithm layers for the above output data is the same.
  2. 根据权利要求1所述的装置,其中,The device according to claim 1, wherein,
    所述缓冲存储器,还用于在每检测到一次对所述存储数据和/或所述输出数据的读取操作时,将所述缓存数据块对应的缓存计数器中的读取次数减一;直至所述缓存计数器中的读取次数置零,则删除所述存储数据和/或所述输出数据。The buffer memory is further configured to decrement the number of reads in the cache counter corresponding to the cache data block by one each time a read operation on the stored data and/or the output data is detected; until If the number of reads in the buffer counter is set to zero, the stored data and/or the output data are deleted.
  3. 根据权利要求1所述的装置,其中,The device according to claim 1, wherein,
    所述神经网络处理器,用于根据读取所述输出数据的算法层数量,确定所述输出数据的读取次数;根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器和/或所述内存存储器中确定第一存储单元;将所述输出数据的读取次数和所述输出数据、和/或所述输出数据缓存至所述第一存储单元中。The neural network processor is configured to determine the number of reads of the output data according to the number of algorithm layers for reading the output data; according to the destination address of the output data in the memory storage and the memory In a storage mapping manner between the memory and the buffer memory, the first storage unit is determined from the buffer memory and/or the internal memory; the number of reads of the output data and the output data, and/or The output data is buffered into the first storage unit.
  4. 根据权利要求3所述的装置,其中,The apparatus according to claim 3, wherein,
    所述神经网络处理器,还用于根据所述输出数据在所述内存存储器中的目的地址和所述存储映射方式,从所述缓冲存储器中确定所述输出数据对应的第一缓存单元组;若所述第一缓存单元组中包括处于第一空闲缓存单元,则将所述第一空闲缓存单元确定为所述第一缓存单元;The neural network processor is further configured to determine the first cache unit group corresponding to the output data from the buffer memory according to the destination address of the output data in the memory memory and the storage mapping method; If the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit;
    所述神经网络处理器,还用于若所述第一缓存单元组中不包括所述第一空闲缓存单元,且未从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则根据所述目的地址,从所述内存存储器中确定所述第一存储单元;The neural network processor is further configured to if the first free cache unit is not included in the first cache unit group, and the number of reads is not found in the first cache unit group to be less than the output data The cache unit of the read times, then according to the destination address, determine the first storage unit from the internal memory;
    所述神经网络处理器,还用于若所述第一缓存单元组中不包括所述第一空闲的第一缓存单元,且从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则将读取次数小于所述输出数据的读取次数的缓存单元确定为所述第一缓存单元;The neural network processor is further configured to: if the first cache unit group does not include the first idle first cache unit, and the number of reads found from the first cache unit group is less than the A cache unit whose output data read times is determined as the first cache unit whose read times are less than the output data read times;
    所述缓冲存储器,还用于删除读取次数小于所述输出数据的读取次数的缓存单元中的当前存储的输出数据和当前存储的输出数据对应的剩 余读取次数。The buffer memory is further configured to delete the currently stored output data and the remaining read times corresponding to the currently stored output data in the cache unit whose read times are smaller than the output data read times.
  5. 根据权利要求3所述的装置,其中,The apparatus according to claim 3, wherein,
    所述神经网络处理器,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器。The neural network processor is further configured to store the output data, and/or the output data and the output The read count of data is updated to the memory storage.
  6. 根据权利要求3所述的装置,其中,The apparatus according to claim 3, wherein,
    所述神经网络处理器,还用于在将所述输出数据的读取次数和所述输出数据缓存至所述缓冲存储器时,为所述输出数据设置待同步标志,在所述缓冲存储器将所述输出数据和所述输出数据的读取次数删除时,根据所述待同步标志,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。The neural network processor is further configured to set a synchronization flag for the output data when the output data read times and the output data are cached in the buffer memory, and the output data is stored in the buffer memory When the output data and the number of reads of the output data are deleted, the output data, and/or the output data and the number of reads of the output data are updated to the internal memory according to the flag to be synchronized middle.
  7. 根据权利要求2所述的装置,其中,The apparatus according to claim 2, wherein,
    所述内存存储器,用于根据所述存储数据的存储地址和所述存储映射方式,从所述缓冲存储器中确定所述存储数据对应的第二缓存单元组,若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在第二空闲存储单元,则从所述第二缓存单元组中查找读取次数最小的第二缓存单元,并将所述存储数据缓存至所述第二缓存单元中;The memory storage is configured to determine from the buffer memory the second cache unit group corresponding to the stored data according to the storage address of the stored data and the storage mapping method, if the second cache unit group is If there is a second free storage unit, cache the stored data into the second free storage unit, if there is no second free storage unit in the second cache unit group, then cache the stored data from the second cache unit group Find the second cache unit with the smallest number of reads, and cache the stored data into the second cache unit;
    所述神经网络处理器,还用于确定所述存储数据的读取次数;若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二空闲存储单元中,若所述第二缓存单元组中不存在第二空闲存储单元,则将所述存储数据的读取次数缓存至所述第二缓存单元中。The neural network processor is also used to determine the number of reads of the stored data; if there is a second free storage unit in the second cache unit group, cache the number of reads of the stored data to the In the second free storage unit, if there is no second free storage unit in the second cache unit group, the read times of the stored data are cached in the second cache unit.
  8. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein,
    若所述内存存储器中包括存储数据块和所述存储数据块对应的存储计数器,则所述神经网络处理器,还用于从所述存储计数器中获取所述存储数据的读取次数,所述存储计数器中的读取次数根据读取所述存储数据块中数据的算法层数量确定、和/或根据所述神经网络处理器传输的读取次数确定。If the memory storage includes a stored data block and a storage counter corresponding to the stored data block, the neural network processor is further configured to obtain the number of reads of the stored data from the storage counter, the The number of reads in the storage counter is determined according to the number of algorithm layers for reading data in the storage data block, and/or determined according to the number of reads transmitted by the neural network processor.
  9. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein,
    若所述内存存储器中仅包括所述存储数据块,则所述神经网络处理器,还用于从所述缓冲存储器中确定所述第二缓存单元中当前存储的读取次数,并根据所述第二缓存单元中当前存储的读取次数,确定所述存储数据的读取次数,和/或根据读取所述存储数据的算法层数量确定所述存储数据的读取次数。If only the stored data block is included in the internal memory, the neural network processor is further configured to determine the number of reads currently stored in the second cache unit from the buffer memory, and according to the The reading times currently stored in the second cache unit determine the reading times of the stored data, and/or determine the reading times of the stored data according to the number of algorithm layers for reading the stored data.
  10. 一种数据处理方法,应用于上述权利要求1-9任一项所述的数据处理装置,所述方法包括:A data processing method, applied to the data processing device described in any one of claims 1-9 above, the method comprising:
    获取待执行的算法网络的网络结构,并根据所述网络结构、确定读 取内存存储器中的存储数据和/或所述算法网络中的每一个算法层的输出数据的算法层数量;Obtain the network structure of the algorithm network to be executed, and according to the network structure, determine the number of algorithm layers to read the stored data in the memory storage and/or the output data of each algorithm layer in the algorithm network;
    将所述算法层数量确定为所述存储数据的读取次数和/或所述输出数据的读取次数;并将所述存储数据和所述存储数据的读取次数、和/或所述输出数据和所述输出数据的读取次数添加至缓冲存储器中。Determining the number of layers of the algorithm as the number of reads of the stored data and/or the number of reads of the output data; and determining the number of reads of the stored data and the stored data, and/or the output The data and the read count of the output data are added to the buffer memory.
  11. 根据权利要求10所述的方法,其中,所述方法还包括:The method according to claim 10, wherein the method further comprises:
    在每检测到一个算法层对所述存储数据和/或所述输出数据的读取操作时,将所述缓冲存储器中、所述存储数据的读取次数和/或所述输出数据的读取次数减一;When an algorithm layer read operation on the stored data and/or the output data is detected, the number of reads of the stored data and/or the read of the output data in the buffer memory times minus one;
    直至所述缓冲存储器中、所述存储数据的读取次数和/或所述输出数据的读取次数置零,则将对应的所述存储数据和/或所述输出数据从所述缓冲存储器中删除。Until the number of reads of the stored data and/or the number of reads of the output data in the buffer memory is set to zero, then the corresponding stored data and/or the output data are read from the buffer memory delete.
  12. 根据权利要求10所述的方法,其中,所述将所述输出数据和所述输出数据的读取次数添加至缓冲存储器中,包括:The method according to claim 10, wherein said adding the output data and the number of reads of the output data to a buffer memory comprises:
    根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器确定第一存储单元;determining a first storage unit from the buffer memory according to a destination address of the output data in the memory memory and a storage mapping manner between the memory memory and the buffer memory;
    将所述输出数据的读取次数和所述输出数据缓存至所述第一存储单元中。The read times of the output data and the output data are cached in the first storage unit.
  13. 根据权利要求12所述的方法,其中,所述根据所述输出数据在所述内存存储器中的目的地址、以及所述内存存储器和所述缓冲存储器之间的存储映射方式,从所述缓冲存储器中确定第一存储单元,包括:The method according to claim 12, wherein, according to the destination address of the output data in the memory memory and the storage mapping method between the memory memory and the buffer memory, the output from the buffer memory Determine the first storage unit, including:
    根据所述输出数据在所述内存存储器中的目的地址和所述存储映射方式,从所述缓冲存储器中确定所述输出数据对应的第一缓存单元组。Determine the first cache unit group corresponding to the output data from the buffer memory according to the destination address of the output data in the memory storage and the storage mapping manner.
  14. 根据权利要求13所述的方法,其中,The method of claim 13, wherein,
    若所述第一缓存单元组中包括处于第一空闲缓存单元,则将所述第一空闲缓存单元确定为所述第一缓存单元。If the first cache unit group includes a first idle cache unit, then determine the first idle cache unit as the first cache unit.
  15. 根据权利要求13所述的方法,其中,The method of claim 13, wherein,
    若所述第一缓存单元组中不包括所述第一空闲的第一缓存单元,且从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则将所述输出数据的读取次数的缓存单元确定为所述第一缓存单元。If the first cache unit group does not include the first idle first cache unit, and a cache unit with a read count less than the read count of the output data is found from the first cache unit group, Then determine the cache unit of the read times of the output data as the first cache unit.
  16. 根据权利要求13所述的方法,其中,所述方法还包括:The method according to claim 13, wherein the method further comprises:
    若所述第一缓存单元组中不包括所述第一空闲缓存单元,且未从所述第一缓存单元组中查找到读取次数小于所述输出数据的读取次数的缓存单元,则根据所述目的地址,从所述内存存储器确定所述第一存储单元。If the first idle cache unit is not included in the first cache unit group, and no cache unit whose read times are less than the read times of the output data is not found from the first cache unit group, then according to The destination address determines the first storage unit from the memory storage.
  17. 根据权利要求10所述的方法,其中,所述将所述输出数据和所 述输出数据的读取次数添加至缓冲存储器中之后,所述方法还包括:The method according to claim 10, wherein, after adding the output data and the read times of the output data into the buffer memory, the method further comprises:
    将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。updating the output data, and/or the output data and the read times of the output data into the memory storage.
  18. 根据权利要求10所述的方法,其中,所述将所述输出数据和所述输出数据的读取次数添加至缓冲存储器中之后,所述方法还包括:The method according to claim 10, wherein, after adding the output data and the read times of the output data to the buffer memory, the method further comprises:
    为所述输出数据设置待同步标志,并在所述缓冲存储器将所述输出数据和所述输出数据的读取次数删除时,根据所述待同步标志,将所述输出数据、和/或所述输出数据和所述输出数据的读取次数更新至所述内存存储器中。Setting a flag to be synchronized for the output data, and when the buffer memory deletes the output data and the number of reads of the output data, according to the flag to be synchronized, the output data, and/or all The output data and the reading times of the output data are updated to the memory storage.
  19. 根据权利要求10所述的方法,其中,所述将所述存储数据和所述存储数据的读取次数添加至缓冲存储器中,包括:The method according to claim 10, wherein the adding the stored data and the number of reads of the stored data to a buffer memory comprises:
    确定所述存储数据的读取次数;determining the number of reads of the stored data;
    根据所述存储数据的存储地址和所述存储映射方式,从所述缓冲存储器中确定所述存储数据对应的第二缓存单元组;determining a second cache unit group corresponding to the stored data from the buffer memory according to the storage address of the stored data and the storage mapping method;
    若所述第二缓存单元组中存在第二空闲存储单元,则将所述存储数据和所述存储数据的读取次数缓存至所述第二空闲存储单元中;If there is a second free storage unit in the second cache unit group, cache the stored data and the read times of the stored data into the second free storage unit;
    若所述第二缓存单元组中不存在第二空闲存储单元,则从所述第二缓存单元组中查找读取次数最小的第二缓存单元,并将所述存储数据的读取次数缓存至所述第二缓存单元中。If there is no second free storage unit in the second cache unit group, then search for the second cache unit with the smallest number of reads from the second cache unit group, and cache the read times of the stored data to In the second cache unit.
  20. 一种存储介质,其上存储有计算机程序,该计算机程序被神经网络处理器执行时实现如权利要求10-19任一项所述的方法。A storage medium, on which a computer program is stored, and when the computer program is executed by a neural network processor, the method according to any one of claims 10-19 is realized.
PCT/CN2022/138424 2022-01-14 2022-12-12 Data processing method and apparatus, and storage medium WO2023134360A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210044147.5 2022-01-14
CN202210044147.5A CN114492776A (en) 2022-01-14 2022-01-14 Data processing method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2023134360A1 true WO2023134360A1 (en) 2023-07-20

Family

ID=81512398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138424 WO2023134360A1 (en) 2022-01-14 2022-12-12 Data processing method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN114492776A (en)
WO (1) WO2023134360A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492776A (en) * 2022-01-14 2022-05-13 哲库科技(上海)有限公司 Data processing method and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470691A (en) * 2004-11-19 2009-07-01 英特尔公司 Heterogeneous processors sharing a common cache
US20200104691A1 (en) * 2018-09-28 2020-04-02 Qualcomm Incorporated Neural processing unit (npu) direct memory access (ndma) memory bandwidth optimization
CN112712167A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Memory access method and system supporting acceleration of multiple convolutional neural networks
CN112732591A (en) * 2021-01-15 2021-04-30 杭州中科先进技术研究院有限公司 Edge computing framework for cache deep learning
CN114492776A (en) * 2022-01-14 2022-05-13 哲库科技(上海)有限公司 Data processing method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470691A (en) * 2004-11-19 2009-07-01 英特尔公司 Heterogeneous processors sharing a common cache
US20200104691A1 (en) * 2018-09-28 2020-04-02 Qualcomm Incorporated Neural processing unit (npu) direct memory access (ndma) memory bandwidth optimization
CN112712167A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Memory access method and system supporting acceleration of multiple convolutional neural networks
CN112732591A (en) * 2021-01-15 2021-04-30 杭州中科先进技术研究院有限公司 Edge computing framework for cache deep learning
CN114492776A (en) * 2022-01-14 2022-05-13 哲库科技(上海)有限公司 Data processing method and device and storage medium

Also Published As

Publication number Publication date
CN114492776A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN109213772B (en) Data storage method and NVMe storage system
CN112214424B (en) Object memory architecture, processing node, memory object storage and management method
KR101361928B1 (en) Cache prefill on thread migration
US8688951B2 (en) Operating system virtual memory management for hardware transactional memory
CN108664596B (en) Hardware for table scan acceleration
US10747593B2 (en) Lock free container packing
CN108431831B (en) Cyclic code processor optimization
CN113704141A (en) Object memory dataflow instruction execution
US9727479B1 (en) Compressing portions of a buffer cache using an LRU queue
US11360929B2 (en) Pre-registering memory regions for remote direct memory access in a distributed file system
US10552334B2 (en) Systems and methods for acquiring data for loads at different access times from hierarchical sources using a load queue as a temporary storage buffer and completing the load early
WO2023134360A1 (en) Data processing method and apparatus, and storage medium
US8441495B1 (en) Compression tag state interlock
CN107844380A (en) A kind of multi-core buffer WCET analysis methods for supporting instruction prefetch
US20140281318A1 (en) Efficiently searching and modifying a variable length queue
US10552371B1 (en) Data storage system with transparent presentation of file attributes during file system migration
US20150074351A1 (en) Write-behind caching in distributed file systems
CN108762812B (en) Hardware acceleration structure device facing general character string processing and control method
US20170192896A1 (en) Zero cache memory system extension
CN115495020A (en) File processing method and device, electronic equipment and readable storage medium
US10616291B2 (en) Response caching
CN112286448B (en) Object access method and device, electronic equipment and machine-readable storage medium
US20170199842A1 (en) Exposing pre-registered memory regions for remote direct memory access in a distributed file system
JP3180336B2 (en) Data access method using multilayer buffer
CN118132461A (en) Data-dependent-perception dynamic graph processing acceleration system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919993

Country of ref document: EP

Kind code of ref document: A1