CN117131912A

CN117131912A - Neural network operation device and method, and computer readable storage medium

Info

Publication number: CN117131912A
Application number: CN202311094709.8A
Authority: CN
Inventors: 赵海丞; 孙方轩; 连朔
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-28

Abstract

A neural network operation device and method, computer readable storage medium, the neural network operation device includes: the external memory space, on-chip memory space includes on-chip input data buffer memory, on-chip output data buffer memory and on-chip internal data buffer memory, wherein: the external storage space is suitable for caching the data to be processed of the ith cascade group and the output data of the ith cascade group; i is more than 1 and less than or equal to N, N is the total number of cascade groups, and N is a positive integer; an on-chip input data buffer memory is suitable for buffering the input data of a first layer in the ith cascade group, wherein the input data of the first layer is part of data to be processed; the on-chip output data cache is suitable for caching an operation result corresponding to the input data; and the on-chip internal data cache is suitable for caching the operation result of the calculation unit on each layer in the ith cascade group. By adopting the scheme, the bank number in the neural network operation device can be reduced, the hardware complexity is reduced, and the circuit area is reduced.

Description

Neural network operation device and method, and computer readable storage medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a neural network computing device and method, and a computer readable storage medium.

Background

Existing technology for neural network chips (Neural Processing Unit, NPUs) operate at a much higher speed than accessing external memory. In the operation process of the neural network chip, the neural network chip cannot timely acquire data from an external storage space, so that a calculation unit on the neural network chip is stopped, and the calculation efficiency is low.

In order to improve the computing efficiency, the current neural network chip mostly adopts an interlayer cascading mode to process data of a multi-layer network in an on-chip memory, and the interaction frequency between the on-chip memory and an external storage space is reduced. However, due to memory read-write limitation, there still exists a problem that the computing units cannot operate in parallel and the computing efficiency is low.

To ensure that each compute unit is capable of parallel operation, two bank ping-pong (ping-pong) uses are typically provided for the output data of each compute unit. If there are n computing units, 2n banks need to be set to store intermediate computing results.

Disclosure of Invention

The application solves the technical problems of higher hardware complexity and larger circuit area caused by more banks in the neural network operation device.

In order to solve the above technical problems, the present application provides a neural network computing device, including: the external memory space, on-chip memory space includes on-chip input data buffer memory, on-chip output data buffer memory and on-chip internal data buffer memory, wherein: the external storage space is suitable for caching the data to be processed of the ith cascade group and the output data of the ith cascade group; i is more than 1 and less than or equal to N, N is the total number of cascade groups, and N is a positive integer; the on-chip input data cache is suitable for caching the input data of a first layer in the ith cascade group, wherein the input data of the first layer is part of the data to be processed; the on-chip output data cache is suitable for caching an operation result corresponding to the input data; and the on-chip internal data cache is suitable for caching the operation result of the calculation unit on each layer in the ith cascade group.

Optionally, the on-chip internal data cache includes M banks, where M is a positive integer.

Optionally, the number of the computing units is multiple, and when at least two computing units perform parallel operation on computing tasks of different layers in the ith cascade group, the computing tasks of different layers are associated with different banks.

Optionally, the M is associated with a capacity of the on-chip storage space.

Optionally, the number M of on-chip internal data caches is different from the parity of the number of computing units.

Optionally, the number M of on-chip internal data caches is the sum of the number of the computing units and 1.

Optionally, after the input of the data to be processed is completed, the on-chip input data buffer is released.

Optionally, after the input of the data to be processed is completed, the on-chip input data buffer is adapted to buffer the input data of the first layer in the i+1th cascade group.

Optionally, the on-chip input data cache includes 1 bank; and/or, the on-chip output data cache comprises 1 bank.

The application also provides a neural network operation method, which comprises the following steps: acquiring input data of a first layer in an ith cascade group from an on-chip input data buffer; adopting a corresponding calculation unit to calculate the input data in the ith cascade group; and storing the operation result corresponding to the input data into an on-chip output data cache.

Optionally, the neural network operation method further includes: and after finishing the input of the data to be processed of the ith cascade group, releasing the corresponding on-chip internal data cache in the ith cascade group based on the calculation progress of the calculation unit.

The application also provides a computer readable storage medium which is a non-volatile storage medium or a non-transient storage medium, and a computer program is stored on the computer readable storage medium, and the computer program executes the steps of any of the neural network operation methods when being executed by a processor.

Compared with the prior art, the technical scheme of the application has the following beneficial effects:

for one cascade group, an on-chip input data buffer storing input data required by the cascade group and an on-chip output data buffer storing operation results obtained via the calculation unit are set. Therefore, only one on-chip input data buffer and one on-chip output data buffer are required to be arranged in the neural network operation device, and two banks are not required to be arranged for each calculation unit, so that the number of banks can be reduced, and the hardware complexity and the circuit area of the neural network operation device are reduced.

Furthermore, the calculation tasks of different layers in the ith cascade group can be performed in parallel, so that the performance of the neural network operation device can be improved.

In addition, the number M of the on-chip internal data caches is different from the parity of the number of the computing units, so that the computing units capable of executing the computing tasks exist at any time, and the execution efficiency of the computing units is improved.

Drawings

FIG. 1 is a flow chart of a neural network operation method in an embodiment of the application;

FIG. 2 is a schematic diagram of an on-chip memory space structure in an embodiment of the application;

FIG. 3 is a schematic diagram of a neural network model in an embodiment of the present application;

FIG. 4 is a schematic diagram of data distribution corresponding to a cascading group 1 according to an embodiment of the present application;

FIG. 5 is a schematic diagram of data distribution corresponding to a cascading group 2 according to an embodiment of the present application;

FIGS. 6-9 are computational flow diagrams of cascading group 1 in one cycle in an embodiment of the present application;

fig. 10 to 28 are calculation flow diagrams between the cascade group 1 and the cascade group 2 in the embodiment of the application.

Detailed Description

As described in the background art above, to ensure that each computing unit can perform parallel operations, two bank ping-pong (ping-pong) uses are typically provided for the output data of each computing unit. If there are n computing units, 2n banks need to be set to store intermediate computing results.

In the embodiment of the application, for a cascade group, an on-chip input data buffer and an on-chip output data buffer are set, the on-chip input data buffer stores input data required by the cascade group, and the on-chip output data buffer stores an operation result obtained through a calculation unit. Therefore, only one on-chip input data buffer and one on-chip output data buffer are required to be arranged in the neural network operation device, and two banks are not required to be arranged for each calculation unit, so that the number of banks can be reduced, and the hardware complexity and the circuit area of the neural network operation device are reduced.

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

In a specific implementation, the banks described below are formed by a plurality of rows and columns of memory cells, and only one bank can be accessed during a read/write operation. The single port RAM is a RAM with only one group of data lines and address lines, and the read-write can not be performed at the same time. Random access memory (Random Access Memory, RAM) is internal memory that can be read and written at any time and is fast, typically a temporary data storage medium for an operating system or other program in operation.

The convolution layer (conv) is a common calculation form in the neural network, and performs some specific numerical calculations on the input data, so as to obtain the characteristics of the input data as the output of the convolution layer.

The pooling layer (pooling) is a common calculation form in the neural network, divides an input image into a plurality of rectangular subareas, and calculates and outputs a specific value for each rectangular subarea as the output of the pooling layer.

The neural network is an algorithm mathematical model for simulating the behavior characteristics of the animal neural network to process distributed parallel information, and the information is processed by adjusting the interconnection relation among internal multi-layer nodes according to the complexity of the system.

In particular, the specific concepts of the above terms may refer to the prior art and are not described herein.

The embodiment of the application provides a neural network operation method, and the method is described in detail through specific steps with reference to fig. 1.

Step 101, obtaining input data of a first layer in an ith cascade group from an on-chip input data buffer.

Step 102, using a corresponding calculation unit to calculate the input data in the ith cascade group.

And step 103, storing the operation result corresponding to the input data in an external storage space.

In an embodiment of the present application, the neural network computing device may include an external storage space and an on-chip storage space. The external storage space may be a memory provided independently from the neural network operation device, and may communicate with the neural network operation device through a data transmission channel.

In some embodiments, the external memory space may include a FLASH memory (FLASH) chip, a band-point erasable programmable read only memory (EEPROM) chip, or the like.

The on-chip memory space may refer to a memory module built in the neural network computing device. In some embodiments, the neural network computing device may be a neural network chip (NPU), and the on-chip memory space is a memory module inside the neural network chip.

In an embodiment of the present application, the on-chip memory space may be divided into three parts, wherein:

the first part is constructed by a single port RAM for storing data read from an external storage space, can be written for direct memory access (Direct Memory Access, DMA) and can be read by a computing unit, and the reading and the writing are not performed simultaneously. In the following embodiments, the first portion of the on-chip storage space may be simply referred to as an on-chip input data cache.

The second part is constructed by adopting a single-port RAM and is used for storing the calculated result into an external storage space, and the result can be read for DMA and can be written in by a calculating unit, and the reading and the writing are not performed simultaneously. In the following embodiments, the second portion of the on-chip storage space may be simply referred to as an on-chip output data buffer.

The third part is constructed by adopting a single-port RAM and is used for storing intermediate data generated in the calculation process of the calculation unit, and the intermediate data can be read and written only by the calculation unit, and the reading and writing can not be performed simultaneously. In the following embodiments, the third portion of the on-chip storage space may be simply referred to as an on-chip internal data cache.

In the embodiment of the application, the on-chip input data cache can be used as a bank, the on-chip output data cache can be used as a bank, the on-chip internal data cache is divided into N+1 banks, and N is the number of the computing units. Thus, the on-chip internal data cache includes a different number of banks than the number of computing units.

In some embodiments, the computing units may include a convolution computing unit and a pooling computing unit. In this scenario, i.e. the computing unit comprises two computing units. Accordingly, the on-chip internal data cache includes three banks.

Referring to fig. 2, a schematic diagram of an on-chip storage space structure in an embodiment of the present application is given. In fig. 2, the on-chip memory space may include an on-chip input data buffer, an on-chip output data buffer, and an on-chip internal data buffer, where: the on-chip input data cache comprises 1 bank, the on-chip output data cache comprises 1 bank, and the on-chip internal data cache comprises banks 0-2. The data in the external storage space is input into the on-chip input data cache, and the data stored in the on-chip output data cache is output into the external storage space.

Data transmission channels exist between the on-chip input data cache and the on-chip internal data cache and between the on-chip internal data cache and the on-chip external data cache so as to realize data transmission between the on-chip input data cache and the on-chip internal data cache and data transmission between the on-chip internal data cache and the on-chip external data cache. Data transmission channels can also exist among the banks of the on-chip internal data cache so as to realize data transmission among the banks.

In particular applications, the neural network may comprise multiple layers. In the embodiment of the application, the maximum cascade layer number of each cascade group can be calculated according to the capacity of the on-chip storage space, and the neural network is further divided into a plurality of cascade groups. For any cascade group, the input data of the first layer is cached in an on-chip input data cache, and the input data of other layers is cached in an on-chip internal data cache.

In particular, in a cascade group, the input data of the other layers except the first layer is substantially the output data of the previous layer. For example, the input data of layer 2 is substantially the output data of layer 1. Correspondingly, the output data of layer 2 is the input data of layer 3.

For a cascade group, the input data of the first layer can be obtained from the external storage space by a direct memory access (Direct Memory Access, DMA) mode, and is cached in the on-chip input data cache. The input data sequentially passes through each calculation unit, and when the input data passes through the last calculation unit, the input data reenters the first calculation unit, and the steps are circulated until all layers in the cascade group are calculated. The intermediate calculation results of each calculation unit can be stored in an on-chip internal data buffer, the calculation result of the last layer is stored in an on-chip output data buffer, and the intermediate calculation results are written back to an external storage space in a DMA mode.

For the computing unit, the flow mode of the corresponding intermediate computing result in the on-chip internal data cache is as follows: the intermediate calculation results are stored in an on-chip internal data cache. The first calculation result of the first calculation unit is stored on a first bank of the on-chip internal data cache. Then, each time the intermediate data passes through a calculation unit, the obtained intermediate calculation result is stored on the next bank, and the data of the last bank is stored on the first bank after passing through the calculation unit.

For data flow between cascade groups, after one computing unit reads the input data of the last line of any layer in the previous cascade group, the storage space occupied by the input data of the layer can be released, and the released storage space can be used for the next cascade group. When the released storage space is larger, even if the data which is not calculated in the previous cascade group exists, the data of the next cascade group can be prefetched and calculated, so that seamless connection among the cascade groups is realized, and the pause of a calculation unit is avoided.

In the embodiment of the application, a calculation task table can be prepared in advance for the number of the calculation units, so that each calculation unit can work simultaneously.

Specifically, the computing unit corresponding to each computing task and the bank identifier of the on-chip internal data cache used by each computing task may be marked first. Each computing task may correspond to a layer in the neural network, which may perform one computing task.

Then, it is determined whether other computing tasks capable of parallel processing exist in the processing process of one computing task. The condition for judging whether parallel processing is possible between the computing tasks may include: at the same time, different computing tasks use different computing units, and the computing tasks for performing read operation and the computing tasks for performing write operation on the same bank cannot be processed in parallel.

Referring to table 1 below, an example of a calculation task table in an embodiment of the present application is given.

Computing tasks

Calculation unit

Input bank

Output bank

Time 1

Time 2

Time 3

Layer 1

Unit 1

-

0

√

Layer 2

Unit 2

0

1

√

Layer 3

Unit 1

1

2

√

Layer 4

Unit 2

2

0

√

Layer 5

Unit 1

0

1

√

Layer 6

Unit 2

1

-

√

TABLE 1

In table 1, at time 1, the computing unit 1 runs the computing task corresponding to layer 1, and the computing unit 2 runs the computing task corresponding to layer 4; at the moment 2, the computing unit 2 runs the computing task corresponding to the layer 2, and the computing unit 1 runs the computing task corresponding to the layer 5; at time 3, the computing unit 1 runs the computing task corresponding to layer 3 and the computing unit 2 runs the computing task corresponding to layer 6.

The memory access method provided in the above-described embodiments of the present application will be described below by way of specific examples.

The neural network is set to comprise 14 layers, namely layers 1 to 14 in sequence. Referring to fig. 3, a schematic structural diagram of a neural network model in an embodiment of the present application is given.

The calculation tasks corresponding to the layers 1, 3, 5, 7, 9, 11 and 13 are convolution calculation, and the calculation tasks corresponding to the layers 2, 4, 6, 8, 10, 12 and 14 are pooling calculation. The calculation unit includes a convolution calculation unit and a pooling calculation unit. Correspondingly, the on-chip internal data cache comprises three banks, namely banks 0 to 2 in sequence.

The 14 layers are divided into 2 cascade groups, wherein the cascade group 1 comprises layers 1 to 8, and the cascade group 2 comprises layers 9 to 14.

Referring to fig. 4, a schematic diagram of data distribution corresponding to the cascade group 1 is given.

The input data and the output data of the cascade group 1 are stored in an external storage space, the input data of the layer 1 is stored in an on-chip input data buffer, the output data of the layer 8 is stored in an on-chip output data buffer, the output data of the layer 1, the output data of the layer 4 and the output data of the layer 7 are stored in a bank0 of an on-chip internal data buffer, the output data of the layer 2 and the output data of the layer 5 are stored in a bank1, and the output data of the layer 3 and the layer 6 are stored in a bank2.

Referring to fig. 5, a schematic diagram of data distribution corresponding to the cascade group 2 is given.

Input data and output data of the cascade group 2 are stored in an external storage space, input data of the layer 9 is stored in an on-chip input data buffer, output data of the layer 14 is stored in an on-chip output data buffer, data of the layer 9 and the layer 12 are stored in a bank0, data of the layer 10 and the layer 13 are stored in a bank1, and data of the layer 11 is stored in a bank2.

Referring to Table 2, a table of computing tasks corresponding to cascading group 1 is presented.

TABLE 2

The operation of the cascade group 1 and the cascade group 2 will be described with reference to table 2.

The cascade group 1 includes 8 layers, and operations of two layers can be performed in parallel at one time, so that it takes 4 times to complete a cycle. Accordingly, the cascade group 2 includes 6 layers, so that it takes 3 times to complete one cycle.

Referring to fig. 6-9, a computational flow diagram of cascading group 1 over a cycle is presented. One cycle may include 4 times, with the same operations performed by the computing units in different cycles differing only in the data from which the operations were performed.

In fig. 6, at time 1, the computing unit 1 runs the computing task of layer 1, reads data from the on-chip input data buffer (i.e., the input data of layer 1) and performs an operation, and stores the operation result (i.e., the output data of layer 1) in the bank 0; the computing unit 2 extracts the output data of the layer 5 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 6) in the bank2. The data in the on-chip output data cache can be output to an external storage space in a DMA mode. The data stored in the on-chip output data buffer (output data of layer 8) is output to the external storage space by the DMA method.

At time 1, the data stored in the on-chip input data cache is the data to be processed which needs to be operated through the neural network. In general, the amount of data to be processed is large, and the on-chip input data buffer is small, so the data to be processed can be divided into a plurality of segments, and one data segment is input into the on-chip input data buffer in each period. The length of the data segment is not greater than the maximum capacity of the on-chip input data buffer.

In fig. 7, at time 2, the external storage space inputs data to be processed (input data corresponding to layer 1) to the input data buffer by DMA. The computing unit 1 extracts the output data of the layer 4 from the bank0, performs the operation, and stores the obtained operation result (i.e., the output data of the layer 5) in the bank1. The computing unit 2 acquires the output data of the layer 1 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 2) in the bank1.

The last line of data output by layer 1 has been used at time 2 and the memory space occupied by layer 1 can be freed up at the end of time 2.

In fig. 8, at time 3, the external storage space inputs data to be processed to the input data buffer by DMA. The computing unit 1 acquires the output data of the layer 2 from the bank1, performs an operation, and stores the obtained result (i.e., the output data of the layer 3) in the bank2. The computing unit 2 reads the output data of the layer 7 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 8) in an on-chip output buffer.

The last line of data output by layer 2 has been used at time 3 and the memory space occupied by layer 2 can be freed up at time 3.

In fig. 9, at time 4, the data in the on-chip output data buffer may be output to the external storage space by DMA. The calculation unit 1 acquires the output data of the layer 6 from the bank2, calculates the output data, and stores the obtained result (i.e., the output data of the layer 7) in the bank0. The computing unit 2 reads the output data of the layer 3 from the bank2, performs an operation, and stores the obtained result (i.e., the output data of the layer 4) in the bank0.

The last line of data output by layer 3 has been used at time 4 and the memory space occupied by layer 3 can be freed up at time 4.

As can be seen from fig. 6 to 9, the execution of the computing units 1 and 2 in the cascade group 1 is not performed in the order of layers 1 to 8, but the computing units 1 and 2 operate on the data of different layers in parallel. Since the execution of the computing units 1 and 2 is not sequential, at some time, the computing units 1 and 2 continue to operate on the computing tasks that did not complete in the previous cycle.

Specifically, as shown in fig. 6, at time 1 of the current cycle, an operation of reading a data segment from the external storage space to the on-chip input data cache is not substantially performed. The data corresponding to the layer 1 stored in the on-chip input data cache is: at time 2 and time 3 of the previous period, the external storage space inputs data corresponding to the data cache input layer 1 to the chip in a DMA mode.

In addition, at time 1 of the current cycle, the computing unit 1 reads the output data of the layer 5 from the bank1, the output data of the layer 5 in the bank1 being: at time 2 of the previous cycle, the computing unit 1 extracts the output data of the layer 4 from the bank0 and performs an operation. That is, based on the output data of the layer 5 of the previous cycle, the output data of the layer 6 of the current cycle is calculated.

Similarly, at the time 2 of the current period, the data of the layer 4 in the bank2 is: at time 4 of the previous cycle, the computing unit 2 takes out the output data of the layer 3 from the bank2 and performs operation. That is, the calculation unit 1 calculates the output data of the layer 5 of the current cycle based on the output data of the layer 4 of the previous cycle.

At time 3 of the current period, the calculating unit 2 calculates output data of layer 8 of the current period based on output data of layer 7 of the previous period, and stores the output data of layer 8 in an on-chip output data buffer.

At time 4 of the current cycle, the calculation unit 1 calculates output data of the layer 7 based on the output data of the layer 6 calculated at time 1. And outputting the data stored in the on-chip output data cache to an external storage space in a DMA mode.

When a computing unit reads the input data of the last line of any layer in the cascade group 1 from the on-chip input data cache, the storage space occupied by the input data of the layer can be released to be provided for the next cascade group (such as cascade group 2). As long as enough space is released, even if the rest layers of the former cascade group do not complete the calculation, the data corresponding to the next cascade group can be prefetched and calculated, so that the connection between the cascade groups is realized, and the pause between calculation units is avoided.

Referring to fig. 10-28, a computational flow diagram between cascading group 1 and cascading group 2 is presented.

In fig. 10, a time n is set, a computing unit 1 reads the last segment of data to be processed from an on-chip input data buffer, performs an operation on the last segment of data to be processed, and stores the obtained operation result in a bank0. The computing unit 2 extracts the output data of the layer 5 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 6) in the bank2. The data in the on-chip output data cache can be output to an external storage space in a DMA mode. The data stored in the on-chip output data buffer (output data of layer 8) is output to the external storage space by the DMA method.

At time n, the cascade group 1 also has partial layer incomplete calculation, and only the data to be operated is read.

In fig. 11, at time n+1, the computing unit 1 extracts the output data of the layer 4 from the bank0, performs the operation, and stores the obtained operation result (i.e., the output data of the layer 5) in the bank1. The computing unit 2 acquires the output data of the layer 1 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 2) in the bank1. Since all the data required for layer 1 has been read, the external memory space can be used to input data required for data cache input layer 9 on-chip. That is, from time n+1, the on-chip input data cache stores data that is required for layer 9.

The output data of layer 1 is already used at time n+1, and the memory space occupied by the output data of layer 1 can be freed up at the end of time n+1.

In fig. 12, at time n+2, the external storage space may input data required for the data cache input layer 9 to the on-chip. The computing unit 1 reads the output data of the layer 2 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 3) in the bank2. The computing unit 2 reads the output data of the layer 7 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 8) in an on-chip output data buffer.

In fig. 12, since the output data of layer 1 in the cascade group 1 is used at time n+1 and the corresponding calculation task of layer 1 has been completed, the output data of layer 1 in bank0 is released at time n+2.

The output data of layer 2 has been used at time n+2, and the memory space occupied by the output data of layer 2 can be freed up at the end of time n+2.

In fig. 13, at time n+3, the calculation unit 1 reads the output data of the layer 6 from the bank2, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 7) in the bank0. The computing unit 2 reads the output data of the layer 3 from the bank2, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 4) in the bank0. The on-chip output data buffer outputs the output data of the layer 8 to an external storage space in a DMA mode.

The output data of layer 3 has been used at time n+3, and the memory space occupied by the output data of layer 3 can be freed up at the end of time n+3.

In fig. 14, at time n+4, the computing unit 1 receives the input data corresponding to the data buffer reading layer 9 from the on-chip, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 9) in the bank0. The computing unit 2 reads the output data corresponding to the layer 5 from the bank1, performs the operation, and stores the obtained operation result (i.e., the output data of the layer 6) in the bank2. The on-chip output data buffer outputs the output data of the layer 8 to an external storage space in a DMA mode.

In fig. 15, at time n+5, the external storage space may input data required for the data cache input layer 9 to the on-chip. The computing unit 2 may read the output data of the layer 9 from the bank0, perform the operation, and store the obtained operation result (output data of the layer 10) in the bank1. The computing unit 1 reads the output data of the layer 4 from the bank1, performs an operation, and stores the obtained operation result (output data of the layer 5) in the bank2.

The output data of layer 4 has been used at time n+5 and the memory space occupied by the output data of layer 4 can be freed up at the end of time n+5.

In fig. 16, at time n+6, the external storage space may input data required for the data cache input layer 9 to the on-chip. The computing unit 1 reads the output data of the layer 10 from the bank1, performs an operation, and stores the obtained operation result (output data of the layer 11) in the bank2. The computing unit 2 reads the output data of the layer 7 from the bank0, performs an operation, and stores the obtained operation result (output data of the layer 8) in an on-chip output data buffer.

In fig. 17, at time n+7, the calculation unit 1 reads the output data of the layer 6 from the bank2, performs an operation, and stores the obtained operation result (output data of the layer 7) in the bank0. The calculating unit 2 reads the output data of the layer 11 from the bank2, performs an operation, and stores the obtained operation result (output data of the layer 12) in the bank0. The on-chip output data buffer outputs the output data of the layer 8 to an external storage space in a DMA mode.

In fig. 18, at time n+8, the computing unit 1 reads the input data of layer 9 from the on-chip input data buffer, performs an operation, and stores the obtained operation result (output data of layer 9) in bank0. The computing unit 2 reads the output data of the layer 5 from the bank1, performs an operation, and stores the obtained operation result (output data of the layer 6) in the bank2. The on-chip output data buffer outputs the output data of the layer 8 to an external storage space in a DMA mode.

The output data of layer 5 has been used at time n+8 and the memory space occupied by the output data of layer 5 can be freed up at the end of time n+8.

In fig. 19, at time n+9, the external storage space may input data required for the data cache input layer 9 to the on-chip. The calculating unit 2 reads the output data of the layer 9 from the bank0, performs an operation, and stores the obtained operation result (output data of the layer 10) in the bank1. The computing unit 1 reads the output data of the layer 12 from the bank0, performs an operation, and stores the obtained operation result (output data of the layer 13) in the bank1.

In fig. 20, at time n+10, the external memory space may input data required for the data cache continuing to input layer 9 to the on-chip data cache. The computing unit 1 reads the output data of the layer 10 from the bank1, performs an operation, and stores the obtained operation result (output data of the layer 11) in the bank2. The computing unit 2 reads the output data of the layer 7 from the bank0, performs an operation, and stores the obtained operation result (output data of the layer 8) in an on-chip output data buffer.

In fig. 21, at time n+11, the calculating unit 2 reads the output data of the layer 11 from the bank2, performs an operation, and stores the obtained operation result (output data of the layer 12) in the bank0. The computing unit 1 reads the output data of the layer 6 from the bank2, performs an operation, and stores the obtained operation result (output data of the layer 7) in the bank0. The on-chip output data buffer outputs the output data of the layer 8 to an external storage space in a DMA mode.

The output data of layer 6 has been used at time n +11 and the memory space occupied by the output data of layer 6 can be freed up at the end of time n + 11.

In fig. 22, at time n+12, since the cascade group 1 is 8 layers and the cascade group 2 is 7 layers, there is no corresponding parallel calculation of the cascade group 2 for the last calculation of the last two layers (layers 7 and 8) of the cascade group 1. Thus, at time n+12, the computing unit 1 acquires the output data of layer 7 from bank0, performs an operation, and stores the obtained operation result (i.e., the output data of layer 8) in the on-chip output data buffer. The computing unit 1 is stalled.

In fig. 23, at time n+13, the last layer of the cascade group 1 performs the last output, and the calculation of the last layer of the cascade group 2 cannot be performed in parallel, and in this scenario, the calculation unit 2 stops.

At time n+13, the computing unit 1 reads the input data of the layer 9 from the on-chip input data buffer, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 9) in the bank0. And, output data of layer 8 stored in the on-chip output data buffer is output to the external storage space.

At time n+13, the operation of the cascade group 1 ends.

At time n+14, after the completion of the operation of the cascade group 1, the last layer of computation of the cascade group 2 can start.

In fig. 24, the computing unit 1 reads the input data of layer 9 from the on-chip input data buffer, performs an operation, and stores the obtained operation result (i.e., the output data of layer 9) in bank0. The computing unit 2 reads the output data of the layer 13 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 14) in an on-chip output data buffer.

In fig. 25, at time n+15, the computing unit 2 reads the output data of the layer 9 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 10) in the bank1. The computing unit 1 reads the output data of the layer 12 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 13) in the bank1.

In fig. 26, at time n+16, the external storage space inputs data required for the data buffer input layer 9 on-chip. The computing unit 1 reads the output data of the layer 10 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 11) in the bank2. The computing unit 2 reads the output data of the layer 13 from the bank1, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 14) in an on-chip output data buffer.

In fig. 27, at time n+17, the computing unit 1 reads the input data of layer 9 from the on-chip input data buffer, performs an operation, and stores the obtained operation result (i.e., the output data of layer 9) in bank0. The computing unit 2 reads the output data of the layer 11 from the bank2, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 12) in the bank0. The output data of the layer 14 stored in the on-chip output data buffer is output to the external storage space by DMA.

In fig. 28, at time n+18, the calculating unit 2 reads the output data of the layer 9 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 10) in the bank1. The computing unit 1 reads the output data of the layer 12 from the bank0, performs an operation, and stores the obtained operation result (i.e., the output data of the layer 13) in the bank1. The output data of the layer 14 stored in the on-chip output data buffer is output to the external storage space by DMA.

The operation process of one period of the cascade group 2 can be completed through the time n+16 to the time n+18.

In summary, in the embodiment of the present application, for a cascade group, an on-chip input data buffer and an on-chip output data buffer are set, where the on-chip input data buffer stores input data required by the cascade group, and the on-chip output data buffer stores an operation result obtained by the calculation unit. Therefore, only one on-chip input data buffer and one on-chip output data buffer are required to be arranged in the neural network operation device, and two banks are not required to be arranged for each calculation unit, so that the number of banks can be reduced, and the hardware complexity and the circuit area of the neural network operation device are reduced.

The embodiment of the application also provides a computer readable storage medium, which is a non-volatile storage medium or a non-transient storage medium, and a computer program is stored on the computer readable storage medium, and the computer program is executed by a processor to execute the neural network operation method provided by any embodiment.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs related hardware, the program may be stored on a computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, etc.

Although the present application is disclosed above, the present application is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application should be assessed accordingly to that of the appended claims.

Claims

1. A neural network computing device, comprising: the external memory space, on-chip memory space includes on-chip input data buffer memory, on-chip output data buffer memory and on-chip internal data buffer memory, wherein:

the external storage space is suitable for caching the data to be processed of the ith cascade group and the output data of the ith cascade group; i is more than 1 and less than or equal to N, N is the total number of cascade groups, and N is a positive integer;

the on-chip input data cache is suitable for caching the input data of a first layer in the ith cascade group, wherein the input data of the first layer is part of the data to be processed;

the on-chip output data cache is suitable for caching an operation result corresponding to the input data;

and the on-chip internal data cache is suitable for caching the operation result of the calculation unit on each layer in the ith cascade group.

2. The neural network computing device of claim 1, wherein the on-chip internal data cache includes M banks, M being a positive integer.

3. The neural network computing device of claim 2, wherein the number of computing units is plural, and at least two computing units perform parallel operations on computing tasks of different layers in the ith cascade group, the computing tasks of different layers being associated with different banks.

4. The neural network computing device of claim 2, wherein the M is associated with a capacity of the on-chip memory space.

5. The neural network computing device of claim 2, wherein the number M of on-chip internal data caches is different from the parity of the number of computing units.

6. The neural network computing device of claim 5, wherein the number M of on-chip internal data caches is a sum of the number of computing units and 1.

7. The neural network computing device of claim 1, wherein the on-chip input data cache is released after the input of the data to be processed is completed.

8. The neural network computing device of claim 7, wherein the on-chip input data cache is adapted to cache input data for a first tier in an i+1 th cascaded group after completion of the input of the data to be processed.

9. The neural network computing device of any one of claims 1-8, wherein the on-chip input data cache includes 1 bank; and/or, the on-chip output data cache comprises 1 bank.

10. A neural network operation method, comprising:

acquiring input data of a first layer in an ith cascade group from an on-chip input data buffer;

adopting a corresponding calculation unit to calculate the input data in the ith cascade group;

and storing the operation result corresponding to the input data into an on-chip output data cache.

11. The neural network operation method of claim 10, further comprising:

and after finishing the input of the data to be processed of the ith cascade group, releasing the corresponding on-chip internal data cache in the ith cascade group based on the calculation progress of the calculation unit.

12. A computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the neural network operation method according to claim 10 or 11.