CN112597079A

CN112597079A - Data write-back system of convolutional neural network accelerator

Info

Publication number: CN112597079A
Application number: CN202011527851.3A
Authority: CN
Inventors: 王天一; 边立剑
Original assignee: Shanghai Anlogic Information Technology Co ltd
Current assignee: Shanghai Anlogic Information Technology Co ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-02
Anticipated expiration: 2040-12-22
Also published as: CN112597079B

Abstract

The invention provides a data write-back system of a convolutional neural network accelerator, which comprises an input cache module, N-level write-back nodes and a write-back control module, wherein the input cache module is connected with a computing unit to receive data, the uppermost-level write-back node is connected with the input cache module, a next-level write-back node is connected with at least two previous-level write-back nodes, N is a natural number greater than 1, and the write-back control module is connected with the lowest-level write-back node to receive the data from the lowest-level write-back node and transmit the data to a bus. The data write-back system of the convolutional neural network accelerator comprises N levels of write-back nodes, the write-back node at the uppermost level is connected with the input cache module, the write-back node at the next level is connected with at least two write-back nodes at the previous level, N is a natural number greater than 1, and the tree structure levels the write-back nodes, so that the transmission efficiency of data write-back can be improved.

Description

Data write-back system of convolutional neural network accelerator

Technical Field

The invention relates to the technical field of deep learning, in particular to a data write-back system of a convolutional neural network accelerator.

Background

In the prior art, a cloud Field Programmable Gate Array (FPGA) can provide a large amount of logic and Memory resources compared with an edge device, but a neural network model running in a cloud is often huge, a large amount of intermediate results can be generated in a running process, and on-chip Random Access Memory (RAM) resources on an FPGA platform cannot cache all data, so that the data needs to be transmitted to an off-chip Memory, but the transmission requirement of high throughput rate of concurrent data cannot be met in the prior art, and the data transmission efficiency is low.

Therefore, there is a need to provide a new data write-back system of convolutional neural network accelerator to solve the above problems in the prior art.

Disclosure of Invention

The invention aims to provide a data write-back system of a convolutional neural network accelerator, which improves the transmission efficiency of data write-back of the convolutional neural network accelerator.

In order to achieve the above object, the data write-back system of the convolutional neural network accelerator of the present invention includes:

the input cache module is used for being connected with the computing unit to receive data;

the input cache module comprises N levels of write-back nodes, the write-back node at the uppermost level is connected with the input cache module, the write-back node at the next lower level is connected with at least two write-back nodes at the upper level, and N is a natural number greater than 1;

and the write-back control module is connected with the write-back node at the lowest level so as to receive data from the write-back node at the lowest level and transmit the data to the bus.

The beneficial effects of the data write-back system of the convolutional neural network accelerator are as follows: the data write-back system comprises N levels of write-back nodes, the write-back node at the top level is connected with the input cache module, the write-back node at the next level is at least connected with the two write-back nodes at the upper level, N is a natural number larger than 1, and the tree structure levels the write-back nodes, so that the transmission efficiency of data write-back can be improved.

Preferably, the write-back node includes a first output cache unit, a selection unit, and at least two receiving cache units, an output end of the receiving cache unit is connected to an input end of the selection unit, and an output end of the selection unit is connected to an input end of the first output cache unit. The beneficial effects are that: the write-back node is in standardized design, and the interface is simple, easy to use and easy to transplant.

Further preferably, the number of the write-back nodes at the upper level is adapted to the number of the receiving cache units of the write-back node at the lower level. The beneficial effects are that: and avoiding the waste of the receiving cache unit of the write-back node at the next level.

Further preferably, the write-back control module includes an address mapping unit, the data received by the write-back control module from the write-back node at the lowest stage includes calculation unit address information and calculation result data, and the address mapping unit calculates the write-back address according to the calculation address information and the start address information.

Further preferentially, the write-back node further comprises an arbitration unit and a cache management unit, the arbitration unit is connected with the selection unit, and the cache management unit is respectively connected with the receiving cache unit and the first output cache unit. The beneficial effects are that: the collision in the data transmission process can be effectively avoided.

Further preferably, the receiving cache unit includes a first cache state unit and a first data cache unit that are connected to each other, and the first cache state unit is connected to the cache management unit. The beneficial effects are that: whether data exist in the first data cache unit or not is judged conveniently.

Further preferably, the first output buffer unit includes a second buffer status unit and a second data buffer unit connected to each other, and the second buffer status unit is connected to the buffer management unit. The beneficial effects are that: it is convenient to judge whether data exists in the second data cache unit.

Further preferably, the cache management units of the writeback nodes connected to each other are connected to each other. The beneficial effects are that: avoiding data collision.

Further preferably, the input cache module includes input cache units, and the number of the input cache units is adapted to the number of the receiving cache units of the write-back node at the top level. The beneficial effects are that: and avoiding the waste of the receiving cache unit of the write-back node at the uppermost level.

Further preferably, the input cache unit includes a cache control unit, a third data cache unit and a second output cache unit, the cache control unit is respectively connected to the computing unit, the third data cache unit and the corresponding cache management unit of the write-back unit, and the third data cache unit is connected to the second output cache unit.

Preferably, the number of the write-back nodes at the lowest stage is 1. The beneficial effects are that: the method can ensure that only one data is transmitted to the bus at the same time, and avoid data transmission conflict.

Drawings

FIG. 1 is a block diagram of an arbitration unit in some embodiments of the present invention;

FIG. 2 is a block diagram of a receive cache unit in accordance with some embodiments of the present invention;

FIG. 3 is a block diagram of a first output buffer unit according to some embodiments of the invention;

FIG. 4 is a block diagram of an input buffer unit in some embodiments of the invention;

FIG. 5 is a block diagram of a data write-back system of a convolutional neural network accelerator in some embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. As used herein, the word "comprising" and similar words are intended to mean that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items.

Aiming at the problems in the prior art, the embodiment of the invention provides a data write-back system of a convolutional neural network accelerator, which is based on a cloud Field Programmable Gate Array (FPGA), and comprises an input cache module, an N-level write-back node and a write-back control module, wherein the input cache module is used for being connected with a computing unit to receive data; the write-back node at the top level is connected with the input cache module, the write-back node at the next level is connected with at least two write-back nodes at the top level, and N is a natural number greater than 1; the write-back control module is connected with the write-back node at the lowest level so as to receive data from the write-back node at the lowest level and transmit the data to the bus. Preferably, the number of the write-back nodes at the lowest stage is 1.

In some embodiments, the writeback node includes a first output cache unit, a selection unit, an arbitration unit, a cache management unit, and at least two receiving cache units, an output of the receiving cache unit is connected to an input of the selection unit, an output of the selection unit is connected to an input of the first output cache unit, the arbitration unit is connected to the selection unit, and the cache management unit is respectively connected to the receiving cache unit and the first output cache unit. Specifically, the arbitration unit is a shift register, and the bit of the shift register is at least 2.

FIG. 1 is a block diagram of an arbitration unit in some embodiments of the present invention. Referring to fig. 1, the arbitration unit 212 includes a shift register having the same number of bits as the number of the receiving buffer units connected thereto, for example, if the number of the receiving buffer units connected to the arbitration unit 212 is 4, the shift register includes 4 bits, i.e., a first bit 2121, a second bit 2122, a third bit 2123 and a fourth bit 2124. The working principle is to take right shift as an example, the first bit 2121 is 1, the second bit 2122 is 0, the third bit 2123 is 0, the fourth bit 2124 is 0, the first bit 2121 is 0, the second bit 21222 is 1, the third bit 2123 is 0, and the fourth bit 2124 is 0 in a first clock cycle; in a third clock cycle, the first bit 2121 is 0, the second bit 2122 is 0, the third bit 2123 is 1, and the fourth bit 2124 is 0; in a fourth clock cycle, the first bit 2121 is 0, the second bit 2122 is 0, the third bit 2123 is 0, and the fourth bit 2124 is 1; and four clock periods are one cycle. The principle of the left shift and the right shift is the same, and detailed description thereof is omitted.

Fig. 2 is a block diagram of a receive cache unit in some embodiments of the invention. Referring to fig. 2, the receiving buffer unit 211 includes a first buffer status unit 2111 and a first data buffer unit 2112 which are connected to each other, and the first buffer status unit 2111 is connected to the buffer management unit (not shown in the figure). Further, the first buffer status unit 2111 is connected to the buffer management unit, and when the first buffer status unit 2111 detects that no data is stored in the first data buffer unit 2112, feeds back to the buffer management unit, the buffer management unit marks the first data buffer unit 2112 as 1, and when the first buffer status unit 2111 detects that data is stored in the first data buffer unit 2112, feeds back to the buffer management unit, and the buffer management unit marks the first data buffer unit 2112 as 0.

Fig. 3 is a block diagram of a first output buffer unit according to some embodiments of the invention. Referring to fig. 3, the first output buffer unit 215 includes a second buffer status unit 2151 and a second data buffer unit 2152, which are connected to each other, the second buffer status unit 2151 is connected to the buffer management unit (not shown), an input end of the second data buffer unit 2152 is connected to an output end of the selection unit (not shown), and an output end of the second data buffer unit 2152 is connected to an input end of the first data buffer unit (not shown) of the receiving buffer unit of the write-back node of the next stage or an input end of the write-back control module (not shown). Further, the second buffer status unit 2151 is connected to the buffer management unit, and when the second buffer status unit 2151 detects that no data is stored in the second data buffer unit 2152, the second buffer status unit feeds back to the buffer management unit, the buffer management unit marks the second data buffer unit 2152 as 1, and when the second buffer status unit 2151 detects that data is stored in the second data buffer unit 2152, the second buffer status unit feeds back to the buffer management unit, and the buffer management unit marks the second data buffer unit 2152 as 0.

In some embodiments, the interconnected cache management units of the writeback node are interconnected. .

Specifically, when the receiving cache unit in the writeback node of the next level is marked as 1 by the cache management unit, that is, no data is stored in the receiving cache unit, and the output cache unit in the writeback node of the previous level is marked as 1 by the cache management unit, the output cache unit may receive data from the receiving cache unit according to the bit corresponding to 1 in the arbitration unit.

For example, the write-back node at the upper level is a first write-back node, the write-back node at the lower level is a second write-back node, the first write-back node includes a first receiving cache unit, a second receiving cache unit, a third receiving cache unit, a fourth receiving cache unit, a first selection unit, a first arbitration unit, a first cache management unit and a third output cache unit, output ends of the first receiving cache unit, the second receiving cache unit, the third receiving cache unit and the fourth receiving cache unit are respectively connected with four input ends of the first selection unit, the first arbitration unit is connected with a control end of the first selection unit, an output end of the first selection unit is connected with the third output cache unit, and the first arbitration unit is respectively connected with the first receiving cache unit, the second receiving cache unit, the third receiving cache unit, the first output end of the first arbitration unit, the second output cache, The third receiving cache unit, the fourth receiving cache unit and the third output cache unit are connected to mark 1 or 0 to the first receiving cache unit, the second receiving cache unit, the third receiving cache unit, the fourth receiving cache unit and the third output cache unit;

the first write-back node comprises a fifth receiving cache unit, a sixth receiving cache unit, a seventh receiving cache unit, an eighth receiving cache unit, a second selection unit, a second arbitration unit, a second cache management unit and a fourth output cache unit, wherein the output ends of the fifth receiving cache unit, the sixth receiving cache unit, the seventh receiving cache unit and the eighth receiving cache unit are respectively connected with the four input ends of the second selection unit, the second arbitration unit is connected with the control end of the second selection unit, the output end of the second selection unit is connected with the fourth output cache unit, the second arbitration unit is respectively connected with the fifth receiving cache unit, the sixth receiving cache unit, the seventh receiving cache unit, the eighth receiving cache unit and the fourth output cache unit, to mark 1 or 0 for the fifth receiving buffer unit, the sixth receiving buffer unit, the seventh receiving buffer unit, the eighth receiving buffer unit and the fourth output buffer unit;

the first write-back node and the second write-back node are connected to each other, specifically, an output terminal of the third output cache unit is connected to an input terminal of the fifth receiving cache unit, the first cache management unit is connected to the second cache management unit, when the fifth receiving cache unit does not store data, the second cache unit feeds back the mark of the fifth receiving cache unit to the first cache unit as 1, when the third output buffer unit does not store data, the first buffer management unit marks the third output buffer unit as 1, and if the first bit of the first arbitration unit is 1, the first receiving cache unit transmits the data stored therein to the third output cache unit through the first selection unit, and the third output cache unit transmits the data to the fifth receiving cache unit.

FIG. 4 is a block diagram of an input buffer unit according to some embodiments of the invention. Referring to fig. 4, the input buffer module includes input buffer units 11, the number of the input buffer units 11 is adapted to the number of the receiving buffer units of the write-back node at the top level, the input buffer unit 11 includes a buffer control unit 111, a third data buffer unit 112 and a second output buffer unit 113, the cache control unit 111 is respectively connected with the control end of the computing unit (not shown in the figure), the third data cache unit 112 and the corresponding cache management unit (not shown in the figure) of the write-back unit, the input of the third data buffer unit 112 is connected to the data output of the calculation unit, the third data buffer unit 112 is connected to the second output buffer unit 113, and an output end of the second output buffer unit 113 is connected to a first data buffer unit (not shown) in the receiving buffer unit of the write-back node at the top level. Wherein. Specifically, the third data buffer unit 112 is a First-in First-out (FIFO) memory.

In some embodiments, when the cache management unit of the write-back node at the top level feeds back 0 to the cache control unit, that is, data is stored in the first data cache unit, if data is stored in the third data cache unit at this time, the cache control unit sends a non-empty signal to the computing unit, so that the computing unit stops working; when the cache management unit of the write-back node at the top level feeds back 0 to the cache control unit, that is, no data is stored in the first data cache unit, if no data is stored in the third data cache unit at this time, the cache control unit does not process the data, or the cache control unit sends an empty signal to the computing unit, so that the computing unit immediately enters a working state; when the cache management unit of the write-back node at the top level feeds back 1 to the cache control unit, that is, no data is stored in the first data cache unit, if data is stored in the third data cache unit at this time, the second output cache unit reads data from the third data cache unit.

In some embodiments, the number of the write-back nodes at the upper level is adapted to the number of the receiving cache units at the write-back node at the lower level.

In some embodiments, the write-back control module includes an address mapping unit, the data received by the write-back control module from the write-back node at the lowest stage includes calculation unit address information and calculation result data, and the address mapping unit calculates a write-back address by means of address mapping according to the calculation address information and the start address information, and transmits the write-back address and the calculation result data to a Bipolar Random Access Memory (BRAM) of a neural network accelerator along a bus.

In some embodiments, the second output buffer unit, the first data buffer unit, and the second data buffer unit in the present application are all Random Access Memories (RAMs).

FIG. 5 is a block diagram of a data write-back system of a convolutional neural network accelerator in some embodiments of the present invention. Referring to fig. 5, the data write-back system 100 of the convolutional neural network accelerator includes an input cache module (not shown), a 2-level write-back node 20, and a write-back control module 30. The level-2 write-back node 20 includes a first level write-back node 21 and a second level write-back node 22, where the first level write-back node 21 is a previous level of the second level write-back node 22, the input cache module 10 is connected to the first level write-back node 21, the first level write-back node 21 is connected to the second level write-back node 22, the second level write-back node 22 is connected to the write-back control module 30, and the write-back control module 30 is connected to a bus (not shown).

Referring to fig. 5, the input buffer module 10 includes 16 input buffer units 11, and the 16 input buffer units 11 are connected to 16 computing units (not shown) in a one-to-one correspondence manner to receive data from the corresponding computing units.

Referring to fig. 5, the first-level writeback node 21 includes 4 writeback nodes, the second-level writeback node 22 includes 1 writeback node, and each of the writeback node of the first-level writeback node 21 and the writeback node of the second-level writeback node 22 includes 4 receiving buffer units 211, 1 arbitration unit 212, 1 selection unit 213, 1 buffer management unit 214, and 1 first output buffer unit 215. The input ends of the receiving cache units 211 are connected to the input cache units 11 in a one-to-one correspondence manner, in the same write-back node, the output ends of 4 receiving cache units 211 are respectively connected to the 4 input ends of the selecting unit 213, the output end of the arbitration unit 212 is connected to the control end of the selecting unit 213, and the cache management unit 214 is connected to the 4 receiving cache units 211 and the first output cache unit 215.

Referring to fig. 5, the output terminals of the first output cache units 215 of the 4 write-back nodes in the first-level write-back node 21 are respectively connected with the input terminals of the 4 receiving cache units 211 of the write-back node in the second-level write-back node 22; the cache management units 214 of the 4 writeback nodes in the first level writeback node 21 are all connected with the cache management unit 214 in the second level writeback node 22.

Although the embodiments of the present invention have been described in detail hereinabove, it is apparent to those skilled in the art that various modifications and variations can be made to these embodiments. However, it is to be understood that such modifications and variations are within the scope and spirit of the present invention as set forth in the following claims. Moreover, the invention as described herein is capable of other embodiments and of being practiced or of being carried out in various ways.

Claims

1. A data write back system for a convolutional neural network accelerator, comprising:

2. The data write-back system of the convolutional neural network accelerator as claimed in claim 1, wherein the write-back node comprises a first output buffer unit, a selection unit and at least two receiving buffer units, an output of the receiving buffer unit is connected to an input of the selection unit, and an output of the selection unit is connected to an input of the first output buffer unit.

3. The data write-back system of the convolutional neural network accelerator as claimed in claim 2, wherein the number of the write-back nodes at the upper level is adapted to the number of the receiving cache units at the write-back nodes at the lower level.

4. The data write-back system of the convolutional neural network accelerator as claimed in claim 2, wherein the write-back control module comprises an address mapping unit, the data received by the write-back control module from the write-back node at the lowest stage comprises calculation unit address information and calculation result data, and the address mapping unit calculates a write-back address according to the calculation address information and the start address information.

5. The data write-back system of the convolutional neural network accelerator as claimed in claim 2, wherein the write-back node further comprises an arbitration unit and a cache management unit, the arbitration unit is connected to the selection unit, and the cache management unit is respectively connected to the receiving cache unit and the first output cache unit.

6. The data write-back system of the convolutional neural network accelerator as claimed in claim 5, wherein the receiving cache unit comprises a first cache state unit and a first data cache unit connected to each other, and the first cache state unit is connected to the cache management unit.

7. The data write-back system of the convolutional neural network accelerator as claimed in claim 5, wherein the first output buffer unit comprises a second buffer status unit and a second data buffer unit connected to each other, and the second buffer status unit is connected to the buffer management unit.

8. The data write-back system of the convolutional neural network accelerator as claimed in any one of claims 5, 6 or 7, wherein the cache management units of the write-back nodes connected to each other are connected to each other.

9. The data write-back system of the convolutional neural network accelerator as claimed in claim 5, wherein the input buffer module comprises input buffer units, and the number of the input buffer units is adapted to the number of the receiving buffer units of the write-back node at the top level.

10. The data write-back system of the convolutional neural network accelerator as claimed in claim 9, wherein the input buffer unit comprises a buffer control unit, a third data buffer unit and a second output buffer unit, the buffer control unit is respectively connected to the calculation unit, the third data buffer unit and the corresponding buffer management unit of the write-back unit, and the third data buffer unit is connected to the second output buffer unit.

11. The data write-back system of the convolutional neural network accelerator as claimed in claim 1, wherein the number of write-back nodes at the lowest stage is 1.