WO2019041833A1

WO2019041833A1 - Compression apparatus used for deep neural network

Info

Publication number: WO2019041833A1
Application number: PCT/CN2018/083880
Authority: WO
Inventors: 翁凯衡; 韩银和; 王颖
Original assignee: 中国科学院计算技术研究所
Priority date: 2017-08-29
Filing date: 2018-04-20
Publication date: 2019-03-07
Also published as: CN107590533A; CN107590533B

Abstract

Provided is an acceleration system used for a deep neural network. The system comprises: a 3D memory, a deep neural network calculation unit connected to a memory controller on a logic layer of a vault of the 3D memory, a router connected to the memory controller, and a compressor and a decompressor, wherein the memory controller of each vault carries out data transmission via the router connected to the memory controller and by means of a network-on-chip; and the compressor is used for compressing data to be compressed which needs to be transmitted in the network-on-chip and is used for the deep neural network, and the decompressor is used for decompressing data to be decompressed which comes from the network-on-chip and is used for the deep neural network.

Description

Compression device for deep neural network

Technical field

The present invention relates to acceleration of deep neural networks, and more particularly to data processing of deep neural networks.

Background technique

With the development of artificial intelligence technology, technologies involving deep neural networks, especially convolutional neural networks, have developed rapidly in recent years, in image recognition, speech recognition, natural language understanding, weather prediction, gene expression, and content recommendation. And in the field of intelligent robots and other fields have achieved a wide range of applications. The deep neural network can be understood as an operational model in which a large number of data nodes are included, each data node is connected to other data nodes, and the connection relationship between the nodes is represented by a weight. With the continuous development of deep neural networks, the complexity is constantly increasing. Since the calculation using a deep neural network often requires a large number of data to be cyclically operated, it is necessary to frequently fetch memory and require a relatively high memory bandwidth to ensure the calculation speed.

In order to speed up the deep neural network, there are some prior art proposals to improve the memory itself to apply the improved memory to the deep neural network. For example, in the article “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory” published by Duckhwan Kim et al. in ISCA in 2016, a convolutional nerve based on Hybrid Memory Cube (HMC) is proposed. Network acceleration system. The HMC is a novel 3D memory structure, which has the characteristics of large storage capacity and small on-chip memory access delay, and is considered by Duckhwan Kim et al. as a potential storage computing carrier suitable for convolutional neural networks. .

Figure 1 shows a schematic diagram of the three-dimensional structure of the HMC. It can be seen that the HMC uses a multi-layer circuit stack structure that is completely different from the conventional 2D memory, and the parallel stacked chips are vertically linked by through-silicon via technology (TSV). A plurality of memory layers for storing data, and a circuit logic layer for sorting, refreshing, data routing, and correcting the respective memory layers are included in the HMC. The set of chips on each unit area (ie, the small squares in Figure 1) stacked in the vertical direction is referred to as a vault, and each dome has a position at the corresponding circuit logic layer. a memory controller for managing memory operations in the dome, such as controlling data transfer between the domes, such that each dome can be independently provided relatively high Bandwidth. For convenience of description, a system for transferring data between the respective domes of 3D memory is collectively referred to as a Network on Chip (NoC).

In the article by Duckhwan Kim et al., it is proposed that the computational acceleration unit of the neural network can be integrated into the dome of the HMC by utilizing the high bandwidth and low access delay characteristics of the dome. However, in the process of implementing the above technology, there are still many problems that need to be solved and overcome. It can be understood that there is no concept of vault in the traditional 2D memory nor a logic layer. In other words, there is no unit available for calculation in 2D memory, so how to arrange the HMC specifically in the deep neural network It is a problem to be considered in how to set the computational acceleration unit of the neural network in the HMC to better utilize the 3D memory to serve the deep neural network.

On the other hand, because 3D memory can support very high data throughput internally, while the neural network computing unit in the logic layer of the circuit can provide high computational performance, which makes the on-chip used to connect the dome and neural network computing unit. The network must have high data throughput capabilities to cope with the need for deep neural networks to frequently fetch data from memory for computation. However, the huge data transmission requirements will impose a huge burden on the network on the chip, so that the transmission delay of the on-chip network will increase significantly, which will affect the system performance. Therefore, in the specific deployment of a solution such as integrating the computational acceleration unit of the neural network into the dome of the HMC, it is also necessary to consider how to alleviate the load pressure brought about by the on-chip network.

Summary of the invention

Accordingly, it is an object of the present invention to overcome the above-discussed deficiencies of the prior art and to provide an acceleration system for a deep neural network comprising: 3D memory connected to a memory controller on a logic layer of a dome of the 3D memory a deep neural network computing unit, a router connected to the memory controller, and a compressor and a decompressor;

Wherein, the memory controller of each vault transmits data through the on-chip network via a router connected thereto;

The compressor is configured to compress data to be compressed for a deep neural network that needs to be transmitted in an on-chip network, and the decompressor is configured to solve a data to be decompressed for a deep neural network from an on-chip network. compression.

Preferably, according to the system, wherein the compressor is disposed in the router or at a network interface of the on-chip network or at the memory controller, the decompressor is disposed in the router or in the router The network interface of the on-chip network or the deep neural network computing unit.

Preferably, according to the system, wherein the compressor is disposed at an input of the router and the 3D memory connection, the decompressor is disposed at an output of the router connected to the deep neural network computing unit At the end.

Preferably, according to the system, wherein the compressor is disposed between the memory controller and the router, the decompressor being disposed between the router and the deep neural network computing unit.

A compressor for deep neural networks, comprising:

An input buffer (11) for buffering matrix data to be compressed in the deep neural network;

a comparator (12) for reading an element from the input buffer (11) and determining whether the element is 0;

a counter (14) for recording the number of elements read from the input buffer (11);

a switch (15) having an output of the counter (14) as an input and an output of the comparator (12) as a control signal for providing the counter when the comparator (12) determines YES ( 14) output;

a first output buffer (13) for storing an element in the input buffer (11) when the comparator (12) determines NO to obtain a data value for the matrix;

A second output buffer (16) for storing the output of the counter (14) provided by the switch (15) to obtain column values for the matrix.

Preferably, according to the compressor, the method further includes:

a row offset calculating means for calculating, for each row in the data matrix, a position of a first non-zero element in the first output buffer (13) in all outputs of the first output buffer (13), Obtaining a row offset value for the matrix; and,

And a third output buffer, configured to store the row offset value.

Preferably, according to the compressor, wherein the length of the input buffer (11), the length of the first output buffer (13), and the length of the second output buffer (16) are greater than or equal to the matrix Number of rows, each unit in the input buffer (11) is cached in parallel for each row in the matrix, one of the input buffers (11) and the first output buffer (13) One unit of the corresponding one of the second output buffers (16);

And, the length of the third output buffer is less than or equal to the number of rows of the matrix.

A decompressor for deep neural networks, comprising:

a first input buffer (23) for buffering data values of the matrix to be decompressed in the deep neural network;

a second input buffer (22) for buffering column values of the matrix to be decompressed in the deep neural network;

a third input buffer (21), configured to buffer a row offset value of the matrix to be decompressed in the deep neural network;

a counter (25) for recording the number of elements read from the third input buffer (21);

a comparator (24) for comparing whether an element read from the third input buffer (21) is equal to a count of the counter (25);

Writing to a controller (26) for storing elements from the first input buffer (23) and the second input buffer (22);

An output buffer (27) for counting each of the counters (25), and when the comparator (24) determines YES, determining the elapsed according to elements stored in the write controller (26) A row in the matrix that is decompressed.

Preferably, according to the decompressor, wherein the length of the first input buffer (23), the length of the second input buffer (22) is less than or equal to the total number of elements in the matrix, the third The length of the input buffer (21) is less than or equal to the number of rows of the matrix.

Preferably, according to the decompressor, wherein the output buffer (27) is further for counting each of the counters (25) according to data values stored in the write controller (26) The elements, the elements of the corresponding column values, and the number of columns of the matrix, compute respective elements of a row in the decompressed matrix.

The advantages of the present invention over the prior art are:

An acceleration system for a deep neural network is provided that reduces the amount of data that needs to be transmitted and/or stored in an on-chip network by adding compressors and decompressors to the system to alleviate the use of 3D memory and The deep neural network computing unit combines the load pressure brought by the on-chip network, thereby reducing the delay of data transmission. Moreover, in the present invention, a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .

DRAWINGS

The embodiments of the present invention are further described below with reference to the accompanying drawings, wherein:

1 is a schematic diagram of a multilayer structure of a hybrid memory cube memory (HMC) in the prior art;

2 is a schematic structural diagram of a solution combining a HMC and a deep neural network computing unit in the prior art;

3 is a block diagram showing the arrangement of a compressor and a decompressor in the router of FIG. 2, in accordance with one embodiment of the present invention;

4(a) is a schematic structural view showing a compressor disposed between the memory controller and the router shown in FIG. 2 according to an embodiment of the present invention;

4(b) is a schematic structural view showing a decompressor disposed between the router and the computing unit shown in FIG. 2 according to an embodiment of the present invention;

5 is a schematic diagram of a process of performing a convolution operation on a data matrix (ie, an image) and a convolution kernel matrix in a deep neural network;

6 is a schematic diagram of a CSR encoding method suitable for compression for a sparse matrix;

7 is a schematic structural diagram of a compressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention;

FIG. 8 is a block diagram showing a structure of a decompressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention.

Detailed ways

The invention will be described in detail below with reference to the drawings and specific embodiments.

FIG. 2 is a schematic structural diagram of a 3D memory-based deep neural network acceleration system in the prior art. As shown in FIG. 2, a deep neural network computing unit is integrated on the logical layer of the 3D memory, and the deep neural network computing unit is connected to the local vault containing the memory controller through a memory controller. The memory controllers of different vaults transmit data through a common on-chip network, and the local dome's memory controller implements data routing through the on-chip network router and the remote vault.

Referring to FIG. 2, when the deep neural network computing unit starts the neural network calculation, the data request needs to be sent to the corresponding local vault memory controller, and if the data location is not in the local vault, the data request is injected into The connected on-chip network router is then transmitted over the on-chip network to the router corresponding to the remote vault at the destination. The router at the destination provides a data request to the memory controller of the remote vault corresponding thereto. Accessing the required data from the HMC through the memory controller of the remote vault, and injecting the data into a router of the on-chip network, via the entire on-chip network to the router that issued the data request, to The calculation is performed in a corresponding deep neural network calculation unit.

The inventor realized that the memory controller of each dome in the 3D memory is independent of each other, and the data needs to be transmitted between different vaults through the system on chip, so the characteristics of the vault itself in the 3D memory can be used to set the corresponding The processing device, for example, sets a compressor for compressing data and a corresponding decompressor to alleviate the load pressure imposed by the huge data transmission on the on-chip network.

The arrangement of the compressor and decompressor for the deep neural network according to the present invention will be described below by two specific examples.

Figure 3 illustrates an embodiment in which a compressor and a decompressor are disposed within a router of an on-chip network, where only components corresponding to one of the crossbars in the router are drawn for simplicity. It can be seen that in FIG. 3, the compressor is connected to a multiplexer that inputs with local memory to output the compressed data to the on-chip network (ie, northward) through the crossbar switch inside the router; The decompressor is disposed at a multiplexer that takes input from data external to the router to decompress the received data, and outputs the decompressed data to a deep neural network computing unit for calculation . Any suitable compressor and decompressor can be used herein. Preferably, the compressor and decompressor are compressors and decompressors suitable for compressing matrix data for sparsity.

When the data in the memory layer of the local vault in the 3D memory is read into the router corresponding to the local vault (ie, the input comes from the memory), the data packet first enters the corresponding virtual channel through the multiplexer. If the tag bit of the data packet indicates that the data packet has not been compressed or needs to be compressed, the data packet is transferred to the compressor through the multiplexer for compression and its tag bit is modified to have been compressed or There is no need to perform compression; when the packet passes through the virtual channel again, the multiplexer inputs it into the crossbar according to the indication of its tag bit for output from the router into the on-chip network.

When the on-chip network enters data from the router with the remote vault into the router (ie, the input is from the north), the packet first enters the other side through the multiplexer, virtual channel, and crossbar switch. At the path selector, the multiplexer determines, based on the indication of the tag bit of the data packet, whether the data packet has been compressed or whether decompression needs to be performed to determine whether the data packet needs to be transferred to the decompressor. For the data packet transferred into the decompressor, it is decompressed by the decompressor and the label bit of the data packet is modified; when the data packet passes through the multiplexer again, the multiplexer It is output from the router to the deep neural network computing unit according to its tag bit.

Figure 4 shows another embodiment for the compressor and decompressor, Figure 4 (a) is the compressor is set at the memory controller of the vault (that is, between the memory controller and the router), Figure 4 ( b) A decompressor is provided between the vaulted router and the deep neural network computing unit. It can be understood that in the embodiment provided in FIG. 4, the compressor is used to compress the data read from the memory and transmit it to the distant dome through the router and the on-chip network. In contrast, the decompressor is It is used to decompress data received by the router from the on-chip network to restore a data matrix for calculation by the deep neural network computing unit.

Referring to FIG. 4(a), after the memory controller reads out the data packet from the memory layer in the corresponding dome, the data packet needs to be compressed according to the label bit of the data packet by the multiplexer. Transfer to the compressor for compression, modify the tag bits of the compressed packet, and deliver the obtained packet to the router for transmission to the distant vault.

Referring to FIG. 4(b), after the router receives the data packet from the on-chip network, the data packet that needs to be decompressed is transmitted to the decompressor for decompression according to the label bit of the data packet by the multiplexer. The tag bits of the decompressed data packet are modified, and the obtained data packet is handed over to the deep neural network computing unit for calculation.

It should be understood herein that only two embodiments of the present invention are shown in FIGS. 3 and 4. In the present invention, the compressor and the decompressor can also be set to the deep neural network acceleration as shown in FIG. 2 as needed. The corresponding location in the system.

According to one embodiment of the invention, the compressor can be placed in a router of an on-chip network. Here, the current network transmission condition in the router and/or the sparsity of the data packet (ie, the ratio of the data whose value is “0” to the total data) may be adaptively selected to perform compression or not to perform compression. For example, a threshold is set separately for the fluency of the network transmission and the sparsity of the data. If the fluency of the current network transmission exceeds the set threshold and the sparsity of the data is greater than the set threshold, the data that needs to be routed in the router is performed. The package performs compression.

According to yet another embodiment of the invention, the compressor can be placed at the network interface of the network on chip. The data can be compressed when the data content is packaged into a data packet, or after the data packet is packaged and before the data packet is injected into the router of the on-chip network to compress the data into a new data packet, after performing the compression The obtained data packet (the compressed data packet or the newly generated data packet after compression) is then injected into the router of the on-chip network to be transmitted. In this way, the size or number of packets that need to be transmitted can be reduced, thereby avoiding the burden of adding an on-chip network router.

According to still another embodiment of the present invention, the compressor can be placed at a memory controller of the 3D memory. Thereby, the data read from the memory can be directly compressed, for example, the data matrix is directly compressed, and then the compressed data content is encapsulated into packets for routing. This approach saves time from the memory controller to the network interface, but this approach makes it difficult to mask the delay by making pipelines.

In the present invention, similar to the compressor, the decompressor can be placed in a router on an on-chip network, at a network interface of an on-chip network, and in a deep neural network calculation unit. Preferably, the location of the decompressor needs to be determined based on the type of data compressed by the compressor. For example, for a scheme in which a compressor is placed in a router of an on-chip network, a decompressor can also be placed in a router of an on-chip network.

Specific implementations for the compressor and decompressor are also provided in other embodiments of the invention. As described in the foregoing, the data used for calculation in deep neural networks has its own unique format, and the data used for calculation is usually in the form of a matrix to facilitate calculation. Moreover, the data of the deep neural network tends to have high sparsity, that is, the matrix includes a large number of elements with a value of "0". Thus, the inventors believe that a compressor and a decompressor dedicated to a deep neural network can be provided in accordance with the above-described laws.

The inventors found through research that in the usual deep neural network, the most important operation process of the computing unit is to convolute the matrix of the data and the matrix of the convolution kernel. Fig. 5 shows an example of a simplified convolution calculation process in which the matrix size of the data is 5 x 5 and the matrix size of the convolution kernel is 3 x 3. Referring to FIG. 5(a), in the calculation, it is first necessary to find the elements in the 1-3th row and the 1-3th column from the matrix of the data as the data submatrix corresponding to the size of the convolution kernel, and Each element in the submatrix is multiplied by an element of the corresponding position in the convolution kernel to obtain an accumulated result (ie, "4") as an element of the first row and the first column in the convolutional combination; subsequently, refer to FIG. 5(b) ), find the elements in rows 1-3 and 2-4 from the matrix of data as another data submatrix, repeat the above steps, and so on, until the calculation for all submatrices is completed, and finally A matrix of convolution results of size 3 x 3.

In a computational data matrix such as the deep neural network shown in FIG. 5, since it requires an activation function (for example, sigmod) itself, many data having a value of "0" are generated in the calculation process while being calculated. The pruning in the process further increases the resulting data with a value of "0". Here, we refer to this multi-zero case as the "sparseness" of matrix data in deep neural networks. The sparseness of matrix data in deep neural networks has been demonstrated in some prior art, such as "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing" published by Jorge Albericio et al. in ISCA in 2016. And "Deep Compreesion: Compressing Deep Neural Networks with Pruning, Training Quantization and Huffman Coding" published by S Han et al. on ICLR in 2015.

The inventor believes that if the "0" can be removed by encoding to store and/or transmit, and correspondingly when the data in the original matrix needs to be used for decoding by decoding, the data should be greatly reduced. Scale, reducing the amount of data stored in memory and/or the load pressure of the network on the chip. For example, a codec method such as CSR, CSC, and other codec modes that can be used to compress sparse matrix data can be used to accelerate the deep neural network.

Figure 6 shows a schematic diagram of the principle of the CSR coding mode. It can be seen that if the matrix to be compressed is a matrix of size 5×8 as shown in the figure, then all elements of the matrix that are not 0 can be included in the data values one by one. For example, the first row to the last row are arranged in a first left and right right manner to obtain data values of 1, 5, 2, 4, 3, 6, 7, 8, and 9. And, the number of columns in which each of the above data values is in the matrix is extracted. For example, if element 1 is in column 1, then the number of columns is determined to be 1, and element 5 is in column 3, then the number of columns is determined to be 3, and so on, and 1, 3, 3, 5, 1, are obtained. Column values of 6, 3, 8, and 4. It can be seen that in the matrix, the first non-zero element from the left in each row is 1, 2, 3, 7, and 9, respectively, where the position of these elements in the data value can be obtained as a row offset. shift. For example, where element 1 is in the first position in the data value, its row offset is determined to be 1, element 2 is the third element in the data value, its row offset is determined to be 3, and so on. Obtaining five row offset elements corresponding to the matrix size, respectively 1, 3, 5, 7, and 9.

It can be seen that there are 40 elements in the original matrix of size 5×8. After CSR encoding, only 23 elements can be used to represent the contents of the matrix (9 data values + 9 column values + 5 row offset values). Such coding is particularly suitable for compression for sparse matrices.

Based on the above-described calculation process of CSR coding and the rule of calculating data in a matrix manner in a deep neural network, the present invention provides a structure of a specific compressor. The compressor can be coupled with existing calculations for data in a deep neural network using a matrix, and at the same time, the data is compressed by means of CSR.

Figure 7 illustrates a compressor including an input buffer 11, a comparator 12, a counter 14, a switch 15, an output buffer 13 on the left, and an output buffer 16 on the right, in accordance with one embodiment of the present invention. Referring to Figure 7, the input buffer 11 is coupled to the comparator 12 to store non-zero elements of the input buffer into the output buffer 13 on the left side by the comparator 12, the comparator 12 generating an element of zero in the input buffer. A control signal for the switch 15 to store a count of the counter 14 in the output buffer 16 on the right when the control signal indicates that the switch 15 needs to be stored, the counter being read into the comparator 12 after each element is read Make a count. It should be understood that the counter 14 herein may be connected to the comparator 12 in addition to the input buffer 11, as long as the counter 14 can record the number of elements read from the input buffer 11.

The input buffer 11 is a multi-bit register for buffering a neural network that needs to be compressed to calculate a data matrix or partial data in a matrix, where the input buffer 11 and/or the output buffer 13 on the left side and the output buffer 16 on the right side are input. The length can be determined based on the data size of the matrix. For example, a number of input buffers 11 may be set for a matrix of a row and b columns, each input buffer 11 has a length less than or equal to b, and a output buffer 13 having a length less than or equal to b is set, and a length is less than or equal to The output buffer 16 on the right side of b.

Taking FIG. 7 as an example, it is assumed that five parallel input buffers 11 are set for the data matrix of size 5×8 as shown in FIG. 6, and each input buffer 11 is for each row in the data matrix, each The input buffer 11 corresponds to a left output buffer 13 and a right output buffer 16. Here, it can be assumed that the respective input buffers 11 read the data at the same speed, and the counter 14 is used to count the number of elements read from the input buffer 11. When the count of the counter 14 is 1, the first to eighth input buffers 11 respectively read

elements

1, 0, 3, 0, 0, and when the count of the counter 14 is 2, the first to eighth input buffers 11 are read separately. Take the

elements

0, 0, 0, 0, 0, and so on.

Now, taking the first input buffer 11 as an example, the process of performing the compression on the 5×8 data matrix as shown in FIG. 6 is specifically described. When the count of the counter 14 is 1, the input buffer 11 reads the element 1, and the comparator 12 connected to the input buffer 11 determines that the element 1 is not 0, and stores the element 1 in correspondence with the input buffer 11. The output buffer on the left side is 13. When the count of the counter 14 is 2, the input buffer 11 reads element 0, and the comparator 12 connected to the input buffer 11 determines that the element is equal to 0, then sends a control signal to the corresponding switch 15, so that the switch 15 will The count 2 in the counter 14 is written to the output buffer 16 on the right side. Similarly, when the count of counter 14 is 3, the input buffer 11 reads element 5, and so on.

After performing the above steps for each element of each row in the data matrix, the contents stored in the parallel output buffers 13 on the left side may be combined as the data values of the data matrix, and the respective right sides of the parallel The contents stored in the output buffer 16 are combined as column values of the data matrix, and since each of the output buffers 13 on the left side and the output buffers 16 on the right side are respectively determined to correspond to the data according to the position of the cache queue The first few rows in the matrix, therefore, for each row in the data matrix, the contents of the row offset can be determined by comparing the position of the first non-zero element in the output buffer 13 on the left side in the data value (using The row offset computing device for performing the comparison is not shown in FIG. 7. Here, a third type of output buffer may be added to buffer the obtained row offset value, and the maximum length of the third type of output buffer depends on the data matrix. Number of columns). Thus, the compressor can be used to perform a CSR compression process on the contents of the data matrix to obtain data values, column values, and row offset values for the data matrix.

8 illustrates a decompressor for performing decompression on data compressed using CSR mode, including: an input buffer 21 for row offset values, an input buffer 22 for column values, in accordance with an embodiment of the present invention, An input buffer 23 for data values, a comparator 24, a counter 25, a write controller 26, and an output buffer 27. Referring to Figure 8, the input buffer 21 for the row offset value provides an input to the counter 25 for counting and provides the buffered data content to the comparator 24, which counters the input to the column value based on the count of the row offset value. The cache 22 and the input buffer 23 for data values provide control signals such that they buffer data to the write controller 26 row by row, waiting for the counter 25 and the input buffer 21 for the row offset value to pass through the comparator 24. The obtained control signal instructs the output buffer 27 to acquire data line by line from the write controller 26.

The input buffers 21, 22, and 23 for the row offset value, the column value, and the data value are used to buffer the row offset value, the column value, and the data value in the data to be decompressed acquired from the on-chip network. Waiting for the process of decompression. The counter 25 is for counting the number of data contents read from the input buffer 21 for the line offset value to input the input buffer 22 for the column value and the input buffer for the data value according to the count. 23 provides a control signal and it is also used to provide said count to comparator 24; here, since each row offset value corresponds to a row in the original data matrix, counter 25 is input from each of the values for the row offset Reading a value in the cache 21 corresponds to decompressing a row in the original data matrix. In the case where the decompressed row has not changed, the counter 25 generates a corresponding control signal to inform the column value and the data. The value input buffers 22, 23 provide column values and data values to the write controller 26. The write controller 26 is for temporarily storing column values and data values from the input buffer 22 for column values and the input buffer 23 for data values to store the stored column values when the output buffer 27 performs the read. And the data value is written to the output buffer 27. The comparator 24 is configured to compare the data content in the input buffer 21 for the row offset value with the count in the counter 25. If the two are equal, a corresponding control signal is generated to control the output buffer 27 from the write control. The device 26 reads the data. It should be understood herein that each of the counters 25 corresponds to decompressing for one of the rows of data, thus, in addition to connecting the counter 25 to the column values and the input buffer 22 for data values. In addition to 23, other connection methods may be employed as long as the output buffer 27 can distinguish each row in the data matrix according to its count.

Similarly to the compressor, the length of each buffer in the decompressor according to the above-described embodiment of the present invention may be determined according to the size of the data matrix and/or the size of the convolution kernel in the deep neural network. For example, for the case where the size of the data matrix is 5×8 and the size of the convolution kernel is 4×4, the lengths of the input buffers 22 and 23 for column values and data values can be set to 40 data lengths ( That is, at most, it is necessary to cache an element equal to the number of elements in the data matrix), the length of the input buffer 21 for the line offset is set to 5 data lengths (ie, the number of rows of the data matrix), and the length of the output buffer 27 is set to 8 The length of the data (assuming that it is output for one line of data after decompression) or set to 40 data lengths (each output for the complete data matrix after decompression).

Taking FIG. 8 as an example, it is assumed that the decompressor needs to decompress the data value, the column value, and the row offset as shown in FIG. 6 to restore the original matrix of size 5×8. In the decompressor, the line offset values are output from the input buffer 21 for line offset in the order of 1, 3, 5, 7, and 9. First, the counter 25 and the comparator 24 read the first row offset value 1, and the counter 25 has a count of 1, indicating that the first row in the original data matrix is decompressed before the count is changed to 2. The input buffers 22, 23 for column values and data values are provided with control signals to perform writing to instruct them to write "1, 3" and "1, 5" to the write controller 26. At the same time, comparator 24 compares the current row offset value 1 with the count 1 of counter 25, which are of uniform size, such that comparator 24 provides a control signal to the output buffer 27 to perform the write to indicate the output buffer. 27 restores the first row in the data matrix to "1, 0, 5, 0, according to the data values "1, 5" in the write controller 26 and the column values "1, 3" and the number of columns of the matrix. 0, 0, 0, 0", and output the obtained decompressed first line of data content. Subsequently, the second row offset value 3 is read by the counter 25 and the comparator 24, and the counter 25 modifies its count to 2 to begin decompressing the second row in the data matrix. The foregoing steps are performed repeatedly, and so on, until the decompression of all five rows of data in the original data matrix is completed.

It can be seen here that the compressor and decompressor provided by Figures 7 and 8 of the present invention are particularly suitable for compressing data matrices in deep neural networks, since data matrices in deep neural networks are usually larger and sparse. Sex, that is, the element with a value of 0 accounts for a considerable proportion. It can be understood that the compressor and decompressor provided by the present invention are not only suitable for the scheme of combining the deep neural network computing unit with the 3D memory, but also for the data matrix in other deep neural networks.

As can be seen from the above embodiments, the present invention provides a scheme for compressing and decompressing data content in a deep neural network to reduce the amount of data that needs to be transmitted and/or stored in the network. For the scheme of adopting 3D memory in the deep neural network, a specific deployment scheme for the compressor and the decompressor is provided, which relieves the load pressure of the on-chip network, thereby reducing the delay of data transmission. Moreover, in the present invention, a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .

It should be noted that the various steps described in the foregoing embodiments are not all necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, and the like according to actual needs.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention and not limiting. For example, in the present invention, the type of 3D memory used is not limited. In addition to the HMC proposed in the specific embodiment of the present invention, a high bandwidth memory (HBM) may be used.

While the invention has been described in detail hereinabove with reference to the embodiments of the present invention, it should be understood that Within the scope of the claims of the present invention.

Claims

An acceleration system for a deep neural network, comprising: 3D memory, a deep neural network computing unit connected to a memory controller on a logical layer of a dome of the 3D memory, a router connected to the memory controller, And a compressor and a decompressor;

Wherein, the memory controller of each vault transmits data through the on-chip network via a router connected thereto;

The compressor is configured to compress data to be compressed for a deep neural network that needs to be transmitted in an on-chip network, and the decompressor is configured to solve a data to be decompressed for a deep neural network from an on-chip network. compression.
The system of claim 1 wherein said compressor is disposed within said router or at a network interface of said on-chip network or said memory controller, said decompressor being disposed within said router or The network interface of the on-chip network or the deep neural network computing unit.
The system of claim 2 wherein said compressor is disposed at an input of said router in connection with said 3D memory, said decompressor being disposed at said router coupled to said deep neural network computing unit At the output.
The system of claim 2 wherein said compressor is disposed between said memory controller and said router, said decompressor being disposed between said router and said deep neural network computing unit.
A compressor for deep neural networks, comprising:

An input buffer (11) for buffering matrix data to be compressed in the deep neural network;

a comparator (12) for reading an element from the input buffer (11) and determining whether the element is 0;

a counter (14) for recording the number of elements read from the input buffer (11);

a switch (15) having an output of the counter (14) as an input and an output of the comparator (12) as a control signal for providing the counter when the comparator (12) determines YES ( 14) output;

a first output buffer (13) for storing an element in the input buffer (11) when the comparator (12) determines NO to obtain a data value for the matrix;

A second output buffer (16) for storing the output of the counter (14) provided by the switch (15) to obtain column values for the matrix.
The compressor of claim 5, further comprising:

a row offset calculating means for calculating, for each row in the data matrix, a position of a first non-zero element in the first output buffer (13) in all outputs of the first output buffer (13), Obtaining a row offset value for the matrix; and,

And a third output buffer, configured to store the row offset value.
The compressor according to claim 6, wherein a length of said input buffer (11), a length of said first output buffer (13), a length of said second output buffer (16) is greater than or equal to said a number of rows of the matrix, each unit in the input buffer (11) is cached in parallel for each row in the matrix, a unit in the input buffer (11) and the first output buffer (13) One of the units and one of the second output buffers (16);

And, the length of the third output buffer is less than or equal to the number of rows of the matrix.
A decompressor for deep neural networks, comprising:

a first input buffer (23) for buffering data values of the matrix to be decompressed in the deep neural network;

a second input buffer (22) for buffering column values of the matrix to be decompressed in the deep neural network;

a third input buffer (21), configured to buffer a row offset value of the matrix to be decompressed in the deep neural network;

a counter (25) for recording the number of elements read from the third input buffer (21);

a comparator (24) for comparing whether an element read from the third input buffer (21) is equal to a count of the counter (25);

Writing to a controller (26) for storing elements from the first input buffer (23) and the second input buffer (22);

An output buffer (27) for counting each of the counters (25), and when the comparator (24) determines YES, determining the elapsed according to elements stored in the write controller (26) A row in the matrix that is decompressed.
The decompressor according to claim 8, wherein a length of said first input buffer (23), a length of said second input buffer (22) is less than or equal to a total number of elements in said matrix, said The length of the three-input buffer (21) is less than or equal to the number of rows of the matrix.
The decompressor of claim 9, wherein said output buffer (27) is further for counting each of said counters (25) based on data values stored in said write controller (26) The elements, the elements of the corresponding column values, and the number of columns of the matrix, compute the individual elements of a row in the decompressed matrix.