WO2019041833A1 - Compression apparatus used for deep neural network - Google Patents

Compression apparatus used for deep neural network Download PDF

Info

Publication number
WO2019041833A1
WO2019041833A1 PCT/CN2018/083880 CN2018083880W WO2019041833A1 WO 2019041833 A1 WO2019041833 A1 WO 2019041833A1 CN 2018083880 W CN2018083880 W CN 2018083880W WO 2019041833 A1 WO2019041833 A1 WO 2019041833A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
deep neural
neural network
input buffer
Prior art date
Application number
PCT/CN2018/083880
Other languages
French (fr)
Chinese (zh)
Inventor
翁凯衡
韩银和
王颖
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Publication of WO2019041833A1 publication Critical patent/WO2019041833A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to acceleration of deep neural networks, and more particularly to data processing of deep neural networks.
  • the deep neural network can be understood as an operational model in which a large number of data nodes are included, each data node is connected to other data nodes, and the connection relationship between the nodes is represented by a weight.
  • the complexity is constantly increasing. Since the calculation using a deep neural network often requires a large number of data to be cyclically operated, it is necessary to frequently fetch memory and require a relatively high memory bandwidth to ensure the calculation speed.
  • FIG 1 shows a schematic diagram of the three-dimensional structure of the HMC. It can be seen that the HMC uses a multi-layer circuit stack structure that is completely different from the conventional 2D memory, and the parallel stacked chips are vertically linked by through-silicon via technology (TSV).
  • TSV through-silicon via technology
  • a plurality of memory layers for storing data, and a circuit logic layer for sorting, refreshing, data routing, and correcting the respective memory layers are included in the HMC.
  • the set of chips on each unit area (ie, the small squares in Figure 1) stacked in the vertical direction is referred to as a vault, and each dome has a position at the corresponding circuit logic layer.
  • a memory controller for managing memory operations in the dome such as controlling data transfer between the domes, such that each dome can be independently provided relatively high Bandwidth.
  • NoC Network on Chip
  • the computational acceleration unit of the neural network can be integrated into the dome of the HMC by utilizing the high bandwidth and low access delay characteristics of the dome.
  • the process of implementing the above technology there are still many problems that need to be solved and overcome. It can be understood that there is no concept of vault in the traditional 2D memory nor a logic layer. In other words, there is no unit available for calculation in 2D memory, so how to arrange the HMC specifically in the deep neural network It is a problem to be considered in how to set the computational acceleration unit of the neural network in the HMC to better utilize the 3D memory to serve the deep neural network.
  • 3D memory can support very high data throughput internally, while the neural network computing unit in the logic layer of the circuit can provide high computational performance, which makes the on-chip used to connect the dome and neural network computing unit.
  • the network must have high data throughput capabilities to cope with the need for deep neural networks to frequently fetch data from memory for computation.
  • the huge data transmission requirements will impose a huge burden on the network on the chip, so that the transmission delay of the on-chip network will increase significantly, which will affect the system performance. Therefore, in the specific deployment of a solution such as integrating the computational acceleration unit of the neural network into the dome of the HMC, it is also necessary to consider how to alleviate the load pressure brought about by the on-chip network.
  • an acceleration system for a deep neural network comprising: 3D memory connected to a memory controller on a logic layer of a dome of the 3D memory a deep neural network computing unit, a router connected to the memory controller, and a compressor and a decompressor;
  • the memory controller of each vault transmits data through the on-chip network via a router connected thereto;
  • the compressor is configured to compress data to be compressed for a deep neural network that needs to be transmitted in an on-chip network
  • the decompressor is configured to solve a data to be decompressed for a deep neural network from an on-chip network. compression.
  • the decompressor is disposed in the router or in the router The network interface of the on-chip network or the deep neural network computing unit.
  • the decompressor is disposed at an output of the router connected to the deep neural network computing unit At the end.
  • the compressor is disposed between the memory controller and the router, the decompressor being disposed between the router and the deep neural network computing unit.
  • a compressor for deep neural networks comprising:
  • a comparator (12) for reading an element from the input buffer (11) and determining whether the element is 0;
  • a switch having an output of the counter (14) as an input and an output of the comparator (12) as a control signal for providing the counter when the comparator (12) determines YES ( 14) output;
  • a first output buffer (13) for storing an element in the input buffer (11) when the comparator (12) determines NO to obtain a data value for the matrix
  • the method further includes:
  • a row offset calculating means for calculating, for each row in the data matrix, a position of a first non-zero element in the first output buffer (13) in all outputs of the first output buffer (13), Obtaining a row offset value for the matrix;
  • a third output buffer configured to store the row offset value.
  • each unit in the input buffer (11) is cached in parallel for each row in the matrix, one of the input buffers (11) and the first output buffer (13) One unit of the corresponding one of the second output buffers (16);
  • the length of the third output buffer is less than or equal to the number of rows of the matrix.
  • a decompressor for deep neural networks comprising:
  • a comparator (24) for comparing whether an element read from the third input buffer (21) is equal to a count of the counter (25);
  • the length of the first input buffer (23), the length of the second input buffer (22) is less than or equal to the total number of elements in the matrix
  • the third The length of the input buffer (21) is less than or equal to the number of rows of the matrix.
  • the output buffer (27) is further for counting each of the counters (25) according to data values stored in the write controller (26)
  • the elements, the elements of the corresponding column values, and the number of columns of the matrix compute respective elements of a row in the decompressed matrix.
  • An acceleration system for a deep neural network reduces the amount of data that needs to be transmitted and/or stored in an on-chip network by adding compressors and decompressors to the system to alleviate the use of 3D memory and
  • the deep neural network computing unit combines the load pressure brought by the on-chip network, thereby reducing the delay of data transmission.
  • a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .
  • HMC hybrid memory cube memory
  • FIG. 2 is a schematic structural diagram of a solution combining a HMC and a deep neural network computing unit in the prior art
  • FIG. 3 is a block diagram showing the arrangement of a compressor and a decompressor in the router of FIG. 2, in accordance with one embodiment of the present invention
  • FIG. 4(a) is a schematic structural view showing a compressor disposed between the memory controller and the router shown in FIG. 2 according to an embodiment of the present invention
  • FIG. 4(b) is a schematic structural view showing a decompressor disposed between the router and the computing unit shown in FIG. 2 according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a process of performing a convolution operation on a data matrix (ie, an image) and a convolution kernel matrix in a deep neural network;
  • FIG. 6 is a schematic diagram of a CSR encoding method suitable for compression for a sparse matrix
  • FIG. 7 is a schematic structural diagram of a compressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention.
  • FIG. 8 is a block diagram showing a structure of a decompressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a 3D memory-based deep neural network acceleration system in the prior art.
  • a deep neural network computing unit is integrated on the logical layer of the 3D memory, and the deep neural network computing unit is connected to the local vault containing the memory controller through a memory controller.
  • the memory controllers of different vaults transmit data through a common on-chip network, and the local dome's memory controller implements data routing through the on-chip network router and the remote vault.
  • the data request needs to be sent to the corresponding local vault memory controller, and if the data location is not in the local vault, the data request is injected into The connected on-chip network router is then transmitted over the on-chip network to the router corresponding to the remote vault at the destination.
  • the router at the destination provides a data request to the memory controller of the remote vault corresponding thereto. Accessing the required data from the HMC through the memory controller of the remote vault, and injecting the data into a router of the on-chip network, via the entire on-chip network to the router that issued the data request, to The calculation is performed in a corresponding deep neural network calculation unit.
  • the processing device for example, sets a compressor for compressing data and a corresponding decompressor to alleviate the load pressure imposed by the huge data transmission on the on-chip network.
  • Figure 3 illustrates an embodiment in which a compressor and a decompressor are disposed within a router of an on-chip network, where only components corresponding to one of the crossbars in the router are drawn for simplicity.
  • the compressor is connected to a multiplexer that inputs with local memory to output the compressed data to the on-chip network (ie, northward) through the crossbar switch inside the router;
  • the decompressor is disposed at a multiplexer that takes input from data external to the router to decompress the received data, and outputs the decompressed data to a deep neural network computing unit for calculation .
  • Any suitable compressor and decompressor can be used herein.
  • the compressor and decompressor are compressors and decompressors suitable for compressing matrix data for sparsity.
  • the data packet When the data in the memory layer of the local vault in the 3D memory is read into the router corresponding to the local vault (ie, the input comes from the memory), the data packet first enters the corresponding virtual channel through the multiplexer. If the tag bit of the data packet indicates that the data packet has not been compressed or needs to be compressed, the data packet is transferred to the compressor through the multiplexer for compression and its tag bit is modified to have been compressed or There is no need to perform compression; when the packet passes through the virtual channel again, the multiplexer inputs it into the crossbar according to the indication of its tag bit for output from the router into the on-chip network.
  • the packet When the on-chip network enters data from the router with the remote vault into the router (ie, the input is from the north), the packet first enters the other side through the multiplexer, virtual channel, and crossbar switch.
  • the multiplexer determines, based on the indication of the tag bit of the data packet, whether the data packet has been compressed or whether decompression needs to be performed to determine whether the data packet needs to be transferred to the decompressor. For the data packet transferred into the decompressor, it is decompressed by the decompressor and the label bit of the data packet is modified; when the data packet passes through the multiplexer again, the multiplexer It is output from the router to the deep neural network computing unit according to its tag bit.
  • Figure 4 shows another embodiment for the compressor and decompressor
  • Figure 4 (a) is the compressor is set at the memory controller of the vault (that is, between the memory controller and the router)
  • Figure 4 ( b) A decompressor is provided between the vaulted router and the deep neural network computing unit.
  • the compressor is used to compress the data read from the memory and transmit it to the distant dome through the router and the on-chip network.
  • the decompressor is It is used to decompress data received by the router from the on-chip network to restore a data matrix for calculation by the deep neural network computing unit.
  • the data packet needs to be compressed according to the label bit of the data packet by the multiplexer. Transfer to the compressor for compression, modify the tag bits of the compressed packet, and deliver the obtained packet to the router for transmission to the distant vault.
  • the router receives the data packet from the on-chip network
  • the data packet that needs to be decompressed is transmitted to the decompressor for decompression according to the label bit of the data packet by the multiplexer.
  • the tag bits of the decompressed data packet are modified, and the obtained data packet is handed over to the deep neural network computing unit for calculation.
  • FIGS. 3 and 4 only two embodiments of the present invention are shown in FIGS. 3 and 4.
  • the compressor and the decompressor can also be set to the deep neural network acceleration as shown in FIG. 2 as needed. The corresponding location in the system.
  • the compressor can be placed in a router of an on-chip network.
  • the current network transmission condition in the router and/or the sparsity of the data packet ie, the ratio of the data whose value is “0” to the total data
  • the current network transmission condition in the router and/or the sparsity of the data packet may be adaptively selected to perform compression or not to perform compression.
  • a threshold is set separately for the fluency of the network transmission and the sparsity of the data. If the fluency of the current network transmission exceeds the set threshold and the sparsity of the data is greater than the set threshold, the data that needs to be routed in the router is performed.
  • the package performs compression.
  • the compressor can be placed at the network interface of the network on chip.
  • the data can be compressed when the data content is packaged into a data packet, or after the data packet is packaged and before the data packet is injected into the router of the on-chip network to compress the data into a new data packet, after performing the compression
  • the obtained data packet (the compressed data packet or the newly generated data packet after compression) is then injected into the router of the on-chip network to be transmitted. In this way, the size or number of packets that need to be transmitted can be reduced, thereby avoiding the burden of adding an on-chip network router.
  • the compressor can be placed at a memory controller of the 3D memory.
  • the data read from the memory can be directly compressed, for example, the data matrix is directly compressed, and then the compressed data content is encapsulated into packets for routing.
  • This approach saves time from the memory controller to the network interface, but this approach makes it difficult to mask the delay by making pipelines.
  • the decompressor can be placed in a router on an on-chip network, at a network interface of an on-chip network, and in a deep neural network calculation unit.
  • the location of the decompressor needs to be determined based on the type of data compressed by the compressor. For example, for a scheme in which a compressor is placed in a router of an on-chip network, a decompressor can also be placed in a router of an on-chip network.
  • the compressor and decompressor are also provided in other embodiments of the invention.
  • the data used for calculation in deep neural networks has its own unique format, and the data used for calculation is usually in the form of a matrix to facilitate calculation.
  • the data of the deep neural network tends to have high sparsity, that is, the matrix includes a large number of elements with a value of "0".
  • the inventors believe that a compressor and a decompressor dedicated to a deep neural network can be provided in accordance with the above-described laws.
  • Fig. 5 shows an example of a simplified convolution calculation process in which the matrix size of the data is 5 x 5 and the matrix size of the convolution kernel is 3 x 3.
  • a codec method such as CSR, CSC, and other codec modes that can be used to compress sparse matrix data can be used to accelerate the deep neural network.
  • Figure 6 shows a schematic diagram of the principle of the CSR coding mode. It can be seen that if the matrix to be compressed is a matrix of size 5 ⁇ 8 as shown in the figure, then all elements of the matrix that are not 0 can be included in the data values one by one. For example, the first row to the last row are arranged in a first left and right right manner to obtain data values of 1, 5, 2, 4, 3, 6, 7, 8, and 9. And, the number of columns in which each of the above data values is in the matrix is extracted. For example, if element 1 is in column 1, then the number of columns is determined to be 1, and element 5 is in column 3, then the number of columns is determined to be 3, and so on, and 1, 3, 3, 5, 1, are obtained. Column values of 6, 3, 8, and 4.
  • the first non-zero element from the left in each row is 1, 2, 3, 7, and 9, respectively, where the position of these elements in the data value can be obtained as a row offset. shift. For example, where element 1 is in the first position in the data value, its row offset is determined to be 1, element 2 is the third element in the data value, its row offset is determined to be 3, and so on. Obtaining five row offset elements corresponding to the matrix size, respectively 1, 3, 5, 7, and 9.
  • the present invention provides a structure of a specific compressor.
  • the compressor can be coupled with existing calculations for data in a deep neural network using a matrix, and at the same time, the data is compressed by means of CSR.
  • Figure 7 illustrates a compressor including an input buffer 11, a comparator 12, a counter 14, a switch 15, an output buffer 13 on the left, and an output buffer 16 on the right, in accordance with one embodiment of the present invention.
  • the input buffer 11 is coupled to the comparator 12 to store non-zero elements of the input buffer into the output buffer 13 on the left side by the comparator 12, the comparator 12 generating an element of zero in the input buffer.
  • a control signal for the switch 15 to store a count of the counter 14 in the output buffer 16 on the right when the control signal indicates that the switch 15 needs to be stored, the counter being read into the comparator 12 after each element is read Make a count.
  • the counter 14 herein may be connected to the comparator 12 in addition to the input buffer 11, as long as the counter 14 can record the number of elements read from the input buffer 11.
  • the input buffer 11 is a multi-bit register for buffering a neural network that needs to be compressed to calculate a data matrix or partial data in a matrix, where the input buffer 11 and/or the output buffer 13 on the left side and the output buffer 16 on the right side are input.
  • the length can be determined based on the data size of the matrix. For example, a number of input buffers 11 may be set for a matrix of a row and b columns, each input buffer 11 has a length less than or equal to b, and a output buffer 13 having a length less than or equal to b is set, and a length is less than or equal to The output buffer 16 on the right side of b.
  • each input buffer 11 is for each row in the data matrix, each The input buffer 11 corresponds to a left output buffer 13 and a right output buffer 16.
  • the respective input buffers 11 read the data at the same speed, and the counter 14 is used to count the number of elements read from the input buffer 11.
  • the first to eighth input buffers 11 respectively read elements 1, 0, 3, 0, 0, and when the count of the counter 14 is 2, the first to eighth input buffers 11 are read separately. Take the elements 0, 0, 0, 0, 0, and so on.
  • the process of performing the compression on the 5 ⁇ 8 data matrix as shown in FIG. 6 is specifically described.
  • the input buffer 11 reads the element 1
  • the comparator 12 connected to the input buffer 11 determines that the element 1 is not 0, and stores the element 1 in correspondence with the input buffer 11.
  • the output buffer on the left side is 13.
  • the input buffer 11 reads element 0, and the comparator 12 connected to the input buffer 11 determines that the element is equal to 0, then sends a control signal to the corresponding switch 15, so that the switch 15 will
  • the count 2 in the counter 14 is written to the output buffer 16 on the right side.
  • the input buffer 11 reads element 5, and so on.
  • the contents stored in the parallel output buffers 13 on the left side may be combined as the data values of the data matrix, and the respective right sides of the parallel
  • the contents stored in the output buffer 16 are combined as column values of the data matrix, and since each of the output buffers 13 on the left side and the output buffers 16 on the right side are respectively determined to correspond to the data according to the position of the cache queue
  • the first few rows in the matrix therefore, for each row in the data matrix, the contents of the row offset can be determined by comparing the position of the first non-zero element in the output buffer 13 on the left side in the data value (using The row offset computing device for performing the comparison is not shown in FIG. 7.
  • a third type of output buffer may be added to buffer the obtained row offset value, and the maximum length of the third type of output buffer depends on the data matrix. Number of columns).
  • the compressor can be used to perform a CSR compression process on the contents of the data matrix to obtain data values, column values, and row offset values for the data matrix.
  • FIG. 8 illustrates a decompressor for performing decompression on data compressed using CSR mode, including: an input buffer 21 for row offset values, an input buffer 22 for column values, in accordance with an embodiment of the present invention, An input buffer 23 for data values, a comparator 24, a counter 25, a write controller 26, and an output buffer 27.
  • the input buffer 21 for the row offset value provides an input to the counter 25 for counting and provides the buffered data content to the comparator 24, which counters the input to the column value based on the count of the row offset value.
  • the cache 22 and the input buffer 23 for data values provide control signals such that they buffer data to the write controller 26 row by row, waiting for the counter 25 and the input buffer 21 for the row offset value to pass through the comparator 24.
  • the obtained control signal instructs the output buffer 27 to acquire data line by line from the write controller 26.
  • the input buffers 21, 22, and 23 for the row offset value, the column value, and the data value are used to buffer the row offset value, the column value, and the data value in the data to be decompressed acquired from the on-chip network. Waiting for the process of decompression.
  • the counter 25 is for counting the number of data contents read from the input buffer 21 for the line offset value to input the input buffer 22 for the column value and the input buffer for the data value according to the count. 23 provides a control signal and it is also used to provide said count to comparator 24; here, since each row offset value corresponds to a row in the original data matrix, counter 25 is input from each of the values for the row offset Reading a value in the cache 21 corresponds to decompressing a row in the original data matrix.
  • the counter 25 In the case where the decompressed row has not changed, the counter 25 generates a corresponding control signal to inform the column value and the data.
  • the value input buffers 22, 23 provide column values and data values to the write controller 26.
  • the write controller 26 is for temporarily storing column values and data values from the input buffer 22 for column values and the input buffer 23 for data values to store the stored column values when the output buffer 27 performs the read. And the data value is written to the output buffer 27.
  • the comparator 24 is configured to compare the data content in the input buffer 21 for the row offset value with the count in the counter 25. If the two are equal, a corresponding control signal is generated to control the output buffer 27 from the write control.
  • the device 26 reads the data.
  • each of the counters 25 corresponds to decompressing for one of the rows of data, thus, in addition to connecting the counter 25 to the column values and the input buffer 22 for data values.
  • other connection methods may be employed as long as the output buffer 27 can distinguish each row in the data matrix according to its count.
  • the length of each buffer in the decompressor may be determined according to the size of the data matrix and/or the size of the convolution kernel in the deep neural network.
  • the lengths of the input buffers 22 and 23 for column values and data values can be set to 40 data lengths ( That is, at most, it is necessary to cache an element equal to the number of elements in the data matrix)
  • the length of the input buffer 21 for the line offset is set to 5 data lengths (ie, the number of rows of the data matrix)
  • the length of the output buffer 27 is set to 8
  • the length of the data (assuming that it is output for one line of data after decompression) or set to 40 data lengths (each output for the complete data matrix after decompression).
  • the decompressor needs to decompress the data value, the column value, and the row offset as shown in FIG. 6 to restore the original matrix of size 5 ⁇ 8.
  • the line offset values are output from the input buffer 21 for line offset in the order of 1, 3, 5, 7, and 9.
  • the counter 25 and the comparator 24 read the first row offset value 1, and the counter 25 has a count of 1, indicating that the first row in the original data matrix is decompressed before the count is changed to 2.
  • the input buffers 22, 23 for column values and data values are provided with control signals to perform writing to instruct them to write "1, 3" and "1, 5" to the write controller 26.
  • comparator 24 compares the current row offset value 1 with the count 1 of counter 25, which are of uniform size, such that comparator 24 provides a control signal to the output buffer 27 to perform the write to indicate the output buffer.
  • 27 restores the first row in the data matrix to "1, 0, 5, 0, according to the data values "1, 5" in the write controller 26 and the column values "1, 3" and the number of columns of the matrix. 0, 0, 0, 0", and output the obtained decompressed first line of data content.
  • the second row offset value 3 is read by the counter 25 and the comparator 24, and the counter 25 modifies its count to 2 to begin decompressing the second row in the data matrix. The foregoing steps are performed repeatedly, and so on, until the decompression of all five rows of data in the original data matrix is completed.
  • the compressor and decompressor provided by Figures 7 and 8 of the present invention are particularly suitable for compressing data matrices in deep neural networks, since data matrices in deep neural networks are usually larger and sparse. Sex, that is, the element with a value of 0 accounts for a considerable proportion. It can be understood that the compressor and decompressor provided by the present invention are not only suitable for the scheme of combining the deep neural network computing unit with the 3D memory, but also for the data matrix in other deep neural networks.
  • the present invention provides a scheme for compressing and decompressing data content in a deep neural network to reduce the amount of data that needs to be transmitted and/or stored in the network.
  • a specific deployment scheme for the compressor and the decompressor is provided, which relieves the load pressure of the on-chip network, thereby reducing the delay of data transmission.
  • a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .
  • HBM high bandwidth memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided is an acceleration system used for a deep neural network. The system comprises: a 3D memory, a deep neural network calculation unit connected to a memory controller on a logic layer of a vault of the 3D memory, a router connected to the memory controller, and a compressor and a decompressor, wherein the memory controller of each vault carries out data transmission via the router connected to the memory controller and by means of a network-on-chip; and the compressor is used for compressing data to be compressed which needs to be transmitted in the network-on-chip and is used for the deep neural network, and the decompressor is used for decompressing data to be decompressed which comes from the network-on-chip and is used for the deep neural network.

Description

一种用于深度神经网络的压缩装置Compression device for deep neural network 技术领域Technical field
本发明涉及对深度神经网络的加速,尤其涉及对深度神经网络的数据处理。The present invention relates to acceleration of deep neural networks, and more particularly to data processing of deep neural networks.
背景技术Background technique
随着人工智能技术的发展,涉及深度神经网络、尤其是卷积神经网络的技术在近几年得到了飞速的发展,在图像识别、语音识别、自然语言理解、天气预测、基因表达、内容推荐和智能机器人等领域均取得了广泛的应用。所述深度神经网络可以被理解为一种运算模型,其中包含大量数据节点,每个数据节点与其他数据节点相连,各个节点间的连接关系用权重表示。随着深度神经网络的不断发展,其复杂程度也在不断地提高。由于采用深度神经网络进行计算往往需要对大量的数据进行循环运算,因此需要频繁地对内存进行访存,并且需要相对较高的内存带宽以保障计算速度。With the development of artificial intelligence technology, technologies involving deep neural networks, especially convolutional neural networks, have developed rapidly in recent years, in image recognition, speech recognition, natural language understanding, weather prediction, gene expression, and content recommendation. And in the field of intelligent robots and other fields have achieved a wide range of applications. The deep neural network can be understood as an operational model in which a large number of data nodes are included, each data node is connected to other data nodes, and the connection relationship between the nodes is represented by a weight. With the continuous development of deep neural networks, the complexity is constantly increasing. Since the calculation using a deep neural network often requires a large number of data to be cyclically operated, it is necessary to frequently fetch memory and require a relatively high memory bandwidth to ensure the calculation speed.
为了对深度神经网络进行加速,有一些现有技术提出可以对内存本身进行改进,以将改进后的内存应用到深度神经网络中。例如,在Duckhwan Kim等人于2016年发表于ISCA上的文章“Neurocube:A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory”中提出一种基于混合内存立方体(Hybrid Memory Cube,HMC)的卷积神经网络加速系统。其中,所述HMC是一种新型的3D内存结构,其具有存储容量大、片上访存延迟小等特点,因而被Duckhwan Kim等人认为是一种潜在的适用于卷积神经网络的存储计算载体。In order to speed up the deep neural network, there are some prior art proposals to improve the memory itself to apply the improved memory to the deep neural network. For example, in the article “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory” published by Duckhwan Kim et al. in ISCA in 2016, a convolutional nerve based on Hybrid Memory Cube (HMC) is proposed. Network acceleration system. The HMC is a novel 3D memory structure, which has the characteristics of large storage capacity and small on-chip memory access delay, and is considered by Duckhwan Kim et al. as a potential storage computing carrier suitable for convolutional neural networks. .
图1示出了HMC的三维结构的示意图。可以看到,HMC采用了与传统2D内存完全不同的多层电路堆叠结构,通过硅通孔技术(TSV)将平行堆叠的芯片垂直链接起来。在HMC中包括用于存储数据的多个内存层、以及用于对各个内存层进行排序、刷新、数据路由、纠错的一个电路逻辑层。在垂直方向上堆叠的单位面积(即图1上的小方格)上的各层芯片的集合被称作为一个拱顶(vault),每个拱顶在对应的电路逻辑层的位置处 均具有一个内存控制器,所述内存控制器用于对所述拱顶中的内存操作进行管理,例如控制各个拱顶之间的数据传输,这样的结构使得每个拱顶均可以独立地提供相对较高的带宽。为了方便描述,这里将在3D内存的各个拱顶之间传输数据的系统抽象地称作为片上网络(Network on Chip,NoC)。Figure 1 shows a schematic diagram of the three-dimensional structure of the HMC. It can be seen that the HMC uses a multi-layer circuit stack structure that is completely different from the conventional 2D memory, and the parallel stacked chips are vertically linked by through-silicon via technology (TSV). A plurality of memory layers for storing data, and a circuit logic layer for sorting, refreshing, data routing, and correcting the respective memory layers are included in the HMC. The set of chips on each unit area (ie, the small squares in Figure 1) stacked in the vertical direction is referred to as a vault, and each dome has a position at the corresponding circuit logic layer. a memory controller for managing memory operations in the dome, such as controlling data transfer between the domes, such that each dome can be independently provided relatively high Bandwidth. For convenience of description, a system for transferring data between the respective domes of 3D memory is collectively referred to as a Network on Chip (NoC).
在Duckhwan Kim等人的文章中提出可以利用所述拱顶的高带宽、低访问延迟的特点,将神经网络的计算加速单元集成到HMC的拱顶中。然而,在具体实现上述技术的过程中,还存在许多需要解决和克服的问题。可以理解,在传统的2D内存中并不存在拱顶的概念也不具备逻辑层,换句话说,在2D内存中并不存在可用于计算的单元,因而如何将HMC具体地布置于深度神经网络中,以及如何将神经网络的计算加速单元设置在HMC中,以更好地利用3D内存来为深度神经网络进行服务,是需要考虑的问题。In the article by Duckhwan Kim et al., it is proposed that the computational acceleration unit of the neural network can be integrated into the dome of the HMC by utilizing the high bandwidth and low access delay characteristics of the dome. However, in the process of implementing the above technology, there are still many problems that need to be solved and overcome. It can be understood that there is no concept of vault in the traditional 2D memory nor a logic layer. In other words, there is no unit available for calculation in 2D memory, so how to arrange the HMC specifically in the deep neural network It is a problem to be considered in how to set the computational acceleration unit of the neural network in the HMC to better utilize the 3D memory to serve the deep neural network.
另一方面,由于3D内存内部能够支持很高的数据通量,而同时电路逻辑层中的神经网络计算单元能够提供很高的计算性能,这使得用于连接拱顶和神经网络计算单元的片上网络必需具备很高的数据通量能力,以应对深度神经网络需要频繁地从内存中访存数据以进行计算的需求。然而,庞大的数据传输需求会给片上网络带来巨大的负担,以致片上网络的传输延时大幅上升,进而影响系统性能。因此,在具体部署诸如将神经网络的计算加速单元集成到HMC的拱顶中的方案时,还需要考虑如何缓解为片上网络所带来的负载压力。On the other hand, because 3D memory can support very high data throughput internally, while the neural network computing unit in the logic layer of the circuit can provide high computational performance, which makes the on-chip used to connect the dome and neural network computing unit. The network must have high data throughput capabilities to cope with the need for deep neural networks to frequently fetch data from memory for computation. However, the huge data transmission requirements will impose a huge burden on the network on the chip, so that the transmission delay of the on-chip network will increase significantly, which will affect the system performance. Therefore, in the specific deployment of a solution such as integrating the computational acceleration unit of the neural network into the dome of the HMC, it is also necessary to consider how to alleviate the load pressure brought about by the on-chip network.
发明内容Summary of the invention
因此,本发明的目的在于克服上述现有技术的缺陷,提供一种用于深度神经网络的加速系统,包括:3D内存、与所述3D内存的拱顶的逻辑层上的内存控制器连接的深度神经网络计算单元、与所述内存控制器连接的路由器、以及压缩器和解压缩器;Accordingly, it is an object of the present invention to overcome the above-discussed deficiencies of the prior art and to provide an acceleration system for a deep neural network comprising: 3D memory connected to a memory controller on a logic layer of a dome of the 3D memory a deep neural network computing unit, a router connected to the memory controller, and a compressor and a decompressor;
其中,各个拱顶的内存控制器经由与其连接的路由器通过片上网络进行数据传输;以及Wherein, the memory controller of each vault transmits data through the on-chip network via a router connected thereto;
其中,所述压缩器用于对需要在片上网络中传输的用于深度神经网络的待压缩数据进行压缩,所述解压缩器用于对来自片上网络的用于深度神 经网络的待解压缩数据进行解压缩。The compressor is configured to compress data to be compressed for a deep neural network that needs to be transmitted in an on-chip network, and the decompressor is configured to solve a data to be decompressed for a deep neural network from an on-chip network. compression.
优选地,根据所述系统,其中,所述压缩器设置在所述路由器内或所述片上网络的网络接口处或所述内存控制器处,所述解压缩器设置在所述路由器内或所述片上网络的网络接口处或所述深度神经网络计算单元中。Preferably, according to the system, wherein the compressor is disposed in the router or at a network interface of the on-chip network or at the memory controller, the decompressor is disposed in the router or in the router The network interface of the on-chip network or the deep neural network computing unit.
优选地,根据所述系统,其中所述压缩器设置在所述路由器与所述3D内存连接的输入端处,所述解压缩器设置在所述路由器与所述深度神经网络计算单元连接的输出端处。Preferably, according to the system, wherein the compressor is disposed at an input of the router and the 3D memory connection, the decompressor is disposed at an output of the router connected to the deep neural network computing unit At the end.
优选地,根据所述系统,其中所述压缩器设置在所述内存控制器与所述路由器之间,所述解压缩器设置在所述路由器与所述深度神经网络计算单元之间。Preferably, according to the system, wherein the compressor is disposed between the memory controller and the router, the decompressor being disposed between the router and the deep neural network computing unit.
一种用于深度神经网络的压缩器,包括:A compressor for deep neural networks, comprising:
输入缓存(11),用于缓存所述深度神经网络中待压缩的矩阵数据;An input buffer (11) for buffering matrix data to be compressed in the deep neural network;
比较器(12),用于从所述输入缓存(11)中读取元素并判断所述元素是否为0;a comparator (12) for reading an element from the input buffer (11) and determining whether the element is 0;
计数器(14),用于记录从所述输入缓存(11)中读取的元素的个数;a counter (14) for recording the number of elements read from the input buffer (11);
开关(15),以所述计数器(14)的输出作为输入并且以所述比较器(12)的输出作为控制信号,用于在所述比较器(12)判断为是时提供所述计数器(14)的输出;a switch (15) having an output of the counter (14) as an input and an output of the comparator (12) as a control signal for providing the counter when the comparator (12) determines YES ( 14) output;
第一输出缓存(13),用于在所述比较器(12)判断为否时存储所述输入缓存(11)中的元素,以获得针对所述矩阵的数据值;a first output buffer (13) for storing an element in the input buffer (11) when the comparator (12) determines NO to obtain a data value for the matrix;
第二输出缓存(16),用于存储由所述开关(15)所提供的所述计数器(14)的输出,以获得针对所述矩阵的列数值。A second output buffer (16) for storing the output of the counter (14) provided by the switch (15) to obtain column values for the matrix.
优选地,根据所述压缩器,其中,还包括:Preferably, according to the compressor, the method further includes:
行偏移计算装置,用于针对数据矩阵中的每一行,计算所述第一输出缓存(13)中第一个非0元素在所述第一输出缓存(13)的全部输出中的位置,以获得针对所述矩阵的行偏移值;以及,a row offset calculating means for calculating, for each row in the data matrix, a position of a first non-zero element in the first output buffer (13) in all outputs of the first output buffer (13), Obtaining a row offset value for the matrix; and,
第三输出缓存,用于存储所述行偏移值。And a third output buffer, configured to store the row offset value.
优选地,根据所述压缩器,其中,所述输入缓存(11)的长度、所述第一输出缓存(13)的长度、所述第二输出缓存(16)的长度大于或等于所述矩阵的行数,所述输入缓存(11)中的各个单元并行地针对所述矩阵中的每一行进行缓存,所述输入缓存(11)中的一个单元与所述第一输出缓存(13)中的一个单元以及所述第二输出缓存(16)中的一个单元相对 应;Preferably, according to the compressor, wherein the length of the input buffer (11), the length of the first output buffer (13), and the length of the second output buffer (16) are greater than or equal to the matrix Number of rows, each unit in the input buffer (11) is cached in parallel for each row in the matrix, one of the input buffers (11) and the first output buffer (13) One unit of the corresponding one of the second output buffers (16);
并且,所述第三输出缓存的长度小于或等于所述矩阵的行数。And, the length of the third output buffer is less than or equal to the number of rows of the matrix.
一种用于深度神经网络的解压缩器,包括:A decompressor for deep neural networks, comprising:
第一输入缓存(23),用于缓存所述深度神经网络中待解压缩的矩阵的数据值;a first input buffer (23) for buffering data values of the matrix to be decompressed in the deep neural network;
第二输入缓存(22),用于缓存所述深度神经网络中待解压缩的矩阵的列数值;a second input buffer (22) for buffering column values of the matrix to be decompressed in the deep neural network;
第三输入缓存(21),用于缓存所述深度神经网络中待解压缩的矩阵的行偏移值;a third input buffer (21), configured to buffer a row offset value of the matrix to be decompressed in the deep neural network;
计数器(25),用于记录从所述第三输入缓存(21)中读取的元素的个数;a counter (25) for recording the number of elements read from the third input buffer (21);
比较器(24),用于比较从所述第三输入缓存(21)中读取的元素与所述计数器(25)的计数是否相等;a comparator (24) for comparing whether an element read from the third input buffer (21) is equal to a count of the counter (25);
写入控制器(26),用于存储来自所述第一输入缓存(23)和所述第二输入缓存(22)的元素;Writing to a controller (26) for storing elements from the first input buffer (23) and the second input buffer (22);
输出缓存(27),用于针对所述计数器(25)的每一个计数,在所述比较器(24)判断为是时,根据所述写入控制器(26)中存储的元素,确定经过解压缩的所述矩阵中的一行。An output buffer (27) for counting each of the counters (25), and when the comparator (24) determines YES, determining the elapsed according to elements stored in the write controller (26) A row in the matrix that is decompressed.
优选地,根据所述解压缩器,其中,所述第一输入缓存(23)的长度、所述第二输入缓存(22)的长度小于或等于所述矩阵中元素的总数,所述第三输入缓存(21)的长度小于或等于所述矩阵的行数。Preferably, according to the decompressor, wherein the length of the first input buffer (23), the length of the second input buffer (22) is less than or equal to the total number of elements in the matrix, the third The length of the input buffer (21) is less than or equal to the number of rows of the matrix.
优选地,根据所述解压缩器,其中,所述输出缓存(27)还用于针对所述计数器(25)的每一个计数,根据所述写入控制器(26)中存储的数据值的元素、对应的列数值的元素、以及所述矩阵的列数,计算经过解压缩的所述矩阵中的一行的各个元素。Preferably, according to the decompressor, wherein the output buffer (27) is further for counting each of the counters (25) according to data values stored in the write controller (26) The elements, the elements of the corresponding column values, and the number of columns of the matrix, compute respective elements of a row in the decompressed matrix.
与现有技术相比,本发明的优点在于:The advantages of the present invention over the prior art are:
提供了一种用于深度神经网络的加速系统,其通过在所述系统中增加压缩器和解压缩器的方式,降低需要在片上网络中传输和/或存储的数据量,来缓解采用3D内存与深度神经网络计算单元相结合时为片上网络带来的负载压力,由此降低了数据传输的延迟。并且,在本发明中,还提供了专门针对深度神经网络中采用矩阵形式进行计算的数据的压缩器和解 压缩器,基于神经网络的数据矩阵的稀疏性,可以自动地对数据矩阵进行压缩和解压缩。An acceleration system for a deep neural network is provided that reduces the amount of data that needs to be transmitted and/or stored in an on-chip network by adding compressors and decompressors to the system to alleviate the use of 3D memory and The deep neural network computing unit combines the load pressure brought by the on-chip network, thereby reducing the delay of data transmission. Moreover, in the present invention, a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .
附图说明DRAWINGS
以下参照附图对本发明实施例作进一步说明,其中:The embodiments of the present invention are further described below with reference to the accompanying drawings, wherein:
图1是现有技术中混合内存立方体内存(HMC)的多层结构的示意图;1 is a schematic diagram of a multilayer structure of a hybrid memory cube memory (HMC) in the prior art;
图2是现有技术中将HMC与深度神经网络计算单元相结合的方案的结构示意图;2 is a schematic structural diagram of a solution combining a HMC and a deep neural network computing unit in the prior art;
图3是根据本发明的一个实施例将压缩器和解压缩器布置在图2的路由器中的结构示意图;3 is a block diagram showing the arrangement of a compressor and a decompressor in the router of FIG. 2, in accordance with one embodiment of the present invention;
图4(a)是根据本发明的一个实施例将压缩器布置在图2所示出的内存控制器与路由器之间的结构示意图;4(a) is a schematic structural view showing a compressor disposed between the memory controller and the router shown in FIG. 2 according to an embodiment of the present invention;
图4(b)是根据本发明的一个实施例将解压缩器布置在图2所示出的路由器与计算单元之间的结构示意图;4(b) is a schematic structural view showing a decompressor disposed between the router and the computing unit shown in FIG. 2 according to an embodiment of the present invention;
图5是深度神经网络中对数据矩阵(即图像)与卷积核矩阵执行卷积运算的过程的示意图;5 is a schematic diagram of a process of performing a convolution operation on a data matrix (ie, an image) and a convolution kernel matrix in a deep neural network;
图6是适用于针对稀疏矩阵进行压缩的CSR编码方法的原理图;6 is a schematic diagram of a CSR encoding method suitable for compression for a sparse matrix;
图7是根据本发明的一个实施例的用于深度神经网络中采用矩阵形式的数据的压缩器的结构示意图;7 is a schematic structural diagram of a compressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention;
图8是根据本发明的一个实施例的用于深度神经网络中采用矩阵形式的数据的解压缩器的结构示意图。FIG. 8 is a block diagram showing a structure of a decompressor for using data in a matrix form in a deep neural network according to an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图和具体实施方式对本发明作详细说明。The invention will be described in detail below with reference to the drawings and specific embodiments.
图2示出了现有技术中基于3D内存的深度神经网络加速系统的结构示意图。如图2所示,在3D内存的逻辑层上集成了深度神经网络计算单元,所述深度神经网络计算单元通过内存控制器与包含该内存控制器的本地拱顶相连。不同拱顶的内存控制器则通过共同的片上网络进行数据传输,由本地拱顶的内存控制器通过片上网络的路由器与远处拱顶实现数据的路由。FIG. 2 is a schematic structural diagram of a 3D memory-based deep neural network acceleration system in the prior art. As shown in FIG. 2, a deep neural network computing unit is integrated on the logical layer of the 3D memory, and the deep neural network computing unit is connected to the local vault containing the memory controller through a memory controller. The memory controllers of different vaults transmit data through a common on-chip network, and the local dome's memory controller implements data routing through the on-chip network router and the remote vault.
参考图2,当深度神经网络计算单元开始进行神经网络计算时,需要 将数据请求发送到对应相连的本地拱顶内存控制器中,如若数据所在位置不在本地拱顶中,则将数据请求注入到相连的片上网络的路由器中,然后通过片上网络传输到目的地处的远处拱顶对应的路由器中。所述目的地处的路由器将数据请求提供至与其对应的远处拱顶的内存控制器中。通过所述远处拱顶的内存控制器,从HMC中访问所需的数据,并将所述数据注入到片上网络的路由器中,经由整个片上网络传输至发出所述数据请求的路由器中,以提供至对应的深度神经网络计算单元中进行计算。Referring to FIG. 2, when the deep neural network computing unit starts the neural network calculation, the data request needs to be sent to the corresponding local vault memory controller, and if the data location is not in the local vault, the data request is injected into The connected on-chip network router is then transmitted over the on-chip network to the router corresponding to the remote vault at the destination. The router at the destination provides a data request to the memory controller of the remote vault corresponding thereto. Accessing the required data from the HMC through the memory controller of the remote vault, and injecting the data into a router of the on-chip network, via the entire on-chip network to the router that issued the data request, to The calculation is performed in a corresponding deep neural network calculation unit.
发明人意识到,3D内存中每个拱顶的内存控制器是相互独立的,数据需要通过片上系统在不同的拱顶之间进行传输,因此可以利用3D内存中拱顶本身的特点来设置相应的处理装置,例如设置用于对数据进行压缩的压缩器和相应的解压缩器,以缓解庞大的数据传输对片上网络带来的负载压力。The inventor realized that the memory controller of each dome in the 3D memory is independent of each other, and the data needs to be transmitted between different vaults through the system on chip, so the characteristics of the vault itself in the 3D memory can be used to set the corresponding The processing device, for example, sets a compressor for compressing data and a corresponding decompressor to alleviate the load pressure imposed by the huge data transmission on the on-chip network.
下面将通过两个具体的实例来介绍根据本发明的用于深度神经网络的压缩器和解压缩器的布置方式。The arrangement of the compressor and decompressor for the deep neural network according to the present invention will be described below by two specific examples.
图3示出了将压缩器和解压缩器设置在片上网络的路由器内的一种实施方式,这里为了简化仅绘制了与路由器中的一个交叉开关对应的部件。可以看到在图3中,压缩器与以本地内存为输入的多路选择器相连,以将压缩后的数据通过路由器内部的交叉开关输出到片上网络中(即北向);与此相对地,解压缩器被设置在以来自所述路由器外部的数据为输入的多路选择器处,以对接收到的数据进行解压缩,并将解压缩后的数据输出到深度神经网络计算单元中进行计算。这里可以使用任意恰当的压缩器和解压缩器,优选地,所述压缩器和解压缩器为适用于针对稀疏性的矩阵数据进行压缩的压缩器和解压缩器。Figure 3 illustrates an embodiment in which a compressor and a decompressor are disposed within a router of an on-chip network, where only components corresponding to one of the crossbars in the router are drawn for simplicity. It can be seen that in FIG. 3, the compressor is connected to a multiplexer that inputs with local memory to output the compressed data to the on-chip network (ie, northward) through the crossbar switch inside the router; The decompressor is disposed at a multiplexer that takes input from data external to the router to decompress the received data, and outputs the decompressed data to a deep neural network computing unit for calculation . Any suitable compressor and decompressor can be used herein. Preferably, the compressor and decompressor are compressors and decompressors suitable for compressing matrix data for sparsity.
当3D内存中本地拱顶的内存层中的数据被读入到与该本地拱顶对应的路由器中时(即输入来自内存),首先数据包通过多路选择器进入到对应的虚通道中,若所述数据包的标签位指示该数据包尚未执行过压缩或者需要执行压缩,则通过多路选择器将该数据包转入压缩器中进行压缩并将其标签位修改为已执行过压缩或者无需执行压缩;当所述数据包再次通过虚通道时,多路选择器根据其标签位的指示将其输入到交叉开关中以从所述路由器中输出到片上网络中。When the data in the memory layer of the local vault in the 3D memory is read into the router corresponding to the local vault (ie, the input comes from the memory), the data packet first enters the corresponding virtual channel through the multiplexer. If the tag bit of the data packet indicates that the data packet has not been compressed or needs to be compressed, the data packet is transferred to the compressor through the multiplexer for compression and its tag bit is modified to have been compressed or There is no need to perform compression; when the packet passes through the virtual channel again, the multiplexer inputs it into the crossbar according to the indication of its tag bit for output from the router into the on-chip network.
当片上网络将来自与远处拱顶对于的路由器的数据输入到所述路由器中时(即输入来自北向),首先数据包通过多路选择器、虚通道、交叉 开关进入到另一侧的多路选择器处,该多路选择器根据所述数据包的标签位的指示判断该数据包是否已执行过压缩或是否需要执行解压缩,以确定是否需要将该数据包转入解压缩器中;对于转入所述解压缩器中的数据包,由解压缩器对其进行解压并对该数据包的标签位进行修改;在所述数据包再次通过多路选择器时,多路选择器根据其标签位将其从所述路由器中输出到深度神经网络计算单元。When the on-chip network enters data from the router with the remote vault into the router (ie, the input is from the north), the packet first enters the other side through the multiplexer, virtual channel, and crossbar switch. At the path selector, the multiplexer determines, based on the indication of the tag bit of the data packet, whether the data packet has been compressed or whether decompression needs to be performed to determine whether the data packet needs to be transferred to the decompressor. For the data packet transferred into the decompressor, it is decompressed by the decompressor and the label bit of the data packet is modified; when the data packet passes through the multiplexer again, the multiplexer It is output from the router to the deep neural network computing unit according to its tag bit.
图4示出了针对压缩器和解压缩器的另一种实施方式,图4(a)为在拱顶的内存控制器处(即该内存控制器与路由器之间)设置压缩器,图4(b)为在拱顶的路由器与深度神经网络计算单元之间设置解压器。可以理解,在图4所提供的实施方式中,压缩器用于对从内存中读取出的数据进行压缩并通过路由器及片上网络传输至远处拱顶,相较之下,解压缩器则是用于对由路由器接收到的来自片上网络的数据进行解压缩处理,以还原出用于深度神经网络计算单元进行计算的数据矩阵。Figure 4 shows another embodiment for the compressor and decompressor, Figure 4 (a) is the compressor is set at the memory controller of the vault (that is, between the memory controller and the router), Figure 4 ( b) A decompressor is provided between the vaulted router and the deep neural network computing unit. It can be understood that in the embodiment provided in FIG. 4, the compressor is used to compress the data read from the memory and transmit it to the distant dome through the router and the on-chip network. In contrast, the decompressor is It is used to decompress data received by the router from the on-chip network to restore a data matrix for calculation by the deep neural network computing unit.
参考图4(a),当内存控制器从与其对应的拱顶中的内存层中读取出数据包后,通过多路选择器根据所述数据包的标签位将其中需要进行压缩的数据包传输至压缩器中进行压缩,对经过压缩的数据包的标签位进行修改,并将所获得的数据包交至路由器以传输至远处拱顶。Referring to FIG. 4(a), after the memory controller reads out the data packet from the memory layer in the corresponding dome, the data packet needs to be compressed according to the label bit of the data packet by the multiplexer. Transfer to the compressor for compression, modify the tag bits of the compressed packet, and deliver the obtained packet to the router for transmission to the distant vault.
参考图4(b),当路由器接收到来自片上网络的数据包后,通过多路选择器根据所述数据包的标签位将其中需要进行解压缩的数据包传输至解压缩器中进行解压,对经过解压缩的数据包的标签位进行修改,并将所获得的数据包交至深度神经网络计算单元进行计算。Referring to FIG. 4(b), after the router receives the data packet from the on-chip network, the data packet that needs to be decompressed is transmitted to the decompressor for decompression according to the label bit of the data packet by the multiplexer. The tag bits of the decompressed data packet are modified, and the obtained data packet is handed over to the deep neural network computing unit for calculation.
这里应当理解图3和图4所示出的仅为本发明的两种实施例,在本发明中,还可以根据需要将压缩器和解压器设置在如图2所示出的深度神经网络加速系统中的相应位置处。It should be understood herein that only two embodiments of the present invention are shown in FIGS. 3 and 4. In the present invention, the compressor and the decompressor can also be set to the deep neural network acceleration as shown in FIG. 2 as needed. The corresponding location in the system.
根据本发明的一个实施例,可以将压缩器设置在片上网络的路由器内。这里可以利用路由器中当前的网络传输情况和/或数据包的稀疏度(即其中取值为“0”的数据占全部数据的比例),自适应地选择执行压缩或者不执行压缩。例如,针对网络传输的流畅度以及数据的稀疏度分别设置阈值,若当前的网络传输的流畅度超过设置的阈值并且数据的稀疏度大于设置的阈值,则对所述路由器中需要执行路由的数据包执行压缩。According to one embodiment of the invention, the compressor can be placed in a router of an on-chip network. Here, the current network transmission condition in the router and/or the sparsity of the data packet (ie, the ratio of the data whose value is “0” to the total data) may be adaptively selected to perform compression or not to perform compression. For example, a threshold is set separately for the fluency of the network transmission and the sparsity of the data. If the fluency of the current network transmission exceeds the set threshold and the sparsity of the data is greater than the set threshold, the data that needs to be routed in the router is performed. The package performs compression.
根据本发明的又一个实施例,可以将压缩器设置在片上网络的网络接口处。可以在将数据内容打包为数据包时对数据进行压缩,或者在完成数 据包的打包之后并且在将数据包注入片上网络的路由器之前对数据进行压缩以封装成新的数据包,在执行完压缩之后将所获得的数据包(数据内容经过压缩的数据包或者压缩后新生成的数据包)注入片上网络的路由器中等待被传输。通过此种方式,可以减少需要传输的数据包的大小或数量,从而避免增加片上网络路由器的负担。According to yet another embodiment of the invention, the compressor can be placed at the network interface of the network on chip. The data can be compressed when the data content is packaged into a data packet, or after the data packet is packaged and before the data packet is injected into the router of the on-chip network to compress the data into a new data packet, after performing the compression The obtained data packet (the compressed data packet or the newly generated data packet after compression) is then injected into the router of the on-chip network to be transmitted. In this way, the size or number of packets that need to be transmitted can be reduced, thereby avoiding the burden of adding an on-chip network router.
根据本发明的又一个实施例,可以将压缩器设置在3D内存的内存控制器处。由此可以对从内存读取出来的数据直接进行压缩,例如直接对数据矩阵进行压缩,并随后将压缩后的数据内容封装为数据包进行路由。此种方式可以节省从内存控制器传输到网络接口的时间,然而这种方式很难通过制作流水线来掩盖延迟。According to still another embodiment of the present invention, the compressor can be placed at a memory controller of the 3D memory. Thereby, the data read from the memory can be directly compressed, for example, the data matrix is directly compressed, and then the compressed data content is encapsulated into packets for routing. This approach saves time from the memory controller to the network interface, but this approach makes it difficult to mask the delay by making pipelines.
在本发明中,与压缩器相类似,可以将解压缩器设置在片上网络的路由器内、片上网络的网络接口处、以及深度神经网络计算单元中。优选地,根据压缩器所压缩的数据类型来确定需要将解压缩器设置在何处。例如,对于将压缩器设置在片上网络的路由器内的方案,可以将解压缩器也设置在片上网络的路由器内。In the present invention, similar to the compressor, the decompressor can be placed in a router on an on-chip network, at a network interface of an on-chip network, and in a deep neural network calculation unit. Preferably, the location of the decompressor needs to be determined based on the type of data compressed by the compressor. For example, for a scheme in which a compressor is placed in a router of an on-chip network, a decompressor can also be placed in a router of an on-chip network.
在本发明的其他实施例中还提供了针对所述压缩器和解压缩器的具体实现方式。如前文中所述,深度神经网络中所用于计算的数据有其特有的格式,其用于计算的数据通常会采用矩阵的形式以方便运算。并且,深度神经网络的数据往往具有较高的稀疏性,即所述矩阵中包括大量的取值为“0”的元素。由此,发明人认为可以根据上述规律而设置专用于深度神经网络的压缩器和解压器。Specific implementations for the compressor and decompressor are also provided in other embodiments of the invention. As described in the foregoing, the data used for calculation in deep neural networks has its own unique format, and the data used for calculation is usually in the form of a matrix to facilitate calculation. Moreover, the data of the deep neural network tends to have high sparsity, that is, the matrix includes a large number of elements with a value of "0". Thus, the inventors believe that a compressor and a decompressor dedicated to a deep neural network can be provided in accordance with the above-described laws.
发明人通过研究发现,在通常的深度神经网络中,计算单元最主要的运算过程在于对数据的矩阵以及卷积核的矩阵进行卷积运算。图5示出了一个简化后的卷积计算过程的示例,其中数据的矩阵尺寸为5×5,卷积核的矩阵尺寸为3×3。参考图5(a),在计算时,首先需要从数据的矩阵中找出第1-3行及1-3列中的元素以作为与卷积核尺寸相当的数据子矩阵,并将所述子矩阵中的每个元素与卷积核中对应位置的元素相乘获得累加的结果(即“4”)以作为卷积结合中第1行第1列的元素;随后,参考图5(b),从数据的矩阵中找出第1-3行及2-4列中的元素以作为另一个数据子矩阵,重复上述步骤,以此类推,直至完成针对全部子矩阵的计算,并最终获得尺寸为3×3的卷积结果的矩阵。The inventors found through research that in the usual deep neural network, the most important operation process of the computing unit is to convolute the matrix of the data and the matrix of the convolution kernel. Fig. 5 shows an example of a simplified convolution calculation process in which the matrix size of the data is 5 x 5 and the matrix size of the convolution kernel is 3 x 3. Referring to FIG. 5(a), in the calculation, it is first necessary to find the elements in the 1-3th row and the 1-3th column from the matrix of the data as the data submatrix corresponding to the size of the convolution kernel, and Each element in the submatrix is multiplied by an element of the corresponding position in the convolution kernel to obtain an accumulated result (ie, "4") as an element of the first row and the first column in the convolutional combination; subsequently, refer to FIG. 5(b) ), find the elements in rows 1-3 and 2-4 from the matrix of data as another data submatrix, repeat the above steps, and so on, until the calculation for all submatrices is completed, and finally A matrix of convolution results of size 3 x 3.
在诸如图5所示出的深度神经网络的计算数据矩阵中,由于其本身需 要采用激活函数(例如sigmod)的原因,在计算过程中会产生许多取值为“0”的数据,同时在计算过程中的剪枝操作(pruning)进一步地增加了所产生的取值为“0”的数据。这里,我们将这种多“0”的情况称作为深度神经网络中矩阵数据的“稀疏性”。关于深度神经网络中矩阵数据具备稀疏性的特点已在一些现有技术中得到了论证,例如Jorge Albericio等人于2016年发表于ISCA上的“Cnvlutin:Ineffectual-Neuron-Free Deep Neural Network Computing”,以及S Han等人于2015年发表于ICLR上的“Deep Compreesion:Compressing Deep Neural Networks with Pruning,Trained Quantization and Huffman Coding”。In a computational data matrix such as the deep neural network shown in FIG. 5, since it requires an activation function (for example, sigmod) itself, many data having a value of "0" are generated in the calculation process while being calculated. The pruning in the process further increases the resulting data with a value of "0". Here, we refer to this multi-zero case as the "sparseness" of matrix data in deep neural networks. The sparseness of matrix data in deep neural networks has been demonstrated in some prior art, such as "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing" published by Jorge Albericio et al. in ISCA in 2016. And "Deep Compreesion: Compressing Deep Neural Networks with Pruning, Training Quantization and Huffman Coding" published by S Han et al. on ICLR in 2015.
发明人认为,若是能够通过编码的方式去除其中的“0”来进行存储和/或传输,并相应地在需要使用原始矩阵中的数据时通过解码的方式进行还原,则应当可以大幅降低数据的规模,减少内存存储的数据量和/或片上网络的负载压力。例如,可以采用诸如CSR、CSC的编解码方式,以及其他能够用于对稀疏矩阵数据进行压缩的编解码方式,从而对深度神经网络进行加速。The inventor believes that if the "0" can be removed by encoding to store and/or transmit, and correspondingly when the data in the original matrix needs to be used for decoding by decoding, the data should be greatly reduced. Scale, reducing the amount of data stored in memory and/or the load pressure of the network on the chip. For example, a codec method such as CSR, CSC, and other codec modes that can be used to compress sparse matrix data can be used to accelerate the deep neural network.
图6示出了CSR编码方式的原理示意图。可以看到,假使需要压缩的矩阵为如图所示出的尺寸为5×8的矩阵,那么则可以将该矩阵中非0的所有元素逐个地列入数据值中。例如,从第一行直至最后一行依照先左后右的方式排列,以获得1、5、2、4、3、6、7、8、9的数据值。并且,提取上述数据值中的每一个元素在所述矩阵中所处的列数。例如,元素1处于第1列,则将其列数确定为1,元素5处于第3列,则将其列数确定为3,以此类推,获得了1、3、3、5、1、6、3、8、4的列数值。可以看到,在所述矩阵中,每一行中从左数第一个非0的元素分别为1、2、3、7、9,这里可以获取这些元素在数据值中的位置以作为行偏移。例如,其中元素1处于数据值中的第1个位置,则将其行偏移确定为1,元素2为数据值中的第3个元素,则将其行偏移确定为3,以此类推,获得与所述矩阵尺寸对应的5个行偏移元素,分别为1、3、5、7、9。Figure 6 shows a schematic diagram of the principle of the CSR coding mode. It can be seen that if the matrix to be compressed is a matrix of size 5×8 as shown in the figure, then all elements of the matrix that are not 0 can be included in the data values one by one. For example, the first row to the last row are arranged in a first left and right right manner to obtain data values of 1, 5, 2, 4, 3, 6, 7, 8, and 9. And, the number of columns in which each of the above data values is in the matrix is extracted. For example, if element 1 is in column 1, then the number of columns is determined to be 1, and element 5 is in column 3, then the number of columns is determined to be 3, and so on, and 1, 3, 3, 5, 1, are obtained. Column values of 6, 3, 8, and 4. It can be seen that in the matrix, the first non-zero element from the left in each row is 1, 2, 3, 7, and 9, respectively, where the position of these elements in the data value can be obtained as a row offset. shift. For example, where element 1 is in the first position in the data value, its row offset is determined to be 1, element 2 is the third element in the data value, its row offset is determined to be 3, and so on. Obtaining five row offset elements corresponding to the matrix size, respectively 1, 3, 5, 7, and 9.
可以看到,在尺寸为5×8的原始矩阵中,共有40个元素。而经过CSR编码后,只需23个元素即可表示出所述矩阵中的内容(9个数据值+9个列数值+5个行偏移值)。这样的编码方式尤其适合于针对稀疏的矩阵进行压缩。It can be seen that there are 40 elements in the original matrix of size 5×8. After CSR encoding, only 23 elements can be used to represent the contents of the matrix (9 data values + 9 column values + 5 row offset values). Such coding is particularly suitable for compression for sparse matrices.
基于上述CSR编码的计算过程,以及深度神经网络中将数据采用矩阵 方式进行计算的规则,本发明提供了一种具体的压缩器的结构。所述压缩器可以与现有的针对深度神经网络中数据采用矩阵进行计算的方式衔接,并且同时实现采用CSR的方式对数据进行压缩。Based on the above-described calculation process of CSR coding and the rule of calculating data in a matrix manner in a deep neural network, the present invention provides a structure of a specific compressor. The compressor can be coupled with existing calculations for data in a deep neural network using a matrix, and at the same time, the data is compressed by means of CSR.
图7示出了根据本发明的一个实施例的压缩器,包括:输入缓存11、比较器12、计数器14、开关15、左侧的输出缓存13、右侧的输出缓存16。参考图7,输入缓存11与比较器12相连,以通过比较器12将输入缓存中非0的元素存入左侧的输出缓存13中,所述比较器12根据输入缓存中为0的元素产生用于开关15的控制信号,以将控制信号指示所述开关15需要进行存储时计数器14的计数存入右侧的输出缓存16中,所述计数器在每将一个元素读入到比较器12后进行一次计数。应当理解这里的计数器14除与输入缓存11相连之外,还可以连接至比较器12,只要所述计数器14可以记录从输入缓冲11中读取的元素的个数即可。Figure 7 illustrates a compressor including an input buffer 11, a comparator 12, a counter 14, a switch 15, an output buffer 13 on the left, and an output buffer 16 on the right, in accordance with one embodiment of the present invention. Referring to Figure 7, the input buffer 11 is coupled to the comparator 12 to store non-zero elements of the input buffer into the output buffer 13 on the left side by the comparator 12, the comparator 12 generating an element of zero in the input buffer. A control signal for the switch 15 to store a count of the counter 14 in the output buffer 16 on the right when the control signal indicates that the switch 15 needs to be stored, the counter being read into the comparator 12 after each element is read Make a count. It should be understood that the counter 14 herein may be connected to the comparator 12 in addition to the input buffer 11, as long as the counter 14 can record the number of elements read from the input buffer 11.
其中,输入缓存11是多位的寄存器,用于缓存需要进行压缩的神经网络计算数据矩阵或者矩阵中的部分数据,这里输入缓存11和/或左侧的输出缓存13以及右侧的输出缓存16的长度可以根据矩阵的数据规模来确定。例如,可以针对a行b列的矩阵设置a个输入缓存11,每个输入缓存11的长度小于等于b,并且设置a个长度小于等于b的左侧的输出缓存13,以及a个长度小于等于b的右侧的输出缓存16。The input buffer 11 is a multi-bit register for buffering a neural network that needs to be compressed to calculate a data matrix or partial data in a matrix, where the input buffer 11 and/or the output buffer 13 on the left side and the output buffer 16 on the right side are input. The length can be determined based on the data size of the matrix. For example, a number of input buffers 11 may be set for a matrix of a row and b columns, each input buffer 11 has a length less than or equal to b, and a output buffer 13 having a length less than or equal to b is set, and a length is less than or equal to The output buffer 16 on the right side of b.
以图7为例,假设其针对如图6所示出的尺寸为5×8的数据矩阵设置了5个并行的输入缓存11,每个输入缓存11分别针对数据矩阵中的每一行,每个输入缓存11对应于一个左侧的输出缓存13以及一个右侧的输出缓存16。这里可以假设各个输入缓存11读取数据的速度相同,计数器14用于计数从输入缓存11中读取的元素的数量。当计数器14的计数为1时,第1-8个输入缓存11分别读取元素1、0、3、0、0,当计数器14的计数为2时,第1-8个输入缓存11分别读取元素0、0、0、0、0,以此类推。Taking FIG. 7 as an example, it is assumed that five parallel input buffers 11 are set for the data matrix of size 5×8 as shown in FIG. 6, and each input buffer 11 is for each row in the data matrix, each The input buffer 11 corresponds to a left output buffer 13 and a right output buffer 16. Here, it can be assumed that the respective input buffers 11 read the data at the same speed, and the counter 14 is used to count the number of elements read from the input buffer 11. When the count of the counter 14 is 1, the first to eighth input buffers 11 respectively read elements 1, 0, 3, 0, 0, and when the count of the counter 14 is 2, the first to eighth input buffers 11 are read separately. Take the elements 0, 0, 0, 0, 0, and so on.
现在以第1个输入缓存11为例,具体地介绍对如图6所示出的5×8的数据矩阵执行所述压缩的过程。在计数器14的计数为1时,该输入缓存11读取元素1,与该输入缓存11相连的比较器12判断出该元素1不为0,并将该元素1存入与该输入缓存11对应的左侧的输出缓存13中。在计数器14的计数为2时,该输入缓存11读取元素0,与该输入缓存11相连的比较器12判断出该元素等于0,则向对应的开关15发出控制信号, 使得该开关15将计数器14中的计数2写入右侧的输出缓存16。类似地,在计数器14的计数为3时,该输入缓存11读取元素5,以此类推。Now, taking the first input buffer 11 as an example, the process of performing the compression on the 5×8 data matrix as shown in FIG. 6 is specifically described. When the count of the counter 14 is 1, the input buffer 11 reads the element 1, and the comparator 12 connected to the input buffer 11 determines that the element 1 is not 0, and stores the element 1 in correspondence with the input buffer 11. The output buffer on the left side is 13. When the count of the counter 14 is 2, the input buffer 11 reads element 0, and the comparator 12 connected to the input buffer 11 determines that the element is equal to 0, then sends a control signal to the corresponding switch 15, so that the switch 15 will The count 2 in the counter 14 is written to the output buffer 16 on the right side. Similarly, when the count of counter 14 is 3, the input buffer 11 reads element 5, and so on.
在针对数据矩阵中的每一行的每一个元素执行了上述步骤后,可以将并行的各个左侧的输出缓存13中所存储的内容合起来作为该数据矩阵的数据值,将并行的各个右侧的输出缓存16中所存储的内容合起来作为该数据矩阵的列数值,并且由于可以根据缓存队列的位置来确定每个左侧的输出缓存13和每个右侧的输出缓存16分别对应于数据矩阵中的第几行,因此,针对数据矩阵中的每一行,通过比较左侧的输出缓存13中第一个非0元素在所述数据值中的位置便可确定行偏移的内容(用于执行所述比较的行偏移计算装置未在图7中示出,这里还可以增设第三类输出缓存以缓存获得的行偏移值,该第三类输出缓存的最大长度取决于数据矩阵的列数)。由此,可以利用所述压缩器对数据矩阵的内容执行CSR的压缩过程以获得针对该数据矩阵的数据值、列数值、和行偏移值。After performing the above steps for each element of each row in the data matrix, the contents stored in the parallel output buffers 13 on the left side may be combined as the data values of the data matrix, and the respective right sides of the parallel The contents stored in the output buffer 16 are combined as column values of the data matrix, and since each of the output buffers 13 on the left side and the output buffers 16 on the right side are respectively determined to correspond to the data according to the position of the cache queue The first few rows in the matrix, therefore, for each row in the data matrix, the contents of the row offset can be determined by comparing the position of the first non-zero element in the output buffer 13 on the left side in the data value (using The row offset computing device for performing the comparison is not shown in FIG. 7. Here, a third type of output buffer may be added to buffer the obtained row offset value, and the maximum length of the third type of output buffer depends on the data matrix. Number of columns). Thus, the compressor can be used to perform a CSR compression process on the contents of the data matrix to obtain data values, column values, and row offset values for the data matrix.
图8示出了根据本发明的一个实施例用于对采用CSR方式压缩的数据执行解压的解压缩器,包括:用于行偏移值的输入缓存21、用于列数值的输入缓存22、用于数据值的输入缓存23、比较器24、计数器25、写入控制器26、输出缓存27。参考图8,用于行偏移值的输入缓存21向计数器25提供输入进行计数并且向比较器24提供所缓存的数据内容,计数器25根据对行偏移值的计数向用于列数值的输入缓存22和用于数据值的输入缓存23提供控制信号,使得它们逐行地将数据缓存到写入控制器26,以等待计数器25与用于行偏移值的输入缓存21通过比较器24所获得的控制信号指示输出缓存27从写入控制器26逐行地获取数据。8 illustrates a decompressor for performing decompression on data compressed using CSR mode, including: an input buffer 21 for row offset values, an input buffer 22 for column values, in accordance with an embodiment of the present invention, An input buffer 23 for data values, a comparator 24, a counter 25, a write controller 26, and an output buffer 27. Referring to Figure 8, the input buffer 21 for the row offset value provides an input to the counter 25 for counting and provides the buffered data content to the comparator 24, which counters the input to the column value based on the count of the row offset value. The cache 22 and the input buffer 23 for data values provide control signals such that they buffer data to the write controller 26 row by row, waiting for the counter 25 and the input buffer 21 for the row offset value to pass through the comparator 24. The obtained control signal instructs the output buffer 27 to acquire data line by line from the write controller 26.
其中,用于行偏移值、列数值、以及数据值的输入缓存21、22、23用于对从片上网络获取的待解压数据中的行偏移值、列数值、以及数据值进行缓存以等待执行解压的过程。计数器25用于对从用于行偏移值的输入缓存21中读取的数据内容的个数进行计数,以根据所述计数向用于列数值的输入缓存22和用于数据值的输入缓存23提供控制信号,并且其还用于向比较器24提供所述计数;这里,由于每一个行偏移值对应于原始数据矩阵中的一行,因此计数器25每从用于行偏移值的输入缓存21中读取一个值,则对应于针对原始数据矩阵中的一行进行解压,在所解压的行未发生改变的情况下,由计数器25产生相应的控制信号以告知用于列数值、以及数据值的输入缓存22、23向写入控制器26中提供列数值以及数 据值。写入控制器26用于暂时地存储来自用于列数值的输入缓存22和用于数据值的输入缓存23的列数值和数据值,以在输出缓存27执行读入时将所存储的列数值和数据值写入输出缓存27。比较器24用于对用于行偏移值的输入缓存21中的数据内容与计数器25中的计数进行比较,若这两者相等,则产生相应的控制信号以控制输出缓存27从写入控制器26读取数据。这里应当理解,所述计数器25中的每一个计数对应于针对所述数据矩阵中的一行进行解压缩,因此除将所述计数器25连接至所述于列数值和用于数据值的输入缓存22、23之外,还可以采用其他连接方式,只要所述输出缓存27能够根据其计数区分出所述数据矩阵中的各行即可。The input buffers 21, 22, and 23 for the row offset value, the column value, and the data value are used to buffer the row offset value, the column value, and the data value in the data to be decompressed acquired from the on-chip network. Waiting for the process of decompression. The counter 25 is for counting the number of data contents read from the input buffer 21 for the line offset value to input the input buffer 22 for the column value and the input buffer for the data value according to the count. 23 provides a control signal and it is also used to provide said count to comparator 24; here, since each row offset value corresponds to a row in the original data matrix, counter 25 is input from each of the values for the row offset Reading a value in the cache 21 corresponds to decompressing a row in the original data matrix. In the case where the decompressed row has not changed, the counter 25 generates a corresponding control signal to inform the column value and the data. The value input buffers 22, 23 provide column values and data values to the write controller 26. The write controller 26 is for temporarily storing column values and data values from the input buffer 22 for column values and the input buffer 23 for data values to store the stored column values when the output buffer 27 performs the read. And the data value is written to the output buffer 27. The comparator 24 is configured to compare the data content in the input buffer 21 for the row offset value with the count in the counter 25. If the two are equal, a corresponding control signal is generated to control the output buffer 27 from the write control. The device 26 reads the data. It should be understood herein that each of the counters 25 corresponds to decompressing for one of the rows of data, thus, in addition to connecting the counter 25 to the column values and the input buffer 22 for data values. In addition to 23, other connection methods may be employed as long as the output buffer 27 can distinguish each row in the data matrix according to its count.
与压缩器相类似地,在根据本发明的上述实施例的解压缩器中的各个缓存的长度可以根据深度神经网络中的数据矩阵的尺寸和/或卷积核的尺寸决定。例如,对于数据矩阵的尺寸为5×8,卷积核的尺寸为4×4的情况,可以将用于列数值、以及数据值的输入缓存22、23的长度均设置为40个数据长度(即最多需要缓存等于数据矩阵中元素数量的元素),将用于行偏移的输入缓存21的长度设置为5个数据长度(即数据矩阵的行数),将输出缓存27的长度设置为8个数据长度(假设每次针对解压后的一行数据进行输出)或者设置为40个数据长度(每次针对解压后的完整数据矩阵进行输出)。Similarly to the compressor, the length of each buffer in the decompressor according to the above-described embodiment of the present invention may be determined according to the size of the data matrix and/or the size of the convolution kernel in the deep neural network. For example, for the case where the size of the data matrix is 5×8 and the size of the convolution kernel is 4×4, the lengths of the input buffers 22 and 23 for column values and data values can be set to 40 data lengths ( That is, at most, it is necessary to cache an element equal to the number of elements in the data matrix), the length of the input buffer 21 for the line offset is set to 5 data lengths (ie, the number of rows of the data matrix), and the length of the output buffer 27 is set to 8 The length of the data (assuming that it is output for one line of data after decompression) or set to 40 data lengths (each output for the complete data matrix after decompression).
以图8为例,假设解压缩器需要对如图6所示出的数据值、列数值、行偏移进行解压以还原出原始的尺寸为5×8的矩阵。在解压缩器中,按照1、3、5、7、9的顺序从用于行偏移的输入缓存21中输出行偏移值。首先,由计数器25和比较器24读取第一个行偏移值1,计数器25此时的计数为1,表示此时针对原始数据矩阵中的第1行进行解压,在计数变更为2之前,向用于列数值、以及数据值的输入缓存22、23提供执行写入的控制信号,以指示他们向写入控制器26写入“1、3”和“1、5”。与此同时,比较器24将当前的行偏移值1与计数器25的计数1进行比较,这两者大小一致,因而比较器24向输出缓存27提供执行写入的控制信号,以指示输出缓冲27根据写入控制器26中的数据值“1、5”以及列数值“1、3”以及所述矩阵的列数,将数据矩阵中第一行恢复为“1、0、5、0、0、0、0、0”,并输出所获得的解压后的第一行数据内容。随后,由计数器25和比较器24读取第二个行偏移值3,计数器25将其计数修改为2,开始针对数据矩阵中的第二行进行解压。重复地执行前述步骤,以此类推,直 至完成针对原始数据矩阵中的全部五行数据的解压。Taking FIG. 8 as an example, it is assumed that the decompressor needs to decompress the data value, the column value, and the row offset as shown in FIG. 6 to restore the original matrix of size 5×8. In the decompressor, the line offset values are output from the input buffer 21 for line offset in the order of 1, 3, 5, 7, and 9. First, the counter 25 and the comparator 24 read the first row offset value 1, and the counter 25 has a count of 1, indicating that the first row in the original data matrix is decompressed before the count is changed to 2. The input buffers 22, 23 for column values and data values are provided with control signals to perform writing to instruct them to write "1, 3" and "1, 5" to the write controller 26. At the same time, comparator 24 compares the current row offset value 1 with the count 1 of counter 25, which are of uniform size, such that comparator 24 provides a control signal to the output buffer 27 to perform the write to indicate the output buffer. 27 restores the first row in the data matrix to "1, 0, 5, 0, according to the data values "1, 5" in the write controller 26 and the column values "1, 3" and the number of columns of the matrix. 0, 0, 0, 0", and output the obtained decompressed first line of data content. Subsequently, the second row offset value 3 is read by the counter 25 and the comparator 24, and the counter 25 modifies its count to 2 to begin decompressing the second row in the data matrix. The foregoing steps are performed repeatedly, and so on, until the decompression of all five rows of data in the original data matrix is completed.
这里可以看出,由本发明中图7和图8所提供的压缩器和解压缩器尤其适合于对深度神经网络中的数据矩阵进行压缩,这是由于深度神经网络中的数据矩阵通常具备较大稀疏性,即其中取值为0的元素占了相当大的比例。可以理解,本发明所提供的压缩器和解压器不仅适用于针对将深度神经网络计算单元与3D内存相结合的方案,还适用于其它深度神经网络中的数据矩阵。It can be seen here that the compressor and decompressor provided by Figures 7 and 8 of the present invention are particularly suitable for compressing data matrices in deep neural networks, since data matrices in deep neural networks are usually larger and sparse. Sex, that is, the element with a value of 0 accounts for a considerable proportion. It can be understood that the compressor and decompressor provided by the present invention are not only suitable for the scheme of combining the deep neural network computing unit with the 3D memory, but also for the data matrix in other deep neural networks.
通过上述实施例可以看出,本发明提供了一种针对深度神经网络中的数据内容进行压缩和解压缩的方案,以降低需要在网络中传输和/或存储的数据量。对于在深度神经网络中采用3D内存的方案而言,提供了针对压缩器和解压缩器的具体部署方案,缓解了片上网络的负载压力,由此降低了数据传输的延迟。并且,在本发明中,还提供了专门针对深度神经网络中采用矩阵形式进行计算的数据的压缩器和解压缩器,基于神经网络的数据矩阵的稀疏性,可以自动地对数据矩阵进行压缩和解压缩。As can be seen from the above embodiments, the present invention provides a scheme for compressing and decompressing data content in a deep neural network to reduce the amount of data that needs to be transmitted and/or stored in the network. For the scheme of adopting 3D memory in the deep neural network, a specific deployment scheme for the compressor and the decompressor is provided, which relieves the load pressure of the on-chip network, thereby reducing the delay of data transmission. Moreover, in the present invention, a compressor and a decompressor for data calculated in a matrix form in a deep neural network are also provided, and the data matrix can be automatically compressed and decompressed based on the sparsity of the data matrix of the neural network. .
需要说明的是,上述实施例中介绍的各个步骤并非都是必须的,本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。It should be noted that the various steps described in the foregoing embodiments are not all necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, and the like according to actual needs.
最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。例如,在本发明中并不限制所采用的3D内存的类型,除本发明具体实施例中提出的HMC之外,还可以是高带宽内存(High Bandwidth Memory,HBM)Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention and not limiting. For example, in the present invention, the type of 3D memory used is not limited. In addition to the HMC proposed in the specific embodiment of the present invention, a high bandwidth memory (HBM) may be used.
尽管上文参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。While the invention has been described in detail hereinabove with reference to the embodiments of the present invention, it should be understood that Within the scope of the claims of the present invention.

Claims (10)

  1. 一种用于深度神经网络的加速系统,包括:3D内存、与所述3D内存的拱顶的逻辑层上的内存控制器连接的深度神经网络计算单元、与所述内存控制器连接的路由器、以及压缩器和解压缩器;An acceleration system for a deep neural network, comprising: 3D memory, a deep neural network computing unit connected to a memory controller on a logical layer of a dome of the 3D memory, a router connected to the memory controller, And a compressor and a decompressor;
    其中,各个拱顶的内存控制器经由与其连接的路由器通过片上网络进行数据传输;以及Wherein, the memory controller of each vault transmits data through the on-chip network via a router connected thereto;
    其中,所述压缩器用于对需要在片上网络中传输的用于深度神经网络的待压缩数据进行压缩,所述解压缩器用于对来自片上网络的用于深度神经网络的待解压缩数据进行解压缩。The compressor is configured to compress data to be compressed for a deep neural network that needs to be transmitted in an on-chip network, and the decompressor is configured to solve a data to be decompressed for a deep neural network from an on-chip network. compression.
  2. 根据权利要求1所述的系统,其中,所述压缩器设置在所述路由器内或所述片上网络的网络接口处或所述内存控制器处,所述解压缩器设置在所述路由器内或所述片上网络的网络接口处或所述深度神经网络计算单元中。The system of claim 1 wherein said compressor is disposed within said router or at a network interface of said on-chip network or said memory controller, said decompressor being disposed within said router or The network interface of the on-chip network or the deep neural network computing unit.
  3. 根据权利要求2所述的系统,其中所述压缩器设置在所述路由器与所述3D内存连接的输入端处,所述解压缩器设置在所述路由器与所述深度神经网络计算单元连接的输出端处。The system of claim 2 wherein said compressor is disposed at an input of said router in connection with said 3D memory, said decompressor being disposed at said router coupled to said deep neural network computing unit At the output.
  4. 根据权利要求2所述的系统,其中所述压缩器设置在所述内存控制器与所述路由器之间,所述解压缩器设置在所述路由器与所述深度神经网络计算单元之间。The system of claim 2 wherein said compressor is disposed between said memory controller and said router, said decompressor being disposed between said router and said deep neural network computing unit.
  5. 一种用于深度神经网络的压缩器,包括:A compressor for deep neural networks, comprising:
    输入缓存(11),用于缓存所述深度神经网络中待压缩的矩阵数据;An input buffer (11) for buffering matrix data to be compressed in the deep neural network;
    比较器(12),用于从所述输入缓存(11)中读取元素并判断所述元素是否为0;a comparator (12) for reading an element from the input buffer (11) and determining whether the element is 0;
    计数器(14),用于记录从所述输入缓存(11)中读取的元素的个数;a counter (14) for recording the number of elements read from the input buffer (11);
    开关(15),以所述计数器(14)的输出作为输入并且以所述比较器(12)的输出作为控制信号,用于在所述比较器(12)判断为是时提供所述计数器(14)的输出;a switch (15) having an output of the counter (14) as an input and an output of the comparator (12) as a control signal for providing the counter when the comparator (12) determines YES ( 14) output;
    第一输出缓存(13),用于在所述比较器(12)判断为否时存储所述输入缓存(11)中的元素,以获得针对所述矩阵的数据值;a first output buffer (13) for storing an element in the input buffer (11) when the comparator (12) determines NO to obtain a data value for the matrix;
    第二输出缓存(16),用于存储由所述开关(15)所提供的所述计数器(14)的输出,以获得针对所述矩阵的列数值。A second output buffer (16) for storing the output of the counter (14) provided by the switch (15) to obtain column values for the matrix.
  6. 根据权利要求5所述的压缩器,其中,还包括:The compressor of claim 5, further comprising:
    行偏移计算装置,用于针对数据矩阵中的每一行,计算所述第一输出缓存(13)中第一个非0元素在所述第一输出缓存(13)的全部输出中的位置,以获得针对所述矩阵的行偏移值;以及,a row offset calculating means for calculating, for each row in the data matrix, a position of a first non-zero element in the first output buffer (13) in all outputs of the first output buffer (13), Obtaining a row offset value for the matrix; and,
    第三输出缓存,用于存储所述行偏移值。And a third output buffer, configured to store the row offset value.
  7. 根据权利要求6所述的压缩器,其中,所述输入缓存(11)的长度、所述第一输出缓存(13)的长度、所述第二输出缓存(16)的长度大于或等于所述矩阵的行数,所述输入缓存(11)中的各个单元并行地针对所述矩阵中的每一行进行缓存,所述输入缓存(11)中的一个单元与所述第一输出缓存(13)中的一个单元以及所述第二输出缓存(16)中的一个单元相对应;The compressor according to claim 6, wherein a length of said input buffer (11), a length of said first output buffer (13), a length of said second output buffer (16) is greater than or equal to said a number of rows of the matrix, each unit in the input buffer (11) is cached in parallel for each row in the matrix, a unit in the input buffer (11) and the first output buffer (13) One of the units and one of the second output buffers (16);
    并且,所述第三输出缓存的长度小于或等于所述矩阵的行数。And, the length of the third output buffer is less than or equal to the number of rows of the matrix.
  8. 一种用于深度神经网络的解压缩器,包括:A decompressor for deep neural networks, comprising:
    第一输入缓存(23),用于缓存所述深度神经网络中待解压缩的矩阵的数据值;a first input buffer (23) for buffering data values of the matrix to be decompressed in the deep neural network;
    第二输入缓存(22),用于缓存所述深度神经网络中待解压缩的矩阵的列数值;a second input buffer (22) for buffering column values of the matrix to be decompressed in the deep neural network;
    第三输入缓存(21),用于缓存所述深度神经网络中待解压缩的矩阵的行偏移值;a third input buffer (21), configured to buffer a row offset value of the matrix to be decompressed in the deep neural network;
    计数器(25),用于记录从所述第三输入缓存(21)中读取的元素的个数;a counter (25) for recording the number of elements read from the third input buffer (21);
    比较器(24),用于比较从所述第三输入缓存(21)中读取的元素与所述计数器(25)的计数是否相等;a comparator (24) for comparing whether an element read from the third input buffer (21) is equal to a count of the counter (25);
    写入控制器(26),用于存储来自所述第一输入缓存(23)和所述第二输入缓存(22)的元素;Writing to a controller (26) for storing elements from the first input buffer (23) and the second input buffer (22);
    输出缓存(27),用于针对所述计数器(25)的每一个计数,在所述比较器(24)判断为是时,根据所述写入控制器(26)中存储的元素,确定经过解压缩的所述矩阵中的一行。An output buffer (27) for counting each of the counters (25), and when the comparator (24) determines YES, determining the elapsed according to elements stored in the write controller (26) A row in the matrix that is decompressed.
  9. 根据权利要求8所述的解压缩器,其中,所述第一输入缓存(23)的长度、所述第二输入缓存(22)的长度小于或等于所述矩阵中元素的总数,所述第三输入缓存(21)的长度小于或等于所述矩阵的行数。The decompressor according to claim 8, wherein a length of said first input buffer (23), a length of said second input buffer (22) is less than or equal to a total number of elements in said matrix, said The length of the three-input buffer (21) is less than or equal to the number of rows of the matrix.
  10. 根据权利要求9所述的解压缩器,其中,所述输出缓存(27)还用于针对所述计数器(25)的每一个计数,根据所述写入控制器(26)中 存储的数据值的元素、对应的列数值的元素、以及所述矩阵的列数,计算经过解压缩的所述矩阵中的一行的各个元素。The decompressor of claim 9, wherein said output buffer (27) is further for counting each of said counters (25) based on data values stored in said write controller (26) The elements, the elements of the corresponding column values, and the number of columns of the matrix, compute the individual elements of a row in the decompressed matrix.
PCT/CN2018/083880 2017-08-29 2018-04-20 Compression apparatus used for deep neural network WO2019041833A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710753293.4A CN107590533B (en) 2017-08-29 2017-08-29 Compression device for deep neural network
CN201710753293.4 2017-08-29

Publications (1)

Publication Number Publication Date
WO2019041833A1 true WO2019041833A1 (en) 2019-03-07

Family

ID=61050227

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/083880 WO2019041833A1 (en) 2017-08-29 2018-04-20 Compression apparatus used for deep neural network

Country Status (2)

Country Link
CN (1) CN107590533B (en)
WO (1) WO2019041833A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590533B (en) * 2017-08-29 2020-07-31 中国科学院计算技术研究所 Compression device for deep neural network
CN110084364B (en) * 2018-01-25 2021-08-27 赛灵思电子科技(北京)有限公司 Deep neural network compression method and device
CN108470009B (en) * 2018-03-19 2020-05-29 上海兆芯集成电路有限公司 Processing circuit and neural network operation method thereof
CN108629410B (en) * 2018-04-28 2021-01-22 中国科学院计算技术研究所 Neural network processing method based on principal component analysis dimension reduction and/or dimension increase
CN108665062B (en) * 2018-04-28 2020-03-10 中国科学院计算技术研究所 Neural network processing system for reducing IO (input/output) overhead based on wavelet transformation
CN108629409B (en) * 2018-04-28 2020-04-10 中国科学院计算技术研究所 Neural network processing system for reducing IO overhead based on principal component analysis
CN109240605B (en) * 2018-08-17 2020-05-19 华中科技大学 Rapid repeated data block identification method based on 3D stacked memory
CN109298884B (en) * 2018-08-29 2021-05-25 北京中科睿芯科技集团有限公司 Universal character operation accelerated processing hardware device and control method
CN109325590B (en) * 2018-09-14 2020-11-03 中国科学院计算技术研究所 Device for realizing neural network processor with variable calculation precision
WO2020062312A1 (en) * 2018-09-30 2020-04-02 华为技术有限公司 Signal processing device and signal processing method
CN109104197B (en) * 2018-11-12 2022-02-11 合肥工业大学 Coding and decoding circuit and coding and decoding method for non-reduction sparse data applied to convolutional neural network
US12008475B2 (en) 2018-11-14 2024-06-11 Nvidia Corporation Transposed sparse matrix multiply by dense matrix for neural network training
CN109785905B (en) * 2018-12-18 2021-07-23 中国科学院计算技术研究所 Accelerating device for gene comparison algorithm
CN109800869B (en) * 2018-12-29 2021-03-05 深圳云天励飞技术有限公司 Data compression method and related device
CN110738310B (en) * 2019-10-08 2022-02-01 清华大学 Sparse neural network accelerator and implementation method thereof
CN110889259B (en) * 2019-11-06 2021-07-09 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix
CN110958177B (en) * 2019-11-07 2022-02-18 浪潮电子信息产业股份有限公司 Network-on-chip route optimization method, device, equipment and readable storage medium
CN110943744B (en) * 2019-12-03 2022-12-02 嘉楠明芯(北京)科技有限公司 Data compression, decompression and processing method and device based on data compression and decompression
CN111240743B (en) * 2020-01-03 2022-06-03 格兰菲智能科技有限公司 Artificial intelligence integrated circuit
CN111431539B (en) * 2020-03-04 2023-12-08 嘉楠明芯(北京)科技有限公司 Compression method and device for neural network data and computer readable storage medium
US11604976B2 (en) * 2020-04-29 2023-03-14 International Business Machines Corporation Crossbar arrays for computations in memory-augmented neural networks
CN116661707B (en) * 2023-07-28 2023-10-31 北京算能科技有限公司 Data processing method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140167987A1 (en) * 2012-12-17 2014-06-19 Maxeler Technologies Ltd. Systems and methods for data compression and parallel, pipelined decompression
CN105184362A (en) * 2015-08-21 2015-12-23 中国科学院自动化研究所 Depth convolution neural network acceleration and compression method based on parameter quantification
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN106557812A (en) * 2016-11-21 2017-04-05 北京大学 The compression of depth convolutional neural networks and speeding scheme based on dct transform
CN107590533A (en) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 A kind of compression set for deep neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013120497A1 (en) * 2012-02-15 2013-08-22 Curevac Gmbh Nucleic acid comprising or coding for a histone stem-loop and a poly(a) sequence or a polyadenylation signal for increasing the expression of an encoded therapeutic protein
CN107092961B (en) * 2017-03-23 2018-08-28 中国科学院计算技术研究所 A kind of neural network processor and design method based on mode frequency statistical coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140167987A1 (en) * 2012-12-17 2014-06-19 Maxeler Technologies Ltd. Systems and methods for data compression and parallel, pipelined decompression
CN105184362A (en) * 2015-08-21 2015-12-23 中国科学院自动化研究所 Depth convolution neural network acceleration and compression method based on parameter quantification
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN106557812A (en) * 2016-11-21 2017-04-05 北京大学 The compression of depth convolutional neural networks and speeding scheme based on dct transform
CN107590533A (en) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 A kind of compression set for deep neural network

Also Published As

Publication number Publication date
CN107590533A (en) 2018-01-16
CN107590533B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
WO2019041833A1 (en) Compression apparatus used for deep neural network
US20190196907A1 (en) Compression techniques for distributed data
US10771090B2 (en) Data processing unit having hardware-based range encoding and decoding
TWI814975B (en) Method and apparatus for storage media programming with adaptive write buffer release, and the system-on-chip thereof
JP7221242B2 (en) Neural network data processor, method and electronics
EP2104236B1 (en) Table device, variable-length encoding device, variable-length decoding device, and variable-length encoding/decoding device
US20140108481A1 (en) Universal fpga/asic matrix-vector multiplication architecture
US9176977B2 (en) Compression/decompression accelerator protocol for software/hardware integration
US9479194B2 (en) Data compression apparatus and data decompression apparatus
JPH04213223A (en) Device for variable-length coding-decoding of digital signal
US20190123763A1 (en) Data compression engine for dictionary based lossless data compression
US20200142642A1 (en) Data processing unit having hardware-based range encoding and decoding
US10877668B2 (en) Storage node offload of residual part of a portion of compressed and distributed data to a second storage node for decompression
JP2022551266A (en) Representation format of neural network
CN107801044B (en) Backward adaptive device and correlation technique
US20210224191A1 (en) Compression and decompression module in a cache controller for reducing off-chip data traffic
CN103974090B (en) Image encoding apparatus
CN112290953B (en) Array encoding device and method, array decoding device and method for multi-channel data stream
CN106817584A (en) A kind of MJPEG compressions implementation method and FPGA based on FPGA
WO2020211000A1 (en) Apparatus and method for data decompression
US11424761B2 (en) Multiple symbol decoder
JP2023155450A5 (en) Computer system and computer system control method
RU2450441C1 (en) Data compression method and apparatus
CN115705150A (en) System, method and apparatus for partitioning and compressing data
JP7381393B2 (en) Conditional transcoder and transcoding method for encoded data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18850251

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18850251

Country of ref document: EP

Kind code of ref document: A1