CN110766136A - Compression method of sparse matrix and vector - Google Patents

Compression method of sparse matrix and vector Download PDF

Info

Publication number
CN110766136A
CN110766136A CN201910982345.4A CN201910982345A CN110766136A CN 110766136 A CN110766136 A CN 110766136A CN 201910982345 A CN201910982345 A CN 201910982345A CN 110766136 A CN110766136 A CN 110766136A
Authority
CN
China
Prior art keywords
segment
elements
data
zero
sparse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910982345.4A
Other languages
Chinese (zh)
Other versions
CN110766136B (en
Inventor
杨建磊
赵巍胜
付文智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Aeronautics and Astronautics
Original Assignee
Beijing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Aeronautics and Astronautics filed Critical Beijing University of Aeronautics and Astronautics
Priority to CN201910982345.4A priority Critical patent/CN110766136B/en
Publication of CN110766136A publication Critical patent/CN110766136A/en
Application granted granted Critical
Publication of CN110766136B publication Critical patent/CN110766136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a compression method applied to sparse matrixes and vectors, namely an enhanced coordinate list (ECOO). Including dividing a matrix or vector into a plurality of data segments in rows (columns) during compression. Each non-zero data is represented after segmentation by a triplet including a value (value), an offset (offset) within the segment, and an end of group (EOG). The compression method can represent any large sparse matrix and vector by using limited bit width, and can complete decoding operation with low cost. Therefore, compared with a naive coordinate list (COO) compression method, the method is particularly suitable for accessing sparse matrixes or vectors by rows or columns according to storage requirements in a hardware structure with limited bit width.

Description

Compression method of sparse matrix and vector
Technical Field
The invention relates to the technical field of data processing of a neural network processor, in particular to a compression method of a sparse matrix and a vector.
Background
In recent years, deep learning has been increasingly achieved in fields such as image recognition, voice processing, and the like. However, with the increasing of the depth of the network, the demands of computing power, memory access bandwidth and the like required in the deep neural network training and reasoning process are gradually difficult to be met by the traditional computing platform.
The sparsity of a neural network refers to the ratio of its weights to the zeros contained in the features. During the training and reasoning process of the neural network with high sparsity, most operations include one or more operands equal to zero, and can be removed without affecting the calculation result. Sparse matrices and vectors are important objects that are often handled in scientific exploration and engineering practice. The sparse matrix and the vector contain a large number of zero elements, and the part of zero elements can be removed to reduce the space overhead required in the storage process. However, in this process, the storage and use efficiency of the sparse matrix and the vector are reduced by improper compression. Therefore, various compression methods are proposed to optimize the storage and calculation process of sparse matrix and vector. For example, in the coo (coordanatelist) format, each non-zero element in the sparse matrix is represented by a triplet (row number, column number, value); in the csr (compressed Row storage) format, each non-zero data only retains its relative offset from the last non-zero data. However, especially when processing sparse matrices and vectors on dedicated hardware, these existing compression formats impose certain constraints on the calculation process. For example, the COO format is difficult to compress arbitrarily large sparse matrices with limited bit-widths; the CSR format requires the relative offsets to be accumulated for subsequent operations, thereby incurring additional computational overhead. Therefore, it is of great practical significance to propose a new compression format for sparse matrices and vectors to efficiently process the sparse matrices and vectors on proprietary hardware.
Disclosure of Invention
In order to at least solve the technical problems, the invention provides a sparse matrix and vector compression method. The method can compress any large matrix or vector by using limited bit width, and is particularly suitable for hardware processing of sparse matrices and vectors. Compared with CRS, the matrix and vector compressed by the method can be conveniently sent to a computing unit for processing, thereby reducing the computing overhead brought by the decoding process.
The complete technical scheme of the invention comprises the following steps:
a compression method of sparse vectors comprises the following steps:
(1) segmenting elements in the vector according to a given length;
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiThe offset value is denoted as N-i, xiIs the ith element in the segment, and N is the total number of elements in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
The vector compression method used may also be the following steps:
(1) segmenting elements in the vector according to a given length;
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiAnd its offset value is denoted as i-1, xiIs the ith element in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
Further, for the above two vector compression methods, in step (2), in addition to labeling the offset values in a sequential or reverse order, the offset values may be labeled in an arbitrary order.
A compression method of a sparse matrix comprises the following steps:
(1) the elements in the matrix are segmented row by row or column by column with a given length,
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiThe offset value is denoted as N-i, xiIs the ith element in the segment, and N is the total number of elements in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
The matrix compression method used may also be the following steps:
(1) the elements in the matrix are segmented row by row or column by column with a given length,
(2) labeling the offset of each data within a segment, specifically for element x within the segmentiThe offset is denoted as i-1, xiIs the ith element in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
Further, for the above two matrix compression methods, in step (2), in addition to labeling the offset values in a sequential or reverse order, the offset values may also be labeled in an arbitrary order.
The compression method disclosed by the invention can represent any large sparse matrix and vector by using limited bit width, and can complete decoding operation with lower cost. Therefore, compared with a naive coordinate list (COO) compression method, the method is particularly suitable for accessing sparse matrixes or vectors by rows or columns according to storage requirements in a hardware structure with limited bit width. Meanwhile, compared with the CRS, the matrix and the vector compressed by the method can be conveniently sent to a corresponding neural network processor computing unit for processing, so that the computing overhead brought by the decoding process is reduced.
Drawings
Fig. 1 is a flow chart of compressing a compressed sparse vector according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating a process of compressing a sparse vector according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of an exemplary sparse matrix for processing in accordance with the present invention.
Fig. 4 is a schematic diagram of the sparse matrix shown in fig. 3 after being compressed by the method proposed by the present invention.
FIG. 5a is a schematic diagram of an embodiment of a sparse neural network processor operating on compressed data of the present invention.
FIG. 5b is a block diagram of a PE in an embodiment of the sparse neural network processor of FIG. 5a using the architecture of the present invention.
FIG. 6 is a schematic diagram of a CE in one embodiment of a sparse neural network processor employing the architecture of the present invention shown in FIG. 5 a.
FIG. 7 is a graph comparing performance with a naive systolic array when running a real neural network.
In the figure, a, b and c are results of three different neural networks of AlexNet, VGG16 and ResNet50 respectively.
FIG. 8 is a graph of the evaluation of the sensitivity to sparsity of a neural network processor employing the compression method of the present invention.
In the figure, a, b and c are results of three different neural networks of AlexNet, VGG16 and ResNet50 respectively.
FIG. 9 is a diagram illustrating a process of performing operations on the compressed sparse vector according to the embodiment of the PE shown in FIG. 5 b.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by combining specific embodiments with the accompanying drawings.
In order to improve the efficiency of sparse matrix and vector storage and calculation, the invention provides a compression method for sparse matrices and vectors, so that the random large matrix or vector can be represented by using limited bit width.
FIG. 1 is a flow chart for compressing sparse vectors according to one embodiment of the present invention. The method comprises the steps that firstly, the sparse vector is divided into sections according to a given length, and the subsequent operation is carried out by taking the sections as granularity. Thereafter, the offset (offset) value of each data within the group is noted and the data that does not need to be retained is removed. Meanwhile, after EOG labeling is carried out on the last data in each group, the compression process is completed.
Fig. 2 illustrates the compression process in conjunction with a particular sparse vector.
For a sparse vector with multiple zero elements, the elements in the vector are first segmented by a given length of 6, for a total of 3 segments.
The offset of each data in the segment is then noted, e.g., for the last data in the segment, the offset is 0, the data offset one bit before is 1, and the data offset two bits before is 2. And so on, obtaining the offset of all data in the segment.
Judging whether the elements in the segment are non-zero elements or zero elements; if no nonzero element exists in the segment, reserving a first zero element in the segment for occupying; if there are non-zero elements in the segment, all zero elements in the segment are removed.
Carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
The above offset marking method can also be reversed, for example, for the first data in the segment, the offset is 0, the data offset of the next bit is 1, and the data offset of the next two bits is 2. And so on, obtaining the offset of all data in the segment. When EOG labeling is performed on the data in the segment in step 4, the element with the largest offset value among the remaining elements is labeled as 1, and the other elements are labeled as 0.
The method can also be used for compressing the sparse matrix. Fig. 3 shows a typical sparse matrix schematic for compression according to the present invention. First, the segment length is set to 3, and after segmenting and compressing line by line, the obtained compression result is shown in fig. 4. Each non-zero element compressed triplet includes its offset (offset) within the segment. Meanwhile, if a certain segment has no non-zero element, a zero element is reserved in the segment for occupying. Finally, the element with the largest offset value in the remaining elements of each segment is marked as EOG, so that the whole compression process is completed.
The compression method can be applied to the neural network processor with the systolic array architecture designed by the inventor of the invention, and the compressed matrix and vector can be conveniently sent to a processor computing unit for processing, thereby reducing the computing overhead brought by the decoding process. The neural network processor and the application of the compression method to the processor are briefly described as follows:
the sparse neural network processor applying the compression method comprises a storage unit, a control unit, a confluence array, a sparse matrix operation array and a calculation unit. The storage unit is used for storing the weight, gradient, characteristic and instruction sequence for data flow scheduling. The control unit is connected with the storage unit, acquires required data from storage according to the scheduling of the instruction sequence, reforms the data into a matrix operation form, bridges the data through the confluence array, and sends the data to the sparse matrix operation array to complete corresponding calculation. The calculation unit receives the calculation result of the sparse matrix operation array, and further completes operations such as nonlinear activation function, pooling and the like on the calculation result to complete final calculation. The processor can achieve speed and power consumption ratio improvement by utilizing sparsity in training and deducing processes of various neural networks.
The weight and the intermediate result (the characteristics of each layer) calculated by each layer of the neural network are taken out by the storage unit under the dispatching of the control unit, and are bridged by the confluence array and then are sent to the sparse matrix operation array. The sparse matrix operation array sends the result to a calculation unit to perform required nonlinear operation so as to complete the calculation. Finally, under the scheduling of the control unit, the calculation result is taken out from the calculation unit and stored in the storage unit for the calculation of the next layer.
Wherein the sparse matrix operation array comprises a plurality of PEs, and the PEsijThe processing units PE are composed of a Dynamic Selection (DS) component, a Multiplication and Addition (MAC) component and a Result Forwarding (RF) component, and also comprise two data input ports A and B, two data output ports A 'and B', the data input ports A and A 'can be responsible for data transmission between rows, and the data input ports B and B' can be responsible for data transmission between columns. And vice versa. Meanwhile, the system also comprises an input port C for forwarding a calculation result and an output port C' for forwarding the calculation result; after a plurality of PEs are constructed into a systolic array form, the A port of each PE is connected with the A ' port of another PE, and B is connected with B ' and C is connected with C ' similarly.
During processing, two input data streams enter the processing unit PE from a and B, respectively, first pass through the dynamic selection DS module, and then are output to the adjacent processing unit PE from a 'and B' ports, respectively.
In the process, a DS component for bridging compressed sparse data streams in two directions of A-A 'and B-B' selects a data pair (a, B) to be calculated and outputs the data pair to a multiplication and addition MAC component, the multiplication and addition MAC component is internally provided with a register for storing partial accumulation sum c, and after receiving the data pair (a, B), the data pair is subjected to multiplication and accumulation operation: c ← C + ab, which, after computation is complete, outputs the accumulated sum C into the RF component, which outputs the computation result from the C' port into the RF component of the neighboring processing unit PE, forwarding the computation result out of the systolic array. The calculation results of other PEs enter from the C port, pass through the RF assembly and are sent to the adjacent PE from the C' port, so that the calculation results of the PEs are forwarded to the outside of the systolic array;
taking a sparse matrix-matrix multiplication process of 4 × 4 as an example, let a be (a ═ as shown in the following equationij) And B ═ Bij) For both sparse matrices, the zero element in both is represented by "0". C ═ Cij) Is the product of A and B.
Figure BDA0002235607480000051
Each row of data in the matrix A is compressed and then sent into different columns of the systolic array, and each column of data in the matrix B is compressed and then sent into different rows of the systolic array. The data input/output ports A, A ', B, B' of each PE are responsible for data transfer between rows and columns. In this process, the DS component selects the data pair that the PE needs to compute (if C is in the C matrix)21Is assigned to the PE of the second row and the first column21Then PE21Need to select a22And b21) And outputs it to the multiply-add MAC component for multiply-and-accumulate operations. After the computation is completed, the accumulated sum C is output to the RF component, which outputs the computation result to the RF component of the neighboring processing element PE via the C' port, thereby forwarding the computation result out of the systolic array.
In multi-layer perceptrons (MLPs), most of the computational tasks in their training and inference processes can be decomposed into sparse matrix-vector multiplications. Furthermore, most of the computational tasks in the training and inference process of convolutional neural networks can be decomposed into sparse convolution operations. Therefore, in the processor, the integrated sparse matrix is compressed and then sent to the sparse matrix operation array to complete the corresponding calculation. Each PE in the sparse matrix operation array independently completes the calculation of one element in the result matrix R. The sparse matrix integrated by convolution operation is compressed and sent to the sparse matrix operation array to complete corresponding calculation. Similarly, at this time, each PE in the sparse matrix operation array independently completes the calculation of one element in the result matrix R. Therefore, the processor applying the compression method can improve the speed and the power consumption ratio of the neural network by utilizing the sparsity in the training and deducing processes of the neural network.
Also, in many applications, the required bit width of data in the same matrix or vector varies. Using the same bit width to represent all values in a matrix or vector uniformly incurs unnecessary overhead. However, it is difficult for a general-purpose computing platform to effectively accelerate the processing of fine-grained mixed-precision data. On the other hand, adding an extra data path in the accelerator to calculate the blending precision data brings certain overhead: when the high-precision data proportion is too low, a high-precision data path may be idle, so that waste is caused; on the other hand, when the high precision data percentage is too high, the blocking of the high precision data path by full load may degrade the performance of the entire systolic array. Therefore, the present invention calculates the data of the mixed precision by using the uniform data path in the accelerator, and achieves optimization in terms of the storage space and the calculation power consumption by using the difference in the data precision. The method specifically comprises the following steps:
for the vectors with sparse mixed precision, a unified data path is used for processing, firstly, the input sparse mixed precision vectors are preprocessed, and the data is divided into two or more precision levels. For example, an 8-bit data path is adopted for data processing, 16-bit unsigned fixed point number in a vector is split into two 8-bit unsigned fixed point numbers, an additional mark is adopted for marking in the data compression process, and then the two 8-bit unsigned fixed point numbers are fed into a PE for normal processing.
When two 16-bit data meet at the same PE, the data is processed by dividing the data into four pairs and sending to the PE.
The above mixed precision processing method may also be used to process floating point data.
The above-described processing of unsigned blending accuracy can be used to process signed data as well. And the process of the mixed precision sparse vector-vector multiplication represented by the single PE can be suitable for one-dimensional or two-dimensional systolic arrays and further used for processing sparse mixed precision matrix-vector operation or mixed precision sparse matrix-matrix operation. The optimization can be achieved by utilizing the difference of data precision, and simultaneously, the overhead caused by additionally adding a high-precision data path is avoided.
On the other hand, during the calculation of the neural network, the same data may be sent to different rows of the sparse matrix operation array. In the inference process of the convolutional neural network, the feature maps sent to different rows of the systolic array to complete the corresponding convolution operation are often overlapped with each other. When the rows are allocated with memories independent of each other, due to the overlapping of the characteristic diagrams, multiple copies of the same data need to be stored in different memories, thereby causing waste of storage space. Meanwhile, in the operation process, the same data needs to be read from a plurality of memories for multiple times to be sent to different rows of the sparse matrix operation array, so that additional access and storage expenses are brought. Therefore, the neural network processor applying the compression method of the invention bridges the data flow from the memory unit to the sparse matrix operation array by the confluence array to reduce the storage overhead. The bus array is composed of a plurality of bus cells CE, each CE containing a local memory therein. And may receive data from outside the array or from adjacent CEs. Each CE may buffer received data internally or output data out of the array. Specifically, each CE includes an array external input port C and an array external output port D, and data ports a and B of the remaining CEs, and each CE receives data input from outside the array through the C port; exchanging data with other CEs through the ports A and B; the data is output out of the array through the D-port.
Each CE receives data input from the outside of the array through a C port; exchanging data with other CEs through the ports A and B; the data is output out of the array through the D-port.
A typical data transfer requirement process for the present bus array is, let x0、x1And x2Three different data blocks are represented. l0、l1And l2Three ports, the data required for each is: data block x0Needs to be sent to0On the port; data block x0And x1Needs to be sent to1On the port; data block x0,x1And x2Needs to be sent to2On the port. Let each CE number in the CE arrayIn turn is CE0,CE1,CE2First, the data block x0,x1,x2Respectively sent to CE0,CE1,CE2Respective off-array data port, then CE0,CE1,CE2Will first be the data block x0,x1,x2To the data output port outside the array, while each CE retains it in internal memory, the CE1From CE0In receiving data block x0It is sent to its own off-array data port while it is held in internal memory. CE2From CE1In receiving data block x1It is sent to its own off-array data port while it is held in internal memory. In the next stage, CE2From CE1In receiving data block x0It is sent to its own off-array data port. The bus array can transmit data in one memory to one or more output interfaces, thereby removing data redundancy among different memories and reducing the required memory capacity. Meanwhile, by preventing the same data from being written to and repeatedly read from the plurality of memories, reduction in storage power consumption can be achieved.
In the following, a specific embodiment is adopted to evaluate and explain the overall performance and technical effect of the vector and matrix compression method disclosed in the present invention and the sparse neural network processor based on the systolic array, which is shown in fig. 5a and adopts the compression method of the present invention, by converting the operation process of three different neural networks (AlexNet, VGG16, ResNet50) into sparse matrix operation.
First, as shown in fig. 5b, the sparse neural network processor based on the compression method of the present invention employs synchronous sequential logic in this embodiment, and buffers the input data stream using FIFO in the DS component; an 8-bit fixed-point multiplier-adder is used in the MAC component. The coding period length is set to 16. Wherein the DS components run at a higher frequency than the MAC components so that the MAC components are as full as possible.
As shown in fig. 6, the bus unit also uses synchronous sequential logic, and also uses FIFO to temporarily store data, wherein the depth of the FIFO is 16.
The following examples and figures collectively illustrate the labels used:
delay (cycle): refers to the number of cycles required to complete the operation of a given neural network;
operation speed (ops): the operation number (MAC) finished in unit time is referred, wherein zero is not removed when a certain neural network operand is counted;
power consumption efficiency (ops/w): the operation speed provided by unit power consumption;
area efficiency (ops/m)2): the operation speed provided by unit area;
"x, y, z" in the legend of each figure: the depths of a Feature FIFO (F-FIFO), a Weight FIFO (Weight FIFO, W-FIFO) and a Weight-Feature pair FIFO (WF-FIFO) are x, y and z respectively;
true neural networks: the sparse neural network is obtained by compressing the neural network according to the existing pruning algorithm.
Generated neural network (synthetic neural network): the neural network is generated layer by layer according to indexes such as given sparsity and 16-bit data proportion and meets the indexes.
As shown in fig. 7, taking a 16 × 16 systolic array as an example, the performance of the present invention can be improved more stably than that of a naive systolic array when operating various real networks under different parameter configurations. The results show that the present invention can utilize sparsity to speed up the computation process. It is particularly noted that the speed boost is substantially converged when the FIFO size is small, thereby avoiding area and power consumption overhead due to the introduction of an excessively large FIFO. Meanwhile, when the frequency of DS and MAC components is smaller, the speed increase is basically converged, and DS components can operate at lower frequency to avoid the expense brought by hardware design.
FIG. 8 shows the evaluation of the sensitivity of the present neural network processor to sparsity using a series of generated neural networks with different sparsity after input data has been compressed, again with the systolic array size controlled at 16x 16.
As can be seen from FIG. 8, the present neural network processor can achieve optimization of the computation speed using sparsity. When the input neural network is completely dense, the power consumption introduced by the additionally added components reduces the power consumption efficiency of the systolic array; however, since the input data is compressed with density, thereby reducing the required buffer capacity, a smaller on-chip-buffer (SRAM) can be used, thereby achieving an advantage in area efficiency. Therefore, compared with a naive systolic array, the method can effectively improve the operation speed, the power consumption efficiency and the area efficiency by using the sparsity of data when sparse matrix operation is carried out.
Fig. 9 is a schematic diagram illustrating a process of performing an operation on the compressed sparse vector according to the embodiment of PE shown in fig. 5 b. The behavior of each cycle thereof is described below.
In the period0In the method, the offset (0) of the data stored in the weight register is smaller than the offset (1) stored in the feature register. Then the weight flow lags the feature flow at this point, which remains stationary, fetches the first element in the weight FIFO (OW-FIFO), and simultaneously writes to the weight register (OW-Reg) and feeds into the neighboring PE.
In the period1In (2), the offset (1) of the data stored in the weight register is equal to the offset (1) stored in the feature register. Then the weight-feature pair (w) is now present1-f0) The features are fed into the MAC component for the desired features. In addition, the first element in the weight FIFO (OW-FIFO) is taken out, and simultaneously written into the weight register (OW-Reg) and sent into the adjacent PE; and the first element in the feature FIFO (F-FIFO) is fetched, simultaneously written into the feature register (F-Reg) and fed into the neighboring PE.
Period of time2Condition and period of0The opposite is true.
In the period3Although the offset (3) of the data stored in the weight register is smaller than the offset (4) stored in the feature register, the data in the weight register has encountered the end-of-segment flag(EOG ═ 1), in the same cycle2And in the same way, the weight flow is kept still and the characteristic flow is advanced.
In the period4If the data stored in the weight register and the data stored in the feature register meet the end mark, both data streams are advanced. At this point, the PE has finished processing the data segment.
In the period5In (3), the PE starts processing to process the next data segment.
It can be seen that by simply comparing the offset of the data to the EOG, the dynamic selection component in the PE can simply select the data pair to be computed from the two input data streams. Therefore, the compression strategy can reduce the design complexity of the PE, thereby realizing the improvement of performance.
The invention is further illustrated above using specific embodiments. It should be noted that the above-mentioned embodiments are only specific embodiments of the present invention, and should not be construed as limiting the present invention. Any modification, replacement, improvement and the like within the idea of the present invention should be within the protection scope of the present invention.

Claims (6)

1. A method for compressing a sparse vector, comprising the steps of:
(1) segmenting elements in the vector according to a given length;
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiThe offset value is denoted as N-i, xiIs the ith element in the segment, and N is the total number of elements in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving a zero element in the segment for occupying; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
2. A method for compressing a sparse vector, comprising the steps of:
(1) segmenting elements in the vector according to a given length;
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiAnd its offset value is denoted as i-1, xiIs the ith element in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
3. A method for compressing a sparse vector, comprising the steps of:
(1) segmenting elements in the vector according to a given length;
(2) marking the deviant of each data in the segment, specifically marking the deviant of the elements in the segment in an arbitrary sequence;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
4. A compression method of a sparse matrix is characterized by comprising the following steps:
(1) the elements in the matrix are segmented row by row or column by column with a given length,
(2) marking the offset value of each data in the segment, specifically, marking the element x in the segmentiWhich isThe offset value is denoted as N-i, xiIs the ith element in the segment, and N is the total number of elements in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
5. A compression method of a sparse matrix is characterized by comprising the following steps:
(1) the elements in the matrix are segmented row by row or column by column with a given length,
(2) labeling the offset of each data within a segment, specifically for element x within the segmentiThe offset is denoted as i-1, xiIs the ith element in the segment;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
6. A compression method of a sparse matrix is characterized by comprising the following steps:
(1) the elements in the matrix are segmented row by row or column by column with a given length,
(2) labeling the offset of each data within a segment, specifically for element x within the segmentiMarking the deviation value in a mode of adopting any sequence;
(3) judging whether the elements in the segment are non-zero elements or zero elements;
(4) if no nonzero element exists in the segment, reserving the first or any zero element in the segment for occupying the position; if the segment has non-zero elements, removing all zero elements in the segment;
(5) carrying out EOG labeling on the data in the segment; specifically, the element with the largest offset value among the remaining elements is marked as 1, and the other elements are marked as 0.
CN201910982345.4A 2019-10-16 2019-10-16 Compression method of sparse matrix and vector Active CN110766136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910982345.4A CN110766136B (en) 2019-10-16 2019-10-16 Compression method of sparse matrix and vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910982345.4A CN110766136B (en) 2019-10-16 2019-10-16 Compression method of sparse matrix and vector

Publications (2)

Publication Number Publication Date
CN110766136A true CN110766136A (en) 2020-02-07
CN110766136B CN110766136B (en) 2022-09-09

Family

ID=69331377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910982345.4A Active CN110766136B (en) 2019-10-16 2019-10-16 Compression method of sparse matrix and vector

Country Status (1)

Country Link
CN (1) CN110766136B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113835754A (en) * 2021-08-26 2021-12-24 电子科技大学 Active sparsification vector processor
CN114416180A (en) * 2022-03-28 2022-04-29 腾讯科技(深圳)有限公司 Vector data compression method, vector data decompression method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336758A (en) * 2013-06-29 2013-10-02 中国科学院软件研究所 Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN107273094A (en) * 2017-05-18 2017-10-20 中国科学院软件研究所 One kind is adapted to the data structure and its efficient implementation method that HPCG optimizes on " light in martial prowess Taihu Lake "
CN109255429A (en) * 2018-07-27 2019-01-22 中国人民解放军国防科技大学 Parameter decompression method for sparse neural network model
US20190042542A1 (en) * 2018-03-28 2019-02-07 Intel Corporation Accelerator for sparse-dense matrix multiplication
CN109726314A (en) * 2019-01-03 2019-05-07 中国人民解放军国防科技大学 Bitmap-based sparse matrix compression storage method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336758A (en) * 2013-06-29 2013-10-02 中国科学院软件研究所 Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN107273094A (en) * 2017-05-18 2017-10-20 中国科学院软件研究所 One kind is adapted to the data structure and its efficient implementation method that HPCG optimizes on " light in martial prowess Taihu Lake "
US20190042542A1 (en) * 2018-03-28 2019-02-07 Intel Corporation Accelerator for sparse-dense matrix multiplication
CN109255429A (en) * 2018-07-27 2019-01-22 中国人民解放军国防科技大学 Parameter decompression method for sparse neural network model
CN109726314A (en) * 2019-01-03 2019-05-07 中国人民解放军国防科技大学 Bitmap-based sparse matrix compression storage method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙涛: ""基于稀疏矩阵并行算法的混合调度模型研究"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *
邬贵明: ""面向定制结构的稀疏矩阵分块方法"", 《计算机科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113835754A (en) * 2021-08-26 2021-12-24 电子科技大学 Active sparsification vector processor
CN114416180A (en) * 2022-03-28 2022-04-29 腾讯科技(深圳)有限公司 Vector data compression method, vector data decompression method, device and equipment
CN114416180B (en) * 2022-03-28 2022-07-15 腾讯科技(深圳)有限公司 Vector data compression method, vector data decompression method, device and equipment

Also Published As

Publication number Publication date
CN110766136B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN110705703B (en) Sparse neural network processor based on systolic array
CN109063825B (en) Convolutional neural network accelerator
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
US20180046895A1 (en) Device and method for implementing a sparse neural network
CN107256424B (en) Three-value weight convolution network processing system and method
CN110321997B (en) High-parallelism computing platform, system and computing implementation method
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN112668708B (en) Convolution operation device for improving data utilization rate
CN112257844B (en) Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof
CN112836813B (en) Reconfigurable pulse array system for mixed-precision neural network calculation
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN110766136B (en) Compression method of sparse matrix and vector
CN110580519A (en) Convolution operation structure and method thereof
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN112639839A (en) Arithmetic device of neural network and control method thereof
CN110764602B (en) Bus array for reducing storage overhead
CN115496190A (en) Efficient reconfigurable hardware accelerator for convolutional neural network training
JP6888073B2 (en) Chip equipment and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant