CN113835754A - Active sparsification vector processor - Google Patents
Active sparsification vector processor Download PDFInfo
- Publication number
- CN113835754A CN113835754A CN202110986231.4A CN202110986231A CN113835754A CN 113835754 A CN113835754 A CN 113835754A CN 202110986231 A CN202110986231 A CN 202110986231A CN 113835754 A CN113835754 A CN 113835754A
- Authority
- CN
- China
- Prior art keywords
- naf
- bit
- compressed
- input
- circuit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 98
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000007781 pre-processing Methods 0.000 claims description 30
- 229910002056 binary alloy Inorganic materials 0.000 claims description 12
- 230000003213 activating effect Effects 0.000 claims description 6
- 239000003550 marker Substances 0.000 claims description 6
- 238000013139 quantization Methods 0.000 claims description 6
- 101100026203 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.71 / DSM 1257 / FGSC 987) neg-1 gene Proteins 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 241001442055 Vipera berus Species 0.000 description 10
- 238000009825 accumulation Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses an active sparsification vector processor, and belongs to the technical field of vector control operation and implementation. The vector processor comprises an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit; the input temporary storage circuit is used for receiving the control signal and the input data, the arithmetic operation circuit processes the input data according to the control signal, and the output temporary storage circuit is used for storing and sending an operation result. The vector processor directly processes the vector, does not need to keep an intermediate result, and has less redundant operation and less access times; the active sparsification is carried out on the input data, so that the performance is better, the power consumption is lower, and the universality is higher; the support of multiple data formats and multiple operation modes is realized on the same hardware, and the method has smaller area and flexible use mode.
Description
Technical Field
The invention belongs to the technical field of vector control operation and realization, and particularly relates to an active sparsification vector processor.
Background
A large amount of vector operations are involved in the reasoning process of the neural network, such as scenes of image video denoising, feature extraction, object recognition, voice keyword retrieval and the like. The nature of the data is multidimensional vector, and the reasoning process of the neural network is divided into basic logical operations, mainly vector multiplication, vector addition and vector comparison. However, no matter for fixed point numbers or floating point numbers, the conventional processor mostly adopts a parallel mode of single instruction multiple data streams to increase the operational capability of the processor, can simultaneously execute a plurality of basic operations, and has no logical association between operands. For vector operations, the parallel mode introduces a large number of redundant intermediate operations due to neglect of logical association among vector elements, and the operational capability of basic operations also limits the performance of vector computation.
in the design of single instruction multiple data stream, an intermediate variable c is needed, and the execution is performed for P times, wherein c is equal to c + AiBiThe operation can get the final result. This way, P intermediate results c are retained, that is, P redundant operations of formatting, storing data, and reading data exist, and P operations must be performed sequentially, so that there is no advantage of parallelism at all.
The existing bit-skipping multiplication-addition operation circuit comprises a bit serial algorithm, a full parallel algorithm and a partial parallel algorithm, wherein the algorithms all use a shift accumulation method, when a corresponding bit is 1, an operand is added, when the corresponding bit is 0, operation is not executed, the circuit adopting the bit serial algorithm checks one bit at a time, the circuit adopting the full parallel algorithm checks all bits at a time, and the circuit adopting the partial parallel algorithm checks a part of bits at a time. These designs can only rely on the natural bit sparsity of the data to speed up multiply-add operations, which is not common, and there are many meaningless operations.
By P-dimensional vectorAndusing dot product of (1) as an exampleThe sparsity of each element is subjected to bit skipping, and the mathematical expression is as follows:
wherein Q is a vectorNumber of bits of value of each element in, Bi,jThe j-th bit of the ith element in the generated control flow has different values according to the algorithm, and the number of times of addition is equal to the vectorThe total number of non-zero bits of the middle element is half of the total number of bits, namely Q/2, and is at least 0 and at most the total number of bits.
In conclusion, the prior art is redundant in calculation and low in calculation performance.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide an active sparsification vector processor.
The technical problem proposed by the invention is solved as follows:
an active sparsification vector processor comprises an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register and preprocesses the input data according to the control signal; the arithmetic controller reads the control signal in the input register and generates a control signal group of the arithmetic operator according to the control signal and the preprocessed input data; the arithmetic operator performs mathematical operation on the preprocessed input data according to the control signal group to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
Furthermore, the active sparsification vector processor has the working mode of floating point number multiplication, floating point number addition, floating point number maximum value, fixed point number multiplication, fixed point number addition or fixed point number maximum value.
Further, when the working mode is floating point number multiplication, the input data comprises two input vectors; the preprocessing circuit splits and recombines each pair of corresponding elements in the two input vectors into a symbol, a step code, a first mantissa and a second mantissa; comparing all the codes to obtain the maximum code; the difference is made between the maximum order code and the current order code to obtain the mantissa offset; converting the second mantissa to a compressed NAF code, right shifting the compressed NAF code according to a mantissa offset to align the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, and the number of the selected elements is the channel number of the addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
Further, when the working mode is floating point number addition, the input data comprises two input vectors; the preprocessing circuit splits each pair of corresponding elements in the two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; comparing and differencing the first order code and the second order code of each pair of elements to obtain a maximum order code and a mantissa offset, wherein the smallest mantissa in the first mantissa and the second mantissa is used for aligning the mantissas according to the mantissa offset, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
Furthermore, when the working mode is the floating point maximum value, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
Further, when the working mode is fixed point number multiplication, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors and converts all elements in the second input vector into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF codes, periodically selects elements from a first input vector, and the number of the selected elements is the channel number of an addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
Further, when the working mode is fixed point number addition, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
Furthermore, when the working mode is the fixed point maximum value, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
Further, the conversion method for compressing NAF code is as follows:
the signed binary data is converted into a separate NAF code in two ways;
the specific process of the first separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are 1 and 0 respectively, the nth bit of the NAF is + 1;
the specific process of the second separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of a-1 mark and a non-0 mark coded by the second separate NAF are 0 and 1 respectively, the nth bit of the NAF is + 1;
the separated NAF coding is converted into compressed NAF coding in two modes;
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF of N bit is used as extended NAF, making M equal to N;
the first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if the 2m +1 th bit and the 2m bit of the NAF are {1, 0}, the first compressed NAF code is 010, and the m bit value of the representation compressed NAF is 2;
the second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2.
The invention has the beneficial effects that:
the vector processor directly processes the vector, has a simpler vector operation control flow and is more friendly to a large-bandwidth memory; the vector processor directly obtains vector operation results without reserving intermediate results in the processing process, and has less redundant operation and less access times; the vector processor actively sparsifies the input data, converts the input data into a form with strong bit-level sparsity, can perform operation acceleration by applying a bit-level sparsity optimization method unconditionally, and has better performance, lower power consumption and more universality; the vector processor realizes the support of multiple data formats and multiple operation modes on the same hardware, and has smaller area and flexible use mode.
Drawings
FIG. 1 is a block diagram of a vector processor according to the present invention;
FIG. 2 is a flow chart of a first separated NAF encoding mode in the embodiment;
fig. 3 is a flow chart of a second separated NAF encoding method in the embodiment.
Detailed Description
The invention is further described below with reference to the figures and examples.
The present embodiment provides an active sparsification vector processor, the structural composition diagram of which is shown in fig. 1, and which includes an input temporary storage circuit, an arithmetic operation circuit, and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register, extracts the working mode information from the control signal, conducts the data stream under the corresponding working mode according to the control signal and preprocesses the input data; the arithmetic controller reads the control signal in the input register, extracts the working mode information from the control signal, and generates a control signal group of the arithmetic operator according to the working mode information and the preprocessed input data; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, and performs mathematical operation on the preprocessed input data to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
The working mode of the active sparsification vector processor in the embodiment is floating point number multiplication, floating point number addition, floating point number maximum, fixed point number point multiplication, fixed point number addition or fixed point number maximum.
When the working mode is floating point number multiplication, the input data comprises two floating point number input vectors; the preprocessing circuit conducts data flow under a floating point number multiplication working mode, and splits and recombines each pair of corresponding elements in two input vectors into a symbol, a step code, a first mantissa and a second mantissa; all the step codes are input into a comparator for comparison, and the comparator outputs the maximum step code; inputting the maximum order code and the current order code into a subtracter, and subtracting to obtain mantissa offset; converting the second mantissa to a compressed NAF code, inputting the compressed NAF code to a shifter, the shifter right shifting amount being a mantissa offset for aligning the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, the number of the selected elements is the channel number of the addition tree in the arithmetic operator, and the control signal group comprises working mode information; NAF weighting is carried out on the selected elements, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, the addition tree and the accumulator are all activated, data to be operated are summed and accumulated, and an accumulation result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
When the working mode is floating point number addition, the input data comprises two floating point number input vectors; the preprocessing circuit conducts data flow under a floating point number addition working mode, and splits each pair of corresponding elements in two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; inputting the first order code and the second order code of each pair of elements into a comparator, and outputting the maximum order code by the comparator; inputting the first-order code and the second-order code into a subtracter, and subtracting to obtain mantissa offset; inputting the smallest mantissa of the first mantissa and the second mantissa into a shifter, wherein the right shift amount is mantissa offset and is used for aligning the mantissas, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed, and the control signal group comprises working mode information; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
When the working mode is the floating point number maximum value, the input data is a floating point number input vector; the preprocessing circuit conducts data flow under the maximum working mode of the floating point number, all elements in the input vector are input to the comparator, and the comparator outputs the maximum and minimum indexes and directly sends the maximum and minimum indexes to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
When the working mode is fixed point number multiplication, the input data comprises two fixed point number input vectors; the preprocessing circuit conducts data flow under a fixed point number point multiplication working mode, sign bit expansion is conducted on all elements in the two input vectors, and all elements in the second input vector are converted into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF coding, periodically selects elements from a first input vector, the number of the selected elements is the channel number of an addition tree in the arithmetic operator, and a control signal group comprises working mode information; NAF weighting is carried out on the selected elements, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, the addition tree and the accumulator are all activated, data to be operated are summed and accumulated, and an accumulation result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
When the working mode is fixed point number addition, the input data comprises two fixed point number input vectors; the preprocessing circuit conducts data flow under the fixed point number addition working mode and carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed, and the control signal group comprises working mode information; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
When the working mode is the fixed point maximum value, the input data is a fixed point input vector; the preprocessing circuit conducts data flow under a fixed point maximum working mode, all elements in an input vector are input into the comparator, and the comparator outputs indexes of the maximum and directly sends the indexes to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
When the working mode is floating point number multiplication or fixed point number multiplication, the arithmetic operation circuit can adopt three forms of bit serial algorithm, full parallel algorithm or partial parallel algorithm; when a full parallel algorithm is adopted, the number of channels of the addition tree is determined by the maximum bit width of the preprocessed data format; when a partial parallel algorithm is adopted, the number of channels of the addition tree is determined by the requirement of parallelism; when a serial algorithm is used, the adder tree is degenerated into a 2-input adder with a channel number of 2.
The conversion method for compressing NAF codes comprises the following steps:
there are two ways to convert signed binary data to separate NAF encoding.
Fig. 2 shows a flow chart of a first separated NAF encoding method, which specifically includes the following steps:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 flag and the-1 flag of the first separated NAF code are 1 and 0, respectively, the nth bit of NAF is + 1. There is no case where the nth bit of both the +1 flag and the-1 flag encoded by the first separable NAF is 1. The first separate NAF encoding table is shown in table 1.
TABLE 1 first disconnect-type NAF code table
-1 marks the nth position | +1 marks the nth position | N th position of NAF |
1 | 0 | -1 |
0 | 1 | +1 |
0 | 0 | 0 |
1 | 1 | Is absent from |
Fig. 3 shows a flow diagram of a second separated NAF encoding method, which specifically includes the following steps:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of the-1 flag and the non-0 flag encoded by the second separate NAF are 0 and 1, respectively, the nth bit of NAF is + 1. There is no case where the nth bits of the-1 flag and the non-0 flag encoded by the two separate NAF are 1 and 0, respectively. A second separate NAF encoding table is shown in table 2.
TABLE 2 second disconnect-type NAF code table
-1 marks the nth position | Non-0 marking the nth bit | N th position of NAF |
1 | 1 | -1 |
0 | 1 | +1 |
0 | 0 | 0 |
1 | 0 | Is absent from |
There are two ways to convert the separate NAF encoding to the compressed NAF encoding.
And recoding the group of two bits by utilizing the characteristic that NAF is not adjacent to the 0 bit to obtain numerical representation that the number of the non-0 bit is unchanged, the number of the total bits is halved, and the value of the bit is-2, -1, 0, 1, 2. In order to facilitate the jump zero control, two ' compressed NAF ' coding schemes are proposed, the first one is to directly adopt a ' sign-absolute value ' format (namely, an original code format) to generate a 3-bit numerical value, and the second one is to respectively use 3 bits for marking ' whether the sign is multiplied by 2 ' and whether the sign is not 0 ';
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF with N bit is used as extension NAF to make M equal to N.
The first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the first compressed NAF is encoded as 010, indicating that the value of the m-th bit of the compressed NAF is 2.
The second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2. And because NAF encodes the property that non-zero values will not be contiguous, there is no case of { -1, +1}, - { -1, -1}, { +1, -1}, { +1, +1 }.
Two compressed NAF coding tables are shown in table 3.
Table 3 two compressed NAF coding tables
In the embodiment, vector basic operations such as dot multiplication, addition, maximum value and the like are directly calculated as a whole, and the input and the output are vectors with mathematical meanings of logic correlation between elements. For point multiplication, all multiplications are allowed to be calculated in parallel, then accumulation is completed by using an addition tree, and intermediate results do not need to be saved, so that redundant intermediate operations do not exist, and meanwhile, for a large number of same calculations, an optimized digital circuit design is adopted, so that the method has higher performance, smaller area and lower power consumption.
In the embodiment, a Non-Adjacent Form (NAF) is used for bit-level sparseness, data is converted into a coding mode with the strongest bit-level sparsity, the average number of Non-0 bits is reduced to one third of the total bit number, the maximum number of Non-0 bits is reduced to one half of the total bit number, the processing mode is not limited by the natural bit-level sparsity of the data any more, a bit zero-jump method is unconditionally applied by artificially creating sparsity, intermediate meaningless operation is combined, and the operation amount is greatly reduced.
Claims (9)
1. An active sparsification vector processor is characterized by comprising an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register and preprocesses the input data according to the control signal; the arithmetic controller reads the control signal in the input register and generates a control signal group of the arithmetic operator according to the control signal and the preprocessed input data; the arithmetic operator performs mathematical operation on the preprocessed input data according to the control signal group to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
2. The active sparsification vector processor of claim 1, wherein the vector processor operates in a floating point number multiplication, floating point number addition, floating point number most, fixed point number multiplication, fixed point number addition, or fixed point number most.
3. The active sparsification vector processor of claim 2, wherein when the operating mode is floating point multiplication, the input data comprises two input vectors; the preprocessing circuit splits and recombines each pair of corresponding elements in the two input vectors into a symbol, a step code, a first mantissa and a second mantissa; comparing all the codes to obtain the maximum code; the difference is made between the maximum order code and the current order code to obtain the mantissa offset; converting the second mantissa to a compressed NAF code, right shifting the compressed NAF code according to a mantissa offset to align the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, and the number of the selected elements is the channel number of the addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
4. The active sparsification vector processor of claim 2, wherein the input data comprises two input vectors when the operating mode is floating point addition; the preprocessing circuit splits each pair of corresponding elements in the two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; comparing and differencing the first order code and the second order code of each pair of elements to obtain a maximum order code and a mantissa offset, wherein the smallest mantissa in the first mantissa and the second mantissa is used for aligning the mantissas according to the mantissa offset, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
5. The active sparsification vector processor of claim 2, wherein the input data is an input vector when the operation mode is the floating point mode; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
6. The active sparsification vector processor of claim 2, wherein when the mode of operation is fixed point number multiplication, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors and converts all elements in the second input vector into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF codes, periodically selects elements from a first input vector, and the number of the selected elements is the channel number of an addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
7. The active sparsification vector processor of claim 2, wherein when the operation mode is fixed-point number addition, the input data includes two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
8. The active sparsification vector processor of claim 2, further comprising, when the operation mode is fixed-point maximum, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
9. The active sparsification vector processor of claim 3 or 6, wherein the conversion method of compressing NAF coding is:
the signed binary data is converted into a separate NAF code in two ways;
the specific process of the first separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are 1 and 0 respectively, the nth bit of the NAF is + 1;
the specific process of the second separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of a-1 mark and a non-0 mark coded by the second separate NAF are 0 and 1 respectively, the nth bit of the NAF is + 1;
the separated NAF coding is converted into compressed NAF coding in two modes;
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF of N bit is used as extended NAF, making M equal to N;
the first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if the 2m +1 th bit and the 2m bit of the NAF are {1, 0}, the first compressed NAF code is 010, and the m bit value of the representation compressed NAF is 2;
the second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110986231.4A CN113835754B (en) | 2021-08-26 | 2021-08-26 | Active sparsification vector processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110986231.4A CN113835754B (en) | 2021-08-26 | 2021-08-26 | Active sparsification vector processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113835754A true CN113835754A (en) | 2021-12-24 |
CN113835754B CN113835754B (en) | 2023-04-18 |
Family
ID=78961217
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110986231.4A Active CN113835754B (en) | 2021-08-26 | 2021-08-26 | Active sparsification vector processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113835754B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030028572A1 (en) * | 2001-06-29 | 2003-02-06 | Yatin Hoskote | Fast single precision floating point accumulator using base 32 system |
US20090248778A1 (en) * | 2008-03-28 | 2009-10-01 | Magerlein Karen A | Systems and methods for a combined matrix-vector and matrix transpose vector multiply for a block-sparse matrix |
US20110078226A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Sparse Matrix-Vector Multiplication on Graphics Processor Units |
US20120259906A1 (en) * | 2011-04-08 | 2012-10-11 | Fujitsu Limited | Arithmetic circuit, arithmetic processing apparatus and method of controlling arithmetic circuit |
CN110766136A (en) * | 2019-10-16 | 2020-02-07 | 北京航空航天大学 | Compression method of sparse matrix and vector |
US20200293488A1 (en) * | 2019-03-15 | 2020-09-17 | Intel Corporation | Scalar core integration |
US20210191733A1 (en) * | 2019-12-23 | 2021-06-24 | Western Digital Technologies, Inc. | Flexible accelerator for sparse tensors (fast) in machine learning |
-
2021
- 2021-08-26 CN CN202110986231.4A patent/CN113835754B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030028572A1 (en) * | 2001-06-29 | 2003-02-06 | Yatin Hoskote | Fast single precision floating point accumulator using base 32 system |
US20090248778A1 (en) * | 2008-03-28 | 2009-10-01 | Magerlein Karen A | Systems and methods for a combined matrix-vector and matrix transpose vector multiply for a block-sparse matrix |
US20110078226A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Sparse Matrix-Vector Multiplication on Graphics Processor Units |
US20120259906A1 (en) * | 2011-04-08 | 2012-10-11 | Fujitsu Limited | Arithmetic circuit, arithmetic processing apparatus and method of controlling arithmetic circuit |
US20200293488A1 (en) * | 2019-03-15 | 2020-09-17 | Intel Corporation | Scalar core integration |
CN110766136A (en) * | 2019-10-16 | 2020-02-07 | 北京航空航天大学 | Compression method of sparse matrix and vector |
US20210191733A1 (en) * | 2019-12-23 | 2021-06-24 | Western Digital Technologies, Inc. | Flexible accelerator for sparse tensors (fast) in machine learning |
Non-Patent Citations (3)
Title |
---|
YAN ZHANG: "FPGA vs. GPU for sparse matrix vector multiply" * |
常亮,杨思琪: "磁性随机存储器的发展及其缓存应用" * |
白洪涛: "基于GPU的稀疏矩阵向量乘优化" * |
Also Published As
Publication number | Publication date |
---|---|
CN113835754B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111832719A (en) | Fixed point quantization convolution neural network accelerator calculation circuit | |
KR102153791B1 (en) | Digital neural, artificial neuron for artificial neuron network and inference engine having the same | |
US10872295B1 (en) | Residual quantization of bit-shift weights in an artificial neural network | |
CN101981618B (en) | Reduced-complexity vector indexing and de-indexing | |
JP3902990B2 (en) | Hadamard transform processing method and apparatus | |
CN109389208B (en) | Data quantization device and quantization method | |
US5577132A (en) | Image coding/decoding device | |
CN114647399B (en) | Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device | |
CN110019184B (en) | Method for compressing and decompressing ordered integer array | |
CN111488133A (en) | High-radix approximate Booth coding method and mixed-radix Booth coding approximate multiplier | |
CN114615507A (en) | Image coding method, decoding method and related device | |
CN113835754B (en) | Active sparsification vector processor | |
JP3033671B2 (en) | Image signal Hadamard transform encoding / decoding method and apparatus | |
CN100493199C (en) | Coding apparatus, coding method, and codebook | |
CN112702600B (en) | Image coding and decoding neural network layered fixed-point method | |
CN110766136B (en) | Compression method of sparse matrix and vector | |
CN113283591B (en) | Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier | |
CN111492369A (en) | Residual quantization of shift weights in artificial neural networks | |
EP1617324A1 (en) | Method and system for digital signal processing and program product therefor | |
CN116205244A (en) | Digital signal processing structure | |
CN116451769A (en) | Quantization method of language model and electronic equipment | |
CN113794709B (en) | Hybrid coding method for binary sparse matrix | |
KR20220045920A (en) | Method and apparatus for processing images/videos for machine vision | |
CN110737869B (en) | DCT/IDCT multiplier circuit optimization method and application | |
Swilem | A fast vector quantization encoding algorithm based on projection pyramid with Hadamard transformation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |