CN113835754A - Active sparsification vector processor - Google Patents

Active sparsification vector processor Download PDF

Info

Publication number
CN113835754A
CN113835754A CN202110986231.4A CN202110986231A CN113835754A CN 113835754 A CN113835754 A CN 113835754A CN 202110986231 A CN202110986231 A CN 202110986231A CN 113835754 A CN113835754 A CN 113835754A
Authority
CN
China
Prior art keywords
naf
bit
compressed
input
circuit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110986231.4A
Other languages
Chinese (zh)
Other versions
CN113835754B (en
Inventor
常亮
周军
竹子轩
杨思琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110986231.4A priority Critical patent/CN113835754B/en
Publication of CN113835754A publication Critical patent/CN113835754A/en
Application granted granted Critical
Publication of CN113835754B publication Critical patent/CN113835754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an active sparsification vector processor, and belongs to the technical field of vector control operation and implementation. The vector processor comprises an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit; the input temporary storage circuit is used for receiving the control signal and the input data, the arithmetic operation circuit processes the input data according to the control signal, and the output temporary storage circuit is used for storing and sending an operation result. The vector processor directly processes the vector, does not need to keep an intermediate result, and has less redundant operation and less access times; the active sparsification is carried out on the input data, so that the performance is better, the power consumption is lower, and the universality is higher; the support of multiple data formats and multiple operation modes is realized on the same hardware, and the method has smaller area and flexible use mode.

Description

Active sparsification vector processor
Technical Field
The invention belongs to the technical field of vector control operation and realization, and particularly relates to an active sparsification vector processor.
Background
A large amount of vector operations are involved in the reasoning process of the neural network, such as scenes of image video denoising, feature extraction, object recognition, voice keyword retrieval and the like. The nature of the data is multidimensional vector, and the reasoning process of the neural network is divided into basic logical operations, mainly vector multiplication, vector addition and vector comparison. However, no matter for fixed point numbers or floating point numbers, the conventional processor mostly adopts a parallel mode of single instruction multiple data streams to increase the operational capability of the processor, can simultaneously execute a plurality of basic operations, and has no logical association between operands. For vector operations, the parallel mode introduces a large number of redundant intermediate operations due to neglect of logical association among vector elements, and the operational capability of basic operations also limits the performance of vector computation.
By P-dimensional vector
Figure BDA0003230773580000014
And
Figure BDA0003230773580000013
by way of example, the mathematical expression of (c) is as follows:
Figure BDA0003230773580000011
in the design of single instruction multiple data stream, an intermediate variable c is needed, and the execution is performed for P times, wherein c is equal to c + AiBiThe operation can get the final result. This way, P intermediate results c are retained, that is, P redundant operations of formatting, storing data, and reading data exist, and P operations must be performed sequentially, so that there is no advantage of parallelism at all.
The existing bit-skipping multiplication-addition operation circuit comprises a bit serial algorithm, a full parallel algorithm and a partial parallel algorithm, wherein the algorithms all use a shift accumulation method, when a corresponding bit is 1, an operand is added, when the corresponding bit is 0, operation is not executed, the circuit adopting the bit serial algorithm checks one bit at a time, the circuit adopting the full parallel algorithm checks all bits at a time, and the circuit adopting the partial parallel algorithm checks a part of bits at a time. These designs can only rely on the natural bit sparsity of the data to speed up multiply-add operations, which is not common, and there are many meaningless operations.
By P-dimensional vector
Figure BDA0003230773580000017
And
Figure BDA0003230773580000016
using dot product of (1) as an example
Figure BDA0003230773580000015
The sparsity of each element is subjected to bit skipping, and the mathematical expression is as follows:
Figure BDA0003230773580000012
wherein Q is a vector
Figure BDA0003230773580000018
Number of bits of value of each element in, Bi,jThe j-th bit of the ith element in the generated control flow has different values according to the algorithm, and the number of times of addition is equal to the vector
Figure BDA0003230773580000019
The total number of non-zero bits of the middle element is half of the total number of bits, namely Q/2, and is at least 0 and at most the total number of bits.
In conclusion, the prior art is redundant in calculation and low in calculation performance.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide an active sparsification vector processor.
The technical problem proposed by the invention is solved as follows:
an active sparsification vector processor comprises an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register and preprocesses the input data according to the control signal; the arithmetic controller reads the control signal in the input register and generates a control signal group of the arithmetic operator according to the control signal and the preprocessed input data; the arithmetic operator performs mathematical operation on the preprocessed input data according to the control signal group to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
Furthermore, the active sparsification vector processor has the working mode of floating point number multiplication, floating point number addition, floating point number maximum value, fixed point number multiplication, fixed point number addition or fixed point number maximum value.
Further, when the working mode is floating point number multiplication, the input data comprises two input vectors; the preprocessing circuit splits and recombines each pair of corresponding elements in the two input vectors into a symbol, a step code, a first mantissa and a second mantissa; comparing all the codes to obtain the maximum code; the difference is made between the maximum order code and the current order code to obtain the mantissa offset; converting the second mantissa to a compressed NAF code, right shifting the compressed NAF code according to a mantissa offset to align the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, and the number of the selected elements is the channel number of the addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
Further, when the working mode is floating point number addition, the input data comprises two input vectors; the preprocessing circuit splits each pair of corresponding elements in the two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; comparing and differencing the first order code and the second order code of each pair of elements to obtain a maximum order code and a mantissa offset, wherein the smallest mantissa in the first mantissa and the second mantissa is used for aligning the mantissas according to the mantissa offset, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
Furthermore, when the working mode is the floating point maximum value, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
Further, when the working mode is fixed point number multiplication, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors and converts all elements in the second input vector into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF codes, periodically selects elements from a first input vector, and the number of the selected elements is the channel number of an addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
Further, when the working mode is fixed point number addition, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
Furthermore, when the working mode is the fixed point maximum value, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
Further, the conversion method for compressing NAF code is as follows:
the signed binary data is converted into a separate NAF code in two ways;
the specific process of the first separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are 1 and 0 respectively, the nth bit of the NAF is + 1;
the specific process of the second separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of a-1 mark and a non-0 mark coded by the second separate NAF are 0 and 1 respectively, the nth bit of the NAF is + 1;
the separated NAF coding is converted into compressed NAF coding in two modes;
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF of N bit is used as extended NAF, making M equal to N;
the first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if the 2m +1 th bit and the 2m bit of the NAF are {1, 0}, the first compressed NAF code is 010, and the m bit value of the representation compressed NAF is 2;
the second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2.
The invention has the beneficial effects that:
the vector processor directly processes the vector, has a simpler vector operation control flow and is more friendly to a large-bandwidth memory; the vector processor directly obtains vector operation results without reserving intermediate results in the processing process, and has less redundant operation and less access times; the vector processor actively sparsifies the input data, converts the input data into a form with strong bit-level sparsity, can perform operation acceleration by applying a bit-level sparsity optimization method unconditionally, and has better performance, lower power consumption and more universality; the vector processor realizes the support of multiple data formats and multiple operation modes on the same hardware, and has smaller area and flexible use mode.
Drawings
FIG. 1 is a block diagram of a vector processor according to the present invention;
FIG. 2 is a flow chart of a first separated NAF encoding mode in the embodiment;
fig. 3 is a flow chart of a second separated NAF encoding method in the embodiment.
Detailed Description
The invention is further described below with reference to the figures and examples.
The present embodiment provides an active sparsification vector processor, the structural composition diagram of which is shown in fig. 1, and which includes an input temporary storage circuit, an arithmetic operation circuit, and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register, extracts the working mode information from the control signal, conducts the data stream under the corresponding working mode according to the control signal and preprocesses the input data; the arithmetic controller reads the control signal in the input register, extracts the working mode information from the control signal, and generates a control signal group of the arithmetic operator according to the working mode information and the preprocessed input data; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, and performs mathematical operation on the preprocessed input data to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
The working mode of the active sparsification vector processor in the embodiment is floating point number multiplication, floating point number addition, floating point number maximum, fixed point number point multiplication, fixed point number addition or fixed point number maximum.
When the working mode is floating point number multiplication, the input data comprises two floating point number input vectors; the preprocessing circuit conducts data flow under a floating point number multiplication working mode, and splits and recombines each pair of corresponding elements in two input vectors into a symbol, a step code, a first mantissa and a second mantissa; all the step codes are input into a comparator for comparison, and the comparator outputs the maximum step code; inputting the maximum order code and the current order code into a subtracter, and subtracting to obtain mantissa offset; converting the second mantissa to a compressed NAF code, inputting the compressed NAF code to a shifter, the shifter right shifting amount being a mantissa offset for aligning the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, the number of the selected elements is the channel number of the addition tree in the arithmetic operator, and the control signal group comprises working mode information; NAF weighting is carried out on the selected elements, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, the addition tree and the accumulator are all activated, data to be operated are summed and accumulated, and an accumulation result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
When the working mode is floating point number addition, the input data comprises two floating point number input vectors; the preprocessing circuit conducts data flow under a floating point number addition working mode, and splits each pair of corresponding elements in two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; inputting the first order code and the second order code of each pair of elements into a comparator, and outputting the maximum order code by the comparator; inputting the first-order code and the second-order code into a subtracter, and subtracting to obtain mantissa offset; inputting the smallest mantissa of the first mantissa and the second mantissa into a shifter, wherein the right shift amount is mantissa offset and is used for aligning the mantissas, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed, and the control signal group comprises working mode information; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
When the working mode is the floating point number maximum value, the input data is a floating point number input vector; the preprocessing circuit conducts data flow under the maximum working mode of the floating point number, all elements in the input vector are input to the comparator, and the comparator outputs the maximum and minimum indexes and directly sends the maximum and minimum indexes to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
When the working mode is fixed point number multiplication, the input data comprises two fixed point number input vectors; the preprocessing circuit conducts data flow under a fixed point number point multiplication working mode, sign bit expansion is conducted on all elements in the two input vectors, and all elements in the second input vector are converted into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF coding, periodically selects elements from a first input vector, the number of the selected elements is the channel number of an addition tree in the arithmetic operator, and a control signal group comprises working mode information; NAF weighting is carried out on the selected elements, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the arithmetic operator conducts data flow under a corresponding working mode according to the control signal group, the addition tree and the accumulator are all activated, data to be operated are summed and accumulated, and an accumulation result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
When the working mode is fixed point number addition, the input data comprises two fixed point number input vectors; the preprocessing circuit conducts data flow under the fixed point number addition working mode and carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed, and the control signal group comprises working mode information; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
When the working mode is the fixed point maximum value, the input data is a fixed point input vector; the preprocessing circuit conducts data flow under a fixed point maximum working mode, all elements in an input vector are input into the comparator, and the comparator outputs indexes of the maximum and directly sends the indexes to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
When the working mode is floating point number multiplication or fixed point number multiplication, the arithmetic operation circuit can adopt three forms of bit serial algorithm, full parallel algorithm or partial parallel algorithm; when a full parallel algorithm is adopted, the number of channels of the addition tree is determined by the maximum bit width of the preprocessed data format; when a partial parallel algorithm is adopted, the number of channels of the addition tree is determined by the requirement of parallelism; when a serial algorithm is used, the adder tree is degenerated into a 2-input adder with a channel number of 2.
The conversion method for compressing NAF codes comprises the following steps:
there are two ways to convert signed binary data to separate NAF encoding.
Fig. 2 shows a flow chart of a first separated NAF encoding method, which specifically includes the following steps:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 flag and the-1 flag of the first separated NAF code are 1 and 0, respectively, the nth bit of NAF is + 1. There is no case where the nth bit of both the +1 flag and the-1 flag encoded by the first separable NAF is 1. The first separate NAF encoding table is shown in table 1.
TABLE 1 first disconnect-type NAF code table
-1 marks the nth position +1 marks the nth position N th position of NAF
1 0 -1
0 1 +1
0 0 0
1 1 Is absent from
Fig. 3 shows a flow diagram of a second separated NAF encoding method, which specifically includes the following steps:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of the-1 flag and the non-0 flag encoded by the second separate NAF are 0 and 1, respectively, the nth bit of NAF is + 1. There is no case where the nth bits of the-1 flag and the non-0 flag encoded by the two separate NAF are 1 and 0, respectively. A second separate NAF encoding table is shown in table 2.
TABLE 2 second disconnect-type NAF code table
-1 marks the nth position Non-0 marking the nth bit N th position of NAF
1 1 -1
0 1 +1
0 0 0
1 0 Is absent from
There are two ways to convert the separate NAF encoding to the compressed NAF encoding.
And recoding the group of two bits by utilizing the characteristic that NAF is not adjacent to the 0 bit to obtain numerical representation that the number of the non-0 bit is unchanged, the number of the total bits is halved, and the value of the bit is-2, -1, 0, 1, 2. In order to facilitate the jump zero control, two ' compressed NAF ' coding schemes are proposed, the first one is to directly adopt a ' sign-absolute value ' format (namely, an original code format) to generate a 3-bit numerical value, and the second one is to respectively use 3 bits for marking ' whether the sign is multiplied by 2 ' and whether the sign is not 0 ';
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF with N bit is used as extension NAF to make M equal to N.
The first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the first compressed NAF is encoded as 010, indicating that the value of the m-th bit of the compressed NAF is 2.
The second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2. And because NAF encodes the property that non-zero values will not be contiguous, there is no case of { -1, +1}, - { -1, -1}, { +1, -1}, { +1, +1 }.
Two compressed NAF coding tables are shown in table 3.
Table 3 two compressed NAF coding tables
Figure BDA0003230773580000091
In the embodiment, vector basic operations such as dot multiplication, addition, maximum value and the like are directly calculated as a whole, and the input and the output are vectors with mathematical meanings of logic correlation between elements. For point multiplication, all multiplications are allowed to be calculated in parallel, then accumulation is completed by using an addition tree, and intermediate results do not need to be saved, so that redundant intermediate operations do not exist, and meanwhile, for a large number of same calculations, an optimized digital circuit design is adopted, so that the method has higher performance, smaller area and lower power consumption.
In the embodiment, a Non-Adjacent Form (NAF) is used for bit-level sparseness, data is converted into a coding mode with the strongest bit-level sparsity, the average number of Non-0 bits is reduced to one third of the total bit number, the maximum number of Non-0 bits is reduced to one half of the total bit number, the processing mode is not limited by the natural bit-level sparsity of the data any more, a bit zero-jump method is unconditionally applied by artificially creating sparsity, intermediate meaningless operation is combined, and the operation amount is greatly reduced.

Claims (9)

1. An active sparsification vector processor is characterized by comprising an input temporary storage circuit, an arithmetic operation circuit and an output temporary storage circuit;
the input temporary storage circuit comprises an input controller and an input register; the input controller receives the handshake signals, the control signals and the input data sent by the input end, stores the control signals and the input data into the input register, and feeds the handshake signals back to the input end;
the arithmetic operation circuit comprises a preprocessing circuit, an operation controller, an arithmetic operator and a formatting circuit;
the preprocessing circuit reads the control signal and the input data in the input register and preprocesses the input data according to the control signal; the arithmetic controller reads the control signal in the input register and generates a control signal group of the arithmetic operator according to the control signal and the preprocessed input data; the arithmetic operator performs mathematical operation on the preprocessed input data according to the control signal group to obtain an operation result; the formatting circuit converts the format of the operation result into a specified format and sends the specified format to the output temporary storage circuit;
the output temporary storage circuit comprises an output controller and an output register; and the output controller receives the operation result converted into the specified format, stores the operation result into the output register and sends the operation result to the output end.
2. The active sparsification vector processor of claim 1, wherein the vector processor operates in a floating point number multiplication, floating point number addition, floating point number most, fixed point number multiplication, fixed point number addition, or fixed point number most.
3. The active sparsification vector processor of claim 2, wherein when the operating mode is floating point multiplication, the input data comprises two input vectors; the preprocessing circuit splits and recombines each pair of corresponding elements in the two input vectors into a symbol, a step code, a first mantissa and a second mantissa; comparing all the codes to obtain the maximum code; the difference is made between the maximum order code and the current order code to obtain the mantissa offset; converting the second mantissa to a compressed NAF code, right shifting the compressed NAF code according to a mantissa offset to align the mantissa; the operation controller uses the aligned compressed NAF code to carry out bit jump zero control, periodically selects elements from the first input vector, and the number of the selected elements is the channel number of the addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit packs the accumulated result and the maximum order code in the preprocessing circuit into a floating point number format.
4. The active sparsification vector processor of claim 2, wherein the input data comprises two input vectors when the operating mode is floating point addition; the preprocessing circuit splits each pair of corresponding elements in the two input vectors into a first symbol, a first order code, a first mantissa, a second symbol, a second order code and a second mantissa; comparing and differencing the first order code and the second order code of each pair of elements to obtain a maximum order code and a mantissa offset, wherein the smallest mantissa in the first mantissa and the second mantissa is used for aligning the mantissas according to the mantissa offset, and the first mantissa and the second mantissa after aligning the mantissas are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; and the formatting circuit packs the summation result of all the element groups and the corresponding maximum order code in the preprocessing circuit into a floating point number format.
5. The active sparsification vector processor of claim 2, wherein the input data is an input vector when the operation mode is the floating point mode; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
6. The active sparsification vector processor of claim 2, wherein when the mode of operation is fixed point number multiplication, the input data comprises two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors and converts all elements in the second input vector into compressed NAF codes; the operation controller performs bit jump zero control on compressed NAF codes, periodically selects elements from a first input vector, and the number of the selected elements is the channel number of an addition tree in the arithmetic operator; NAF weighting is carried out on the elements of the first input vector, and then data to be calculated are output until all non-0 bits in compressed NAF codes are searched; the addition tree and the accumulator of the arithmetic operator are all activated, the data to be operated are summed and accumulated, and the accumulated result is output to the formatting circuit; the formatting circuit re-quantizes the accumulated result according to the fixed point quantization requirement.
7. The active sparsification vector processor of claim 2, wherein when the operation mode is fixed-point number addition, the input data includes two input vectors; the preprocessing circuit carries out sign bit expansion on all elements in the two input vectors; two elements corresponding to the positions in the two input vectors are used as an element group; the arithmetic controller periodically selects element groups, and the number of the selected element groups is the number of adders in the arithmetic operator until all the element groups are traversed; activating all adders in the arithmetic operator, summing elements in the element group selected by the arithmetic controller and outputting the summed elements to the formatting circuit; the formatting circuit re-quantizes the summation result of all element groups according to the fixed point quantization requirement.
8. The active sparsification vector processor of claim 2, further comprising, when the operation mode is fixed-point maximum, the input data is an input vector; the preprocessing circuit compares all elements in the input vector, finds the most value and the index of the most value, and directly sends the most value and the index to the formatting circuit; the format circuit retains the most value and packs the index of the most value separately.
9. The active sparsification vector processor of claim 3 or 6, wherein the conversion method of compressing NAF coding is:
the signed binary data is converted into a separate NAF code in two ways;
the specific process of the first separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a +1 mark and a-1 mark coded by the first separated NAF; wherein bitwise adding x3 to non-xh gives naf.pos +1 marker encoded by the first isolated NAF, bitwise adding x3 to xh gives naf.neg-1 marker encoded by the first isolated NAF;
when the nth bit of the +1 mark and the-1 mark coded by the first separated NAF is 0, the nth bit of the NAF is 0, and N is more than or equal to 0 and less than or equal to N-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are respectively 0 and 1, the nth bit of the NAF is-1; when the nth bits of the +1 mark and the-1 mark coded by the first separated NAF are 1 and 0 respectively, the nth bit of the NAF is + 1;
the specific process of the second separated NAF coding mode is as follows:
right-shifting the N-bit signed binary data x by one bit to obtain a first intermediate result xh of an N-bit binary system, wherein N is a positive integer; adding x and xh and discarding overflow to obtain a second intermediate result x3 of an N-bit binary system; x and x3 are operated according to bits to obtain a-1 mark and a non-0 mark coded by the second separation type NAF; wherein, the non-x 3 is bitwise compared with xh to obtain-1 mark naf.neg coded by the second isolated NAF, and the x3 is bitwise exclusive-or xh to obtain non-0 mark naf.non0 coded by the second isolated NAF;
when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 1, the nth bit of the NAF is-1; when the nth bit of the-1 mark and the non-0 mark coded by the second separate NAF is 0, the nth bit of the NAF is 0; when the nth bits of a-1 mark and a non-0 mark coded by the second separate NAF are 0 and 1 respectively, the nth bit of the NAF is + 1;
the separated NAF coding is converted into compressed NAF coding in two modes;
for NAF with N bits, if N is an odd number, 0 is supplemented before the highest bit of the NAF with N bits to obtain extended NAF with N +1 bits, and M is made to be N + 1; if N is even number, NAF of N bit is used as extended NAF, making M equal to N;
the first compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the first compressed NAF code is 110, which represents that the value of the M bit of the compressed NAF is-2; if bits 2m +1 and 2m of the NAF are {0, -1}, the first compressed NAF is encoded as 101, characterizing the value of the m-th bit of the compressed NAF as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the first compressed NAF is encoded as 000, indicating that the value of the m bit of the compressed NAF is 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the first compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if the 2m +1 th bit and the 2m bit of the NAF are {1, 0}, the first compressed NAF code is 010, and the m bit value of the representation compressed NAF is 2;
the second compressed NAF encoding method is:
if the 2M +1 bit and the 2M bit of the NAF are { -1, 0}, M is more than or equal to 0 and less than or equal to M/2-1, the second compressed NAF code is 111, and the value of the M bit of the compressed NAF is represented as-2; if bits 2m +1 and 2m of NAF are {0, -1}, the second compressed NAF is encoded as 101, the value of bit m of the compressed NAF is represented as-1; if bits 2m +1 and 2m of NAF are {0, 0}, the second compressed NAF is encoded to be 000, the value of bit m of the compressed NAF is represented to be 0; if the 2m +1 th bit and the 2m bit of the NAF are {0, 1}, the second compressed NAF is encoded to be 001, and the m bit of the compressed NAF is represented to be 1; if bits 2m +1 and 2m of the NAF are {1, 0}, the second compressed NAF is encoded to 011, indicating that the value of the m-th bit of the compressed NAF is 2.
CN202110986231.4A 2021-08-26 2021-08-26 Active sparsification vector processor Active CN113835754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110986231.4A CN113835754B (en) 2021-08-26 2021-08-26 Active sparsification vector processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110986231.4A CN113835754B (en) 2021-08-26 2021-08-26 Active sparsification vector processor

Publications (2)

Publication Number Publication Date
CN113835754A true CN113835754A (en) 2021-12-24
CN113835754B CN113835754B (en) 2023-04-18

Family

ID=78961217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110986231.4A Active CN113835754B (en) 2021-08-26 2021-08-26 Active sparsification vector processor

Country Status (1)

Country Link
CN (1) CN113835754B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028572A1 (en) * 2001-06-29 2003-02-06 Yatin Hoskote Fast single precision floating point accumulator using base 32 system
US20090248778A1 (en) * 2008-03-28 2009-10-01 Magerlein Karen A Systems and methods for a combined matrix-vector and matrix transpose vector multiply for a block-sparse matrix
US20110078226A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Sparse Matrix-Vector Multiplication on Graphics Processor Units
US20120259906A1 (en) * 2011-04-08 2012-10-11 Fujitsu Limited Arithmetic circuit, arithmetic processing apparatus and method of controlling arithmetic circuit
CN110766136A (en) * 2019-10-16 2020-02-07 北京航空航天大学 Compression method of sparse matrix and vector
US20200293488A1 (en) * 2019-03-15 2020-09-17 Intel Corporation Scalar core integration
US20210191733A1 (en) * 2019-12-23 2021-06-24 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (fast) in machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028572A1 (en) * 2001-06-29 2003-02-06 Yatin Hoskote Fast single precision floating point accumulator using base 32 system
US20090248778A1 (en) * 2008-03-28 2009-10-01 Magerlein Karen A Systems and methods for a combined matrix-vector and matrix transpose vector multiply for a block-sparse matrix
US20110078226A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Sparse Matrix-Vector Multiplication on Graphics Processor Units
US20120259906A1 (en) * 2011-04-08 2012-10-11 Fujitsu Limited Arithmetic circuit, arithmetic processing apparatus and method of controlling arithmetic circuit
US20200293488A1 (en) * 2019-03-15 2020-09-17 Intel Corporation Scalar core integration
CN110766136A (en) * 2019-10-16 2020-02-07 北京航空航天大学 Compression method of sparse matrix and vector
US20210191733A1 (en) * 2019-12-23 2021-06-24 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (fast) in machine learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YAN ZHANG: "FPGA vs. GPU for sparse matrix vector multiply" *
常亮,杨思琪: "磁性随机存储器的发展及其缓存应用" *
白洪涛: "基于GPU的稀疏矩阵向量乘优化" *

Also Published As

Publication number Publication date
CN113835754B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
KR102153791B1 (en) Digital neural, artificial neuron for artificial neuron network and inference engine having the same
US10872295B1 (en) Residual quantization of bit-shift weights in an artificial neural network
CN101981618B (en) Reduced-complexity vector indexing and de-indexing
JP3902990B2 (en) Hadamard transform processing method and apparatus
CN109389208B (en) Data quantization device and quantization method
US5577132A (en) Image coding/decoding device
CN114647399B (en) Low-energy-consumption high-precision approximate parallel fixed-width multiplication accumulation device
CN110019184B (en) Method for compressing and decompressing ordered integer array
CN111488133A (en) High-radix approximate Booth coding method and mixed-radix Booth coding approximate multiplier
CN114615507A (en) Image coding method, decoding method and related device
CN113835754B (en) Active sparsification vector processor
JP3033671B2 (en) Image signal Hadamard transform encoding / decoding method and apparatus
CN100493199C (en) Coding apparatus, coding method, and codebook
CN112702600B (en) Image coding and decoding neural network layered fixed-point method
CN110766136B (en) Compression method of sparse matrix and vector
CN113283591B (en) Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN111492369A (en) Residual quantization of shift weights in artificial neural networks
EP1617324A1 (en) Method and system for digital signal processing and program product therefor
CN116205244A (en) Digital signal processing structure
CN116451769A (en) Quantization method of language model and electronic equipment
CN113794709B (en) Hybrid coding method for binary sparse matrix
KR20220045920A (en) Method and apparatus for processing images/videos for machine vision
CN110737869B (en) DCT/IDCT multiplier circuit optimization method and application
Swilem A fast vector quantization encoding algorithm based on projection pyramid with Hadamard transformation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant