CN112712168A - Method and system for realizing high-efficiency calculation of neural network - Google Patents

Method and system for realizing high-efficiency calculation of neural network Download PDF

Info

Publication number
CN112712168A
CN112712168A CN202011640098.9A CN202011640098A CN112712168A CN 112712168 A CN112712168 A CN 112712168A CN 202011640098 A CN202011640098 A CN 202011640098A CN 112712168 A CN112712168 A CN 112712168A
Authority
CN
China
Prior art keywords
data
quantization
address
processed
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011640098.9A
Other languages
Chinese (zh)
Inventor
周方坤
唐士斌
欧阳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingwei Intelligent Technology Co ltd
Original Assignee
Beijing Qingwei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingwei Intelligent Technology Co ltd filed Critical Beijing Qingwei Intelligent Technology Co ltd
Priority to CN202011640098.9A priority Critical patent/CN112712168A/en
Publication of CN112712168A publication Critical patent/CN112712168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention provides a method for realizing high-efficiency calculation of a neural network, which can be operated on a reconfigurable processor. The reconfigurable processor has a set processing amount. The reconfigurable processor has a plurality of types of arithmetic unit hardware therein. The method comprises the following steps: and obtaining a quantization parameter according to the operation type, the data to be processed and the processing amount. And storing the data to be processed by the storage mode of the HWCn model. N in the HWCn model is a set single-time storage number. The number of single stores corresponds to the throughput. And acquiring the parallel number of the operation unit according to the single storage number. The pending data and intermediate calculation data called from the HWCn model each time are registered by the order of registers occupied by the plurality of calculation units. Therefore, the invention combines the quantification of the data to be processed and an efficient storage mode, and improves the running speed and the reliability of the neural network on hardware through the high-energy-efficiency pipeline design on the hardware. Meanwhile, the invention provides a system for realizing the high-efficiency calculation of the neural network.

Description

Method and system for realizing high-efficiency calculation of neural network
Technical Field
The invention relates to the field of convolutional neural network computing. The invention particularly relates to a method and a system for realizing high-efficiency calculation of a neural network.
Background
In the calculation of the convolutional neural network, the prediction error of the neural network is reduced by continuously and iteratively updating the model parameters, so that the output of the network is more expected, wherein a large number of activation functions and loss functions are involved, and besides, a large number of data multiplication, data addition, intermediate operation result quantization, nonlinear operation quantization and the like exist. The above operation has a characteristic on a vector and includes a characteristic on a channel. However, in the prior art, because the implemented hardware has a fixed processing and memory structure, the computation requirements of the convolutional neural network cannot be met while the computation speed and reliability are ensured, and efficient convolutional neural network computation cannot be ensured, which affects the implementation and application.
Disclosure of Invention
The invention aims to provide a method for realizing high-efficiency calculation of a neural network, which combines the quantification of data to be processed and a high-efficiency storage mode and improves the running speed and the reliability of the neural network on hardware through the high-energy-efficiency pipeline design on the hardware.
Another object of the present invention is to provide a system for realizing efficient computation of a neural network, which improves the operation speed and reliability of the neural network on hardware through an energy-efficient pipeline design on hardware.
In one aspect of the invention, a method for achieving efficient computation of a neural network is provided that is capable of operating on a reconfigurable processor. The reconfigurable processor has a set processing amount. The reconfigurable processor has a plurality of types of arithmetic unit hardware therein.
The method for realizing the high-efficiency calculation of the neural network comprises the following steps:
and step S101, obtaining a quantization parameter according to the operation type, the data to be processed and the processing amount.
And S102, storing the data to be processed in a storage mode of the HWCn model. N in the HWCn model is a set single-time storage number. The number of single stores corresponds to the throughput.
And step S103, acquiring the parallel number of the arithmetic unit according to the single storage number. And acquiring a plurality of operation units corresponding to the operation types and the calling orders of the operation units according to the operation types. The plurality of arithmetic units correspond to arithmetic unit hardware types. And combining the operation units into a data operation combination according to the calling sequence and the quantization parameters of the operation units.
Step S104, obtaining the sequence of the registers occupied by the multiple computing units according to the data operation combination and the calling sequence of the multiple computing units, so that the execution data operation combination can register the to-be-processed data and the intermediate computing data called from the HWCn model each time through the sequence of the registers occupied by the multiple computing units.
In one embodiment of the method for implementing efficient calculation of a neural network according to the present invention, step S101 includes:
in step S1011, the convolution forward calculates the to-be-processed picture data, and obtains the maximum max of the convolution input absolute value in a statistical manner.
In step S1012, the calculation floating point quantization interval dist _ scale is obtained by formula 1.
dist _ scale ═ max/2048.0. Equation 1
Step S1013, the to-be-processed picture data is convolved again, and a histogram with a length of 2048 is constructed according to the convolved floating point quantization interval dist _ scale.
In step S1014, a data processing section [128,2048] is acquired based on the multiple of the set processing amount of 8 bytes and the histogram length 2048.
In step S1015, the quantization threshold th is traversed in a loop in the data interval [128,2048 ]. For each threshold th, the histogram of 2048 is changed to a histogram of length 128.
In step S1016, the KL divergence is calculated from the histogram of length 128, and the threshold value when the KL divergence is the smallest is taken as the optimum threshold value target _ th.
Step S1017, a quantization parameter scale is obtained by formula 2.
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 2.
In another embodiment of the method for implementing efficient calculation of a neural network according to the present invention, step S101 includes:
and step S1, when the operation type is the Eltwise layer operation, respectively carrying out forward operation on the two paths of input data to be processed, and obtaining two paths of quantization coefficients SApre and SBpre through KL divergence. And each path is int8 quantization or uint8 quantization during quantization.
And step S2, obtaining the quantized coefficient Scur according to the two paths of quantized coefficients SApre and SBpre. When two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8. While the two inputs are quantized by the uint8, the output is also quantized by the uint 8. When the two inputs are different, the output is int8 quantization.
In another embodiment of the method for implementing efficient calculation of a neural network according to the present invention, in step S103, when the operation type is convolution operation, the corresponding operation type is: addition, multiplication, or exclusive or operation. Arithmetic unit hardware type: an adder corresponding to addition and a multiplier corresponding to multiplication. A comparator corresponding to an exclusive-or operation.
In another embodiment of the method for implementing high-efficiency calculation of a neural network, in step S102, when the set single storage number is 8, the data width is 16bit, the H height direction data and the W width direction data are 10, and the C depth direction data is 16, the HWC8 model storage manner is:
after the data to be processed is imported into an internal memory of the reconfigurable processor from an external memory, the data is cached through a RAM with a data bit width of 128 bit.
The first address H0W0 in RAM stores data for C0-C7. H0W0 is a 0 address in the height direction. W0 is a 0 address in the width direction. C0-C7 are 0-7 data in the depth direction. The second address H0W1 stores data for C0-C7, and so on until the tenth address H0W9 stores data for C0-C7.
The eleventh address H1W0 then stores data for C0-C7 until the twentieth address stores data for C0-C7 of H1W9 until the one hundred address H9W9 stores data for C0-C7, so that the first HWC8 is full and the C0-C7 data is full. The above is repeated so that the second HWC8 is full and C8-C15 data are full.
In a second aspect of the invention, a system for implementing neural network efficient computing is provided that is capable of running on a reconfigurable processor. The reconfigurable processor has a set processing amount. The reconfigurable processor has a plurality of types of arithmetic unit hardware therein.
The system for realizing the high-efficiency calculation of the neural network comprises: the device comprises a quantization parameter acquisition unit, a storage unit, a data operation acquisition unit and a register unit. Wherein:
and the quantization parameter acquisition unit is configured to acquire quantization parameters according to the operation type, the data to be processed and the processing amount.
And the storage unit is configured to store the data to be processed through the storage mode of the HWCn model. N in the HWCn model is a set single-time storage number. The number of single stores corresponds to the throughput.
And the data operation acquisition unit is configured to acquire the parallel number of the operation unit according to the single storage number. And acquiring a plurality of operation units corresponding to the operation types and the calling orders of the operation units according to the operation types. The plurality of arithmetic units correspond to arithmetic unit hardware types. And combining the operation units into a data operation combination according to the calling sequence and the quantization parameters of the operation units.
And the register unit is configured to acquire the sequence of the registers occupied by the calculation units according to the data operation combination and the calling sequence of the calculation units, so that the execution data operation combination can register the data to be processed and the intermediate calculation data called from the HWCn model each time through the sequence of the registers occupied by the calculation units.
In an embodiment of the system for implementing efficient calculation of a neural network according to the present invention, the quantization parameter obtaining unit is configured to:
and carrying out convolution forward operation on the image data to be processed, and acquiring the maximum value max of the convolution input absolute value in a statistical mode.
And obtaining a floating point quantization interval dist _ scale through formula 3.
dist _ scale ═ max/2048.0. Equation 3
And performing convolution forward operation on the picture data to be processed again, and constructing a histogram with the length of 2048 according to the convolution floating point quantization interval dist _ scale.
A data processing section [128,2048] is acquired based on the multiple of the set throughput of 8 bytes and the histogram length 2048.
The quantization threshold th is traversed cyclically in the data interval 128,2048. For each threshold th, the histogram of 2048 is changed to a histogram of length 128.
The KL divergence is calculated from the histogram of length 128, and the threshold value when the KL divergence is the smallest is taken as the optimal threshold value target _ th.
The quantization parameter scale is obtained by equation 4.
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 4.
In another embodiment of the present invention, a system for neural network efficient computation is provided, wherein the quantization parameter obtaining unit is configured to:
when the operation type is the operation of an Eltwise layer, forward operation is respectively carried out on two paths of input data to be processed, and two paths of quantization coefficients SApre and SBpre are obtained through KL divergence. And each path is int8 quantization or uint8 quantization during quantization.
And obtaining a quantization coefficient Scur according to the two paths of quantization coefficients SApre and SBpre. When two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8. While the two inputs are quantized by the uint8, the output is also quantized by the uint 8. When the two inputs are different, the output is int8 quantization.
In another embodiment of the system for implementing efficient computation of a neural network according to the present invention, the data operation obtaining unit is configured to, when the operation type is convolution operation, corresponding operation types of the convolution operation are: addition, multiplication, or exclusive or operation. Arithmetic unit hardware type: an adder corresponding to addition and a multiplier corresponding to multiplication. A comparator corresponding to an exclusive-or operation.
In another embodiment of the system for implementing efficient calculation of a neural network according to the present invention, when the set single storage number in the storage unit is 8, the data width is 16 bits, the H height direction data and the W width direction data are 10, and the C depth direction data is 16, the HWC8 model storage manner is:
after the data to be processed is imported into an internal memory of the reconfigurable processor from an external memory, the data is cached through a RAM with a data bit width of 128 bit.
The first address H0W0 in RAM stores data for C0-C7. H0W0 is a 0 address in the height direction. W0 is a 0 address in the width direction. C0-C7 are 0-7 data in the depth direction. The second address H0W1 stores data for C0-C7, and so on until the tenth address H0W9 stores data for C0-C7.
The eleventh address H1W0 then stores data for C0-C7 until the twentieth address stores data for C0-C7 of H1W9 until the one hundred address H9W9 stores data for C0-C7, so that the first HWC8 is full and the C0-C7 data is full. The above is repeated so that the second HWC8 is full and C8-C15 data are full.
The characteristics, technical features, advantages and implementation manners of the method and system for realizing efficient calculation of the neural network are further described in a clear and understandable manner by combining the attached drawings.
Drawings
Fig. 1 is a flow chart for explaining a method for realizing efficient calculation of a neural network in one embodiment of the present invention.
Fig. 2 is a schematic diagram for explaining a flow of acquiring quantization parameters in convolutional layer calculation according to an embodiment of the present invention.
Fig. 3 is a schematic diagram for explaining the acquisition flow of quantization parameters in the Eltwise layer calculation according to an embodiment of the present invention.
FIG. 4 is a schematic diagram illustrating the components of a system for implementing neural network efficient computing in one embodiment of the present invention.
Fig. 5 is a schematic diagram for explaining the execution of registers in the method for realizing the neural network efficient computation according to another embodiment of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings, in which the same reference numerals indicate the same or structurally similar but functionally identical elements.
"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings only schematically show the parts relevant to the present exemplary embodiment, and they do not represent the actual structure and the true scale of the product.
In one aspect of the invention, a method for achieving efficient computation of a neural network is provided that is capable of operating on a reconfigurable processor. The reconfigurable processor has a set processing amount. The reconfigurable processor has a plurality of types of arithmetic unit hardware therein.
As shown in fig. 1, the method for realizing the neural network efficient computation includes:
step S101, obtaining a quantization parameter.
In this step, a quantization parameter is obtained according to the operation type, the data to be processed, and the processing amount.
Step S102, store data through HWCn model.
In this step, the data to be processed is stored in the storage mode of the HWCn model. N in the HWCn model is a set single-time storage number. The number of single stores corresponds to the throughput.
In step S103, a combination of arithmetic units is acquired.
In this step, the parallel number of the arithmetic unit is obtained according to the single storage number. And acquiring a plurality of operation units corresponding to the operation types and the calling orders of the operation units according to the operation types. The plurality of arithmetic units correspond to arithmetic unit hardware types. And combining the operation units into a data operation combination according to the calling sequence and the quantization parameters of the operation units.
And step S104, calling and calculating the data to be processed.
In this step, the order of the registers occupied by the plurality of computing units is obtained according to the data operation combination and the calling order of the plurality of computing units, so that the execution data operation combination can register the to-be-processed data and the intermediate computing data called from the HWCn model each time through the order of the registers occupied by the plurality of computing units.
As shown in fig. 2, in an embodiment of the method for implementing neural network efficient computation of the present invention, step S101 includes:
in step S1011, the maximum value max of the convolution input absolute value is acquired.
In this step, the convolution forward calculates the image data to be processed, and obtains the maximum max of the convolution input absolute value in a statistical manner.
In step S1012, the calculation floating point quantization interval dist _ scale is obtained.
In this step, the calculation floating point quantization interval dist _ scale is obtained through formula 1.
dist _ scale ═ max/2048.0. Equation 1
In step S1013, a histogram is constructed.
In this step, the image data to be processed is convolved again and a histogram with a length of 2048 is constructed according to the convolved floating point quantization interval dist _ scale.
In step S1014, a data processing section is acquired.
In this step, a data processing interval [128,2048] is obtained based on the multiple of the set throughput of 8 bytes and the histogram length 2048.
In step S1015, the histogram is transformed.
In this step, the quantization threshold th is traversed in a loop in the data interval [128,2048 ]. For each threshold th, the histogram of 2048 is changed to a histogram of length 128.
In step S1016, the optimal threshold value target _ th is acquired.
In this step, KL divergence is calculated from a histogram of length 128, and the threshold value at which KL divergence is minimum is taken as the optimum threshold value target _ th.
Step S1017, a quantization parameter scale is acquired.
In this step, a quantization parameter scale is obtained by formula 2.
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 2.
Therefore, the acquisition of the quantization parameters in the convolution layer calculation is realized.
As shown in fig. 3, in another embodiment of the method for implementing neural network efficient computation of the present invention, step S101 includes:
and step S1, acquiring two paths of quantization coefficients SApre and SBpre.
When the operation type is the operation of an Eltwise layer, forward operation is respectively carried out on two paths of input data to be processed, and two paths of quantization coefficients SApre and SBpre are obtained through KL divergence. And each path is int8 quantization or uint8 quantization during quantization. The aforementioned Eltwise layer can be understood as a fully connected layer or an output layer in the neural network.
In step S2, the quantization parameter is output.
Therefore, the quantization parameter in the Eltwise layer operation is obtained.
And obtaining a quantization coefficient Scur according to the two paths of quantization coefficients SApre and SBpre. When two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8. While the two inputs are quantized by the uint8, the output is also quantized by the uint 8. When the two inputs are different, the output is int8 quantization.
In another embodiment of the method for implementing efficient calculation of a neural network according to the present invention, in step S103, when the operation type is convolution operation, the corresponding operation type is: addition, multiplication, or exclusive or operation. Arithmetic unit hardware type: an adder corresponding to addition and a multiplier corresponding to multiplication. A comparator corresponding to an exclusive-or operation.
In another embodiment of the method for implementing high-efficiency calculation of a neural network, in step S102, when the set single storage number is 8, the data width is 16bit, the H height direction data and the W width direction data are 10, and the C depth direction data is 16, the HWC8 model storage manner is:
after the data to be processed is imported into an internal memory of the reconfigurable processor from an external memory, the data is cached through a RAM with a data bit width of 128 bit.
The first address H0W0 in RAM stores data for C0-C7. H0W0 is a 0 address in the height direction. W0 is a 0 address in the width direction. C0-C7 are 0-7 data in the depth direction. The second address H0W1 stores data for C0-C7, and so on until the tenth address H0W9 stores data for C0-C7.
The eleventh address H1W0 then stores data for C0-C7 until the twentieth address stores data for C0-C7 of H1W9 until the one hundred address H9W9 stores data for C0-C7, so that the first HWC8 is full and the C0-C7 data is full. The above is repeated so that the second HWC8 is full and C8-C15 data are full.
In a second aspect of the invention, a system for implementing neural network efficient computing is provided that is capable of running on a reconfigurable processor. The reconfigurable processor has a set processing amount. The reconfigurable processor has a plurality of types of arithmetic unit hardware therein.
As shown in fig. 4, a system for implementing neural network efficient computing includes: a quantization parameter acquisition unit 101, a storage unit 201, a data operation acquisition unit 301, and a register unit 401. Wherein:
a quantization parameter obtaining unit 101 configured to obtain a quantization parameter according to an operation type, data to be processed, and a processing amount.
And the storage unit 201 is configured to store the data to be processed through a storage mode of the HWCn model. N in the HWCn model is a set single-time storage number. The number of single stores corresponds to the throughput.
A data operation acquisition unit 301 configured to acquire the parallel number of the operation unit from the single stored number. And acquiring a plurality of operation units corresponding to the operation types and the calling orders of the operation units according to the operation types. The plurality of arithmetic units correspond to arithmetic unit hardware types. And combining the operation units into a data operation combination according to the calling sequence and the quantization parameters of the operation units.
A register unit 401 configured to obtain an order of the registers occupied by the plurality of computing units according to the data operation combination and the call order of the plurality of computing units, so that the data to be processed and the intermediate computing data called from the HWCn model at each time can be registered by the data operation combination through the order of the registers occupied by the plurality of computing units.
In an embodiment of the system for implementing efficient calculation of a neural network, the quantization parameter obtaining unit 101 is configured to:
and carrying out convolution forward operation on the image data to be processed, and acquiring the maximum value max of the convolution input absolute value in a statistical mode.
And obtaining a floating point quantization interval dist _ scale through formula 3.
dist _ scale ═ max/2048.0. Equation 3
And performing convolution forward operation on the picture data to be processed again, and constructing a histogram with the length of 2048 according to the convolution floating point quantization interval dist _ scale.
A data processing section [128,2048] is acquired based on the multiple of the set throughput of 8 bytes and the histogram length 2048.
The quantization threshold th is traversed cyclically in the data interval 128,2048. For each threshold th, the histogram of 2048 is changed to a histogram of length 128.
The KL divergence is calculated from the histogram of length 128, and the threshold value when the KL divergence is the smallest is taken as the optimal threshold value target _ th.
The quantization parameter scale is obtained by equation 4.
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 4.
In another embodiment of the present invention, the quantization parameter obtaining unit 101 is configured to:
when the operation type is the operation of an Eltwise layer, forward operation is respectively carried out on two paths of input data to be processed, and two paths of quantization coefficients SApre and SBpre are obtained through KL divergence. And each path is int8 quantization or uint8 quantization during quantization.
And obtaining a quantization coefficient Scur according to the two paths of quantization coefficients SApre and SBpre. When two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8. While the two inputs are quantized by the uint8, the output is also quantized by the uint 8. When the two inputs are different, the output is int8 quantization.
In another embodiment of the system for implementing efficient calculation of a neural network according to the present invention, the data operation obtaining unit 301 is configured to, when the operation type is convolution operation, corresponding operation types of the convolution operation are: addition, multiplication, or exclusive or operation. Arithmetic unit hardware type: an adder corresponding to addition and a multiplier corresponding to multiplication. A comparator corresponding to an exclusive-or operation.
In another embodiment of the system for implementing efficient calculation of a neural network according to the present invention, when the set single storage number in the storage unit 201 is 8, the data width is 16 bits, the H height direction data and the W width direction data are 10, and the C depth direction data is 16, the HWC8 model storage manner is:
after the data to be processed is imported into an internal memory of the reconfigurable processor from an external memory, the data is cached through a RAM with a data bit width of 128 bit.
The first address H0W0 in RAM stores data for C0-C7. H0W0 is a 0 address in the height direction. W0 is a 0 address in the width direction. C0-C7 are 0-7 data in the depth direction. The second address H0W1 stores data for C0-C7, and so on until the tenth address H0W9 stores data for C0-C7.
The eleventh address H1W0 then stores data for C0-C7 until the twentieth address stores data for C0-C7 of H1W9 until the one hundred address H9W9 stores data for C0-C7, so that the first HWC8 is full and the C0-C7 data is full. The above is repeated so that the second HWC8 is full and C8-C15 data are full.
In another embodiment of the method for realizing the high-efficiency calculation of the neural network, the method for realizing the high-efficiency calculation of the neural network comprises the following steps:
step S01, data quantization:
1. the quantization operation process is illustrated by 8-bit convolution quantization.
1.1, performing forward operation, for example running 1000 pictures, and recording the maximum value of the statistical convolution input absolute value as max.
1.2, calculating a floating point quantization interval: dist _ scale ═ max/2048.0.
1.3, forward of 1000 pictures before running again, the convolved input floating point value constructs a length 2048 histogram based on the quantization interval dist _ scale.
1.4, the quantization threshold th is cyclically traversed in the interval [128,2048], the histogram of 2048 is changed into a histogram of 128 in length for each threshold th, the KL divergence is calculated, and the threshold with the smallest KL divergence is taken as the optimal threshold target _ th.
1.5, calculating the input quantization parameter scale ═ (target _ th +0.5) × dist _ scale/127.0.
After the above steps, the quantization coefficient scale of the convolution input can be obtained.
2. In the vector operation, two paths of data input (e.g., Eltwise layer operation) are used to employ the following quantization procedure.
2.1, respectively running forward for the two paths of input, and obtaining two paths of quantization coefficients by adopting KL divergence and recording the two paths of quantization coefficients as SApre and SBpre. When quantizing, it needs to be noted that each path may be int8 or uint8 quantization.
2.2, the quantization coefficient of the quantization layer after recording is Scur.
2.3, when two paths of input are both int8 in the quantization process, the output is also int 8. When the two inputs are uint8, the output is uint 8. When the two inputs are different, the output is int 8.
2.4, uint8 is an 8-bit unsigned integer Int8, equal to Byte, which takes 1 Byte.
Step S02, storing data into a vectorization model:
in the chip internal SRAM storage, there are different data storage forms for buffering input and intermediate data. For example, the HWC type storage model stores the number of C channels first, and after the storage of the number of C channels in one (H, W) direction is completed, stores the data of (H, W +1) until the movement with step size 1 in the H direction is continued after the storage of the data in the W direction is completed.
In addition, the HWCn model is stored, that is, after one C-direction n number is stored, the W-direction shift is performed, and after the W-channel direction is stored, the W-direction shift is performed again in the H-direction. The present invention discusses quantization and pipelining on HWCn storage in detail. In HWCn mode, n data can be read from SRAM into the computation pipeline in one clock cycle, and in order to match the input bandwidth, it is also necessary to prepare the computation unit capable of processing n data for operation. In this mode, the degree of data parallelism n determines the computational power of the data.
For example, when n is 8, that is, in the HWC8 model, the data width is 16bit, H and W are 10, and C is 16, after the input data is imported from the external storage into the hardware, a RAM with a data bit width of 128bit is needed for buffering. The first address in the RAM stores data of C0-C7 of H0W0, the second address stores data of C0-C7 of H0W1, and so on until the tenth address stores data of C0-C7 of H0W9, and then the eleventh address stores data of C0-C7 of H1W 0.
Until the twentieth address stores data of C0-C7 of H1W9, and so on until the first hundred addresses store data of C0-C7 of H9W9, namely the first HWC8 which is full, because the model C is 16, the storage space with the same size as the storage space for storing the second HWC8, namely C8-C15 of the first one hundred addresses store H0W0, and so on until C8-C15 of the second one hundred addresses store H7W 7.
In the case of such a model stored data, the arithmetic unit sequentially fetches from the address of the SRAM every clock cycle, and HWC8 data for one address is supplied to 8 arithmetic units for operation, and when the 8 data are calculated, they are stored in the same format in another blank address of the RAM. The advantage of this data storage is that the throughput of the data supply side and the throughput of the data consumption side are exactly matched, and the operation does not change the data storage mode under the operation without changing the data model, and resources and time periods are wasted because hardware external resources are not needed to rearrange the data.
Step S03, data operation model
The more compact the resources on the chip, the better the demand. As described in step S02, the parallelism of the arithmetic units is determined based on the number of n in the HWCn model, and the type of the arithmetic unit is determined based on the arithmetic requirement. The operation requirement of the network model determines the type of an operation unit of hardware, a pipeline type multiplier is needed when multiplication is needed, a pipeline type adder is needed when addition is needed, a comparator capable of outputting numbers when the numbers are compared is needed, and the like.
Before the operation is performed, the operation type of the fixed point data among the hardware is determined according to the quantization method described in step S01. The fixed-point data will have shifted values and quantization parameters, and these multiplication and shift operations also need to be embodied in the data operation.
Taking a BN operation example common in a neural network, data needs to be multiplied by a quantization parameter and added with a quantization parameter, such that a set of multipliers and adders with parallelism n is needed, data enters the multipliers and then enters the adder operation by the flow of HWCn, and after using pipelined multipliers and adder units, data can continuously enter the operation unit without delay. And taking a common accumulation operation example, the adders need to be continuously used, and because only one group of adders exists in hardware, the output result needs to be queued at an addend inlet of the next round and added after the current complete addition process is finished.
And step S04, a method for realizing a vector calculation pipeline design algorithm.
In the design, in the actual hardware design process, a more stable time sequence is required to be met, the delay between each register cannot exceed a threshold value, each stage of register is controlled by a pipeline in the operation process, and a pipeline type calculation unit is used for operation.
The pipeline control means that when data is read from the RAM and the last result is written back, a plurality of registers are used for caching the intermediate result, and the data delay in the circuit can be effectively reduced to meet the requirement of higher frequency. The length of the pipeline is configured at any time according to different operators, for example, one addition operation is needed, the length of the pipeline is only 5 units, and if continuous operations such as addition, multiplication and addition are needed, the length of the pipeline can reach more than 10 units.
Fig. 5 shows a pipeline of multiplication by addition. rm0_ rddata is the first stage pipeline register, reg1 and reg2 are the input registers of the multiplier, and also are used as the second stage pipeline register, the result of multiplication is stored in reg3 and reg4 which already exists is used as the third stage pipeline register, and the final addition result is stored in reg5 is the fourth stage pipeline register.
It should be understood that although the present description is described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein as a whole may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for realizing high-efficiency calculation of a neural network is characterized by being capable of running on a reconfigurable processor; the reconfigurable processor has a set processing amount; the reconfigurable processor is provided with a plurality of types of arithmetic unit hardware;
the method for realizing the high-efficiency calculation of the neural network comprises the following steps:
step S101, obtaining a quantization parameter according to an operation type, data to be processed and the processing amount;
step S102, storing the data to be processed in a storage mode of the HWCn model; n in the HWCn model is a set single storage number; the single storage number corresponds to the processing amount;
step S103, acquiring the parallel number of the arithmetic unit according to the single storage number; acquiring a plurality of operation units corresponding to the operation types and calling orders of the operation units according to the operation types; the plurality of arithmetic units correspond to the arithmetic unit hardware types; combining the operation units into a data operation combination according to the calling order of the operation units and the quantization parameters;
and step S104, acquiring the sequence of the registers occupied by the plurality of computing units according to the data operation combination and the calling sequence of the plurality of computing units, so that the data operation combination can register the data to be processed and the intermediate computing data called from the HWCn model each time through the sequence of the registers occupied by the plurality of computing units.
2. The method according to claim 1, wherein the step S101 comprises:
step S1011, carrying out convolution forward operation on the picture data to be processed, and acquiring the maximum value max of a convolution input absolute value in a statistical manner;
step S1012, obtaining a calculation floating point quantization interval dist _ scale according to formula 1;
dist _ scale ═ max/2048.0; equation 1
Step S1013, performing convolution forward operation on the picture data to be processed again, and constructing a histogram with the length of 2048 according to the convolution floating point quantization interval dist _ scale;
step 1014, acquiring a data processing section [128,2048] according to the multiple of the set processing amount of 8 bytes and the histogram length 2048;
step S1015, circularly traversing the quantization threshold th in the data interval [128,2048 ]; for each threshold th, change the histogram of 2048 into a histogram of length 128;
step S1016, calculating KL divergence according to the histogram with the length of 128, and taking a threshold value when the KL divergence is minimum as an optimal threshold value target _ th;
step S1017, obtaining a quantization parameter scale through a formula 2;
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 2.
3. The method according to claim 1, wherein the step S101 comprises:
step S1, when the operation type is Eltwise layer operation, forward operation is respectively carried out on the two paths of input data to be processed, and two paths of quantization coefficients SApre and SBpre are obtained through KL divergence; when in quantization, each path is int8 quantization or uint8 quantization;
step S2, obtaining a quantization coefficient Scur according to the two paths of quantization coefficients SApre and SBpre; when two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8; when the two paths of input are quantized by the uint8, the output is also quantized by the uint 8; when the two inputs are different, the output is int8 quantization.
4. The method according to claim 1, wherein in step S103, when the operation type is convolution operation, the operation type corresponding thereto is: addition, multiplication or exclusive-or operation; the hardware type of the arithmetic unit is as follows: an adder corresponding to addition and a multiplier corresponding to multiplication; a comparator corresponding to an exclusive-or operation.
5. The method according to claim 1, wherein in step S102, when the set single storage count is 8, the data width is 16 bits, the H height direction data and the W width direction data are 10, and the C depth direction data is 16, the HWC8 model is stored in the following manner:
after data to be processed is imported into an internal memory of the reconfigurable processor from external storage, caching is carried out through an RAM with a data bit width of 128 bit;
the first address H0W0 in the RAM stores data of C0-C7; the H0W0 is a 0 address in the height direction; w0 is 0 address in the width direction; C0-C7 are 0-7 data in the depth direction; the second address H0W1 stores data of C0-C7, and so on until the tenth address H0W9 stores data of C0-C7;
the eleventh address H1W0 then stores data of C0-C7 until the twentieth address stores data of C0-C7 of H1W9 until the one hundred address H9W9 stores data of C0-C7, so that the first HWC8 is full and the C0-C7 data is full; the above is repeated so that the second HWC8 is full and C8-C15 data are full.
6. A system for implementing neural network efficient computing, capable of operating on a reconfigurable processor; the reconfigurable processor has a set processing amount; the reconfigurable processor is provided with a plurality of types of arithmetic unit hardware; the system for realizing the neural network efficient computation comprises the following steps: a quantization parameter obtaining unit, a storage unit, a data operation obtaining unit and a register unit; wherein:
the quantization parameter obtaining unit is configured to obtain quantization parameters according to operation types, data to be processed and the processing amount;
the storage unit is configured to store the to-be-processed data in a storage manner of a HWCn model; n in the HWCn model is a set single storage number; the single storage number corresponds to the processing amount;
the data operation acquisition unit is configured to acquire the parallel number of the operation unit according to the single storage number; acquiring a plurality of operation units corresponding to the operation types and calling orders of the operation units according to the operation types; the plurality of arithmetic units correspond to the arithmetic unit hardware types; combining the operation units into a data operation combination according to the calling order of the operation units and the quantization parameters;
the register unit is configured to acquire the order of the registers occupied by the plurality of computing units according to the data operation combination and the calling order of the plurality of computing units, so that the data operation combination can be executed to register the to-be-processed data and the intermediate computing data called from the HWCn model each time through the order of the registers occupied by the plurality of computing units.
7. The system of claim 6, wherein the quantization parameter obtaining unit is configured to:
carrying out convolution forward operation on the picture data to be processed, and acquiring the maximum value max of convolution input absolute values in a statistical mode;
obtaining a calculation floating point quantization interval dist _ scale through a formula 3;
dist _ scale ═ max/2048.0; equation 3
Carrying out convolution forward operation on the picture data to be processed again, and constructing a histogram with the length of 2048 according to a convolution floating point quantization interval dist _ scale;
acquiring a data processing interval [128,2048] according to the multiple of the set processing capacity of 8 bytes and the histogram length 2048;
cycling through a quantization threshold th in the data interval [128,2048 ]; for each threshold th, change the histogram of 2048 into a histogram of length 128;
calculating KL divergence according to the histogram with the length of 128, and taking a threshold value when the KL divergence is minimum as an optimal threshold value target _ th;
obtaining a quantization parameter scale through a formula 4;
scale ═ (target _ th +0.5) × (dist _ scale/127.0 equation 4.
8. The system of claim 6, wherein the quantization parameter obtaining unit is configured to:
when the operation type is the operation of an Eltwise layer, forward operation is respectively carried out on two paths of input data to be processed, and two paths of quantization coefficients SApre and SBpre are obtained through KL divergence; when in quantization, each path is int8 quantization or uint8 quantization;
obtaining a quantization coefficient Scur according to the two paths of quantization coefficients SApre and SBpre; when two paths of input are quantized to int8 in the quantization process, the output is also quantized to int 8; when the two paths of input are quantized by the uint8, the output is also quantized by the uint 8; when the two inputs are different, the output is int8 quantization.
9. The system according to claim 6, wherein the data operation obtaining unit is configured to, when the operation type is a convolution operation, corresponding to the operation type: addition, multiplication or exclusive-or operation; the hardware type of the arithmetic unit is as follows: an adder corresponding to addition and a multiplier corresponding to multiplication; a comparator corresponding to an exclusive-or operation.
10. The system according to claim 6, wherein when the set number of single-time storage in the storage unit is 8, the data width is 16 bits, the H height direction data and the W width direction data are 10, and the C depth direction data are 16, the HWC8 model is stored in the following way:
after data to be processed is imported into an internal memory of the reconfigurable processor from external storage, caching is carried out through an RAM with a data bit width of 128 bit;
the first address H0W0 in the RAM stores data of C0-C7; the H0W0 is a 0 address in the height direction; w0 is 0 address in the width direction; C0-C7 are 0-7 data in the depth direction; the second address H0W1 stores data of C0-C7, and so on until the tenth address H0W9 stores data of C0-C7;
the eleventh address H1W0 then stores data of C0-C7 until the twentieth address stores data of C0-C7 of H1W9 until the one hundred address H9W9 stores data of C0-C7, so that the first HWC8 is full and the C0-C7 data is full; the above is repeated so that the second HWC8 is full and C8-C15 data are full.
CN202011640098.9A 2020-12-31 2020-12-31 Method and system for realizing high-efficiency calculation of neural network Pending CN112712168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011640098.9A CN112712168A (en) 2020-12-31 2020-12-31 Method and system for realizing high-efficiency calculation of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011640098.9A CN112712168A (en) 2020-12-31 2020-12-31 Method and system for realizing high-efficiency calculation of neural network

Publications (1)

Publication Number Publication Date
CN112712168A true CN112712168A (en) 2021-04-27

Family

ID=75547980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011640098.9A Pending CN112712168A (en) 2020-12-31 2020-12-31 Method and system for realizing high-efficiency calculation of neural network

Country Status (1)

Country Link
CN (1) CN112712168A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108896A (en) * 2023-04-11 2023-05-12 上海登临科技有限公司 Model quantization method, device, medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11261542A (en) * 1998-11-26 1999-09-24 Fujitsu Ltd Method and device for multimedia multiple transmission
CN101848311A (en) * 2010-02-21 2010-09-29 哈尔滨工业大学 JPEG2000 EBCOT encoder based on Avalon bus
CN111695682A (en) * 2019-03-15 2020-09-22 上海寒武纪信息科技有限公司 Operation method, device and related product
CN111915003A (en) * 2019-05-09 2020-11-10 深圳大普微电子科技有限公司 Neural network hardware accelerator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11261542A (en) * 1998-11-26 1999-09-24 Fujitsu Ltd Method and device for multimedia multiple transmission
CN101848311A (en) * 2010-02-21 2010-09-29 哈尔滨工业大学 JPEG2000 EBCOT encoder based on Avalon bus
CN111695682A (en) * 2019-03-15 2020-09-22 上海寒武纪信息科技有限公司 Operation method, device and related product
CN111915003A (en) * 2019-05-09 2020-11-10 深圳大普微电子科技有限公司 Neural network hardware accelerator

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108896A (en) * 2023-04-11 2023-05-12 上海登临科技有限公司 Model quantization method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
US11698773B2 (en) Accelerated mathematical engine
KR102258414B1 (en) Processing apparatus and processing method
US11755901B2 (en) Dynamic quantization of neural networks
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
US11507350B2 (en) Processing apparatus and processing method
CN111695671B (en) Method and device for training neural network and electronic equipment
CN113298245B (en) Multi-precision neural network computing device and method based on data flow architecture
CN113222101A (en) Deep learning processing device, method, equipment and storage medium
CN109389208B (en) Data quantization device and quantization method
KR20110105555A (en) Montgomery multiplier having efficient hardware structure
US7020671B1 (en) Implementation of an inverse discrete cosine transform using single instruction multiple data instructions
US10853037B1 (en) Digital circuit with compressed carry
US6738522B1 (en) Efficient SIMD quantization method
CN112712168A (en) Method and system for realizing high-efficiency calculation of neural network
US9632752B2 (en) System and method for implementing a multiplication
US8909687B2 (en) Efficient FIR filters
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN109389213B (en) Storage device and method, data processing device and method, and electronic device
CN109389209B (en) Processing apparatus and processing method
US20220012304A1 (en) Fast matrix multiplication
CN111796797B (en) Method and device for realizing loop polynomial multiplication calculation acceleration by using AI accelerator
Shahbahrami et al. Performance comparison of SIMD implementations of the discrete wavelet transform
Das et al. Hardware implementation of parallel FIR filter using modified distributed arithmetic
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution
JP2002519957A (en) Method and apparatus for processing a sign function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination