WO2022174733A1 - 一种神经元加速处理方法、装置、设备及可读存储介质 - Google Patents

一种神经元加速处理方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2022174733A1
WO2022174733A1 PCT/CN2022/074429 CN2022074429W WO2022174733A1 WO 2022174733 A1 WO2022174733 A1 WO 2022174733A1 CN 2022074429 W CN2022074429 W CN 2022074429W WO 2022174733 A1 WO2022174733 A1 WO 2022174733A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
data
bit
calculation
multiplier
Prior art date
Application number
PCT/CN2022/074429
Other languages
English (en)
French (fr)
Inventor
徐天赐
景璐
Original Assignee
山东英信计算机技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东英信计算机技术有限公司 filed Critical 山东英信计算机技术有限公司
Publication of WO2022174733A1 publication Critical patent/WO2022174733A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of deep learning, and in particular, to a neuron acceleration processing method, apparatus, device, and readable storage medium.
  • Deep Neural Network is a kind of artificial neural network, which is widely used in image classification, object recognition, behavior recognition, speech recognition, natural language processing and document classification.
  • DNN Deep Neural Network
  • the recognition accuracy of DNN network has been greatly improved, but at the same time, the depth of DNN is also deepening and the amount of calculation is also increasing.
  • Heterogeneous computing devices such as GPUs, FPGAs, and ASICs are required to accelerate computing.
  • the neuron calculation is the process of multiplying and adding the feature map data and the weight factor and adding the bias, and finally obtaining the output result through a nonlinear transfer function. It is the core calculation process of the deep neural network, and it also consumes resources and time. Large calculation process, so the current DNN acceleration is mainly for neurons.
  • DNN inference accelerators In the process of neuron computing, traditional DNN inference accelerators generally directly use floating-point data format to perform floating-point data multiplication, or use the method of quantizing data into general integer data for integer data multiplication, or convert data into A method of quantizing to low-bit integer data for multiplication
  • the scheme of using floating-point data for multiplication does not use the method of model compression, and the calculation efficiency of floating-point operation is low; while the scheme of quantizing data into integer data and then performing multiplication, the calculation efficiency is improved, but the calculation efficiency of integer data is improved.
  • the multiplication operation is still the process with the largest resource consumption and the slowest calculation speed in the calculation process of the DNN inference accelerator, and it is easy to become the performance bottleneck of the DNN inference accelerator; the DNN inference accelerator based on low-bit quantization is a relatively new technology, although the calculation speed
  • There are certain difficulties in implementation which usually requires the use of customized low-bit multipliers and low-bit data encoding, which increases the difficulty of related software and hardware design.
  • the purpose of this application is to provide a neuron acceleration processing method, apparatus, device and readable storage medium, which can reduce the hardware and software cost of implementation while improving the neuron computing speed.
  • a neuron-accelerated processing method comprising:
  • a first weight is added to bits 0 to N-1 in the multiplicand of the 4N-bit multiplier, and a second weight is added to bits 2N to 3N-1 in the multiplicand, and other bits in the multiplicand are added. position zero;
  • Batch Norm calculation and quantization calculation are performed on the accumulated value to obtain an output result, and the output result is used as the output feature map of the current neural network layer.
  • the obtaining of two N-bit weights corresponding to the current neural network layer, as the first weight and the second weight includes:
  • iterative accumulation is performed according to the first product value, the second product value and the output results of other multipliers, including:
  • the second product value and the output results of the other multipliers perform a parallel sum calculation as a partial sum
  • Iterative accumulation calculation is performed according to the partial sum reduction result to obtain the accumulated value.
  • performing weight coding on the weight value including:
  • a weight encoding of 2N+1 to N is performed on the weight value.
  • Batch Norm calculation and quantization calculation are performed on the accumulated value to obtain an output result, including:
  • a neuron acceleration processing device comprising:
  • the input acquisition unit is used to acquire the feature map output by the connected previous neural network layer as the input feature map of the current neural network layer;
  • a data extraction unit configured to obtain two N-bit feature map data from the input feature map as first data and second data, where N is a positive integer;
  • a weight acquisition unit used for acquiring two N-bit weights corresponding to the current neural network layer, as the first weight and the second weight;
  • a feature map adding unit configured to add the first data to bits 0 to N-1 in the multiplier of the 4N-bit multiplier, and add the second data to bits 2N to 3N-1 of the multiplier, the multiplier zero in other positions;
  • a weight adding unit for adding a first weight to bits 0 to N-1 in the multiplicand of the 4N-bit multiplier, and adding a second weight to bits 2N to 3N-1 in the multiplicand, so other positions in the multiplicand are zero;
  • a result obtaining unit configured to obtain the output data of the 4N-bit multiplier, and use bits 0 to 2N-1 in the output data as the first product value generated by the first data and the first weight, 4N to 6N-1 bits as the second product value generated by the second data and the second weight;
  • an accumulation processing unit configured to iteratively accumulate according to the first product value, the second product value and the output results of other multipliers to obtain an accumulated value
  • the result output unit is used to perform Batch Norm calculation and quantization calculation on the accumulated value, obtain an output result, and use the output result as the output feature map of the current neural network layer.
  • the weight obtaining unit includes:
  • a model output subunit used to obtain the weight value corresponding to the current neural network layer output by the PACT algorithm model
  • a weight coding subunit configured to perform weight coding on the weight value to obtain several N-bit coding weights
  • an encoding extraction subunit configured to obtain two weights from the encoding weights as the first weight and the second weight
  • the accumulation processing unit includes:
  • a parallel summation subunit configured to perform a parallel summation calculation according to the first product value, the second product value and the output results of other multipliers, as a partial sum
  • a partial sum reduction atomic unit which is used to perform weight coding reduction on the partial sum to obtain a partial sum reduction result
  • the restoration iterative accumulation calculation subunit is configured to perform iterative accumulation calculation according to the part and the restoration result to obtain the accumulated value.
  • the weight encoding subunit is specifically: a first encoding subunit, configured to: perform weight encoding of 2N+1 to N on the weight value.
  • the result output unit includes:
  • the fusion subunit is used to determine the fusion multiplier and fusion addend of the Batch Norm calculation and the quantization calculation;
  • a calculation subunit configured to calculate the multiplication value of the accumulated value, the fusion multiplier and the fusion addend, as the output result.
  • a computer device comprising:
  • the processor is configured to implement the steps of the above-mentioned neuron acceleration processing method when executing the computer program according to the data.
  • a 4N-bit multiplier is used to realize the multiplication calculation of two N-bit low-bit quantized data, avoiding the need for special low-bit multiplication
  • the design and use of the multiplier reduces the implementation cost; at the same time, in this method, a 4N-bit multiplier is used to realize the multiplication of two N-bit low-bit quantized data, so that the time to call one multiplier can complete two multiplication operations at the same time , which can effectively improve the calculation speed and realize the accelerated processing of neurons; and the low-bit data and integer data can be reused with the same set of multipliers to realize the variable data precision calculation of the same accelerator, which improves the applicable scenarios of high-bit multipliers. Avoid multiplier-specific limitations.
  • the embodiments of the present application also provide a neuron accelerated processing apparatus, device, and readable storage medium corresponding to the above-mentioned neuron accelerated processing method, which have the above technical effects, and are not repeated here.
  • FIG. 1 is an implementation flowchart of a neuron acceleration processing method in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a multiply-add calculation in an embodiment of the application
  • FIG. 3 is a schematic structural diagram of a neuron acceleration processing device in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a computer device in an embodiment of the present application.
  • the core of this application is to provide a neuron acceleration processing method, which can reduce the hardware and software costs of implementation while improving the neuron computing speed.
  • FIG. 1 is a flowchart of a neuron acceleration processing method in an embodiment of the present application. The method includes the following steps:
  • the neuron processing method provided in this embodiment is applied to the neural network and is set in the neural network layer.
  • the layer type of the set neural network layer (that is, the "current neural network layer” in this step) is not limited, and can be a volume
  • the neuron acceleration processing method provided in this embodiment can perform accelerated processing for different types of layers.
  • a neural network includes several layers.
  • the "current neural network layer” refers to the neural network layer to which the neuron acceleration processing method provided in this embodiment is applied, which may be one or several layers, that is, for one of the layers.
  • the neuron processing method provided in this embodiment can be applied to the layer or several layers, and the neuron processing method provided by this embodiment can also be applied to each layer to improve the processing speed of the overall neural network.
  • the "current neural network layer” can be determined according to the actual neural network operation acceleration needs.
  • the "current neural network layer” is not the first layer, that is to say, there is a connected previous neural network layer, and the feature map output by the connected previous neural network layer is obtained as the current neural network layer. Input feature map.
  • S102 obtain two N-bit feature map data from the input feature map, as the first data and the second data;
  • the input feature map of the current neural network layer has a relatively long number of digits. Currently, it is often divided into several low-order data for multiplication. In this embodiment, the high-order feature map data also needs to be divided into low-order data for calculation.
  • the N-bit feature map data in this step refers to the low-order (N-bit) feature map data divided from the input feature map, where N is a positive integer, and the value of N is not limited in this embodiment, for example, N can be Take 1, 2, or 4, etc., which can be set according to actual calculation and data division needs.
  • the implementation manner of dividing the input feature map to obtain two low-order (N-bit) feature map data in this step is not limited, and reference may be made to the introduction of the related art, which will not be repeated here.
  • the weight value corresponding to the current neural network layer is similar to the input feature map of the current neural network layer, and it is also high-level data. It is necessary to convert the high-level weight data to obtain two N-bit weights. In this step, for the weight data
  • the processing method is not limited, you can refer to the above step S102 for the division of the feature map data, or you can perform low-level conversion according to the weight type actually used.
  • step S101 is executed first, and then step S103 is executed; it can also be executed simultaneously as shown in FIG. needs to be set accordingly, which will not be repeated here.
  • S104 Add the first data to bits 0 to N-1 in the multiplier of a 4N-bit multiplier (a component of a digital circuit that can multiply two binary numbers), and add the second data to the multiplier 2N to 3N-1 bits, other positions in the multiplier are zero;
  • low-bit multipliers are usually used to multiply the feature map data of each low-bit and the corresponding weight, while the DNN inference accelerator based on low-bit quantization algorithm usually needs to redesign the low-bit multiplication
  • low-bit multipliers have low versatility and are not supported by most hardware platforms. They usually need to be re-designed separately, which is difficult to design, and multipliers of different digits need to be called for feature maps of different digits. Perform operations, which will greatly increase the cost of using the multiplier.
  • this embodiment uses a 4N-bit multiplier to realize the multiplication calculation of two N-bit low-bit quantized data, and can utilize the existing multiplier modules or IPs in the GPU, FPGA or ASIC, avoiding the need for dedicated low-bit multipliers
  • the design and use of the multiplier reduces the implementation cost; at the same time, in this embodiment, a 4N-bit multiplier is used to realize the multiplication calculation of two N-bit low-bit quantized data, so that the time to call one multiplier can complete two multiplication operations at the same time, It can effectively improve the calculation speed and realize the accelerated processing of neurons; and the same set of multipliers can be reused for low-bit data and integer data to realize the variable data precision calculation of the same accelerator, which improves the applicable scenarios of high-bit multipliers and avoids the need for Multiplier-specific limitations.
  • the bits 0 to N-1 and 2N to 3N-1 in the multiplier of the multiplier are respectively put into the N-bit feature map data, the first data and the second data, and the other bits are zero to use
  • a 16(4N)-bit multiplier implements two 4(N)-bit multiplications as an example. Put the 0-3 bits and 8-11 bits in the multiplier of the multiplier into the 4-bit first data (such as 1101) and the second data (such as 1001), respectively, and fill in the 4-7 and 12-15 bits. 0, the multiplier of a multiplier is shown in Table 1 below:
  • the bits 0 to N-1 and 2N to 3N-1 in the multiplicand of the multiplier are respectively put into the N-bit weight data, the first weight and the second weight, and the other positions are zero.
  • the multiplication The bits 0 to N-1 and 2N to 3N-1 in the multiplicand of the multiplier are respectively put into N bits of weight data, the first weight and the second weight, and the other bits are zero to use a 16 (4N) bit multiplier Implement two 4(N)-bit multiplications as an example. Put the 0-3 bits and 8-11 bits of the multiplicand of the multiplier into the 4-bit first weight (such as 0001) and the second weight (such as 0011), and fill in the 4-7 and 12-15 bits. 0, then the multiplier of a multiplier is shown in Table 2 below:
  • the feature map data is filled into the multiplier in the multiplier, and the weight data is filled into the multiplicand, and the two can be interchanged, and the weight data is filled into the multiplier in the multiplier , and the feature map data is filled into the multiplicand.
  • the order of execution of step S104 and step S105 is not limited, and the two can be executed sequentially, for example, step S104 is executed first, and then step S105 is executed; it can also be executed simultaneously as shown in FIG. Corresponding settings are required for actual processing, which will not be repeated here.
  • S106 Obtain the output data of the multiplier, and use bits 0 to 2N-1 in the output data as the first product value generated by the first data and the first weight, and bits 4N to 6N-1 as the second data and the second weight.
  • the output data of the multiplier is obtained. Since one multiplier is applied to two groups of multiplication calculations in this embodiment, the product values of the two groups of multiplications should be obtained respectively in the output data. Specifically, bits 0 to 2N-1 in the output data are used as the first product value generated by the first data and the first weight, and bits 4N to 6N-1 are used as the second product value generated by the second data and the second weight.
  • FIG. 2 is a multiply-add calculation.
  • Bit 0-3 and 8-11 of the multiplicand of the multiplier into 4-bit weight data 1 (weight1) and 4-bit weight data 2 (weight2), respectively, and put 0 in the multiplier of the multiplier -3 bits and 8-11 bits are respectively put into 4-bit feature map data 1 (feature1) and 4-bit feature map data 2 (feature2), the multiplicand of the multiplier and other positions of the multiplier are zero.
  • bits 0-7 are the product of weight data 1 and feature map data 1 (useless data)
  • bits 16-23 are the product of weight data 2 and feature map data 2 (useless data).
  • the accumulation calculation is performed according to the accumulation rule in the traditional neuron calculation.
  • the implementation process of the accumulation calculation according to the product value obtained by the multiplier in this step may refer to the implementation methods in the related art. This is not limited and will not be repeated here.
  • Batch Norm calculation and quantization calculation are necessary steps in neuron calculation.
  • the specific calculation rules for Batch Norm calculation and quantization calculation according to the accumulated value can be implemented according to relevant implementation methods. In this embodiment, This is not limited.
  • the output result is used as the output feature map of the current neural network layer and input to the next neural network layer for calculation.
  • all N-bit data needs to be converted into unsigned data before calculation, and the sign bit of the output data is determined by separately judging the sign bit of the multiplier multiplier and the multiplicand.
  • a 4N-bit multiplier is used to realize the multiplication calculation of two N-bit low-bit quantized data, avoiding the need for The design and use of a dedicated low-bit multiplier reduces the implementation cost; at the same time, a 4N-bit multiplier is used in this method to realize the multiplication of two N-bit low-bit quantized data, so that the time to call one multiplier can be completed at the same time.
  • Two multiplication operations can effectively improve the calculation speed and realize the accelerated processing of neurons; and the low-bit data and integer data can be reused with the same set of multipliers to achieve variable data precision calculation of the same accelerator, which improves the high-bit multiplier. applicable scenarios to avoid multiplier-specific limitations.
  • the embodiments of the present application also provide corresponding improvement solutions.
  • the same steps or corresponding steps in the above-mentioned embodiments can be referred to each other, and corresponding beneficial effects can also be referred to each other, which will not be repeated in the preferred/improved embodiments herein.
  • the specific implementation manner of obtaining the two N-bit weights corresponding to the current neural network layer as the first weight and the second weight is not limited.
  • the weight conversion method is as follows:
  • Encoding the weight value can realize the expansion of the weight expression range, so as to realize the transformation of the large weight to the low encoding value, so that the low-bit encoding value can be used to directly realize the expression of the weight, avoiding the extra workload caused by the splitting of the weight.
  • the weight of the neural network layer is generally obtained through the training process of the PACT quantization algorithm model.
  • the weight generation process is continued to realize the traditional neuron Reuse of computational processes.
  • the weight data obtained by the PACT quantization algorithm model training process is not a continuous natural number, the following is the 4-bit weight format obtained by the PACT algorithm training: -15, -13, -11, -9, -7, -5, - 3, -1, 1, 3, 5, 7, 9, 11, 13, 15, the applicant found that the weight data are all in the form of 2x+1 (x is any positive integer), in order to avoid the increase in the amount of calculation caused by splitting the data , in this embodiment, it is proposed to encode the weight data in the neuron acceleration processing method based on the PACT algorithm.
  • This application uses the code N to represent the weight with a value of 2N+1, and the encoded weight data is: -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 , 3, 4, 5, 6, 7 (N), directly use the code to perform the multiplication and addition calculation ( ⁇ weight*feature) to ensure that the weight data in the multiplication and addition calculation is 4-bit data, and the multiplication and addition calculation is guaranteed by quantization calculation.
  • the process of iterative accumulation according to the first product value, the second product value and the output results of other multipliers specifically includes the following steps:
  • the result of the multiplication and addition calculation is subjected to weight coding and reduction calculation to obtain the final calculation result.
  • the final output feature map data can be obtained only after quantization calculation and Batch Norm calculation.
  • the eigenvalue and the weight data are multiplied by coding, and the calculation result is inverted to obtain the final correct calculation result of the eigenvalue and the weight data, which can ensure the calculation efficiency and ensure the effective calculation of the data.
  • the specific implementation process of performing Batch Norm calculation and quantization calculation on the accumulated value is not limited, wherein, the purpose of the quantization calculation is to store the feature map data with a more suitable dynamic range, so that the 4-bit feature map data is lost. less precision.
  • Quantization calculations are floating-point division and floating-point multiplication calculations
  • Batch Norm calculations are floating-point multiplication and floating-point addition calculations. In order to reduce the amount of inference calculations in deep neural networks and further improve the processing speed of neurons, Batch Norm calculations and quantitative calculations can be integrated.
  • the fusion multiplier and fusion addend of the Batch Norm calculation and the quantization calculation can be determined, and then the multiplication and addition value of the accumulated value, the fusion multiplier and the fusion addend can be calculated as the output result.
  • the process of performing Batch Norm calculation and quantization calculation on the accumulated value specifically needs to perform a quantization calculation, perform a Batch Norm calculation according to the result of the quantization calculation, and then perform a second quantization calculation according to the result of the Batch Norm calculation.
  • the accumulated value needs to be divided by the quantization factor 1 (multiplied by the inverse of the quantization factor 1), and the Batch Norm calculation needs to be (accumulated value/quantization factor 1) ⁇ + ⁇ , where , ⁇ is the multiplier specified in Batch Norm, ⁇ is the addend specified by Batch Norm, and then perform the second quantization calculation on the calculation result, [(accumulated value/quantization factor 1) ⁇ + ⁇ ] ⁇ factor 2.
  • the above-mentioned quantization factor 1 may also be referred to as "factor 1" in the following description.
  • the fusion is a multiplication and addition calculation, in which the values except the accumulated value are fused, the result of ( ⁇ factor 2/factor 1) is used as the fusion multiplier, and the result of ⁇ factor 2 is used as the fusion addend, then the accumulation
  • the process of batch norm calculation and quantitative calculation of the value can be converted into a multiplication and addition calculation, that is, the accumulated value ⁇ ( ⁇ ⁇ factor 2/factor 1) + ( ⁇ ⁇ factor 2), and the three calculation processes are merged into a multiplication and addition calculation, The calculation steps are simplified, which is beneficial to further improve the calculation efficiency.
  • the embodiments of the present application further provide a neuron accelerated processing apparatus, and the neuron accelerated processing apparatus described below and the neuron accelerated processing method described above can be referred to each other correspondingly.
  • the device includes the following modules:
  • the input acquisition unit 110 is mainly used to acquire the feature map output by the connected previous neural network layer, as the input feature map of the current neural network layer;
  • the data extraction unit 120 is mainly used to obtain two N-bit feature map data from the input feature map, as the first data and the second data, wherein N is a positive integer;
  • the weight obtaining unit 130 is mainly used to obtain two N-bit weights corresponding to the current neural network layer, as the first weight and the second weight;
  • the feature map adding unit 140 is mainly used to add the first data to bits 0 to N-1 in the multiplier of the 4N-bit multiplier, add the second data to bits 2N to 3N-1 in the multiplier, and other positions in the multiplier zero;
  • the weight adding unit 150 is mainly used to add the first weight to bits 0 to N-1 in the multiplicand of the 4N-bit multiplier, and add the second weight to bits 2N to 3N-1 in the multiplicand, and the multiplicand other positions are zero;
  • the result obtaining unit 160 is mainly used to obtain the output data of the multiplier, and use bits 0 to 2N-1 in the output data as the first product value generated by the first data and the first weight, and bits 4N to 6N-1 as the second data the second product value generated with the second weight;
  • the accumulation processing unit 170 is mainly used for iterative accumulation according to the first product value, the second product value and the output results of other multipliers to obtain the accumulated value;
  • the result output unit 180 is mainly used to perform Batch Norm calculation and quantization calculation on the accumulated value to obtain an output result, and use the output result as the output feature map of the current neural network layer.
  • the weight obtaining unit includes:
  • the model output subunit is used to obtain the weight value corresponding to the current neural network layer output by the PACT algorithm model
  • the weight coding subunit is used for weight coding the weight value to obtain several N-bit coding weights
  • the encoding extraction subunit is used to obtain two weights from the encoding weights as the first weight and the second weight;
  • the accumulation processing unit includes:
  • a parallel summation subunit configured to perform a parallel summation calculation according to the first product value, the second product value and the output results of other multipliers, as a partial sum
  • a partial sum reduction atomic unit which is used to perform weight coding reduction on the partial sum to obtain a partial sum reduction result
  • the restoration iterative accumulation calculation subunit is configured to perform iterative accumulation calculation according to the part and the restoration result to obtain the accumulated value.
  • the weight encoding subunit is specifically: a first encoding subunit, configured to: perform weight encoding of 2N+1 to N on the weight value.
  • the result output unit includes:
  • the fusion subunit is used to determine the fusion multiplier and fusion addend for Batch Norm calculation and quantization calculation;
  • the calculation subunit is used to calculate the accumulated value, the fusion multiplier, and the multiplication and addition value of the fusion addend as the output result.
  • the embodiments of the present application further provide a computer device, and a computer device described below and a neuron acceleration processing method described above may refer to each other correspondingly.
  • the computer equipment includes:
  • the processor is configured to implement the steps of the neuron acceleration processing method of the above method embodiments when executing the computer program according to the data.
  • FIG. 4 is a schematic diagram of a specific structure of a computer device provided in this embodiment.
  • the computer device may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing). units, CPU) 322 (eg, one or more processors) and memory 332 that stores one or more computer applications 342 or data 344 .
  • the memory 332 may be short-lived storage or persistent storage.
  • the programs stored in memory 332 may include one or more modules (not shown), each of which may include a series of instructions to operate on a data processing device.
  • the central processing unit 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the computer device 301 .
  • Computer device 301 may also include one or more power supplies 326 , one or more wired or wireless network interfaces 350 , one or more input output interfaces 358 , and/or, one or more operating systems 341 .
  • the steps in the neuron accelerated processing method described above can be implemented by the structure of a computer device.
  • the embodiments of the present application further provide a readable storage medium, and a readable storage medium described below and a neuron acceleration processing method described above may refer to each other correspondingly.
  • the readable storage medium may specifically be a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc. Readable storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

提供了一种神经元加速处理方法,神经元加速处理方法获取到两组N位的特征图数据以及对应的权重后,使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,避免了专用低比特乘法器的设计与使用,降低了实现成本;同时,神经元加速处理方法中使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,这样调用一个乘法器的时间可以同时完成两个乘法运算,可以有效提升计算速度,实现神经元的加速处理;而且低比特数据与整型数据可复用同一组乘法器以实现同一个加速器的可变数据精度计算,提升了高位乘法器的适用场景,避免乘法器专用的局限性。还提供了一种神经元加速处理装置、设备及可读存储介质,具有相应的技术效果。

Description

一种神经元加速处理方法、装置、设备及可读存储介质
本申请要求于2021年2月19日提交的、申请号为202110189202.5、发明名称为“一种神经元加速处理方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及深度学习技术领域,特别是涉及一种神经元加速处理方法、装置、设备及可读存储介质。
背景技术
深度神经网络(Deep Neural Network,DNN)是人工神经网络的一种,广泛应用于图像分类、目标识别、行为识别、语音识别、自然语言处理与文档分类等领域。近几年,随着计算机计算能力的增长与DNN结构的发展,DNN网络的识别准确度有了很大提高,但与此同时,DNN的深度也不断加深,计算量也越来越大,因此需要GPU、FPGA、ASIC等异构计算设备来加速计算。
神经元计算为特征图数据与权重因子进行乘加运算并与偏置相加,最终经一个非线性传递函数得到输出结果的过程,它是深度神经网络的核心计算过程,也是资源消耗和耗时较大的计算过程,因此目前DNN加速主要是针对神经元进行。
传统的DNN推理加速器在神经元计算的过程中,一般直接采用浮点数据格式进行浮点数据乘法运算,或者是采用将数据量化为一般整型数据进行整型数据乘法运算的方法,或者将数据量化为低比特整型数据进行乘法运算的方法
采用浮点数据进行乘法运算的方案未采用模型压缩的方法,浮点运算的计算效率低;而采用先将数据量化为整型数据然后进行乘法运算的方案,计算效率有所提高,但整型乘法运算仍为DNN推理的加速器计算过程中资源消耗最大、计算速度最慢的过程,容易成为DNN推理加速器的性能瓶颈;基于低比特量化的DNN推理加速器是一种较新的技术,虽然计算速度有所提升,在实现上存在一定困难,通常需要使用定制化的低比特乘法器以及低比特数据编码 实现,提高了相关软硬件设计的难度。
综上所述,如何在提升神经元计算速度的同时降低实现的软硬件成本,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种神经元加速处理方法、装置、设备及可读存储介质,可以在提升神经元计算速度的同时降低实现的软硬件成本。
为解决上述技术问题,本申请提供如下技术方案:
一种神经元加速处理方法,包括:
获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
从所述输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据,其中,N为正整数;
获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
将所述第一数据添加至4N位乘法器的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,所述乘数中其它位置零;
将第一权重添加至所述4N位乘法器的被乘数中0至N-1位,将第二权重添加至所述被乘数中2N至3N-1位,所述被乘数中其它位置零;
获取所述4N位乘法器的输出数据,并将所述输出数据中0至2N-1位作为所述第一数据与所述第一权重生成的第一乘积值,4N至6N-1位作为所述第二数据与所述第二权重生成的第二乘积值;
根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,并将所述输出结果作为所述当前神经网络层的输出特征图。
可选地,所述获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重,包括:
获取PACT算法模型输出的所述当前神经网络层对应的权重值;
对所述权重值进行权重编码,得到若干N位编码权重;
从所述编码权重中获取两个权重,作为所述第一权重以及所述第二权重;
则相应地,根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,包括:
根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行并行加和计算,作为部分和;
对所述部分和进行权重编码还原,得到部分和还原结果;
根据所述部分和还原结果进行迭代累加计算,得到所述累加值。
可选地,对所述权重值进行权重编码,包括:
对所述权重值进行2N+1到N的权重编码。
可选地,对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,包括:
确定所述Batch Norm计算以及量化计算的融合乘数以及融合加数;
计算所述累加值与所述融合乘数以及融合加数的乘加值,作为所述输出结果。
一种神经元加速处理装置,包括:
输入获取单元,用于获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
数据提取单元,用于从所述输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据,其中,N为正整数;
权重获取单元,用于获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
特征图添加单元,用于将所述第一数据添加至4N位乘法器的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,所述乘数中其它位置零;
权重添加单元,用于将第一权重添加至所述4N位乘法器的被乘数中0至N-1位,将第二权重添加至所述被乘数中2N至3N-1位,所述被乘数中其它位置零;
结果获取单元,用于获取所述4N位乘法器的输出数据,并将所述输出数据中0至2N-1位作为所述第一数据与所述第一权重生成的第一乘积值,4N至 6N-1位作为所述第二数据与所述第二权重生成的第二乘积值;
累加处理单元,用于根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
结果输出单元,用于对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,并将所述输出结果作为所述当前神经网络层的输出特征图。
可选地,所述权重获取单元,包括:
模型输出子单元,用于获取PACT算法模型输出的所述当前神经网络层对应的权重值;
权重编码子单元,用于对所述权重值进行权重编码,得到若干N位编码权重;
编码提取子单元,用于从所述编码权重中获取两个权重,作为所述第一权重以及所述第二权重;
则相应地,则相应地,所述累加处理单元包括:
并行加和子单元,用于根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行并行加和计算,作为部分和;
部分和还原子单元,用于对所述部分和进行权重编码还原,得到部分和还原结果;
还原迭代累加计算子单元,用于根据所述部分和还原结果进行迭代累加计算,得到所述累加值。
可选地,所述权重编码子单元具体为:第一编码子单元,用于:对所述权重值进行2N+1到N的权重编码。
可选地,所述结果输出单元,包括:
融合子单元,用于确定所述Batch Norm计算以及量化计算的融合乘数以及融合加数;
计算子单元,用于计算所述累加值与所述融合乘数以及融合加数的乘加值,作为所述输出结果。
一种计算机设备,包括:
存储器,用于存储计算机程序与数据;
处理器,用于根据所述数据执行所述计算机程序时实现上述神经元加速处 理方法的步骤。
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述神经元加速处理方法的步骤。
本申请实施例所提供的方法,获取到两组N位的特征图数据以及对应的权重后,使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,避免了专用低比特乘法器的设计与使用,降低了实现成本;同时,该方法中中使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,这样调用一个乘法器的时间可以同时完成两个乘法运算,可以有效提升计算速度,实现神经元的加速处理;而且低比特数据与整型数据可复用同一组乘法器以实现同一个加速器的可变数据精度计算,提升了高位乘法器的适用场景,避免乘法器专用的局限性。
相应地,本申请实施例还提供了与上述神经元加速处理方法相对应的神经元加速处理装置、设备和可读存储介质,具有上述技术效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中一种神经元加速处理方法的实施流程图;
图2为本申请实施例中一种乘加计算示意图;
图3为本申请实施例中一种神经元加速处理装置的结构示意图;
图4为本申请实施例中一种计算机设备的结构示意图。
具体实施方式
本申请的核心是提供一种神经元加速处理方法,可以在提升神经元计算速度的同时降低实现的软硬件成本。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一 部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参考图1,图1为本申请实施例中一种神经元加速处理方法的流程图,该方法包括以下步骤:
S101、获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
本实施例提供的神经元处理方法应用于神经网络中,设置于神经网络层中,对于设置的神经网络层(即本步骤中“当前神经网络层”)的层类型不做限定,可以为卷积层或全连接层等含有特征图数据与权重数据进行乘加计算的层,本实施例提供的神经元加速处理方法可以针对不同类型的层进行加速处理。在一个神经网络中包括若干层,本步骤中“当前神经网络层”指应用本实施例提供的神经元加速处理方法的神经网络层,可以为其中的一层或若干层,即针对其中的一层或若干层应用本实施例提供的神经元处理方法,也可以针对每一层都应用本实施例提供的神经元处理方法,以提升整体神经网络的处理速度,本实施例中对此也不做限定,可以根据实际神经网络的运行加速需要确定“当前神经网络层”。
需要说明的是,本实施例中“当前神经网络层”非首层,也就是说存在连接的上一个神经网络层,获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图。
S102、从输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据;
当前神经网络层的输入特征图的位数较长,目前常常将其划分为若干个低位的数据进行乘法运算,本实施例中也需将高位的特征图数据划分为低位的数据来进行计算,本步骤中的N位特征图数据指从输入特征图中划分得到的低位(N位)的特征图数据,其中N为正整数,本实施例中对于N的取值不做限定,比如N可以取1、2或者4等,可以根据实际计算以及数据划分的需要进行相应设定。另外,本步骤中从输入特征图划分得到两个低位(N位)特征图数据的实现方式不做限定,可以参照相关技术的介绍,在此不再赘述。
S103、获取当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
当前神经网络层对应的权重值类似于当前神经网络层的输入特征图,也属于高位数据,则需要将高位的权重数据进行数据转化,获取其中的两个N位权重,本步骤中对于权重数据的处理方式不做限定,可以参照上述步骤S102对于特征图数据的划分,也可以根据实际使用的权重类型进行低位转化。
需要说明的是,本步骤与步骤S101的执行先后顺序不做限定,两者可以顺序执行,比如先执行步骤S101,再执行步骤S103;也可以如图1所示同时执行,具体可以根据实际处理的需要进行相应设定,在此不再赘述。
S104、将第一数据添加至4N位乘法器(数字电路的一种元件,可以将两个二进制数相乘)的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,乘数中其它位置零;
目前针对低比特(低位)的乘法通常应用低比特的乘法器分别对各个低位的特征图数据与对应的权重进行乘积运算,而基于低比特量化算法的DNN推理加速器通常需要重新设计低比特的乘法器才能充分利用计算资源,低比特乘法器的通用性不高,多数硬件平台并不支持,通常需要重新单独设计,设计难度大,而且针对不同位数的特征图需要调用不同位数的乘法器进行运算,这会大大提升乘法器的使用成本。
针对于此,本实施例使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,可以利用GPU、FPGA或ASIC中的现有乘法器模块或IP,避免了专用低比特乘法器的设计与使用,降低了实现成本;同时,本实施例中使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,这样调用一个乘法器的时间可以同时完成两个乘法运算,可以有效提升计算速度,实现神经元的加速处理;而且低比特数据与整型数据可复用同一组乘法器以实现同一个加速器的可变数据精度计算,提升了高位乘法器的适用场景,避免乘法器专用的局限性。
具体地,本实施例中将乘法器的乘数中的0到N-1与2N到3N-1位分别放入N位特征图数据,第一数据以及第二数据,其它位置零,以使用一个16(4N)位的乘法器实现两个4(N)位乘法为例。将乘法器的乘数中的0-3位 与8-11位分别放入4-bit的第一数据(比如1101)与第二数据(比如1001),4至7位和12至15位填0,则一种乘法器的乘数如下表1所示:
1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0
表1
S105、将第一权重添加至4N位乘法器的被乘数中0至N-1位,将第二权重添加至被乘数中2N至3N-1位,被乘数中其它位置零;
将乘法器的被乘数中的0到N-1与2N到3N-1位分别放入N位权重数据,第一权重以及第二权重,其它位置零,具体地,本实施例中将乘法器的被乘数中的0到N-1与2N到3N-1位分别放入N位权重数据,第一权重以及第二权重,其它位置零,以使用一个16(4N)位的乘法器实现两个4(N)位乘法为例。将乘法器的被乘数中的0-3位与8-11位分别放入4-bit的第一权重(比如0001)与第二权重(比如0011,4至7位和12至15位填0,则一种乘法器的乘数如下表2所示:
0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0
表2
需要说明的是,本实施例中在乘法器中将特征图数据填充至乘数中,权重数据填充至被乘数中,两者可以互换,在乘法器中将权重数据填充至乘数中,特征图数据填充至被乘数中,对于后者的执行方式可以参照本实施例的介绍,在此不再赘述。另外,本实施例中对于步骤S104与步骤S105的执行先后顺序不做限定,两者可以顺序执行,比如先执行步骤S104,再执行步骤S105;也可以如图1所示同时执行,具体可以根据实际处理的需要进行相应设定,在此不再赘述。
S106、获取乘法器的输出数据,并将输出数据中0至2N-1位作为第一数据与第一权重生成的第一乘积值,4N至6N-1位作为第二数据与第二权重生成的第二乘积值;
应用上述乘法器的填充规则启动乘法器的乘法运算后,获取乘法器的输出数据,由于本实施例中将一个乘法器应用于两组乘法计算,输出数据中应分别获取两组乘法的乘积值,具体地,将输出数据中0至2N-1位作为第一数据与第一权重生成的第一乘积值,4N至6N-1位作为第二数据与第二权重生成的第 二乘积值。
为加深对本实施例提供的上述步骤的执行的理解,在此以使用一个16(4N)位的乘法器实现两个4(N)位乘法为例,如图2所示为一种乘加计算示意图。将乘法器的被乘数中的0-3位与8-11位分别放入4-bit权重数据1(weight1)与4-bit权重数据2(weight2),将乘法器的乘数中的0-3位与8-11位分别放入4-bit特征图数据1(feature1)与4-bit特征图数据2(feature2),乘法器的被乘数与乘数的其他位置零。16位乘法器的32位输出中,0-7位为权重数据1与特征图数据1的乘积(useless data),16-23位为权重数据2与特征图数据2的乘积(useless data)。
S107、根据第一乘积值、第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
在得到两组乘积值后,根据该组乘积值,结合针对当前神经网络层中用于计算特征图数据与权重的其他乘法器的输出的乘积值,获取当前神经网络层中所有乘积值,并据此按照传统神经元计算中的累加规则进行累加计算,需要说明的是,本步骤中根据乘法器得到的乘积值进行累加计算的实现过程可以参照相关技术中的实现方式,本实施例中对此不做限定,在此不再赘述。
S108、对累加值进行Batch Norm(Batch Normalization,深度网络中经常用到的加速神经网络训练,加速收敛速度及稳定性的算法)计算以及量化计算,得到输出结果,并将输出结果作为当前神经网络层的输出特征图。
除了上述步骤的累加计算外,Batch Norm计算以及量化计算都属于神经元计算中的必要步骤,具体的根据累加值进行Batch Norm计算以及量化计算的具体计算规则可以按照相关实现方式,本实施例中对此不做限定。
经过Batch Norm计算以及量化计算后,将输出结果作为当前神经网络层的输出特征图,输入至下一个神经网络层进行计算。
需要说明的是,本实施例中所有N位数据在计算前需转换为无符号数据,通过单独判断乘法器乘数与被乘数的符号位来确定输出数据的符号位。
基于上述介绍,本实施例所提供的技术方案,获取到两组N位的特征图数据以及对应的权重后,使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,避免了专用低比特乘法器的设计与使用,降低了实现成本;同时, 该方法中中使用一个4N位乘法器实现两个N位低比特量化数据的乘法计算,这样调用一个乘法器的时间可以同时完成两个乘法运算,可以有效提升计算速度,实现神经元的加速处理;而且低比特数据与整型数据可复用同一组乘法器以实现同一个加速器的可变数据精度计算,提升了高位乘法器的适用场景,避免乘法器专用的局限性。
需要说明的是,基于上述实施例,本申请实施例还提供了相应的改进方案。在优选/改进实施例中涉及与上述实施例中相同步骤或相应步骤之间可相互参考,相应的有益效果也可相互参照,在本文的优选/改进实施例中不再一一赘述。
上述实施例中对于获取当前神经网络层对应的两个N位权重,作为第一权重以及第二权重的具体实现方式不做限定,本实施例中介绍一种适用于常见的权重获取方法的低位权重转化方式,如下所示:
(1)获取PACT算法(PArameterized Clipping Activation for Quantized Neural Networks,一种低比特量化算法)模型输出的当前神经网络层对应的权重值;其中,权重值符合2N+1数据表达形式;
(2)对权重值进行权重编码,得到若干N位编码权重;
(3)从编码权重中获取两个权重,作为第一权重以及第二权重;
对权重值进行权重编码可以实现权重表达范围的扩大,从而实现大权重向低编码值的转化,从而可以利用低位编码值直接实现权重的表达,避免权重的拆分带来的额外的工作量。
本实施例中对于权重编码的具体实现方式不做限定,目前,神经网络层的权重一般都是通过PACT量化算法模型训练过程得到,本实施例中延续该权重生成过程,以实现对于传统神经元计算过程的复用。由于PACT量化算法模型训练过程得到的权重数据并非连续的自然数,如下所示为PACT算法训练得到的4-bit权重格式:-15,-13,-11,-9,-7,-5,-3,-1,1,3,5,7,9,11,13,15,申请人发现权重数据均为2x+1(x为任意正整数)形式,为避免拆分数据导致计算量的增加,本实施例中提出在基于PACT算法的神经元加速处理方法中对权重数据进行编码。本申请使用编码N代表数值为2N+1的权重,编 码后的权重数据为:-8,-7,-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6,7(N),直接使用编码进行乘加计算(Σweight*feature),以保证乘加计算中的权重数据为4-bit数据,另外通过量化计算保证乘加计算中的特征图数据为4-bit数据,从而最终保证神经元计算中的乘加计算为4-bit乘加计算,则经过编码后的乘法器的计算过程为:Σ(2*code+1)*feature=Σ2*code*feature+Σfeature。
则相应地,根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加的过程具体包括以下步骤:
(1)根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行并行加和计算,作为部分和;
(2)对所述部分和进行权重编码还原,得到部分和还原结果;
(2)根据所述部分和还原结果进行迭代累加计算,得到所述累加值在根据第一乘积值、第二乘积值以及其它乘法器的输出结果进行本次并行加和计算之后,下一次累加计算迭代之前,需要对本次迭代的并行加和的结果,即部分和,进行权重编码还原,得到部分和的还原结果;在多次迭代中根据部分和还原结果进行累加计算,得到累加值。该方法可以避免编码还原的重复实现。
乘加计算的结果经过权重编码还原计算后得到最终的计算结果,要进行量化计算与Batch Norm计算,才能得到最终的输出特征图数据。
本实施例中特征值与权重数据以编码方式进行乘法计算,并对计算结果进行反演计算得到特征值与权重数据的最终正确计算结果,可以保证计算效率的同时保证数据的有效计算。
上述实施例中对于对累加值进行Batch Norm计算以及量化计算的具体实现过程不做限定,其中,量化计算的目的是使特征图数据以较为合适的动态范围存储,使得4-bit特征图数据损失精度较小。量化计算为浮点除法与浮点乘法计算,Batch Norm计算为浮点乘法与浮点加法计算,为减少深度神经网络推理计算量,从而进一步提升神经元处理速度,Batch Norm计算以及量化计算可以融合为一次乘法与一次加法计算过程,具体地,可以确定Batch Norm计算以及量化计算的融合乘数以及融合加数,然后计算累加值与融合乘数以及 融合加数的乘加值,作为输出结果。
对累加值进行Batch Norm计算以及量化计算的过程具体需要执行一次量化计算,根据量化计算的结果进行Batch Norm计算,再根据Batch Norm计算结果进行第二次量化计算。其中,执行第一次量化计算时需要将累加值除以量化因子1(乘以量化因子1的相反数),进行Batch Norm计算时需要将(累加值/量化因子1)×α+β,其中,α为Batch Norm中指定的乘数,β为Batch Norm指定的加数,再对计算结果进行第二次量化计算,[(累加值/量化因子1)×α+β]×因子2。上述量化因子1在后述中有时也称为“因子1”。
本实施例中将两种算法(Batch Norm计算以及量化计算)、三次计算过程(先执行一次量化计算,根据量化计算的结果进行Batch Norm计算,再根据Batch Norm计算结果进行第二次量化计算)融合为一次乘加计算,将其中除了累加值外的值进行融合,将(α×因子2/因子1)的结果作为融合乘数,将β×因子2的结果作为融合加数,则对累加值进行Batch Norm计算以及量化计算的过程可以转化为一次乘加计算,即累加值×(α×因子2/因子1)+(β×因子2),将三次计算过程融合为一次乘加计算,简化了计算步骤,有利于进一步提升计算效率。
需要说明的是,上述实施例中仅显示了加速器计算过程中核心的神经元计算过程,主要包含乘加计算、反演计算、Batch Norm与量化计算以及其他深度学习推理计算步骤,其它计算过程均可参照相关技术的介绍,在此不再赘述。
相应于上面的方法实施例,本申请实施例还提供了一种神经元加速处理装置,下文描述的神经元加速处理装置与上文描述的神经元加速处理方法可相互对应参照。
参见图3所示,该装置包括以下模块:
输入获取单元110主要用于获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
数据提取单元120主要用于从输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据,其中,N为正整数;
权重获取单元130主要用于获取当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
特征图添加单元140主要用于将第一数据添加至4N位乘法器的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,乘数中其它位置零;
权重添加单元150主要用于将第一权重添加至4N位乘法器的被乘数中0至N-1位,将第二权重添加至被乘数中2N至3N-1位,被乘数中其它位置零;
结果获取单元160主要用于获取乘法器的输出数据,并将输出数据中0至2N-1位作为第一数据与第一权重生成的第一乘积值,4N至6N-1位作为第二数据与第二权重生成的第二乘积值;
累加处理单元170主要用于根据第一乘积值、第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
结果输出单元180主要用于对累加值进行Batch Norm计算以及量化计算,得到输出结果,并将输出结果作为当前神经网络层的输出特征图。
在本申请的一种具体实施方式中,权重获取单元,包括:
模型输出子单元,用于获取PACT算法模型输出的当前神经网络层对应的权重值;
权重编码子单元,用于对权重值进行权重编码,得到若干N位编码权重;
编码提取子单元,用于从编码权重中获取两个权重,作为第一权重以及第二权重;
则相应地,所述累加处理单元包括:
并行加和子单元,用于根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行并行加和计算,作为部分和;
部分和还原子单元,用于对所述部分和进行权重编码还原,得到部分和还原结果;
还原迭代累加计算子单元,用于根据所述部分和还原结果进行迭代累加计算,得到所述累加值。
在本申请的一种具体实施方式中,权重编码子单元具体为:第一编码子单元,用于:对权重值进行2N+1到N的权重编码。
在本申请的一种具体实施方式中,结果输出单元,包括:
融合子单元,用于确定Batch Norm计算以及量化计算的融合乘数以及融合加数;
计算子单元,用于计算累加值与融合乘数以及融合加数的乘加值,作为输出结果。
相应于上面的方法实施例,本申请实施例还提供了一种计算机设备,下文描述的一种计算机设备与上文描述的一种神经元加速处理方法可相互对应参照。
该计算机设备包括:
存储器,用于存储计算机程序与数据;
处理器,用于根据数据执行计算机程序时实现上述方法实施例的神经元加速处理方法的步骤。
具体的,请参考图4,为本实施例提供的一种计算机设备的具体结构示意图,该计算机设备可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(central processing units,CPU)322(例如,一个或多个处理器)和存储器332,存储器332存储有一个或多个的计算机应用程序342或数据344。其中,存储器332可以是短暂存储或持久存储。存储在存储器332的程序可以包括一个或多个模块(图示没标出),每个模块可以包括对数据处理设备中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储器332通信,在计算机设备301上执行存储器332中的一系列指令操作。
计算机设备301还可以包括一个或多个电源326,一个或多个有线或无线网络接口350,一个或多个输入输出接口358,和/或,一个或多个操作系统341。
上文所描述的神经元加速处理方法中的步骤可以由计算机设备的结构实现。
相应于上面的方法实施例,本申请实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种神经元加速处理方法可相互对应参照。
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的神经元加速处理方法的步骤。
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。

Claims (10)

  1. 一种神经元加速处理方法,其特征在于,包括:
    获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
    从所述输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据,其中,N为正整数;
    获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
    将所述第一数据添加至4N位乘法器的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,所述乘数中其它位置零;
    将第一权重添加至所述4N位乘法器的被乘数中0至N-1位,将第二权重添加至所述被乘数中2N至3N-1位,所述被乘数中其它位置零;
    获取所述4N位乘法器的输出数据,并将所述输出数据中0至2N-1位作为所述第一数据与所述第一权重生成的第一乘积值,4N至6N-1位作为所述第二数据与所述第二权重生成的第二乘积值;
    根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
    对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,并将所述输出结果作为所述当前神经网络层的输出特征图。
  2. 根据权利要求1所述的神经元加速处理方法,其特征在于,所述获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重,包括:
    获取PACT算法模型输出的所述当前神经网络层对应的权重值;
    对所述权重值进行权重编码,得到若干N位编码权重;
    从所述编码权重中获取两个权重,作为所述第一权重以及所述第二权重;
    则相应地,根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,包括:
    根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结 果进行并行加和计算,作为部分和;
    对所述部分和进行权重编码还原,得到部分和还原结果;
    根据所述部分和还原结果进行迭代累加计算,得到所述累加值。
  3. 根据权利要求2所述的神经元加速处理方法,其特征在于,对所述权重值进行权重编码,包括:
    对所述权重值进行2N+1到N的权重编码。
  4. 根据权利要求1所述的神经元加速处理方法,其特征在于,对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,包括:
    确定所述Batch Norm计算以及量化计算的融合乘数以及融合加数;
    计算所述累加值与所述融合乘数以及融合加数的乘加值,作为所述输出结果。
  5. 一种神经元加速处理装置,其特征在于,包括:
    输入获取单元,用于获取连接的上一神经网络层输出的特征图,作为当前神经网络层的输入特征图;
    数据提取单元,用于从所述输入特征图中获取两个N位特征图数据,作为第一数据以及第二数据,其中,N为正整数;
    权重获取单元,用于获取所述当前神经网络层对应的两个N位权重,作为第一权重以及第二权重;
    特征图添加单元,用于将所述第一数据添加至4N位乘法器的乘数中0至N-1位,将第二数据添加至乘数中2N至3N-1位,所述乘数中其它位置零;
    权重添加单元,用于将第一权重添加至所述4N位乘法器的被乘数中0至N-1位,将第二权重添加至所述被乘数中2N至3N-1位,所述被乘数中其它位置零;
    结果获取单元,用于获取所述4N位乘法器的输出数据,并将所述输出数据中0至2N-1位作为所述第一数据与所述第一权重生成的第一乘积值,4N至6N-1位作为所述第二数据与所述第二权重生成的 第二乘积值;
    累加处理单元,用于根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行迭代累加,得到累加值;
    结果输出单元,用于对所述累加值进行Batch Norm计算以及量化计算,得到输出结果,并将所述输出结果作为所述当前神经网络层的输出特征图。
  6. 根据权利要求5所述的神经元加速处理装置,其特征在于,所述权重获取单元,包括:
    模型输出子单元,用于获取PACT算法模型输出的所述当前神经网络层对应的权重值;
    权重编码子单元,用于对所述权重值进行权重编码,得到若干N位编码权重;
    编码提取子单元,用于从所述编码权重中获取两个权重,作为所述第一权重以及所述第二权重;
    则相应地,所述累加处理单元包括:
    并行加和子单元,用于根据所述第一乘积值、所述第二乘积值以及其它乘法器的输出结果进行并行加和计算,作为部分和;
    部分和还原子单元,用于对所述部分和进行权重编码还原,得到部分和还原结果;
    还原迭代累加计算子单元,用于根据所述部分和还原结果进行迭代累加计算,得到所述累加值。
  7. 根据权利要求6所述的神经元加速处理装置,其特征在于,所述权重编码子单元具体为:第一编码子单元,用于:对所述权重值进行2N+1到N的权重编码。
  8. 根据权利要求5所述的神经元加速处理装置,其特征在于,所述结果输出单元,包括:
    融合子单元,用于确定所述Batch Norm计算以及量化计算的融合乘数以及融合加数;
    计算子单元,用于计算所述累加值与所述融合乘数以及融合加数 的乘加值,作为所述输出结果。
  9. 一种计算机设备,其特征在于,包括:
    存储器,用于存储计算机程序与数据;
    处理器,用于根据所述数据执行所述计算机程序时实现如权利要求1至4任一项所述神经元加速处理方法的步骤。
  10. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述神经元加速处理方法的步骤。
PCT/CN2022/074429 2021-02-19 2022-01-27 一种神经元加速处理方法、装置、设备及可读存储介质 WO2022174733A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110189202.5 2021-02-19
CN202110189202.5A CN112906863B (zh) 2021-02-19 2021-02-19 一种神经元加速处理方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2022174733A1 true WO2022174733A1 (zh) 2022-08-25

Family

ID=76123804

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074429 WO2022174733A1 (zh) 2021-02-19 2022-01-27 一种神经元加速处理方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112906863B (zh)
WO (1) WO2022174733A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906863B (zh) * 2021-02-19 2023-04-07 山东英信计算机技术有限公司 一种神经元加速处理方法、装置、设备及可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN107862374A (zh) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 基于流水线的神经网络处理系统和处理方法
CN110428047A (zh) * 2018-05-01 2019-11-08 半导体组件工业公司 神经网络系统以及用于实施神经网络的加速器
US20200042881A1 (en) * 2018-08-01 2020-02-06 Nanjing Iluvatar CoreX Technology Co., Ltd. (DBA "Iluvatar CoreX Inc. Nanjing") Methods and Apparatus of Core Compute Units in Artificial Intelligent Devices
CN111199275A (zh) * 2018-11-20 2020-05-26 上海登临科技有限公司 用于神经网络的片上系统
CN111684473A (zh) * 2018-01-31 2020-09-18 亚马逊技术股份有限公司 提高神经网络阵列的性能
CN111758106A (zh) * 2018-03-30 2020-10-09 国际商业机器公司 大规模并行神经推理计算元件
CN112906863A (zh) * 2021-02-19 2021-06-04 山东英信计算机技术有限公司 一种神经元加速处理方法、装置、设备及可读存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273090B (zh) * 2017-05-05 2020-07-31 中国科学院计算技术研究所 面向神经网络处理器的近似浮点乘法器及浮点数乘法
EP3657399A1 (en) * 2017-05-23 2020-05-27 Shanghai Cambricon Information Technology Co., Ltd Weight pruning and quantization method for a neural network and accelerating device therefor
CN108921292B (zh) * 2018-05-02 2021-11-30 东南大学 面向深度神经网络加速器应用的近似计算系统
CN110659014B (zh) * 2018-06-29 2022-01-14 赛灵思公司 乘法器及神经网络计算平台
US11321606B2 (en) * 2019-01-15 2022-05-03 BigStream Solutions, Inc. Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator
CN111475135B (zh) * 2019-01-23 2023-06-16 阿里巴巴集团控股有限公司 一种乘法器
CN110766155A (zh) * 2019-09-27 2020-02-07 东南大学 一种基于混合精度存储的深度神经网络加速器
CN111966327A (zh) * 2020-08-07 2020-11-20 南方科技大学 基于nas搜索的混合精度时空复用乘法器及其控制方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN107862374A (zh) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 基于流水线的神经网络处理系统和处理方法
CN111684473A (zh) * 2018-01-31 2020-09-18 亚马逊技术股份有限公司 提高神经网络阵列的性能
CN111758106A (zh) * 2018-03-30 2020-10-09 国际商业机器公司 大规模并行神经推理计算元件
CN110428047A (zh) * 2018-05-01 2019-11-08 半导体组件工业公司 神经网络系统以及用于实施神经网络的加速器
US20200042881A1 (en) * 2018-08-01 2020-02-06 Nanjing Iluvatar CoreX Technology Co., Ltd. (DBA "Iluvatar CoreX Inc. Nanjing") Methods and Apparatus of Core Compute Units in Artificial Intelligent Devices
CN111199275A (zh) * 2018-11-20 2020-05-26 上海登临科技有限公司 用于神经网络的片上系统
CN112906863A (zh) * 2021-02-19 2021-06-04 山东英信计算机技术有限公司 一种神经元加速处理方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN112906863B (zh) 2023-04-07
CN112906863A (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
CN107451658B (zh) 浮点运算定点化方法及系统
CN106990937B (zh) 一种浮点数处理装置和处理方法
WO2020057162A1 (zh) 一种卷积神经网络加速器
CN110852434B (zh) 基于低精度浮点数的cnn量化方法、前向计算方法及硬件装置
Meng et al. Efficient winograd convolution via integer arithmetic
CN109284761B (zh) 一种图像特征提取方法、装置、设备及可读存储介质
CN113508402A (zh) 从量化的固件神经网络层得出一致的软件神经网络层
CN108196822A (zh) 一种双精度浮点开方运算的方法及系统
WO2022174733A1 (zh) 一种神经元加速处理方法、装置、设备及可读存储介质
Jiang et al. A low-latency LSTM accelerator using balanced sparsity based on FPGA
CN113608718A (zh) 一种实现素数域大整数模乘计算加速的方法
TW202013261A (zh) 算數框架系統及操作浮點至定點算數框架的方法
CN110825346B (zh) 一种低逻辑复杂度的无符号近似乘法器
Tang et al. A high-accuracy hardware-efficient multiply–accumulate (mac) unit based on dual-mode truncation error compensation for cnns
KR20230076641A (ko) 부동-소수점 연산을 위한 장치 및 방법
Asim et al. Centered Symmetric Quantization for Hardware-Efficient Low-Bit Neural Networks.
CN113313253A (zh) 神经网络压缩方法、数据处理方法、装置及计算机设备
CN110738311A (zh) 基于高层次综合的lstm网络加速方法
Murillo et al. PLAM: A Posit Logarithm-Approximate Multiplier for Power Efficient Posit-based DNNs
CN115857873B (zh) 乘法器、乘法计算方法、处理系统及存储介质
CN116151340B (zh) 并行随机计算神经网络系统及其硬件压缩方法、系统
US20230110383A1 (en) Floating-point logarithmic number system scaling system for machine learning
CN115965048A (zh) 数据处理装置、数据处理方法和电子设备
CN117908835B (zh) 一种基于浮点数计算能力加速sm2国密算法的方法
CN117436370B (zh) 面向流体力学网格生成的超定矩阵方程并行方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22755506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22755506

Country of ref document: EP

Kind code of ref document: A1