WO2019085379A1 - 深度学习softmax分类器的硬件实现电路及其控制方法 - Google Patents

深度学习softmax分类器的硬件实现电路及其控制方法 Download PDF

Info

Publication number
WO2019085379A1
WO2019085379A1 PCT/CN2018/080608 CN2018080608W WO2019085379A1 WO 2019085379 A1 WO2019085379 A1 WO 2019085379A1 CN 2018080608 W CN2018080608 W CN 2018080608W WO 2019085379 A1 WO2019085379 A1 WO 2019085379A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
data
calculation module
operation result
calculation
Prior art date
Application number
PCT/CN2018/080608
Other languages
English (en)
French (fr)
Inventor
张玉
康君龙
谢东亮
Original Assignee
北京深鉴智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京深鉴智能科技有限公司 filed Critical 北京深鉴智能科技有限公司
Publication of WO2019085379A1 publication Critical patent/WO2019085379A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/556Logarithmic or exponential functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to an artificial neural network, and more particularly to a hardware implementation circuit of a deep learning softmax classifier and a control method thereof.
  • Deep Learning is derived from the research of artificial neural network (ANN), which is a method based on the representation of data in machine learning.
  • ANN artificial neural network
  • a multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract high-level representation of attribute categories or features by combining underlying features to discover distributed feature representations of data.
  • Deep learning is a new field in machine learning research. Its motivation is to build and simulate a neural network for human brain analysis and learning. It mimics the mechanism of the human brain to interpret data such as images, sounds and texts.
  • the system includes a multi-layer network consisting of an input layer, a hidden layer, and an output layer. There are connections between adjacent layer nodes, and the same layer and cross-layer nodes are not connected to each other. The layer can be thought of as a logistic regression model.
  • This hierarchical structure is closer to the structure of the human brain.
  • the difference lies in the training mechanism.
  • the traditional neural network adopts the method of back propagation. Simply speaking, iterative algorithm is used to train the whole network, the initial value is randomly set, and the output of the current network is calculated, according to the current output and the tag value. The difference is to change the parameters of the previous layers until convergence.
  • deep learning is a strategy of layer-by-layer training and overall tuning.
  • Softmax has a wide range of applications in deep learning. Logistic regression is a problem of dealing with two classifications, while Softmax Regression is mainly to solve multi-classification problems.
  • Softmax is a generalization of Logistc regression in multi-classification, that is, the value of class label y is greater than or equal to 2.
  • a hardware implementation circuit of a softmax classifier comprising: an interface data read control module for reading computation data from an external memory to an index calculation module; an index calculation module , for performing the exponential operation of the floating point element in parallel; the addition tree module is used for accumulating the operation result of the exponential calculation module; the cache module is used for buffering the operation result of the exponential calculation module and the accumulation operation result of the addition tree module a division calculation module for calculating a ratio of an exponential operation result of each floating point element to a sum of all floating point element index operation results in parallel; an interface data write control module for writing a calculation result of the division calculation module to the external memory in.
  • the degree of computational parallelism of the index calculation module and the division calculation module may depend on the data bandwidth of the module interface, as shown in the following formula:
  • IO_data_width ⁇ IO_freq Calc_num ⁇ Calc_data_width ⁇ Calc_freq
  • IO_data_width is the IO data bit width
  • IO_freq is the IO interface data frequency
  • Calc_num is the parallelism of the calculation module
  • Calc_data_width is the data bit width supported by each calculation unit
  • Calc_freq is the operation frequency of the calculation module.
  • the calculation parallelism of the index calculation module and the division calculation module may be 4, and the index calculation module may include 4 index calculation units, and the division calculation module Four division calculation units may be included, and the addition tree module may include two levels of a total of three floating point addition calculation units.
  • the cache module may include an exponential operation result buffer and an accumulation operation result buffer.
  • the index operation result cache and the accumulation operation result buffer all adopt a first in first out (FIFO) structure.
  • a method for controlling a hardware implementation circuit of a softmax classifier comprising: an interface data read control module reading data to be calculated from an external memory; and data entering the exponential calculation module in parallel for floating point The exponential operation of the element; the operation result of the exponential calculation module is accumulated in the addition tree module; the operation result of the exponential calculation module and the accumulation operation result of the addition tree module are cached by the cache module; and the parallel calculation module is parallelized by reading the cache module Calculate the ratio of the exponential operation result of each floating point element to the sum of all floating point element index operation results; write the calculation result of the division calculation module to the external storage module via the interface data write control module.
  • the degree of computational parallelism of the index calculation module and the division calculation module may depend on the data bandwidth of the module interface, as shown in the following formula:
  • IO_data_width ⁇ IO_freq Calc_num ⁇ Calc_data_width ⁇ Calc_freq
  • IO_data_width is the IO data bit width
  • IO_freq is the IO interface data frequency
  • Calc_num is the parallelism of the calculation module
  • Calc_data_width is the data bit width supported by each calculation unit
  • Calc_freq is the operation frequency of the calculation module.
  • the calculation parallelism of the index calculation module and the division calculation module may be 4, and the index calculation module may include 4 index calculation units, and the division calculation module may The method includes four division calculation units, and the addition tree module may include two levels of a total of three floating point addition calculation units.
  • the cache module includes an exponential operation result buffer and an accumulation operation result buffer.
  • the index operation result cache and the accumulation operation result buffer all adopt a first in first out (FIFO) structure.
  • a computer readable medium for recording an instruction executable by a processor, the instruction causing a processor to execute a control method of a hardware implementation circuit of a softmax classifier when executed by a processor
  • the operation includes the following: the interface data read control module reads the data to be calculated from the external memory; the data enters the exponential calculation module in parallel to perform an exponential operation of the floating point element; and the operation result of the exponential calculation module is accumulated in the addition tree module; The operation result of the exponential calculation module and the accumulation operation result of the addition tree module are cached by the cache module; by reading the cache module, the division calculation result of each floating point element and the calculation result of all floating point element indexes are calculated in parallel by the division calculation module. The ratio of the sum; the calculation result of the division calculation module is written to the external storage module via the interface data write control module.
  • the hardware implementation circuit of the deep learning softmax classifier according to the present invention can efficiently perform softmax classification.
  • the parallelism of the circuit is based on the algorithm requirements and also on the module's port bandwidth.
  • the softmax module is realized by adopting a dedicated circuit architecture, which can improve the calculation efficiency, reduce the operation delay, and facilitate the rapid landing of deep learning.
  • FIG. 1 is a schematic block diagram of a hardware implementation circuit of a deep learning softmax classifier in accordance with the present invention
  • FIG. 2 is a flow chart showing a control method of a hardware implementation circuit of a deep learning softmax classifier according to the present invention
  • FIG. 3 is a schematic diagram of a preferred embodiment of a hardware implementation circuit of a deep learning softmax classifier in accordance with the present invention.
  • the transistor density developed at a rate that doubled every year. Compared to the previous generation, the new generation chip frequency can be increased by 50%, while the process node is reduced by 0.3, and the power density is doubled. After the mid-2000s, with the development of manufacturing processes, the problem of leakage currents was highlighted, and it is difficult to continue to increase the frequency. In order to achieve high performance without increasing the frequency, a multi-core processor has emerged.
  • the dedicated circuit architecture is different without the behavior of the application software.
  • the image processing class needs to run on the GPU.
  • the voice signal processing usually runs on the DSP.
  • a large number of control requirements are suitable for running on the CPU.
  • the video codec class needs to run on a dedicated hard core. The continuous landing of artificial intelligence has driven the rapid development of heterogeneous computing systems.
  • the softmax module uses a dedicated circuit architecture to increase computational efficiency and reduce operational latency. Conducive to the rapid landing of deep learning. It is an object of the present invention to provide a hardware implementation circuit for a deep learning softmax classifier.
  • the parallelism design of the circuit depends not only on the algorithm requirements but also on the module's port bandwidth.
  • the present invention provides a hardware implementation circuit of a softmax classifier.
  • 1 is a schematic block diagram of a hardware implementation circuit of a deep learning softmax classifier in accordance with the present invention.
  • the hardware implementation circuit 100 of the deep learning softmax classifier according to the present invention may include the following modules.
  • This module has a direct memory access (DMA) read function that reads the calculated data from the external memory to the subsequent index calculation module 120.
  • DMA direct memory access
  • the index calculation module 120 completes the calculation of the index of the floating point element.
  • the degree of parallelism of the calculation depends on the data bandwidth of the module interface, as shown in the following formula:
  • IO_data_width ⁇ IO_freq Calc_num ⁇ Calc_data_width ⁇ Calc_freq
  • IO_data_width is the IO data bit width
  • IO_freq is the IO interface data frequency
  • Calc_num is the parallelism of the calculation module
  • Calc_data_width is the data bit width supported by each calculation unit
  • Calc_freq is the operation frequency of the calculation module.
  • the addition tree module 130 completes the accumulation operation of the operation result of the index calculation module 120, the number of accumulations depends on the dimension of the input array, and the array dimension is passed through the control module.
  • the cache module 140 is configured to cache the calculation results of the index calculation module 120 and the addition tree module 130.
  • the cache module may include an exponential operation result cache and an accumulation operation result cache. Both the index operation result buffer and the accumulation operation result buffer adopt a first in first out (FIFO) structure.
  • FIFO first in first out
  • the execution period of the index calculation module 120 is consistent with the execution period of the division calculation module 150, which will be described later.
  • the division calculation module 150 calculates the division of the previous set of array elements to form a pipeline. .
  • the division calculation module 150 is configured to calculate the ratio of the element index to the sum of all the elements, and the parallelism of the module depends on the data bandwidth of the interface, which is consistent with the degree of parallelism of the index calculation module.
  • the interface data write control module 160 writes the calculation result of the division calculation module 150 into the designated external memory. At the same time, it also has a back pressure function for the pre-stage module. When the backward writing operation is slow, the back pressure function is provided forward.
  • the data processing of the invention adopts a pipeline design, and the cache module adopts a ping-pang cache structure, and has two internal states, state 1 processes the ping cache, and state 2 processes the pang cache.
  • state 1 processes the ping cache
  • state 2 processes the pang cache.
  • Step 1 The calculation control module first receives the start of the circuit instruction, the instruction includes reading the read address of the array, the write address of the result write back, the length of the array, and the number of executions.
  • Step 2 The interface data read control module reads the data to be calculated from the external storage module according to the instruction of step 1.
  • Step 3 The data enters the index calculation module in parallel, and the module performs the element index operation using the concurrency degree consistent with the interface. The result of the operation is given to the addition tree module, and one copy is written to the cache module ping cache.
  • Step 4 The addition tree module completes the accumulation operation of the index calculation module, and buffers the intermediate result and the final result of the accumulated value.
  • Step 5 Enter state 2, and the division calculation module reads step 3 to ping the buffered data to perform the division operation with the final result of step 4.
  • Step 6 The result after the division is written into the external storage module controlled by the instruction by the interface data write control module.
  • the above-described ping-pong cache structure can be considered as a type of first-in, first-out (FIFO) structure.
  • FIFO first-in, first-out
  • the FIFO structure is adopted, so that the results of the exponential operation and the accumulation operation are stored in the cache on the one hand; on the other hand, the first stored result can also be taken out from the cache for performing. Division operation.
  • control method of the hardware implementation circuit of the deep learning softmax classifier according to the present invention can be further summarized as follows.
  • FIG. 2 is a flow chart of a method of controlling a hardware implementation circuit of a deep learning softmax classifier in accordance with the present invention.
  • control method 200 of the hardware implementation circuit of the deep learning softmax classifier according to the present invention begins in step S210, at which the interface data read control module 110 reads the data to be calculated from the external memory.
  • step S220 the data enters the index calculation module 120 in parallel to perform an exponential operation of the floating point element
  • step S230 the operation result of the index calculation module 120 performs an accumulation operation in the addition tree module 130.
  • step S240 the operation result of the index calculation module 120 and the accumulation operation result of the addition tree module 130 are buffered by the cache module 140.
  • the cache module 140 may include an exponential operation result cache and an accumulation operation result cache.
  • the division calculation module 150 calculates the ratio of the exponential operation result of each floating point element to the sum of all floating point element index operation results in parallel.
  • both the exponential operation result cache and the accumulation operation result cache in the cache module 140 adopt a FIFO structure.
  • step S260 the calculation result of the division calculation module 150 is written to the external storage module via the interface data write control module 160.
  • the method 200 can be completed.
  • the degree of computational parallelism of the index calculation module 120 and the division calculation module 150 depends on the data bandwidth of the module interface, as shown in the following formula:
  • IO_data_width ⁇ IO_freq Calc_num ⁇ Calc_data_width ⁇ Calc_freq
  • IO_data_width is the IO data bit width
  • IO_freq is the IO interface data frequency
  • Calc_num is the parallelism of the calculation module
  • Calc_data_width is the data bit width supported by each calculation unit
  • Calc_freq is the operation frequency of the calculation module.
  • FIG. 3 is a schematic diagram of a preferred embodiment of a hardware implementation circuit of a deep learning softmax classifier in accordance with the present invention.
  • the calculation module is a floating point calculation unit, and the IO and the calculation unit are at the same frequency, so the calculation module parallelism is as follows:
  • the exponential calculation module requires four exponential calculation units (exp), the division calculation module requires four division calculation units (div), and the addition tree module requires two levels of three floating point addition calculation units. (add) and an accumulator (acc) unit. Before the division, the index buffer (exp buffer) and the accumulation buffer (sum buffer) are used for preparation.
  • Non-transitory computer readable media include various types of tangible storage media.
  • non-transitory computer readable medium examples include magnetic recording media (such as floppy disks, magnetic tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROM (Compact Disc Read Only Memory), CD-R, CD-R /W and semiconductor memory (such as ROM, PROM (programmable ROM), EPROM (rewritable PROM), flash ROM and RAM (random access memory)).
  • these programs can be provided to a computer by using various types of transient computer readable media.
  • Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can be used to provide a program to a computer via a wired communication path such as a wire and an optical fiber or a wireless communication path.
  • a computer program or a computer readable medium for recording instructions executable by a processor that, when executed by a processor, causes the processor to execute hardware of the softmax classifier
  • the control method for implementing the circuit includes the following operations: the interface data read control module reads the data to be calculated from the external memory; the data enters the exponential calculation module in parallel, performs an exponential operation of the floating point element; and the operation result of the exponential calculation module is in the addition tree
  • the module performs an accumulation operation; the operation result of the index calculation module and the accumulation operation result of the addition tree module are cached by the cache module; and the index operation result of each floating point element and all floating points are calculated in parallel by the division calculation module by reading the cache module The ratio of the sum of the results of the element index operations; writing the calculation result of the division calculation module to the external storage module via the interface data write control module.

Abstract

本公开提供一种深度学习softmax分类器的硬件实现电路及其控制方法。硬件实现电路(100)包括:接口数据读控制模块(110),用于从外部存储器中读取计算数据给指数计算模块(120);指数计算模块(120),用于并行地进行浮点元素的指数运算;加法树模块(130),用于进行指数计算模块(120)的运算结果的累加运算;缓存模块(140),用于缓存指数计算模块(120)的运算结果以及加法树模块(130)的累加运算结果;除法计算模块(150),用于并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;接口数据写控制模块(160),用于将除法计算模块(150)的计算结果写入外部存储器。

Description

深度学习softmax分类器的硬件实现电路及其控制方法 技术领域
本发明涉及人工神经网络,更具体涉及深度学习softmax分类器的硬件实现电路及其控制方法。
背景技术
深度学习(Deep Learning)的概念源于人工神经网络(ANN)的研究,是机器学习中一种基于对数据进行表征学习的方法。含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合底层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。
深度学习是机器学习研究中的一个新的领域,其动机在于建立、模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本。
深度学习与传统神经网络有相同的地方也有很多不同。相同在于二者都采用相似的分层结构,系统包括输入层、隐层、输出层组成的多层网络,相邻层节点之间有连接,同层以及跨层节点之间相互无连接,每层可以看做是一个逻辑回归模型。这种分层结构是比较接近人类大脑的结构的。不同在于训练机制,传统神经网络采用的是反向传播的方式进行,简单来说就是采用迭代的算法来训练整个网络,随机设定初值,计算当前网络的输出,根据当前输出和标签值的差来改变前面各层的参数,直到收敛。而深度学习整体上是采用逐层训练,再进行整体调优的策略。
Softmax在深度学习中有非常广泛的应用,Logistic回归是处理二分类问题,而Softmax回归(Softmax Regression)主要是解决多分类问题。
Softmax是Logistc回归在多分类上的推广,即类标签y的取值大于等于 2。假设有m个训练样本{(x (1),y (1)),(x (2),y (2)),.........(x (m),y (m))},在对Softmax回归,其输入特征为:
Figure PCTCN2018080608-appb-000001
类标记为:y (i)∈{0,1,.........k}。假设函数为对于每一个样本估值其所属的类别的概率P(y=j|x),具体的假设函数为:
Figure PCTCN2018080608-appb-000002
其中θ表示向量
Figure PCTCN2018080608-appb-000003
则对于每一个样本估计其所属的类别的概率为:
Figure PCTCN2018080608-appb-000004
发明内容
本发明的目的在于提供一种深度学习softmax分类器的硬件实现电路及其控制方法。
根据本发明的第一方面,提供一种softmax分类器的硬件实现电路,该硬件实现电路可以包括:接口数据读控制模块,用于从外部存储器中读取计算数据给指数计算模块;指数计算模块,用于并行地进行浮点元素的指数运算;加法树模块,用于进行指数计算模块的运算结果的累加运算;缓存模块,用于缓存指数计算模块的运算结果以及加法树模块的累加运算结果;除法计算模块,用于并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;接口数据写控制模块,用于将除法计算模块的计算结果写入外部存储器中。
在根据本发明第一方面的硬件实现电路中,所述指数计算模块和所述除法计算模块的计算并行度可以取决于模块接口的数据带宽,如下公式所示:
IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq,
其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
在根据本发明第一方面的硬件实现电路中,所述指数计算模块和所述除法计算模块的计算并行度可以是4,所述指数计算模块可以包括4个指数计算单元,所述除法计算模块可以包括4个除法计算单元,所述加法树模块可以包括2级共3个浮点加法计算单元。
在根据本发明第一方面的硬件实现电路中,所述缓存模块可以包括指数运算结果缓存和累加运算结果缓存。其中,所述指数运算结果缓存和所述累加运算结果缓存都采用先进先出(FIFO)结构。
根据本发明的第二方面,提供一种softmax分类器的硬件实现电路的控制方法,包括:接口数据读控制模块从外部存储器中读取要计算的数据;数据并行进入指数计算模块,进行浮点元素的指数运算;指数计算模块的运算结果在加法树模块进行累加运算;通过缓存模块来缓存指数计算模块的运算结果以及加法树模块的累加运算结果;通过读取缓存模块,在除法计算模块并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;将除法计算模块的计算结果经接口数据写控制模块写入外部存储模块。
在根据本发明第二方面的控制方法中,所述指数计算模块和所述除法计算模块的计算并行度可以取决于模块接口的数据带宽,如下公式所示:
IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq,
其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
在根据本发明第二方面的控制方法中,所述指数计算模块和所述除法计 算模块的计算并行度可以是4,所述指数计算模块可以包括4个指数计算单元,所述除法计算模块可以包括4个除法计算单元,所述加法树模块可以包括2级共3个浮点加法计算单元。
在根据本发明第二方面的控制方法中,所述缓存模块包括指数运算结果缓存和累加运算结果缓存。其中,所述指数运算结果缓存和所述累加运算结果缓存都采用先进先出(FIFO)结构。
根据本发明的第三方面,提供一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行softmax分类器的硬件实现电路的控制方法,包括如下操作:接口数据读控制模块从外部存储器中读取要计算的数据;数据并行进入指数计算模块,进行浮点元素的指数运算;指数计算模块的运算结果在加法树模块进行累加运算;通过缓存模块来缓存指数计算模块的运算结果以及加法树模块的累加运算结果;通过读取缓存模块,在除法计算模块并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;将除法计算模块的计算结果经接口数据写控制模块写入外部存储模块。
根据本发明的深度学习softmax分类器的硬件实现电路可以有效地进行softmax分类。该电路的并行度基于算法需求,也取决于模块的端口带宽。在异构嵌入式系统中,通过采用专用电路架构来实现softmax模块,能够提高计算效率,降低操作延时,有利于深度学习的快速落地。
附图说明
下面参考附图结合实施例说明本发明。在附图中:
图1是根据本发明的深度学习softmax分类器的硬件实现电路的示意框图;
图2是根据本发明的深度学习softmax分类器的硬件实现电路的控制方法的流程图;
图3是根据本发明的深度学习softmax分类器的硬件实现电路的优选实施例的示意图。
具体实施方式
附图仅用于示例说明,不能理解为对本发明的限制。下面结合附图和实施例对本发明的技术方案做进一步的说明。
摩尔定律在1965年被提出后,晶体管密度基本按照每年翻倍的速度发展。相比上一代,新一代芯片频率能提高50%,同时工艺节点减少0.3,功耗密度翻倍。2000年中期之后,随着制造工艺的发展,泄漏电流的问题凸显,再继续提高频率的方法很难凑效。为了在不提高频率的条件下实现高的性能,于是就出现了多核处理器。
随着互联网的快速发展,应用软件需求五花八门,只提高处理器的并行度已无法满足,于是就出现了专用电路。不用应用软件的行为不同,专用电路架构也就不同。图像处理类需求适合运行在GPU上,语音信号处理通常运行在DSP上,大量控制型需求适合运行在CPU上,视频编解码类需求适合跑在专用硬核上。人工智能的不断落地,带动着异构计算系统的快速发展。
在异构嵌入式系统中,softmax模块采用专用电路架构能够提高计算效率,降低操作延时。有利于深度学习的快速落地。本发明的目的在于提供一种深度学习softmax分类器的硬件实现电路。该电路的并行度设计不仅取决算法需求,也取决于模块的端口带宽。
为了实现上述目的,本发明提供了一种softmax分类器的硬件实现电路。图1是根据本发明的深度学习softmax分类器的硬件实现电路的示意框图。
如图1中所示,根据本发明的深度学习softmax分类器的硬件实现电路100可以包含以下的模块。
接口数据读控制模块110:该模块具有直接存储器存取(DMA)读的功能,从外部存储器中读取计算数据给之后的指数计算模块120。
指数计算模块120:完成浮点元素的指数的计算。计算的并行度依赖于模块接口的数据带宽,如下公式所示:
IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq
其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
加法树模块130:完成指数计算模块120的运算结果的累加操作,累加次数取决于输入数组的维度,数组维度通过控制模块传递。
缓存模块140:用于缓存指数计算模块120以及加法树模块130的计算结果。所述缓存模块可以包括指数运算结果缓存和累加运算结果缓存。所述指数运算结果缓存和所述累加运算结果缓存都采用先进先出(FIFO)结构。例如,指数计算模块120的执行周期和下文将描述的除法计算模块150的执行周期一致,指数计算模块120在计算当前数组元素指数时,除法计算模块150计算前一组数组元素的除法,形成流水。
除法计算模块150:用于计算元素指数与所有元素指数和的比值,该模块的并行度依赖于接口的数据带宽,与指数计算模块并行度一致。
接口数据写控制模块160:把除法计算模块150的计算结果写入指定外部存储器中。同时也对前级模块有反压功能,当向后写入操作较慢时,会向前提供反压功能。
本发明数据处理采用流水线设计,缓存模块采用乒乓(ping-pang)缓存结构,内部共两种状态,状态1处理ping缓存,状态2处理pang缓存。具体执行步骤如下:
步骤1:计算控制模块首先收到启动该电路指令,指令包含读取数组的读地址、结果写回的写地址、数组长度以及执行次数。
步骤2:接口数据读控制模块根据步骤1的指令,从外部存储模块读取要计算数据。
步骤3:数据并行进入指数计算模块,该模块采用与接口一致的并发度进行元素指数运算。运算后的结果一份给到加法树模块,一份写入到缓存模块ping缓存。
步骤4:加法树模块完成指数计算模块的累加操作,并缓存累加值的中间结果及最终结果。
步骤5:进入状态2,除法计算模块读取步骤3ping缓存的数据进行与步骤4最终结果进行除法运算。
步骤6:除法后的结果经接口数据写控制模块写入指令控制的外部存储模块中。
更一般地,上述的乒乓缓存结构可以看做是先进先出(FIFO)结构的一种。无论指数运算结果缓存还是累加运算结果缓存,都采用了FIFO结构,从而使得一方面在缓存中存储指数运算和累加运算的结果;另一方面也可以从缓存中取出最先存储的结果,以进行除法运算。
根据如上的描述,可以如下进一步总结出根据本发明的深度学习softmax分类器的硬件实现电路的控制方法。
图2是根据本发明的深度学习softmax分类器的硬件实现电路的控制方法的流程图。
如图2中所示,根据本发明的深度学习softmax分类器的硬件实现电路的控制方法200开始于步骤S210,在此步骤,接口数据读控制模块110从外部存储器中读取要计算的数据。
接下来,在步骤S220,数据并行进入指数计算模块120,进行浮点元素的指数运算;
在步骤S230,指数计算模块120的运算结果在加法树模块130进行累加运算。
然后,在步骤S240,通过缓存模块140来缓存指数计算模块120的运算结果以及加法树模块130的累加运算结果。所述缓存模块140可以包括指数运算结果缓存和累加运算结果缓存。
在步骤S250,通过读取缓存模块140,在除法计算模块150并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值。优选地,所述缓存模块140中的所述指数运算结果缓存和累加运算结果缓存都采用FIFO结构。
最后,在步骤S260,将除法计算模块150的计算结果经接口数据写控制模块160写入外部存储模块。由此,方法200即可结束。
在以上步骤中,如上结合电路100的结构所述,所述指数计算模块120和所述除法计算模块150的计算并行度取决于模块接口的数据带宽,如下公式所示:
IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq,
其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
图3是根据本发明的深度学习softmax分类器的硬件实现电路的优选实施例的示意图。
如图3中所示,在该优选实施例中,由于IO端口位宽是128bits,计算模块是浮点计算单元,IO和计算单元同频,所以计算模块并行度如下公式:
Figure PCTCN2018080608-appb-000005
如上计算且如图3中所示,指数计算模块需要4个指数计算单元(exp),除法计算模块需要4个除法计算单元(div),加法树模块需要2级共3个浮 点加法计算单元(add)以及一个累加器(acc)单元。在进行除法之前,采用指数结果缓存(exp buffer)和累加结果缓存(sum buffer)来进行准备。
本领域普通技术人员应该认识到,本发明的方法可以实现为计算机程序。如上结合图2所述,根据上述实施例的方法可以执行一个或多个程序,包括指令来使得计算机或处理器执行结合附图所述的算法。这些程序可以使用各种类型的非瞬时计算机可读介质存储并提供给计算机或处理器。非瞬时计算机可读介质包括各种类型的有形存贮介质。非瞬时计算机可读介质的示例包括磁性记录介质(诸如软盘、磁带和硬盘驱动器)、磁光记录介质(诸如磁光盘)、CD-ROM(紧凑盘只读存储器)、CD-R、CD-R/W以及半导体存储器(诸如ROM、PROM(可编程ROM)、EPROM(可擦写PROM)、闪存ROM和RAM(随机存取存储器))。进一步,这些程序可以通过使用各种类型的瞬时计算机可读介质而提供给计算机。瞬时计算机可读介质的示例包括电信号、光信号和电磁波。瞬时计算机可读介质可以用于通过诸如电线和光纤的有线通信路径或无线通信路径提供程序给计算机。
因此,根据本发明,还可以提议一种计算机程序或一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行softmax分类器的硬件实现电路的控制方法,包括如下操作:接口数据读控制模块从外部存储器中读取要计算的数据;数据并行进入指数计算模块,进行浮点元素的指数运算;指数计算模块的运算结果在加法树模块进行累加运算;通过缓存模块来缓存指数计算模块的运算结果以及加法树模块的累加运算结果;通过读取缓存模块,在除法计算模块并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;将除法计算模块的计算结果经接口数据写控制模块写入外部存储模块。
上面已经描述了本发明的各种实施例和实施情形。但是,本发明的精神和范围不限于此。本领域技术人员将能够根据本发明的教导而做出更多的应用,而这些应用都在本发明的范围之内。
也就是说,本发明的上述实施例仅仅是为清楚说明本发明所做的举例, 而非对本发明实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其他不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、替换或改进等,均应包含在本发明权利要求的保护范围之内。

Claims (11)

  1. 一种softmax分类器的硬件实现电路,包括:
    接口数据读控制模块,用于从外部存储器中读取计算数据给指数计算模块;
    指数计算模块,用于并行地进行浮点元素的指数运算;
    加法树模块,用于进行指数计算模块的运算结果的累加运算;
    缓存模块,用于缓存指数计算模块的运算结果以及加法树模块的累加运算结果;
    除法计算模块,用于并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;
    接口数据写控制模块,用于将除法计算模块的计算结果写入外部存储器中。
  2. 根据权利要求1所述的硬件实现电路,其中,所述指数计算模块和所述除法计算模块的计算并行度取决于模块接口的数据带宽,如下公式所示:
    IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq,
    其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
  3. 根据权利要求1或2所述的硬件实现电路,其中,所述指数计算模块和所述除法计算模块的计算并行度为4,所述指数计算模块包括4个指数计算单元,所述除法计算模块包括4个除法计算单元,所述加法树模块包括2级共3个浮点加法计算单元。
  4. 根据权利要求1所述的硬件实现电路,其中,所述缓存模块包括指数运算结果缓存和累加运算结果缓存。
  5. 根据权利要求4所述的硬件实现电路,其中,所述指数运算结果缓存和所述累加运算结果缓存都采用先进先出(FIFO)结构。
  6. 一种softmax分类器的硬件实现电路的控制方法,包括:
    接口数据读控制模块从外部存储器中读取要计算的数据;
    数据并行进入指数计算模块,进行浮点元素的指数运算;
    指数计算模块的运算结果在加法树模块进行累加运算;
    通过缓存模块来缓存指数计算模块的运算结果以及加法树模块的累加运算结果;
    通过读取缓存模块,在除法计算模块并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;
    将除法计算模块的计算结果经接口数据写控制模块写入外部存储模块。
  7. 根据权利要求6所述的控制方法,其中,所述指数计算模块和所述除法计算模块的计算并行度取决于模块接口的数据带宽,如下公式所示:
    IO_data_width×IO_freq=Calc_num×Calc_data_width×Calc_freq,
    其中IO_data_width是IO数据位宽,IO_freq是IO接口数据频率,Calc_num是计算模块的并行度,Calc_data_width是每一个计算单元支持的数据位宽,Calc_freq是计算模块的运行频率。
  8. 根据权利要求6或7所述的控制方法,其中,所述指数计算模块和所述除法计算模块的计算并行度为4,所述指数计算模块包括4个指数计算单元,所述除法计算模块包括4个除法计算单元,所述加法树模块包括2级共3个浮点加法计算单元。
  9. 根据权利要求6所述的控制方法,其中,所述缓存模块包括指数运算结果缓存和累加运算结果缓存。
  10. 根据权利要求9所述的控制方法,其中,所述指数运算结果缓存和所述累加运算结果缓存都采用先进先出(FIFO)结构。
  11. 一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行softmax分类器的硬件实现电路的控制 方法,包括如下操作:
    接口数据读控制模块从外部存储器中读取要计算的数据;
    数据并行进入指数计算模块,进行浮点元素的指数运算;
    指数计算模块的运算结果在加法树模块进行累加运算;
    通过缓存模块来缓存指数计算模块的运算结果以及加法树模块的累加运算结果;
    通过读取缓存模块,在除法计算模块并行地计算各个浮点元素的指数运算结果与所有浮点元素指数运算结果之和的比值;
    将除法计算模块的计算结果经接口数据写控制模块写入外部存储模块。
PCT/CN2018/080608 2017-10-30 2018-03-27 深度学习softmax分类器的硬件实现电路及其控制方法 WO2019085379A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711039589.6A CN109726809B (zh) 2017-10-30 2017-10-30 深度学习softmax分类器的硬件实现电路及其控制方法
CN201711039589.6 2017-10-30

Publications (1)

Publication Number Publication Date
WO2019085379A1 true WO2019085379A1 (zh) 2019-05-09

Family

ID=66292834

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/080608 WO2019085379A1 (zh) 2017-10-30 2018-03-27 深度学习softmax分类器的硬件实现电路及其控制方法

Country Status (2)

Country Link
CN (1) CN109726809B (zh)
WO (1) WO2019085379A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728365B (zh) * 2019-09-12 2022-04-01 东南大学 多位宽pe阵列计算位宽的选择方法及计算精度控制电路
CN112036561B (zh) * 2020-09-30 2024-01-19 北京百度网讯科技有限公司 数据处理方法、装置、电子设备及存储介质
CN112685693B (zh) * 2020-12-31 2022-08-02 南方电网科学研究院有限责任公司 一种实现Softmax函数的设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919980A (zh) * 2017-01-24 2017-07-04 南京大学 一种基于神经节分化的增量式目标识别系统
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN107229942A (zh) * 2017-04-16 2017-10-03 北京工业大学 一种基于多个分类器的卷积神经网络快速分类方法
CN107301453A (zh) * 2016-04-15 2017-10-27 北京中科寒武纪科技有限公司 支持离散数据表示的人工神经网络正向运算装置和方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106485319B (zh) * 2015-10-08 2019-02-12 上海兆芯集成电路有限公司 具有神经处理单元可动态配置以执行多种数据尺寸的神经网络单元
US10891540B2 (en) * 2015-12-18 2021-01-12 National Technology & Engineering Solutions Of Sandia, Llc Adaptive neural network management system
CN106228238B (zh) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 现场可编程门阵列平台上加速深度学习算法的方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN107301453A (zh) * 2016-04-15 2017-10-27 北京中科寒武纪科技有限公司 支持离散数据表示的人工神经网络正向运算装置和方法
CN106919980A (zh) * 2017-01-24 2017-07-04 南京大学 一种基于神经节分化的增量式目标识别系统
CN107229942A (zh) * 2017-04-16 2017-10-03 北京工业大学 一种基于多个分类器的卷积神经网络快速分类方法

Also Published As

Publication number Publication date
CN109726809B (zh) 2020-12-08
CN109726809A (zh) 2019-05-07

Similar Documents

Publication Publication Date Title
Xie et al. Weight-sharing neural architecture search: A battle to shrink the optimization gap
WO2019085379A1 (zh) 深度学习softmax分类器的硬件实现电路及其控制方法
CN111416743B (zh) 一种卷积网络加速器、配置方法及计算机可读存储介质
CN110070181A (zh) 一种用于边缘计算设备的深度学习的优化方法
WO2021057722A1 (zh) 用多核处理器实现神经网络模型拆分方法及相关产品
JP7366274B2 (ja) ニューラル・ネットワークのための適応的探索方法および装置
Miao et al. HET: scaling out huge embedding model training via cache-enabled distributed framework
CN113051216B (zh) 一种基于FPGA加速的MobileNet-SSD目标检测装置及方法
Mishra et al. Fine-grained accelerators for sparse machine learning workloads
TW202134861A (zh) 交錯記憶體請求以加速記憶體存取
WO2021259098A1 (zh) 一种基于卷积神经网络的加速系统、方法及存储介质
CN116401502B (zh) 一种基于NUMA系统特性优化Winograd卷积的方法及装置
WO2022227962A1 (zh) 一种数据处理方法及装置
Hu et al. What can knowledge bring to machine learning?—a survey of low-shot learning for structured data
WO2022160579A1 (zh) 一种基于深度神经网络的信息处理系统
WO2021243489A1 (zh) 一种神经网络的数据处理方法及装置
US10990525B2 (en) Caching data in artificial neural network computations
US11494624B2 (en) Accelerating neuron computations in artificial neural networks with dual sparsity
CN112200310A (zh) 智能处理器、数据处理方法及存储介质
WO2023011237A1 (zh) 业务处理
WO2021238289A1 (zh) 序列处理的方法与装置
Zhang et al. PCGraph: Accelerating GNN inference on large graphs via partition caching
CN115437778A (zh) 内核调度方法及装置、电子设备、计算机可读存储介质
US20200342327A1 (en) Data Sparsity Monitoring During Neural Network Training
Mu et al. Boosting the Convergence of Reinforcement Learning-based Auto-pruning Using Historical Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18874123

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18874123

Country of ref document: EP

Kind code of ref document: A1