WO2017166568A1 - 一种神经网络加速器及其运算方法 - Google Patents

一种神经网络加速器及其运算方法 Download PDF

Info

Publication number
WO2017166568A1
WO2017166568A1 PCT/CN2016/094179 CN2016094179W WO2017166568A1 WO 2017166568 A1 WO2017166568 A1 WO 2017166568A1 CN 2016094179 W CN2016094179 W CN 2016094179W WO 2017166568 A1 WO2017166568 A1 WO 2017166568A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
network accelerator
computing module
storage medium
perform
Prior art date
Application number
PCT/CN2016/094179
Other languages
English (en)
French (fr)
Inventor
杜子东
郭崎
陈天石
陈云霁
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Priority to US16/071,801 priority Critical patent/US20190026626A1/en
Publication of WO2017166568A1 publication Critical patent/WO2017166568A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks

Definitions

  • the invention relates to the field of neural network algorithms, and belongs to a neural network accelerator and an operation method thereof.
  • Nonlinear neural network including the most popular Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN) and Deep Neural Network (DNN), mostly Nonlinear neural network.
  • MLP Multi-Layer Perceptron
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • the nonlinearity comes from activation functions such as sigmoid functions, tanh functions, or nonlinear layers such as ReLU.
  • these non-linear operations are independent of other operations, that is, the input and output are one-to-one mapping; and in the final stage of the output neuron, that is, after the nonlinear operation is completed, the calculation of the next layer of neural network can be performed, and the operation speed is deep.
  • the impact of the performance of the neural network accelerator In the neural network accelerator, these non-linear operations are performed using a single ALU (Arithmetic Logic Unit) or a simplified ALU. However, this approach reduces the performance of the neural network accelerator.
  • ALU Arimetic
  • an object of the present invention is to provide a neural network accelerator and an operation method thereof, which introduce a multi-ALU design in a neural network accelerator, thereby improving the operation speed of the nonlinear operation and making the neural network accelerator more efficient.
  • the present invention provides a neural network accelerator including an on-chip storage medium, An on-chip address indexing module, a core computing module, and a multi-ALU device, the on-chip storage medium for storing data transmitted from outside the neural network accelerator or for storing data generated during the calculation process; the on-chip data index a module for performing an operation to map to a correct storage address according to an input index; the core calculation module for performing a linear operation in a neural network operation; the multi-ALU device for using the core calculation module or the The on-chip storage medium acquires input data to perform a non-linear operation that cannot be completed by the core computing module.
  • the data generated in the calculation process includes a calculation result or an intermediate settlement result.
  • the multi-ALU device includes an input mapping unit, a plurality of arithmetic logic operation units, and an output mapping unit,
  • the input mapping unit is configured to map input data obtained from the on-chip storage medium or the core calculation module to a plurality of arithmetic logic operation units;
  • the arithmetic logic operation unit is configured to perform a logic operation according to the input data, where the logic operation includes a nonlinear operation;
  • the output mapping unit is configured to integrate the calculation results obtained by the plurality of arithmetic logic operation units into a correct format for use in subsequent storage or other modules.
  • the input mapping unit assigns the input data to the plurality of arithmetic logic operation units to respectively perform different operations or map a plurality of input data one by one to the plurality of arithmetic logic operations
  • the unit performs the operation.
  • the plurality of arithmetic logic units are isomorphic design or heterogeneous design.
  • a single said arithmetic logic operation unit includes a plurality of sub-operation units that perform different function functions.
  • the multi-ALU device is further configured to configure an arithmetic function performed by each arithmetic logic operation unit according to a control signal at the time of calculation.
  • the on-chip storage medium is a static random access memory, a dynamic random access memory, an enhanced dynamic random access memory, a register file, or a nonvolatile memory.
  • the present invention accordingly provides an arithmetic method using the neural network accelerator as described above, including:
  • the data is acquired from the on-chip storage medium to perform a linear operation
  • the input data is obtained from the on-chip storage medium or the core computing module to perform a non-linear operation that cannot be completed by the core computing module.
  • the step of entering the multi-ALU device operation further includes: the multi-ALU device configuring the arithmetic function performed by each arithmetic logic operation unit according to the control signal.
  • FIG. 1 is a block diagram showing the structure of a neural network accelerator of the present invention
  • FIG. 2 is a block diagram showing the structure of a multi-ALU device according to an embodiment of the present invention.
  • FIG. 3 is a block diagram showing the function implementation of a single arithmetic logic unit in an embodiment of the present invention
  • FIG. 4 is a block diagram showing the function distribution of a plurality of arithmetic logic operation units in an embodiment of the present invention
  • Figure 5 is a flow chart of the neural network operation performed by the neural network accelerator shown in Figure 1;
  • FIG. 6 is a block diagram showing the organization of a core computing module of an embodiment of the neural network accelerator of the present invention.
  • FIG. 7 is a block diagram showing the organization of a core computing module of another embodiment of the neural network accelerator of the present invention.
  • the present invention provides a neural network accelerator 100 comprising an on-chip storage medium 10, an on-chip address indexing module 20, a core computing module 30, and a multi-ALU device 40.
  • the on-chip address indexing module 20 is connected to the on-chip storage medium 10, and the on-chip address indexing module 20, the core computing module 30, and the multi-ALU device 40 are connected in two.
  • the on-chip storage medium 10 is configured to store data transmitted from outside the neural network accelerator or to store data generated in the calculation process.
  • the data generated during the calculation includes calculation results or intermediate results generated during the calculation. These results may come from the on-chip core computation module 30 of the accelerator, or from other computational components, such as the multi-ALU device 40 of the present invention.
  • the on-chip storage medium 10 may be a static random access memory (SRAM), a dynamic random access memory (DRAM), and an enhanced dynamic random access memory (e-DRAM).
  • Register file A common storage medium such as file, RF), or a new type of storage device, such as a non-volatile memory (NVM) or a 3D memory device.
  • the on-chip address indexing module 20 is configured to map to the correct storage address according to the input index when performing the operation. This allows data and on-chip memory modules to interact correctly.
  • the address mapping process here includes direct mapping, arithmetic transformation, and the like.
  • the core calculation module 30 is configured to perform a linear operation in a neural network operation. Specifically, the core computing module 30 performs most of the operations in the neural network algorithm, that is, vector multiply and add operations.
  • the multi-ALU device 40 is configured to acquire input data from the core computing module or the on-chip storage medium to perform a nonlinear operation that cannot be completed by the core computing module.
  • the multi-ALU device is mainly used for nonlinear operations.
  • the neural network accelerator is more efficient.
  • the data path of the core computing module 30, the multi-ALU device 40, and the on-chip storage medium 10 includes, but is not limited to, H-TREE, or an interconnection technology such as FAT-TREE.
  • the multi-ALU device 40 includes an input mapping unit 41, a plurality of arithmetic logic operation units 42, and an output mapping unit 43.
  • the input mapping unit 41 is configured to map input data obtained from the on-chip storage medium or the core calculation module to the plurality of arithmetic logic operation units 42.
  • Different data distribution principles may exist in different accelerator designs. According to different allocation principles, the input mapping unit 41 assigns input data to a plurality of arithmetic logic operation units 42 to perform different operations or map multiple input data one by one to multiple The arithmetic logic operation unit 42 performs an operation.
  • the input data sources herein include those obtained directly from the on-chip storage medium 10 and obtained by the core computing module 30.
  • a plurality of arithmetic logic operation units 42 are configured to perform logical operations respectively according to the input data, and the logical operations include nonlinear operations.
  • the single arithmetic logic operation unit 42 includes a plurality of sub-operation units that perform different function functions. As shown in FIG. 3, the functions of the single arithmetic logic operation unit 42 include multiplication, addition, comparison, division, shift operation, etc., and also include Complex function functions, such as exponential operations, etc., a single arithmetic logic operation unit 42 includes one or more sub-operational units that perform the different functions described above. At the same time, the function of the arithmetic logic unit 42 should be determined by the neural network accelerator function, and is not limited to a specific algorithm operation.
  • the plurality of arithmetic logic operation units 42 are isomorphic design or heterogeneous design, that is, the arithmetic logic operation unit 42 can implement the same function function, and can also implement different function functions.
  • the functions of the plurality of arithmetic logic operation units 42 are isomerized, and the above two ALU implementations Multiplication and addition operations, other ALUs implement other complex functions. Heterogeneous design helps to effectively balance the functionality and overhead of the ALU.
  • the output mapping unit 43 is configured to integrate the calculation results obtained by the plurality of arithmetic logic operation units 42 into a correct format for use in subsequent storage or other modules.
  • FIG. 5 is a flow chart of a neural network accelerator as shown in FIG. 1 for performing neural network operations; the process includes:
  • step S501 it is determined according to the control signal whether to enter the multi-ALU device for calculation, and if yes, the process proceeds to step S502, otherwise, the process proceeds to step S503.
  • the control signal of the present invention is implemented by a control command, a direct signal implementation, and the like.
  • Step S502 obtaining input data from an on-chip storage medium or a core computing module. After this step is completed, the process proceeds to step S504.
  • the nonlinear operation after the completion of the core calculation acquires the input data from the core calculation module on-chip, and obtains the input data from the on-chip storage medium if the calculated input is an intermediate result cached in the on-chip storage medium.
  • step S503 the core computing module is entered to perform calculation. Specifically, the core computing module 30 acquires data from the on-chip storage medium to perform linear operations, and the core computing module 30 performs most of the operations in the neural network algorithm, that is, vector multiply-accumulate operations.
  • step S504 it is determined whether the ALU function is configured. If yes, go to step S505, otherwise go directly to step S506. Specifically, the multi-ALU device 40 also needs to determine, according to the control signal, whether the device itself needs to perform related configuration to control the arithmetic functions of the respective arithmetic logic operation units 42, for example, the arithmetic logic operation unit 42 needs to perform a specific function. That is, the multi-ALU device 40 is also configured to configure the arithmetic functions performed by the respective arithmetic logic operation units in accordance with the control signals at the time of calculation.
  • Step S505 obtaining parameters from the on-chip storage medium for configuration. After the configuration is completed, the process proceeds to step S506.
  • step S506 the multi-ALU device performs calculation.
  • the multi-ALU device 40 is used to perform non-linear operations that the core computing module 30 cannot perform.
  • step S507 it is judged whether all the calculations are completed, and if yes, the process ends. Otherwise, the process returns to step S501 to continue the calculation.
  • the core computing module 30 may have various structures, such as a one-dimensional PE (processing element) implementation in FIG. 6, and a two-dimensional PE implementation in FIG.
  • multiple PEs perform calculations simultaneously, usually isomorphic operations.
  • a common vector arithmetic accelerator is such an implementation.
  • multiple PEs are usually isomorphic, but multiple PEs may have data transmission in two dimensions.
  • a common matrix accelerator is such an implementation, such as two. Dimensional Systolic structure.
  • the present invention adds a multi-ALU device to the neural network accelerator for acquiring input data from the core computing module or the on-chip storage medium to perform a nonlinear operation that cannot be completed by the core computing module.
  • the invention improves the operation speed of the nonlinear operation, making the neural network accelerator more efficient.
  • the present invention adds a plurality of ALU devices to the neural network accelerator for acquiring input data from the core computing module or the on-chip storage medium to perform operations that the core computing module cannot perform, and the operations mainly include nonlinear operations. Compared with the existing neural network accelerator design, the speed of the nonlinear operation is improved, making the neural network accelerator more efficient.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Advance Control (AREA)
  • Memory System (AREA)
  • Complex Calculations (AREA)

Abstract

一种神经网络加速器(100)及其运算方法,适用于神经网络算法领域,该神经网络加速器(100)包括片内存储介质(10)、片内地址索引模块(20)、核心计算模块(30)以及多ALU装置(40),片内存储介质(10),用于存储外部传来的数据或用于存储计算过程中产生的数据;片内数据索引模块(20),用于执行运算时根据输入的索引映射至正确的存储地址;核心计算模块(30)用于执行神经网络运算;多ALU装置(40)用于从核心计算模块(30)或片内存储介质(10)获取输入数据执行核心计算模块(30)无法完成的非线性运算。通过在神经网络加速器(100)中引入多ALU设计,提升了非线性运算的运算速度,使得神经网络加速器(100)更加高效。

Description

一种神经网络加速器及其运算方法 技术领域
本发明涉及神经网络算法领域,属于一种神经网络加速器及其运算方法。
背景技术
在大数据时代,越来越多的设备需要对于真实世界的实时输入进行越来越复杂的处理,如工业机器人、自动驾驶无人汽车以及移动设备等等。这些任务大多数偏向于机器学习领域,其中大部分运算为向量运算或者矩阵运算,具有极高的并行度。相较于传统通用的GPU/CPU加速方案,硬件ASIC加速器是目前最受欢迎的加速方案,一方面可以提供极高的并行度可以实现极高的性能,另外一方面具有极高的能效性。
常见的神经网络算法中,包括最为流行的多层感知神经网络(Multi-Layer Perceptron,MLP)、卷积神经网络(Convolutional Neural Network,CNN)和深度神经网络(Deep Neural Network,DNN),多为非线性神经网络。而其中的非线性来自激活函数,如sigmoid函数、tanh函数,或者非线性层如ReLU等。通常这些非线性的运算都独立于其他操作,也即输入输出为一一映射;且位于输出神经元最后阶段,也即非线性运算完成后才能进行下一层神经网络的计算,其运算速度深切的影响了神经网络加速器的性能。神经网络加速器中,这些非线性的运算都采用单个ALU(Arithmetic Logic Unit,算术逻辑运算单元)或者简化的ALU进行运算。然而这种方式降低了神经网络加速器的性能。
综上可知,现有技术在实际使用上显然存在不便与缺陷,所以有必要加以改进。
发明公开
针对上述的缺陷,本发明的目的在于提供一种神经网络加速器及其运算方法,其在神经网络加速器中引入多ALU设计,从而提升非线性运算的运算速度,使得神经网络加速器更加高效。
为了实现上述目的,本发明提供一种神经网络加速器,包括片内存储介质、 片内地址索引模块、核心计算模块以及多ALU装置,所述片内存储介质,用于存储由神经网络加速器外部传来的数据或用于存储计算过程中产生的数据;所述片内数据索引模块,用于执行运算时根据输入的索引映射至正确的存储地址;所述核心计算模块用于执行神经网络运算中的线性运算;所述多ALU装置用于从所述核心计算模块或所述片内存储介质获取输入数据执行所述核心计算模块无法完成的非线性运算。
根据本发明的神经网络加速器,所述计算过程中产生的数据包括计算结果或中间结算结果。
根据本发明的神经网络加速器,所述多ALU装置包括输入映射单元、多个算数逻辑运算单元以及输出映射单元,
所述输入映射单元用于将从所述片内存储介质或所述核心计算模块获得的输入数据映射到多个算数逻辑运算单元;
所述算数逻辑运算单元,用于根据所述输入数据执行逻辑运算,所述逻辑运算包括非线性运算;
输出映射单元,用于将所述多个算数逻辑运算单元得到的计算结果整合映射成为正确的格式,为后续存储或者其他模块使用。
根据本发明的神经网络加速器,所述输入映射单元将所述输入数据分配至所述多个算数逻辑运算单元分别执行不同的运算或者将多个输入数据一一映射至所述多个算数逻辑运算单元执行运算。
根据本发明的神经网络加速器,所述多个算数逻辑运算单元为同构设计或异构设计。
根据本发明的神经网络加速器,单个所述算数逻辑运算单元包括多个执行不同函数功能的子运算单元。
根据本发明的神经网络加速器,所述多ALU装置还用于在计算时根据控制信号配置各算数逻辑运算单元所执行的运算功能。
根据本发明的神经网络加速器,所述片内存储介质为静态随机存储器、动态随机存储器、增强动态随机存取存储器、寄存器堆或非易失存储器。
本发明相应提供一种采用如上所述的神经网络加速器的运算方法,包括:
根据控制信号选择进入多ALU装置运算或进入核心计算模块进行计算;
若进入所述核心计算模块则从片内存储介质获取数据执行线性运算;
若进入所述多ALU装置运算则从所述片内存储介质或所述核心计算模块获取输入数据执行所述核心计算模块无法完成的非线性运算。
根据本发明的神经网络加速器的运算方法,进入所述多ALU装置运算的步骤还包括:所述多ALU装置根据控制信号配置各算数逻辑运算单元所执行的运算功能。
附图简要说明
图1是本发明一种神经网络加速器的结构框图;
图2是本发明一种实施例的多ALU装置的结构框图;
图3是本发明一种实施例中单个算数逻辑运算单元功能实现框图;
图4是本发明一种实施例中多个算数逻辑运算单元功能分布框图;
图5是如图1所示的神经网络加速器进行神经网络运算的流程图;
图6是本发明神经网络加速器一种实施例的核心计算模块组织框图;
图7是本发明神经网络加速器另一实施例的核心计算模块组织框图。
实现本发明的最佳方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
如图1所示,本发明提供了一种神经网络加速器100,包括片内存储介质10、片内地址索引模块20、核心计算模块30以及多ALU装置40。其中片内地址索引模块20与片内存储介质10连接,片内地址索引模块20、核心计算模块30以及多ALU装置40两两连接。
片内存储介质10,用于存储神经网络加速器外部传来的数据或用于存储计算过程中产生的数据。该计算过程中产生的数据包括计算过程中产生的计算结果或中间结果。这些结果可能来自加速器的片内核心运算模块30,也可能来自其他运算部件,如本发明中多ALU装置40。该片内存储介质10可以是静态随机存储器(Static Random Access Memory,SRAM),动态随机存储器(Dynamic Random Access Memory,DRAM),增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,e-DRAM),寄存器堆(Register  file,RF)等常见存储介质,也可以是新型的存储器件,如非易失存储器(Non-Volatile Memory,NVM)或者3D存储器件等等。
片内地址索引模块20,用于在执行运算时根据输入的索引映射至正确的存储地址。从而使得数据和片上存储模块可以正确的交互。这里的地址映射过程包括直接映射,算术变换等。
核心计算模块30,用于执行神经网络运算中的线性运算。具体的,核心计算模块30完成神经网络算法中的大部分运算,即向量乘加操作。
多ALU装置40,用于从核心计算模块或片内存储介质获取输入数据执行核心计算模块无法完成的非线性运算。在本发明中,该多ALU装置主要用于非线性运算。以提升非线性运算的运算速度,使得神经网络加速器更加高效。在本发明中,核心计算模块30、多ALU装置40与片内存储介质10的数据通路包括但不局限于H-TREE,或者FAT-TREE等互联技术。
如图2所示,多ALU装置40包括输入映射单元41、多个算数逻辑运算单元42以及输出映射单元43。
输入映射单元41,用于将从片内存储介质或核心计算模块获得的输入数据映射到多个算数逻辑运算单元42。在不同的加速器设计中可能存在不同的数据分配原则,根据不同分配原则,输入映射单元41将输入数据分配至多个算数逻辑运算单元42分别执行不同的运算或者将多个输入数据一一映射至多个算数逻辑运算单元42执行运算。这里的输入数据来源包括直接从片内存储介质10获得和核心计算模块30获得。
多个算数逻辑运算单元42,用于分别根据输入数据执行逻辑运算,逻辑运算包括非线性运算。其中单个算数逻辑运算单元42包括多个执行不同函数功能的子运算单元,如图3所示,单个算数逻辑运算单元42的功能包括乘法,加法,比较,除法,移位操作等等,也包括复杂的函数功能,如指数操作等等,单个算数逻辑运算单元42包括执行前述不同函数的一个或多个子运算单元。同时,算数逻辑运算单元42的功能应有神经网络加速器功能决定,而不局限于特定的算法操作。
多个算数逻辑运算单元42之间为同构设计或异构设计,也即算数逻辑运算单元42可以实现相同的函数功能,也可实现不同的函数功能。在如图4所示的实施例中,多个算数逻辑运算单元42的功能异构化,上面2个ALU实现 乘法和加法的操作,其他ALU分别实现其他复杂功能。异构设计有利于有效的平衡ALU的功能性和开销。
输出映射单元43,用于将多个算数逻辑运算单元42得到的计算结果整合映射成为正确的格式,为后续存储或者其他模块使用。
图5是如图1所示的神经网络加速器进行神经网络运算的流程图;该流程包括:
步骤S501,根据控制信号判断是否进入多ALU装置进行计算,若是则进入步骤S502,否则进入步骤S503。本发明的控制信号由控制指令实现、直接信号实现等方式。
步骤S502,从片内存储介质或核心计算模块获取输入数据。本步骤完成后进入步骤S504。一般的,在核心计算完成后的非线性运算则片内从核心计算模块获取输入数据,如果计算的输入为缓存在片内存储介质的中间结果则从片内存储介质获取输入数据。
步骤S503,进入核心计算模块进行计算。具体的,该核心计算模块30从片内存储介质获取数据执行线性运算,核心计算模块30完成神经网络算法中的大部分运算,即向量乘加操作。
步骤S504,判断是否对ALU功能进行配置。若是则进入步骤S505,否则直接进入步骤S506。具体的,多ALU装置40也需要根据控制信号判断装置自身是否需要进行相关配置以控制各个算数逻辑运算单元42的运算功能,例如算数逻辑运算单元42需要完成特定的功能。也即,多ALU装置40还用于在计算时根据控制信号配置各算数逻辑运算单元所执行的运算功能。
步骤S505,从片内存储介质获取参数进行配置。配置完成后进入步骤S506。
步骤S506,多ALU装置进行计算。多ALU装置40用于执行核心计算模块30无法完成的非线性运算。
步骤S507,判断所有计算是否完成,是则结束,否则回到步骤S501继续进行计算。
在本发明的一个是实施例中,核心计算模块30的结构可以多种,例如图6中的一维PE(processing element,处理单元)实现方式,图7中的二维PE实现方式。在图6中,多个PE(处理单元)同时进行计算,通常为同构运算, 常见的向量运算加速器即为此类实现方式。在图7的二维PE实现方式中,多个PE通常为同构计算,然而多个PE在两个维度上都有可能存在数据传递,常见的矩阵类加速器即为此类实现方式,如二维Systolic结构。
综上所述,本发明通过在神经网络加速器中加入了多ALU装置,用于从所述核心计算模块或片内存储介质获取输入数据执行核心计算模块无法完成的非线性运算。本发明提升非线性运算的运算速度,使得神经网络加速器更加高效。
当然,本发明还可有其它多种实施例,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。
工业应用性
本发明通过在神经网络加速器中加入了多ALU装置,用于从所述核心计算模块或片内存储介质获取输入数据执行核心计算模块无法完成的运算,这些运算主要包括非线性运算。相对于现有的神经网络加速器设计,提升非线性运算的运算速度,使得神经网络加速器更加高效。

Claims (10)

  1. 一种神经网络加速器,其特征在于,包括片内存储介质、片内地址索引模块、核心计算模块以及多ALU装置,
    所述片内存储介质,用于存储由神经网络加速器外部传来的数据或用于存储计算过程中产生的数据;
    所述片内数据索引模块,用于执行运算时根据输入的索引映射至正确的存储地址;
    所述核心计算模块用于执行神经网络运算中的线性运算;
    所述多ALU装置用于从所述核心计算模块或所述片内存储介质获取输入数据执行所述核心计算模块无法完成的非线性运算。
  2. 根据权利要求1所述的神经网络加速器,其特征在于,所述计算过程中产生的数据包括计算结果或中间结算结果。
  3. 根据权利要求1所述的神经网络加速器,其特征在于,所述多ALU装置包括输入映射单元、多个算数逻辑运算单元以及输出映射单元,
    所述输入映射单元用于将从所述片内存储介质或所述核心计算模块获得的输入数据映射到多个算数逻辑运算单元;
    所述算数逻辑运算单元,用于根据所述输入数据执行逻辑运算,所述逻辑运算包括非线性运算;
    输出映射单元,用于将所述多个算数逻辑运算单元得到的计算结果整合映射成为正确的格式,为后续存储或者其他模块使用。
  4. 根据权利要求3所述的神经网络加速器,其特征在于,所述输入映射单元将所述输入数据分配至所述多个算数逻辑运算单元分别执行不同的运算或者将多个输入数据一一映射至所述多个算数逻辑运算单元执行运算。
  5. 根据权利要求3所述的神经网络加速器,其特征在于,所述多个算数逻辑运算单元为同构设计或异构设计。
  6. 根据权利要求3所述的神经网络加速器,其特征在于,单个所述算数逻辑运算单元包括多个执行不同函数功能的子运算单元。
  7. 根据权利要求3所述的神经万络加速器,其特征在于,所述多ALU装 置还用于在计算时根据控制信号配置各算数逻辑运算单元所执行的运算功能。
  8. 根据权利要求1所述的神经网络加速器,其特征在于,所述片内存储介质为静态随机存储器、动态随机存储器、增强动态随机存取存储器、寄存器堆或非易失存储器。
  9. 一种如权利要求1~8任一项所述的神经网络加速器的运算方法,其特征在于,包括:
    根据控制信号选择进入多ALU装置运算或进入核心计算模块进行计算;
    若进入所述核心计算模块则从片内存储介质获取数据执行线性运算;
    若进入所述多ALU装置运算则从所述片内存储介质或所述核心计算模块获取输入数据执行所述核心计算模块无法完成的非线性运算。
  10. 根据权利要求9所述的神经网络加速器的运算方法,其特征在于,进入所述多ALU装置运算的步骤还包括:
    所述多ALU装置根据控制信号配置各算数逻辑运算单元所执行的运算功能。
PCT/CN2016/094179 2016-03-28 2016-08-09 一种神经网络加速器及其运算方法 WO2017166568A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/071,801 US20190026626A1 (en) 2016-03-28 2016-08-09 Neural network accelerator and operation method thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610183040.3A CN105892989B (zh) 2016-03-28 2016-03-28 一种神经网络加速器及其运算方法
CN201610183040.3 2016-03-28

Publications (1)

Publication Number Publication Date
WO2017166568A1 true WO2017166568A1 (zh) 2017-10-05

Family

ID=57014899

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/094179 WO2017166568A1 (zh) 2016-03-28 2016-08-09 一种神经网络加速器及其运算方法

Country Status (3)

Country Link
US (1) US20190026626A1 (zh)
CN (1) CN105892989B (zh)
WO (1) WO2017166568A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3660690A4 (en) * 2017-11-30 2020-08-12 SZ DJI Technology Co., Ltd. CALCULATION UNIT, CALCULATION SYSTEM AND ORDERING PROCEDURE FOR CALCULATION UNIT
US11443183B2 (en) 2018-09-07 2022-09-13 Samsung Electronics Co., Ltd. Neural processing system

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102016216947A1 (de) * 2016-09-07 2018-03-08 Robert Bosch Gmbh Modellberechnungseinheit und Steuergerät zur Berechnung eines mehrschichtigen Perzeptronenmodells
DE102016216950A1 (de) * 2016-09-07 2018-03-08 Robert Bosch Gmbh Modellberechnungseinheit und Steuergerät zur Berechnung eines mehrschichtigen Perzeptronenmodells mit Vorwärts- und Rückkopplung
US10963775B2 (en) * 2016-09-23 2021-03-30 Samsung Electronics Co., Ltd. Neural network device and method of operating neural network device
JP2018060268A (ja) * 2016-10-03 2018-04-12 株式会社日立製作所 認識装置および学習システム
WO2018112699A1 (zh) * 2016-12-19 2018-06-28 上海寒武纪信息科技有限公司 人工神经网络反向训练装置和方法
US10402527B2 (en) * 2017-01-04 2019-09-03 Stmicroelectronics S.R.L. Reconfigurable interconnect
CN107392308B (zh) * 2017-06-20 2020-04-03 中国科学院计算技术研究所 一种基于可编程器件的卷积神经网络加速方法与系统
GB2568776B (en) 2017-08-11 2020-10-28 Google Llc Neural network accelerator with parameters resident on chip
US11609623B2 (en) 2017-09-01 2023-03-21 Qualcomm Incorporated Ultra-low power neuromorphic artificial intelligence computing accelerator
CN109003132B (zh) * 2017-10-30 2021-12-14 上海寒武纪信息科技有限公司 广告推荐方法及相关产品
CN109960673B (zh) * 2017-12-14 2020-02-18 中科寒武纪科技股份有限公司 集成电路芯片装置及相关产品
CN109978155A (zh) * 2017-12-28 2019-07-05 北京中科寒武纪科技有限公司 集成电路芯片装置及相关产品
US11436483B2 (en) * 2018-01-17 2022-09-06 Mediatek Inc. Neural network engine with tile-based execution
CN110222833B (zh) * 2018-03-01 2023-12-19 华为技术有限公司 一种用于神经网络的数据处理电路
CN110321064A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 用于神经网络的计算平台实现方法及系统
US20210133854A1 (en) 2018-09-13 2021-05-06 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109358993A (zh) * 2018-09-26 2019-02-19 中科物栖(北京)科技有限责任公司 深度神经网络加速器故障的处理方法及装置
WO2020061924A1 (zh) * 2018-09-27 2020-04-02 华为技术有限公司 运算加速器和数据处理方法
CN110597756B (zh) * 2019-08-26 2023-07-25 光子算数(北京)科技有限责任公司 一种计算电路以及数据运算方法
TWI717892B (zh) * 2019-11-07 2021-02-01 財團法人工業技術研究院 動態多組態cnn加速器架構與操作方法
US11593609B2 (en) 2020-02-18 2023-02-28 Stmicroelectronics S.R.L. Vector quantization decoding hardware unit for real-time dynamic decompression for parameters of neural networks
CN111639045B (zh) * 2020-06-03 2023-10-13 地平线(上海)人工智能技术有限公司 数据处理方法、装置、介质和设备
US11531873B2 (en) 2020-06-23 2022-12-20 Stmicroelectronics S.R.L. Convolution acceleration with embedded vector decompression
CN115600659A (zh) * 2021-07-08 2023-01-13 北京嘉楠捷思信息技术有限公司(Cn) 一种神经网络运算的硬件加速装置和加速方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
CN104915322A (zh) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法及其axi总线ip核
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103107879B (zh) * 2012-12-21 2015-08-26 杭州晟元芯片技术有限公司 一种rsa加速器
US20140289445A1 (en) * 2013-03-22 2014-09-25 Antony Savich Hardware accelerator system and method
DE102013213420A1 (de) * 2013-04-10 2014-10-16 Robert Bosch Gmbh Modellberechnungseinheit, Steuergerät und Verfahrenzum Berechnen eines datenbasierten Funktionsmodells

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
CN104915322A (zh) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法及其axi总线ip核
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3660690A4 (en) * 2017-11-30 2020-08-12 SZ DJI Technology Co., Ltd. CALCULATION UNIT, CALCULATION SYSTEM AND ORDERING PROCEDURE FOR CALCULATION UNIT
US11443183B2 (en) 2018-09-07 2022-09-13 Samsung Electronics Co., Ltd. Neural processing system
US11625606B2 (en) 2018-09-07 2023-04-11 Samsung Electronics Co., Ltd. Neural processing system

Also Published As

Publication number Publication date
US20190026626A1 (en) 2019-01-24
CN105892989A (zh) 2016-08-24
CN105892989B (zh) 2017-04-12

Similar Documents

Publication Publication Date Title
WO2017166568A1 (zh) 一种神经网络加速器及其运算方法
WO2017181562A1 (zh) 一种神经网络的处理方法、系统
KR102402111B1 (ko) 콘볼루션 신경망 정방향 연산 실행용 장치와 방법
US11403516B2 (en) Apparatus and method for processing convolution operation of neural network
US10990410B2 (en) Systems and methods for virtually partitioning a machine perception and dense algorithm integrated circuit
US20200234124A1 (en) Winograd transform convolution operations for neural networks
TWI818944B (zh) 神經網路處理單元及系統晶片
EP3451236A1 (en) Method and device for executing forwarding operation of fully-connected layered neural network
JP2020526830A (ja) 演算アクセラレータ
CN111105023B (zh) 数据流重构方法及可重构数据流处理器
US20190228307A1 (en) Method and apparatus with data processing
JP2018116469A (ja) 演算システムおよびニューラルネットワークの演算方法
JP7386543B2 (ja) 機械知覚および高密度アルゴリズム集積回路を実装するためのシステムおよび方法
CN108960414B (zh) 一种基于深度学习加速器实现单广播多运算的方法
US20210350230A1 (en) Data dividing method and processor for convolution operation
KR20190089685A (ko) 데이터를 처리하는 방법 및 장치
CN108446758B (zh) 一种面向人工智能计算的神经网络数据串行流水处理方法
US11709783B1 (en) Tensor data distribution using grid direct-memory access (DMA) controller
WO2021031351A1 (zh) 一种数据处理系统、方法及介质
CN110929854B (zh) 一种数据处理方法、装置及硬件加速器
Pawanekar et al. Highly scalable processor architecture for reinforcement learning
Zhang et al. Research of Heterogeneous Acceleration Optimization of Convolutional Neural Network Algorithm for Unmanned Vehicle Based on FPGA
US20220197971A1 (en) Systems and methods for an intelligent mapping of neural network weights and input data to an array of processing cores of an integrated circuit
KR20240037146A (ko) 다중 비트 누산기, 다중 비트 누산기를 포함하는 인 메모리 컴퓨팅(in memory computing) 프로세서 및 다중 비트 누산기의 동작 방법
KR20240025827A (ko) Imc(in memory computing) 프로세서 및 imc 프로세서의 동작 방법

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896340

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896340

Country of ref document: EP

Kind code of ref document: A1