WO2019085378A1 - Hardware implementation device and method for high-speed full-connection calculation - Google Patents

Hardware implementation device and method for high-speed full-connection calculation Download PDF

Info

Publication number
WO2019085378A1
WO2019085378A1 PCT/CN2018/080600 CN2018080600W WO2019085378A1 WO 2019085378 A1 WO2019085378 A1 WO 2019085378A1 CN 2018080600 W CN2018080600 W CN 2018080600W WO 2019085378 A1 WO2019085378 A1 WO 2019085378A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
weight
calculation
module
result
Prior art date
Application number
PCT/CN2018/080600
Other languages
French (fr)
Chinese (zh)
Inventor
康君龙
张玉
谢东亮
Original Assignee
北京深鉴智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京深鉴智能科技有限公司 filed Critical 北京深鉴智能科技有限公司
Publication of WO2019085378A1 publication Critical patent/WO2019085378A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to an artificial neural network, and more particularly to a hardware implementation apparatus and method for high speed full connection calculation.
  • Deep Learning is derived from the research of artificial neural network (ANN), which is a method based on the representation of data in machine learning.
  • ANN artificial neural network
  • a multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract high-level representation of attribute categories or features by combining underlying features to discover distributed feature representations of data.
  • Deep learning is a new field in machine learning research. Its motivation is to build and simulate a neural network for human brain analysis and learning. It mimics the mechanism of the human brain to interpret data such as images, sounds and texts.
  • Deep learning compared to the classic network AlexNet's network model is convolution layer (conv) + pooling layer (pooling) + full connection layer (fc) + softmax layer, the full connection layer to map the learned distributed feature representation to The role of the sample mark space.
  • FIG. 1 is a schematic diagram of a simple artificial neural network structure. As shown in Figure 1, the forward calculation process is a linear weighted summation process. Each output of the fully connected layer can be viewed as each node of the previous layer multiplied by a weight coefficient W, and finally added a The offset value b is obtained.
  • the matrix form can be expressed as:
  • each input is multiplied and added with all the weights, so the amount of data involved in calculation is large, and the hardware bandwidth requirement is high.
  • Reasonable design for this feature will achieve the goal of reducing the hardware bandwidth requirements of data and improving computational efficiency.
  • the present invention is rationally designed for the above-mentioned features in the fully connected layer operation of the neural network, so as to achieve the purpose of reducing the hardware bandwidth requirement of the data and thereby improving the computational efficiency.
  • the invention designs a special circuit for implementing neural network full connection operation.
  • An object of the present invention is to provide an apparatus for implementing an FC accelerator, so that FC data has high data multiplexing, low interface requirements, and high computational power and high performance.
  • the present invention provides a hardware implementation apparatus and method for high-speed FC calculation.
  • a hardware implementation apparatus for high speed full connection calculation may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until all output channels The weight calculation is completed; the vector storage module is configured to store n input vector data; the output registration module is configured to implement an output buffer of the calculation result; and the core calculation module is configured to make the m group weight data input by the weight storage module Multiplying with the n input vector data input by the vector storage module, each multiplication result is added to the previous valid result, and the corresponding offset value is added to the result of the multiplication and addition calculation, and the final calculation is performed. The result is output to the output registration module.
  • a hardware implementation apparatus for high speed full connection calculation for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate in the core calculation module The calculation result is stored, and a ping-pong buffer can be used.
  • the core calculation module may include m*n computation cores, so that the multiplication operation of the m sets of weight data and the n input vectors can be simultaneously performed.
  • a hardware implementation method for high-speed full-connection calculation may include: (1) loading m sets of weight data into a weight storage module for storage; (2) requesting input of vector data and receiving The n input vector data are stored in the vector storage module; (3) when the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and n are respectively read from the two modules.
  • Input vector data and send it to the core computing module; (4) the core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, in turn
  • the pipeline completes the data multiplication and addition operation of the input channel; (5) adds the multiplication and addition operation result of step (4) and the corresponding offset data, and completes all the full connection operations of the input channel corresponding to the current calculation, and the operation result Output to the output registration module; (6) the output registration module outputs the result data to the target interface; (7) repeat steps (1) through (6) until all full connection operations are completed.
  • the core calculation module may include m*n computation cores, so that m group weight data and n input vectors can be simultaneously implemented in step (4). Multiplication operation.
  • a computer readable medium for recording an instruction executable by a processor, the instruction, when executed by a processor, causes the processor to perform a hardware implementation method of high speed full connection calculation,
  • the method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) when the weight storage When both the module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) the core computing module pair The separately received weight data and the input vector data are multiplied, and each multiplication result is added to the previous valid result, and the data multiplication and addition operation of the input channel is sequentially performed by the pipeline; (5) multiplication of the step (4) The addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the
  • Figure 1 is a schematic diagram of a simple artificial neural network structure
  • FIG. 2 is a schematic diagram of a hardware implementation apparatus for high speed full connection calculation in accordance with the present invention
  • FIG. 3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention.
  • FIG. 4 is a schematic diagram of a hardware implementation apparatus in accordance with a first preferred embodiment of the present invention.
  • FIG. 5 is a schematic illustration of a hardware implemented device in accordance with a second preferred embodiment of the present invention.
  • the present invention provides a high speed FC computing device including, but not limited to, a weight storage module, a vector storage module, an output registration module, and a core computing module. Wait.
  • FIG. 2 is a schematic diagram of a hardware implemented device for high speed full connection calculation in accordance with the present invention.
  • the composition of the hardware implementation device 200 of the high speed full connection calculation according to the present invention is described below.
  • Weight storage module 210 This module stores weight data for calculation. This design implements the FC function by means of weight sharing, that is, each time the weight of the part is completed with all the inputs, the weight data is updated until all the input vectors are calculated. Preferably, the design employs 4 sets of weights per cache, ping-pong buffer, until the weight calculations for all output channels are completed.
  • Vector storage module 220 Due to the weight sharing implementation adopted by the design, the demand for vector storage is low. When, for example, 4 input data is valid, the calculation can be started, and the calculation pipeline is performed, so that only some registrations need to be added in the module. A small amount of data can be registered, the design uses two sets of registers, each group can register 4 data, ping-pong. When the bandwidth of the input data interface is insufficient, the cache can be appropriately expanded to prevent the calculation efficiency.
  • the output registration module 230 is similar in design to the input storage module, and realizes an output buffer of the calculation result.
  • the output buffer size can be adjusted according to the interface bandwidth to prevent the FC operation from being back-pressured due to the non-issuance of the result, thereby affecting the calculation efficiency.
  • the hardware implementation apparatus for high-speed full-connection calculation may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until weight calculation of all output channels is completed; a vector storage module, For storing n input vector data; an output registration module for implementing an output buffer of the calculation result; and a core calculation module for causing the m sets of weight data input by the weight storage module and the input by the vector storage module.
  • the core computing module can include m*n computing cores so that the multiplication operations of the m sets of weight data and the n input vectors can be implemented simultaneously.
  • a ping-pong buffer may be employed for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module. This can also be taken as a preferred embodiment of the invention.
  • the invention also provides a method for high speed FC calculation, the specific steps are:
  • Step 1 Load weight data into the weight storage module
  • Step 2 request vector data and store the received data in a vector storage module
  • Step 3 When both the weight storage module and the vector storage module have data that can be calculated, four data are respectively read from the above modules and sent to the core calculation module;
  • Step 4 The core calculation module multiplies the received data by two and two, and adds the result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel (input channel);
  • Step 5 adding the multiplication and addition operation result of step 4 and the corresponding offset data, completing all calculations of the input channel FC corresponding to the current calculation, and outputting the operation result to the output registration module;
  • Step 6 output the registration module output result data to the target interface
  • Step 7 Repeat steps 1 through 6 until all FC operations are complete.
  • step 1 and steps 2 and 6 can be started in parallel at the same time as in step 4. This will fully utilize the performance of the core computing module, achieve complete flow calculation, and achieve high computing power and high performance.
  • FIG. 3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention.
  • the hardware implementation method 300 of the high speed full connection calculation according to the present invention may begin in step S310, in which m sets of weight data are loaded into a weight storage module for storage.
  • step S320 input vector data is requested and the received n input vector data is stored in the vector storage module.
  • step S330 When the weight storage module and the vector storage module have data that can be calculated, in step S330, m sets of weight data and n input vector data are respectively read from the two modules and sent to the core calculation module.
  • step S340 the core calculation module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication of the input channel by the pipeline. Add operation.
  • the core computing module can include m*n computational cores such that the multiplication of the m sets of weight data and the n input vectors can be performed simultaneously in step S340.
  • step S350 the multiplication and addition operation result of step S340 is added to the corresponding offset data, and all the full connection operations of the input channels corresponding to the current calculation are completed, and the operation result is output to the output registration module.
  • a ping-pong buffer may be employed for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module.
  • step S360 the output registration module outputs the result data to the target interface.
  • step S370 it is determined whether all the full join operations are completed. If the result of the determination in step S370 is negative, that is, if the full connection operation is not performed, the method 300 returns to step S310, and steps S310 through S360 are repeated. On the other hand, if the result of the determination in step S370 is affirmative, that is, all the full join operations have been completed, the method 300 may end.
  • FIG. 4 is a schematic diagram of a hardware implemented device in accordance with a first preferred embodiment of the present invention.
  • each The data is 32 bits.
  • the weight data 4*2048*32bit is loaded into the weight storage module, and after loading part of the weight data, the vector data is requested to be loaded into the vector storage module, and the amount of data loaded each time is 4*32bit.
  • the calculation is started and the calculation uses 16 (4*4) DSPs. Calculate complete flow, share weight data, vector data flow input, and flow calculation.
  • the interface clock frequency is 300MHz
  • the data rate of the vector input interface is 4.8GB/s, and the calculated operation can reach 9.6Gops, which achieves higher computational efficiency.
  • FIG. 5 is a schematic illustration of a hardware implemented device in accordance with a second preferred embodiment of the present invention.
  • the interface data rate can be doubled, the weight data for each participating calculation is 4, the vector data is 8, or the buffer amount of the weight data is increased.
  • the interface rate remains unchanged, so that the weight data involved in the operation is eight, and the vector data is four.
  • the computing power can be as high as 19.2Gop. It can be seen that the device has good adaptability and can realize high performance calculation to a large extent.
  • Non-transitory computer readable media include various types of tangible storage media.
  • non-transitory computer readable medium examples include magnetic recording media (such as floppy disks, magnetic tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROM (Compact Disc Read Only Memory), CD-R, CD-R /W and semiconductor memory (such as ROM, PROM (programmable ROM), EPROM (rewritable PROM), flash ROM and RAM (random access memory)).
  • these programs can be provided to a computer by using various types of transient computer readable media.
  • Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can be used to provide a program to a computer via a wired communication path such as a wire and an optical fiber or a wireless communication path.
  • a computer program or a computer readable medium for recording instructions executable by a processor that, when executed by a processor, causes the processor to perform a high speed full connection calculation
  • the hardware implementation method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) When the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) The core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel; (5) the step ( 4) The multiplication and addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the input.
  • Storage module (6) the output

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

Provided are a hardware implementation device and method for high-speed full-connection calculation. According to the present invention, the hardware implementation device (200) for high-speed full-connection calculation comprises: a weight storage module (210) for storing weight data for calculation, wherein m sets of weight data are stored each time until the weight calculation of all output channels is completed; a vector storage module (220) for storing n pieces of input vector data; an output registration module (230) for implementing output caching of a calculation result; and a core calculation module (240) for multiplying the m sets of weight data input by the weight storage module with n pieces of input vector data input by the vector storage module, wherein each multiplication result is respectively added to the previous valid result, and adding a corresponding offset value to the result of the multiplication and addition calculation, and outputting the final calculation result to the output registration module.

Description

高速全连接计算的硬件实现装置与方法Hardware implementation device and method for high speed full connection calculation 技术领域Technical field
本发明涉及人工神经网络,更具体涉及高速全连接计算的硬件实现装置与方法。The present invention relates to an artificial neural network, and more particularly to a hardware implementation apparatus and method for high speed full connection calculation.
背景技术Background technique
深度学习(Deep Learning)的概念源于人工神经网络(ANN)的研究,是机器学习中一种基于对数据进行表征学习的方法。含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合底层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。The concept of Deep Learning is derived from the research of artificial neural network (ANN), which is a method based on the representation of data in machine learning. A multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract high-level representation of attribute categories or features by combining underlying features to discover distributed feature representations of data.
深度学习是机器学习研究中的一个新的领域,其动机在于建立、模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本。Deep learning is a new field in machine learning research. Its motivation is to build and simulate a neural network for human brain analysis and learning. It mimics the mechanism of the human brain to interpret data such as images, sounds and texts.
深度学习比较经典的网络AlexNet的网络模型是卷积层(conv)+池化层(pooling)+全连接层(fc)+softmax层,全连接层起到将学习到的分布式特征表示映射到样本标记空间的作用。Deep learning compared to the classic network AlexNet's network model is convolution layer (conv) + pooling layer (pooling) + full connection layer (fc) + softmax layer, the full connection layer to map the learned distributed feature representation to The role of the sample mark space.
全连接层的每一个结点都与上一层的所有结点相连,用来把前面提取到的特征综合起来。图1是一个简单的人工神经网络结构的示意图。如图1所示,前向计算过程就是一个线性的加权求和的过程,全连接层的每一个输出都可以看成前一层的每一个结点乘以一个权重系数W,最后加上一个偏置值b得到。用矩阵形式可表达为:Each node of the fully connected layer is connected to all nodes of the previous layer to integrate the previously extracted features. Figure 1 is a schematic diagram of a simple artificial neural network structure. As shown in Figure 1, the forward calculation process is a linear weighted summation process. Each output of the fully connected layer can be viewed as each node of the previous layer multiplied by a weight coefficient W, and finally added a The offset value b is obtained. The matrix form can be expressed as:
Figure PCTCN2018080600-appb-000001
Figure PCTCN2018080600-appb-000001
神经网络的全连接层(FC)运算中,每一个输入都要与所有的权重进行乘加运算,所以涉及计算的数据量大,硬件带宽需求高。针对这一特点进行合理的设计将可达到降低数据的硬件带宽需求,提高计算效率的目的。In the full connection layer (FC) operation of the neural network, each input is multiplied and added with all the weights, so the amount of data involved in calculation is large, and the hardware bandwidth requirement is high. Reasonable design for this feature will achieve the goal of reducing the hardware bandwidth requirements of data and improving computational efficiency.
发明内容Summary of the invention
如上所述,本发明针对神经网络的全连接层运算中的上述特点进行合理的设计,以达到降低数据的硬件带宽需求,从而提高计算效率的目的。As described above, the present invention is rationally designed for the above-mentioned features in the fully connected layer operation of the neural network, so as to achieve the purpose of reducing the hardware bandwidth requirement of the data and thereby improving the computational efficiency.
本发明设计了一种专用电路,用于实现神经网络全连接运算。本发明的目的在于提供一种FC加速器的实现装置,以便FC计算数据复用高、接口需求低,实现高算力、高性能。The invention designs a special circuit for implementing neural network full connection operation. An object of the present invention is to provide an apparatus for implementing an FC accelerator, so that FC data has high data multiplexing, low interface requirements, and high computational power and high performance.
为了实现上述目的,结合FC计算量大的特点,本发明提供一种高速FC计算的硬件实现装置及方法。In order to achieve the above object, in combination with the feature that the calculation amount of the FC is large, the present invention provides a hardware implementation apparatus and method for high-speed FC calculation.
根据本发明的第一方面,提供一种高速全连接计算的硬件实现装置,其可以包括:权重存储模块,用于存储用于计算的权重数据,每次存储m组权重数据,直到所有输出通道的权重计算完成;向量存储模块,用于存储n个输入向量数据;输出寄存模块,用于实现计算结果的输出缓存;核心计算模块,用于使得由所述权重存储模块输入的m组权重数据与由所述向量存储模块输入的n个输入向量数据进行相乘,各个相乘结果分别与之前的有效结果相加,并在乘加计算的结果上加上对应的偏置值,将最终计算结果输出到所述输出寄存模块。According to a first aspect of the present invention, a hardware implementation apparatus for high speed full connection calculation is provided, which may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until all output channels The weight calculation is completed; the vector storage module is configured to store n input vector data; the output registration module is configured to implement an output buffer of the calculation result; and the core calculation module is configured to make the m group weight data input by the weight storage module Multiplying with the n input vector data input by the vector storage module, each multiplication result is added to the previous valid result, and the corresponding offset value is added to the result of the multiplication and addition calculation, and the final calculation is performed. The result is output to the output registration module.
在根据本发明第一方面的高速全连接计算的硬件实现装置中,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储,可以采用乒乓缓存。In a hardware implementation apparatus for high speed full connection calculation according to the first aspect of the present invention, for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate in the core calculation module The calculation result is stored, and a ping-pong buffer can be used.
在根据本发明第一方面的高速全连接计算的硬件实现装置中,所述核心计算模块可以包括m*n个计算核,从而可以同时实现m组权重数据和n个 输入向量的相乘运算。In the hardware implementation apparatus of the high speed full connection calculation according to the first aspect of the present invention, the core calculation module may include m*n computation cores, so that the multiplication operation of the m sets of weight data and the n input vectors can be simultaneously performed.
在根据本发明第一方面的高速全连接计算的硬件实现装置中,m和n的取值可以是以下情况之一:m=4,n=4;m=8,n=4;或m=4,n=8。In the hardware implementation apparatus of the high speed full connection calculation according to the first aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m= 4, n = 8.
根据本发明的第二方面,提供一种高速全连接计算的硬件实现方法,其可以包括:(1)加载m组权重数据到权重存储模块中存储;(2)请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块;(3)当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块;(4)所述核心计算模块对分别接收到的权重数据和输入向量数据进行相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算;(5)将步骤(4)的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接运算,并将运算结果输出给输出寄存模块;(6)所述输出寄存模块输出结果数据给目标接口;(7)重复步骤(1)到(6),直到所有全连接运算完成。According to a second aspect of the present invention, a hardware implementation method for high-speed full-connection calculation is provided, which may include: (1) loading m sets of weight data into a weight storage module for storage; (2) requesting input of vector data and receiving The n input vector data are stored in the vector storage module; (3) when the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and n are respectively read from the two modules. Input vector data and send it to the core computing module; (4) the core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, in turn The pipeline completes the data multiplication and addition operation of the input channel; (5) adds the multiplication and addition operation result of step (4) and the corresponding offset data, and completes all the full connection operations of the input channel corresponding to the current calculation, and the operation result Output to the output registration module; (6) the output registration module outputs the result data to the target interface; (7) repeat steps (1) through (6) until all full connection operations are completed.
在根据本发明第二方面的高速全连接计算的硬件实现方法中,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储,可以采用乒乓缓存。In a hardware implementation method of high speed full connection calculation according to the second aspect of the present invention, for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate in the core calculation module The calculation result is stored, and a ping-pong buffer can be used.
在根据本发明第二方面的高速全连接计算的硬件实现方法中,所述核心计算模块可以包括m*n个计算核,从而在步骤(4)可以同时实现m组权重数据和n个输入向量的相乘运算。In the hardware implementation method of the high-speed full-connection calculation according to the second aspect of the present invention, the core calculation module may include m*n computation cores, so that m group weight data and n input vectors can be simultaneously implemented in step (4). Multiplication operation.
在根据本发明第二方面的高速全连接计算的硬件实现方法中,m和n的取值可以是以下情况之一:m=4,n=4;m=8,n=4;或m=4,n=8。In the hardware implementation method of the high-speed full-connection calculation according to the second aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m= 4, n = 8.
根据本发明的第三方面,提供一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行高速全连接计算的硬件实现方法,可以包括如下操作:(1)加载m组权重数据到权重存储 模块中存储;(2)请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块;(3)当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块;(4)所述核心计算模块对分别接收到的权重数据和输入向量数据进行相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算;(5)将步骤(4)的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接运算,并将运算结果输出给输出寄存模块;(6)所述输出寄存模块输出结果数据给目标接口;(7)重复步骤(1)到(6),直到所有全连接运算完成。According to a third aspect of the present invention, a computer readable medium for recording an instruction executable by a processor, the instruction, when executed by a processor, causes the processor to perform a hardware implementation method of high speed full connection calculation, The method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) when the weight storage When both the module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) the core computing module pair The separately received weight data and the input vector data are multiplied, and each multiplication result is added to the previous valid result, and the data multiplication and addition operation of the input channel is sequentially performed by the pipeline; (5) multiplication of the step (4) The addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the output registration module; (6) Said output register module outputs a result data to the target interface; (7) repeating steps (1) to (6), until all operations are completed fully connected.
在根据本发明第三方面的计算机可读介质中,m和n的取值可以是以下情况之一:m=4,n=4;m=8,n=4;或m=4,n=8。In the computer readable medium according to the third aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m=4, n= 8.
附图说明DRAWINGS
下面参考附图结合实施例说明本发明。在附图中:The invention will now be described in connection with the embodiments with reference to the accompanying drawings. In the drawing:
图1是一个简单的人工神经网络结构的示意图;Figure 1 is a schematic diagram of a simple artificial neural network structure;
图2是根据本发明的高速全连接计算的硬件实现装置的示意图;2 is a schematic diagram of a hardware implementation apparatus for high speed full connection calculation in accordance with the present invention;
图3是根据本发明的高速全连接计算的硬件实现方法的流程图;3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention;
图4是根据本发明的第一优选实施例的硬件实现装置的示意图;4 is a schematic diagram of a hardware implementation apparatus in accordance with a first preferred embodiment of the present invention;
图5是根据本发明的第二优选实施例的硬件实现装置的示意图。Figure 5 is a schematic illustration of a hardware implemented device in accordance with a second preferred embodiment of the present invention.
具体实施方式Detailed ways
附图仅用于示例说明,不能理解为对本发明的限制。下面结合附图和实施例对本发明的技术方案做进一步的说明。The drawings are for illustrative purposes only and are not to be construed as limiting. The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.
为了实现本发明的目的,结合全连接(FC)计算量大的特点,本发明提供一种高速FC计算装置,该装置包括但不限于权重存储模块、向量存储模块、输出寄存模块和核心计算模块等。In order to achieve the object of the present invention, in combination with the feature that the full connection (FC) is computationally intensive, the present invention provides a high speed FC computing device including, but not limited to, a weight storage module, a vector storage module, an output registration module, and a core computing module. Wait.
图2是根据本发明的高速全连接计算的硬件实现装置的示意图。2 is a schematic diagram of a hardware implemented device for high speed full connection calculation in accordance with the present invention.
如图2中所示,下面描述根据本发明的高速全连接计算的硬件实现装置200的组成。As shown in FIG. 2, the composition of the hardware implementation device 200 of the high speed full connection calculation according to the present invention is described below.
权重存储模块210:该模块存储用于计算的权重数据。本设计采用权重共享的方式实现FC功能,即每次将部分的权重与所有的输入完成运算后,再更新权重数据,直到所有的输入向量全部被计算完成。优选地,设计采用每次缓存4组权重,乒乓缓存,直到所有输出通道的权重计算完成。Weight storage module 210: This module stores weight data for calculation. This design implements the FC function by means of weight sharing, that is, each time the weight of the part is completed with all the inputs, the weight data is updated until all the input vectors are calculated. Preferably, the design employs 4 sets of weights per cache, ping-pong buffer, until the weight calculations for all output channels are completed.
向量存储模块220:由于设计采用的权重共享实现,所以对向量存储的需求较低,当有例如4个输入数据有效时即可开始计算,计算流水进行,这样仅需在该模块增加一些寄存来实现少量数据的寄存即可,设计采用两组寄存器,每组可以寄存4个数据,乒乓实现。当输入数据接口的带宽不足时,可将其缓存适当扩大,防止对计算效率造成影响。Vector storage module 220: Due to the weight sharing implementation adopted by the design, the demand for vector storage is low. When, for example, 4 input data is valid, the calculation can be started, and the calculation pipeline is performed, so that only some registrations need to be added in the module. A small amount of data can be registered, the design uses two sets of registers, each group can register 4 data, ping-pong. When the bandwidth of the input data interface is insufficient, the cache can be appropriately expanded to prevent the calculation efficiency.
输出寄存模块230:其设计与输入存储模块类似,实现计算结果的输出缓存,可根据接口带宽而调节输出缓存大小,防止由于结果未发出而使得FC运算得到反压,影响计算效率。The output registration module 230 is similar in design to the input storage module, and realizes an output buffer of the calculation result. The output buffer size can be adjusted according to the interface bandwidth to prevent the FC operation from being back-pressured due to the non-issuance of the result, thereby affecting the calculation efficiency.
核心计算模块240:该模块实现输入和权重的乘加运算,并在计算结果上加上对应的偏置bias来完成。为了实现较高算力,在接口带宽允许下,可以通过不同数量计算核进行。优选地,设计采用16个计算核,同时实现4个权重数据和4个输入向量的运算,即4*4=16。The core calculation module 240: the module implements multiplication and addition of input and weight, and adds a corresponding offset bias to the calculation result to complete. In order to achieve higher computational power, it is possible to perform core calculations with different numbers, as the interface bandwidth allows. Preferably, the design uses 16 computational cores, and simultaneously implements 4 weight data and 4 input vector operations, ie 4*4=16.
下面的优选实施例中进一步给出了计算实例。A calculation example is further given in the preferred embodiment below.
根据以上的描述,可以概括出如下的内容。根据本发明的高速全连接计算的硬件实现装置可以包括:权重存储模块,用于存储用于计算的权重数据,每次存储m组权重数据,直到所有输出通道的权重计算完成;向量存储模块,用于存储n个输入向量数据;输出寄存模块,用于实现计算结果的输出缓存;核心计算模块,用于使得由所述权重存储模块输入的m组权重数据与由所述向量存储模块输入的n个输入向量数据进行相乘,各个相乘结果分别与之前 的有效结果相加,并在乘加计算的结果上加上对应的偏置值,将最终计算结果输出到所述输出寄存模块。According to the above description, the following can be summarized. The hardware implementation apparatus for high-speed full-connection calculation according to the present invention may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until weight calculation of all output channels is completed; a vector storage module, For storing n input vector data; an output registration module for implementing an output buffer of the calculation result; and a core calculation module for causing the m sets of weight data input by the weight storage module and the input by the vector storage module The n input vector data are multiplied, and the respective multiplication results are respectively added to the previous valid result, and the corresponding offset value is added to the result of the multiplication and addition calculation, and the final calculation result is output to the output registration module.
一般地,核心计算模块可以包括m*n个计算核,从而可以同时实现m组权重数据和n个输入向量的相乘运算。In general, the core computing module can include m*n computing cores so that the multiplication operations of the m sets of weight data and the n input vectors can be implemented simultaneously.
这里需要注意的是,m和n的取值,往往是根据实际的计算硬件情况而确定的,本领域技术人员应该明白如何合理设置m和n的取值,以便利用现有的计算硬件资源而取得理想的算力。如上所述,在优选实施例中,m和n的取值可以分别是m=4和n=4。本领域技术人员应当理解,在实际实现中,m和n的取值也可以是其他的情况。另外可参见下文进一步讨论的优选实施例。It should be noted here that the values of m and n are often determined according to the actual computing hardware. Those skilled in the art should understand how to properly set the values of m and n in order to utilize existing computing hardware resources. Get the ideal computing power. As mentioned above, in a preferred embodiment, the values of m and n can be m=4 and n=4, respectively. Those skilled in the art should understand that in actual implementation, the values of m and n may also be other cases. Further details can be found in the preferred embodiments discussed further below.
此外,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储,可以采用乒乓缓存。这也可以作为本发明的优选实施方式。Further, for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module, a ping-pong buffer may be employed. This can also be taken as a preferred embodiment of the invention.
本发明还提供了一种高速FC计算的方法,具体步骤为:The invention also provides a method for high speed FC calculation, the specific steps are:
步骤1:加载权重数据到权重存储模块;Step 1: Load weight data into the weight storage module;
步骤2:请求向量数据并将接收到的数据存储在向量存储模块;Step 2: request vector data and store the received data in a vector storage module;
步骤3:当权重存储模块和向量存储模块均有可进行计算的数据时,分别从上述模块各读取4个数据并送入核心计算模块;Step 3: When both the weight storage module and the vector storage module have data that can be calculated, four data are respectively read from the above modules and sent to the core calculation module;
步骤4:核心计算模块对接收到的数据两两进行乘运算,并将结果与前次有效结果相加,依次流水完成输入通道(input channel)的数据乘加运算;Step 4: The core calculation module multiplies the received data by two and two, and adds the result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel (input channel);
步骤5:将步骤4的乘加运算结果和对应的偏置数据相加,完成本次计算对应的输入通道FC的所有运算,并将运算结果输出给输出寄存模块;Step 5: adding the multiplication and addition operation result of step 4 and the corresponding offset data, completing all calculations of the input channel FC corresponding to the current calculation, and outputting the operation result to the output registration module;
步骤6:输出寄存模块输出结果数据给目标接口;Step 6: output the registration module output result data to the target interface;
步骤7:重复步骤1到步骤6,直到所有FC运算完成。Step 7: Repeat steps 1 through 6 until all FC operations are complete.
基于高效率计算的设计,分别对缓存单元进行乒乓设计,包括权重存储、向量存储和计算结果存储等。这样,可以在步骤4运算时同时并行的开始步 骤1和步骤2及步骤6。这样将会充分的发挥核心计算模块的性能,实现完全流水的计算,达到高算力、高性能的目的。Based on the design of high efficiency calculation, the ping-pong design of the cache unit is respectively performed, including weight storage, vector storage, and calculation result storage. Thus, step 1 and steps 2 and 6 can be started in parallel at the same time as in step 4. This will fully utilize the performance of the core computing module, achieve complete flow calculation, and achieve high computing power and high performance.
根据以上的描述,更一般地,可以参看图3。图3是根据本发明的高速全连接计算的硬件实现方法的流程图。In accordance with the above description, more generally, reference may be made to FIG. 3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention.
如图3中所示,根据本发明的高速全连接计算的硬件实现方法300可以开始于步骤S310,在此步骤,加载m组权重数据到权重存储模块中存储。As shown in FIG. 3, the hardware implementation method 300 of the high speed full connection calculation according to the present invention may begin in step S310, in which m sets of weight data are loaded into a weight storage module for storage.
与此同时,在步骤S320,请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块。At the same time, in step S320, input vector data is requested and the received n input vector data is stored in the vector storage module.
如上所述,这里需要注意的是,m和n的取值,往往是根据实际的计算硬件情况而确定的,本领域技术人员应该明白如何合理设置m和n的取值,以便利用现有的计算硬件资源而取得理想的算力。例如,m和n的取值可以是m=4,n=4。As mentioned above, it should be noted here that the values of m and n are often determined according to the actual computing hardware. Those skilled in the art should understand how to properly set the values of m and n in order to utilize existing ones. Calculate hardware resources to achieve the desired computing power. For example, the values of m and n can be m=4, n=4.
当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,在步骤S330,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块。When the weight storage module and the vector storage module have data that can be calculated, in step S330, m sets of weight data and n input vector data are respectively read from the two modules and sent to the core calculation module.
接下来,在步骤S340,所述核心计算模块对分别接收到的权重数据和输入向量数据进行相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算。Next, in step S340, the core calculation module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication of the input channel by the pipeline. Add operation.
在方法300中,核心计算模块可以包括m*n个计算核,从而在步骤S340中可以同时实现m组权重数据和n个输入向量的相乘运算。In method 300, the core computing module can include m*n computational cores such that the multiplication of the m sets of weight data and the n input vectors can be performed simultaneously in step S340.
在步骤S350,将步骤S340的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接运算,并将运算结果输出给输出寄存模块。In step S350, the multiplication and addition operation result of step S340 is added to the corresponding offset data, and all the full connection operations of the input channels corresponding to the current calculation are completed, and the operation result is output to the output registration module.
如上所述,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储,可以采用乒乓缓存。As described above, for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module, a ping-pong buffer may be employed.
在步骤S360,所述输出寄存模块输出结果数据给目标接口。In step S360, the output registration module outputs the result data to the target interface.
之后,在步骤S370,判断是否所有全连接运算完成。如果步骤S370的判断结果为否定的,即还有全连接运算未进行,则方法300回到步骤S310,重复步骤S310到步骤S360。另一方面,如果步骤S370的判断结果为肯定的,即所有全连接运算都已完成,则方法300可以结束。Thereafter, in step S370, it is determined whether all the full join operations are completed. If the result of the determination in step S370 is negative, that is, if the full connection operation is not performed, the method 300 returns to step S310, and steps S310 through S360 are repeated. On the other hand, if the result of the determination in step S370 is affirmative, that is, all the full join operations have been completed, the method 300 may end.
下面来看两个优选实施例。Two preferred embodiments are seen below.
图4是根据本发明的第一优选实施例的硬件实现装置的示意图。4 is a schematic diagram of a hardware implemented device in accordance with a first preferred embodiment of the present invention.
如图4中所示,在第一优选实施例中,如果参与运算的输入通道(input channel)的数目为2048,并发(batch)数为210,输出通道(output channel)的数目为30,每个数据为32比特(bit)。首先加载权重数据4*2048*32bit到权重存储模块,在加载部分权重数据后请求加载向量数据到向量存储模块,每次加载数据量为4*32bit,当权重数据和向量数据均准备好后即开始计算,计算采用16(4*4)个DSP。计算完全流水,共享权重数据,向量数据流水输入,流水计算。当接口时钟频率为300MHz时,向量输入接口的数据速率为4.8GB/s,所完成的计算操作可达9.6Gops,实现了较高了计算效率。As shown in FIG. 4, in the first preferred embodiment, if the number of input channels participating in the operation is 2048, the number of batches is 210, and the number of output channels is 30, each The data is 32 bits. Firstly, the weight data 4*2048*32bit is loaded into the weight storage module, and after loading part of the weight data, the vector data is requested to be loaded into the vector storage module, and the amount of data loaded each time is 4*32bit. When the weight data and the vector data are ready, The calculation is started and the calculation uses 16 (4*4) DSPs. Calculate complete flow, share weight data, vector data flow input, and flow calculation. When the interface clock frequency is 300MHz, the data rate of the vector input interface is 4.8GB/s, and the calculated operation can reach 9.6Gops, which achieves higher computational efficiency.
图5是根据本发明的第二优选实施例的硬件实现装置的示意图。Figure 5 is a schematic illustration of a hardware implemented device in accordance with a second preferred embodiment of the present invention.
如图5中所示,在第二优选实施例中,假如接口数据速率可提高一倍的话,每次参与计算的权重数据为4个、向量数据为8个,或者增加权重数据的缓存量、接口速率保持不变,使得参与运算的权值数据为8个,向量数据为4个,采用32(4*8或8*4)个DSP同时参与运算,则可实现算力高达 19.2Gop。可见该装置具有良好的适应性,可较大程度的实现高性能计算。As shown in FIG. 5, in the second preferred embodiment, if the interface data rate can be doubled, the weight data for each participating calculation is 4, the vector data is 8, or the buffer amount of the weight data is increased. The interface rate remains unchanged, so that the weight data involved in the operation is eight, and the vector data is four. When 32 (4*8 or 8*4) DSPs are simultaneously involved in the operation, the computing power can be as high as 19.2Gop. It can be seen that the device has good adaptability and can realize high performance calculation to a large extent.
换句话说,根据第二优选实施例,在本发明的装置200和方法300中,m和n的取值还可以是m=8,n=4或m=4,n=8。In other words, according to the second preferred embodiment, in the apparatus 200 and the method 300 of the present invention, the values of m and n may also be m=8, n=4 or m=4, n=8.
综合第一和第二优选实施例,m和n的取值可以是以下情况之一:m=4,n=4;m=8,n=4;或m=4,n=8。In combination with the first and second preferred embodiments, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m=4, n=8.
本领域普通技术人员应该认识到,本发明的方法可以实现为计算机程序。如上结合图3所述,根据上述实施例的方法可以执行一个或多个程序,包括指令来使得计算机或处理器执行结合附图所述的算法。这些程序可以使用各种类型的非瞬时计算机可读介质存储并提供给计算机或处理器。非瞬时计算机可读介质包括各种类型的有形存贮介质。非瞬时计算机可读介质的示例包括磁性记录介质(诸如软盘、磁带和硬盘驱动器)、磁光记录介质(诸如磁光盘)、CD-ROM(紧凑盘只读存储器)、CD-R、CD-R/W以及半导体存储器(诸如ROM、PROM(可编程ROM)、EPROM(可擦写PROM)、闪存ROM和RAM(随机存取存储器))。进一步,这些程序可以通过使用各种类型的瞬时计算机可读介质而提供给计算机。瞬时计算机可读介质的示例包括电信号、光信号和电磁波。瞬时计算机可读介质可以用于通过诸如电线和光纤的有线通信路径或无线通信路径提供程序给计算机。One of ordinary skill in the art will recognize that the method of the present invention can be implemented as a computer program. As described above in connection with FIG. 3, the method in accordance with the above-described embodiments can execute one or more programs, including instructions to cause a computer or processor to perform the algorithms described in connection with the figures. These programs can be stored and provided to a computer or processor using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of the non-transitory computer readable medium include magnetic recording media (such as floppy disks, magnetic tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROM (Compact Disc Read Only Memory), CD-R, CD-R /W and semiconductor memory (such as ROM, PROM (programmable ROM), EPROM (rewritable PROM), flash ROM and RAM (random access memory)). Further, these programs can be provided to a computer by using various types of transient computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can be used to provide a program to a computer via a wired communication path such as a wire and an optical fiber or a wireless communication path.
因此,根据本发明,还可以提议一种计算机程序或一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行高速全连接计算的硬件实现方法,包括如下操作:(1)加载m组权重数据到权重存储模块中存储;(2)请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块;(3)当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块;(4)所述核心计算模块对分别接收到的权重数据和输入向量数据进行相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算;(5)将步骤(4)的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接 运算,并将运算结果输出给输出寄存模块;(6)所述输出寄存模块输出结果数据给目标接口;(7)重复步骤(1)到(6),直到所有全连接运算完成。Thus, in accordance with the present invention, a computer program or a computer readable medium for recording instructions executable by a processor that, when executed by a processor, causes the processor to perform a high speed full connection calculation The hardware implementation method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) When the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) The core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel; (5) the step ( 4) The multiplication and addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the input. Storage module; (6) the output register module outputs the resulting data to the target access; (7) repeating steps (1) to (6), until all operations are completed fully connected.
根据本发明的优选实施例,在如上的计算机可读介质中,m和n的取值可以是以下几种情况之一:m=4,n=4;m=8,n=4;或m=4,n=8。According to a preferred embodiment of the present invention, in the computer readable medium as above, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m =4, n=8.
上面已经描述了本发明的各种实施例和实施情形。但是,本发明的精神和范围不限于此。本领域技术人员将能够根据本发明的教导而做出更多的应用,而这些应用都在本发明的范围之内。Various embodiments and implementations of the invention have been described above. However, the spirit and scope of the present invention are not limited thereto. Those skilled in the art will be able to make further applications in accordance with the teachings of the present invention, and such applications are within the scope of the present invention.
也就是说,本发明的上述实施例仅仅是为清楚说明本发明所做的举例,而非对本发明实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其他不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、替换或改进等,均应包含在本发明权利要求的保护范围之内。That is, the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations or modifications of the various forms may be made by those skilled in the art in light of the above description. There is no need and no way to exhaust all of the implementations. Any modifications, substitutions or improvements made within the spirit and scope of the invention are intended to be included within the scope of the appended claims.

Claims (10)

  1. 一种高速全连接计算的硬件实现装置,包括:A hardware implementation device for high-speed full-connection computing, comprising:
    权重存储模块,用于存储用于计算的权重数据,每次存储m组权重数据,直到所有输出通道的权重计算完成;a weight storage module, configured to store weight data for calculation, and store m sets of weight data each time until weight calculation of all output channels is completed;
    向量存储模块,用于存储n个输入向量数据;a vector storage module for storing n input vector data;
    输出寄存模块,用于实现计算结果的输出缓存;An output registration module for implementing an output buffer of the calculation result;
    核心计算模块,用于使得由所述权重存储模块输入的m组权重数据与由所述向量存储模块输入的n个输入向量数据进行相乘,各个相乘结果分别与之前的有效结果相加,并在乘加计算的结果上加上对应的偏置值,将最终计算结果输出到所述输出寄存模块。a core calculation module, configured to multiply the m sets of weight data input by the weight storage module and the n input vector data input by the vector storage module, and the respective multiplication results are respectively added to the previous valid results, And adding a corresponding offset value to the result of the multiplication and addition calculation, and outputting the final calculation result to the output registration module.
  2. 根据权利要求1所述的装置,其中,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储采用乒乓缓存。The apparatus of claim 1, wherein the ping-pong buffer is employed for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate calculation result storage in the core computing module.
  3. 根据权利要求1所述的装置,其中,所述核心计算模块包括m*n个计算核,从而同时实现m组权重数据和n个输入向量的相乘运算。The apparatus of claim 1, wherein the core computing module comprises m*n computing cores to simultaneously implement multiplication operations of m sets of weight data and n input vectors.
  4. 根据权利要求1所述的装置,其中,m和n的取值是以下之一:The apparatus of claim 1 wherein the values of m and n are one of:
    m=4,n=4;m=4, n=4;
    m=8,n=4;或m=8, n=4; or
    m=4,n=8。m=4, n=8.
  5. 一种高速全连接计算的硬件实现方法,包括:A hardware implementation method for high-speed full-connection computing, including:
    (1)加载m组权重数据到权重存储模块中存储;(1) loading the m group weight data into the weight storage module for storage;
    (2)请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块;(2) requesting input vector data and storing the received n input vector data in a vector storage module;
    (3)当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块;(3) when both the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module;
    (4)所述核心计算模块对分别接收到的权重数据和输入向量数据进行相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算;(4) The core calculation module performs multiplication operation on the weight data and the input vector data respectively received, and adds each multiplication result to the previous effective result, and sequentially performs data multiplication and addition operations on the input channel;
    (5)将步骤(4)的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接运算,并将运算结果输出给输出寄存模块;(5) adding the multiplication and addition operation result of step (4) and the corresponding offset data, completing all the full connection operations of the input channel corresponding to the current calculation, and outputting the operation result to the output registration module;
    (6)所述输出寄存模块输出结果数据给目标接口;(6) the output registration module outputs the result data to the target interface;
    (7)重复步骤(1)到(6),直到所有全连接运算完成。(7) Repeat steps (1) through (6) until all full join operations are completed.
  6. 根据权利要求5所述的方法,其中,针对所述权重存储模块中的权重数据存储、所述向量存储模块中的输入向量数据存储和所述核心计算模块中的中间计算结果存储采用乒乓缓存。The method of claim 5 wherein the ping-pong buffer is employed for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate calculation result storage in the core computing module.
  7. 根据权利要求5所述的方法,其中,所述核心计算模块包括m*n个计算核,从而在步骤(4)同时实现m组权重数据和n个输入向量的相乘运算。The method of claim 5, wherein said core computing module comprises m*n computational kernels such that at step (4) simultaneous multiplication of m sets of weight data and n input vectors is performed.
  8. 根据权利要求5所述的方法,其中,m和n的取值是以下之一:The method of claim 5 wherein the values of m and n are one of the following:
    m=4,n=4;m=4, n=4;
    m=8,n=4;或m=8, n=4; or
    m=4,n=8。m=4, n=8.
  9. 一种计算机可读介质,用于记录可由处理器执行的指令,所述指令在被处理器执行时,使得处理器执行高速全连接计算的硬件实现方法,包括如下操作:A computer readable medium for recording instructions executable by a processor, the instructions, when executed by a processor, cause a processor to perform a hardware implementation of a high speed full connection calculation, comprising the operations of:
    (1)加载m组权重数据到权重存储模块中存储;(1) loading the m group weight data into the weight storage module for storage;
    (2)请求输入向量数据并将接收到的n个输入向量数据存储在向量存储模块;(2) requesting input vector data and storing the received n input vector data in a vector storage module;
    (3)当所述权重存储模块和所述向量存储模块均有可进行计算的数据时,分别从上述两个模块读取m组权重数据和n个输入向量数据并送入核心计算模块;(3) when both the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module;
    (4)所述核心计算模块对分别接收到的权重数据和输入向量数据进行 相乘运算,并将各个相乘结果与之前的有效结果相加,依次流水完成输入通道的数据乘加运算;(4) The core calculation module performs multiplication operation on the weight data and the input vector data respectively received, and adds each multiplication result to the previous effective result, and sequentially performs data multiplication and addition operations on the input channel;
    (5)将步骤(4)的乘加运算结果与对应的偏置数据相加,完成本次计算对应的输入通道的所有全连接运算,并将运算结果输出给输出寄存模块;(5) adding the multiplication and addition operation result of step (4) and the corresponding offset data, completing all the full connection operations of the input channel corresponding to the current calculation, and outputting the operation result to the output registration module;
    (6)所述输出寄存模块输出结果数据给目标接口;(6) the output registration module outputs the result data to the target interface;
    (7)重复步骤(1)到(6),直到所有全连接运算完成。(7) Repeat steps (1) through (6) until all full join operations are completed.
  10. 根据权利要求9所述的计算机可读介质,其中,m和n的取值是以下之一:The computer readable medium of claim 9, wherein the values of m and n are one of the following:
    m=4,n=4;m=4, n=4;
    m=8,n=4;或m=8, n=4; or
    m=4,n=8。m=4, n=8.
PCT/CN2018/080600 2017-10-30 2018-03-27 Hardware implementation device and method for high-speed full-connection calculation WO2019085378A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711035020.2 2017-10-30
CN201711035020.2A CN109740749A (en) 2017-10-30 2017-10-30 The hardware realization apparatus and method that the full connection of high speed calculates

Publications (1)

Publication Number Publication Date
WO2019085378A1 true WO2019085378A1 (en) 2019-05-09

Family

ID=66332784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/080600 WO2019085378A1 (en) 2017-10-30 2018-03-27 Hardware implementation device and method for high-speed full-connection calculation

Country Status (2)

Country Link
CN (1) CN109740749A (en)
WO (1) WO2019085378A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN107273969A (en) * 2017-05-11 2017-10-20 西安交通大学 It is a kind of to parameterize the expansible full articulamentum multilayer interconnection structure of neutral net

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559019A (en) * 2013-11-08 2014-02-05 上海航天测控通信研究所 Universal floating point full-pipeline FFT (Fast Fourier Transform) operation IP (Internet Protocol) core

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107273969A (en) * 2017-05-11 2017-10-20 西安交通大学 It is a kind of to parameterize the expansible full articulamentum multilayer interconnection structure of neutral net

Also Published As

Publication number Publication date
CN109740749A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
WO2021244354A1 (en) Training method for neural network model, and related product
TW201918939A (en) Method and apparatus for learning low-precision neural network
CN107545889A (en) Suitable for the optimization method, device and terminal device of the model of pattern-recognition
EP3564863B1 (en) Apparatus for executing lstm neural network operation, and operational method
EP3688673A1 (en) Neural architecture search
JPWO2020026741A1 (en) Information processing method, information processing device and information processing program
US20230076260A1 (en) Systems and methods for converting data from int-64 to boolean for computations
TWI775210B (en) Data dividing method and processor for convolution operation
Xu et al. Binary convolutional neural network acceleration framework for rapid system prototyping
CN109598344B (en) Model generation method and device
WO2022111002A1 (en) Method and apparatus for training neural network, and computer readable storage medium
WO2019085378A1 (en) Hardware implementation device and method for high-speed full-connection calculation
CN111126557B (en) Neural network quantization, application method, device and computing equipment
US20200311511A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
CN110489955A (en) Applied to the image procossing of electronic equipment, device, calculate equipment, medium
JP2017157215A (en) Neural network analysis
CN112052865A (en) Method and apparatus for generating neural network model
US20210312278A1 (en) Method and apparatus with incremental learning moddel
TWI740338B (en) Computing method with dynamic minibatch sizes and computing system and computer-readable storage media for performing the same
Song et al. Design and implementation of convolutional neural networks accelerator based on multidie
CN117061365A (en) Node selection method, device, equipment and readable storage medium
US10990525B2 (en) Caching data in artificial neural network computations
Peres et al. Faster convolutional neural networks in low density fpgas using block pruning
CN116822616A (en) Device for training Softmax function in large language model
CN116384513A (en) Yun Bianduan collaborative learning system and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18874513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18874513

Country of ref document: EP

Kind code of ref document: A1