WO2020211654A1 - 一种基于行缓冲Linebuffer的并行计算方法及计算设备 - Google Patents

一种基于行缓冲Linebuffer的并行计算方法及计算设备 Download PDF

Info

Publication number
WO2020211654A1
WO2020211654A1 PCT/CN2020/082960 CN2020082960W WO2020211654A1 WO 2020211654 A1 WO2020211654 A1 WO 2020211654A1 CN 2020082960 W CN2020082960 W CN 2020082960W WO 2020211654 A1 WO2020211654 A1 WO 2020211654A1
Authority
WO
WIPO (PCT)
Prior art keywords
template
calculation
data
linebuffer
computing
Prior art date
Application number
PCT/CN2020/082960
Other languages
English (en)
French (fr)
Inventor
张伟豪
李涵
王封
丁瑞强
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2020211654A1 publication Critical patent/WO2020211654A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the invention relates to the field of convolutional nerves, in particular to a parallel computing method and computing equipment based on line buffer Linebuffer.
  • the convolutional neural network needs to provide the computing unit with the data required for each calculation each time it performs a calculation.
  • the traditional method either stores all input data in on-chip storage or continuously accesses off-chip storage to obtain input data. This method will increase the on-chip storage pressure, while the second method will increase the I/O access pressure.
  • a Linebuffer structure is generally used to realize the buffering of on-chip intermediate data, but the traditional Linebuffer does not support the parallel and synchronous execution of consumers.
  • the present invention provides a parallel computing method and computing device based on line buffer Linebuffer that overcomes or at least partially solves the above-mentioned problems.
  • a parallel calculation method based on line buffer Linebuffer is provided, which is applied to a template calculation structure, and the method includes:
  • the template data is simultaneously transmitted to the multiple computing units through the preset template of the line buffer, and each computing unit processes its own computing tasks in parallel.
  • the preset template of the line buffer Linebuffer can obtain the template data required by multiple calculation units to perform calculations at the same time, and then multiple calculation units can perform calculations synchronously. Compared with the traditional The solution is more computationally efficient and faster.
  • the method includes:
  • the template data is simultaneously transmitted to the multiple computing units through the preset template of the line buffer, and each computing unit processes the tasks of the network layer in parallel, wherein the template data is the template parameter Limited original template data.
  • one or more network layers can be selected from all network layers according to the calculation amount of each network layer, and the template parameters of the selected network layer and the number of calculation units can be selected.
  • the preset template of the line buffer Linebuffer is constructed based on the preset template data, and the template data is transmitted to multiple computing units at the same time. Each computing unit processes the tasks of the network layer in parallel, which is higher than traditional solutions. The calculation efficiency and faster calculation speed.
  • the preset template of the Linebuffer is composed of multiple original templates of a specified size, and the number of the original templates is equal to the number of the calculation units;
  • the multiple original templates are sequentially connected in the preset template, and at least partially overlap.
  • the size of the template is enlarged to obtain the data required by each computing unit at the same time, thereby realizing the parallel calculation of multiple computing units.
  • each computing unit can obtain the required processing data at the same time, thereby realizing parallel computing of multiple computing units.
  • the simultaneous transmission of template data to the multiple computing units through a preset template of the line buffer, and parallel processing of the tasks of the network layer by each computing unit includes:
  • the template data of each original template is simultaneously transmitted to the multiple computing units through the multiple original templates of the line buffer, and each computing unit processes the tasks of the network layer in parallel.
  • the template data of each original template is simultaneously transmitted to the multiple computing units through the multiple original templates of the line buffer, and each computing unit processes the tasks of the network layer in parallel, include:
  • the multiple original templates are continuously moved in the specified direction by a preset step length, and after each movement of the multiple original templates, the new template data required by each computing unit currently performing the convolution operation is simultaneously acquired, and the The new template data is transmitted to the corresponding calculation unit until all the multiple data blocks are read.
  • the method further includes:
  • the new template data is stored in the preset data buffer.
  • the Linebuffer buffer continuously reads the data generated by the upper data layer, which can send multiple template data at the same time, so that Multiple consumers can perform parallel calculations at the same time to reduce the time it takes for the template to obtain data, thereby improving computing efficiency.
  • the preset template moves by a preset step.
  • the preset step length is p ⁇ stride x , where p represents the number of calculation units; stride x represents the horizontal step length of the original template.
  • the Linebuffer is implemented by a set of registers
  • Each original template in the preset template includes a plurality of registers to read based on the data block in the input feature map and write the template data required for each template calculation to the calculation unit each time Template data required to perform template calculations.
  • a computing device including:
  • the processor is configured to execute the parallel calculation method based on the line buffer described in any one of the above.
  • the computing device further includes:
  • the storage device is used to store a computer program, which is loaded and executed by the processor when the computer program is running in the computing device.
  • the present invention provides a more efficient synchronization calculation method based on line buffer Linebuffer. After determining the network layer that needs to perform parallel calculation, multiple calculation units will be allocated to it, and based on the template parameters of the network layer and the number of calculation units Constructs a preset template of Linebuffer, through which the template data is transmitted to multiple computing units at the same time, and then the multiple computing units perform calculations simultaneously. Compared with the traditional solution, the calculation efficiency is higher and the speed is faster. fast.
  • Figure 1 shows a schematic diagram of the working principle of Linebuffer
  • Figure 2 shows a schematic diagram of convolution calculation based on Linebuffer
  • Figure 3 shows a schematic diagram of Linebuffer storing intermediate calculation results between layers of a convolutional neural network
  • Figure 4 shows a schematic diagram of Linebuffer implementation using a shift register
  • FIG. 5 shows a schematic diagram of line-wrapping of the Linebuffer shown in FIG. 4;
  • Figure 6 shows a schematic diagram of splitting the neural network layer
  • FIG. 7 shows a schematic diagram of template calculation to which each calculation unit shown in FIG. 6 is allocated
  • Figure 8 shows a schematic diagram of the calculation time of each traditional calculation unit
  • FIG. 9 shows a schematic flowchart of a parallel computing method based on Linebuffer according to an embodiment of the present invention.
  • Figure 10 shows a schematic diagram of the original Linebuffer template
  • Figure 11 shows a schematic diagram of a preset template based on multiple original templates
  • FIG. 12 shows a schematic diagram of buffer setting in the first embodiment
  • FIG. 13 shows a schematic diagram of synchronous calculation time of each calculation unit in the first embodiment
  • FIG. 14 shows a schematic diagram of buffer setting in the second embodiment
  • Figure 15 shows a schematic diagram of the synchronous Linebuffer work
  • Figure 16 shows a schematic diagram of synchronous Linebuffer movement
  • Figure 17 shows a schematic diagram of the synchronous Linebuffer line break work.
  • Linebuffer also called line buffering
  • Linebuffer is a technology widely used in template computing, and template computing is widely used in image processing, artificial intelligence and other fields.
  • Linebuffer can reduce the number of memory accesses and reduce on-chip storage, which is a common structure in pipeline template calculation.
  • the convolution operation in the convolutional neural network is also a kind of template calculation, so Linebuffer technology is often used in some convolution accelerator architectures, which makes the Linebuffer technology once again widely used in recent years.
  • Figure 1 shows a schematic diagram of the working principle of Linebuffer.
  • the size of the input feature map is 5 ⁇ 5, and a template (drawing window) continuously slides on the input feature map.
  • the non-white part (01-21) in Figure 1 represents the data stored in the Linebuffer, and the dark gray part (01, 10, 11, 12, 21) is the template for this template calculation, that is, the input data involved in this calculation .
  • the Linebuffer needs to provide the calculation unit with the data needed for this calculation. After completing a template calculation, Linebuffer needs to be updated, new data needs to be read in and data that will not be reused is discarded.
  • FIG. 1 shows an example of template calculation.
  • the template is a cross. In actual applications, the template can be of any shape. In a general convolutional neural network, the shape of the template is preferably a rectangle.
  • Figure 2 shows a traditional Linebuffer-based convolution calculation schematic diagram, where the template size is 3 ⁇ 3, and the step size is 1.
  • Linebuffer is often used as a buffer between layers to preserve intermediate results between layers with minimal storage costs.
  • the front and back layers often adopt the producer and consumer model, that is, after the front layer calculates all the data required for one calculation of the later layer, the latter layer immediately starts a calculation. Therefore, the Linebuffer will send the template data to the subsequent network layer after receiving all the data required for this calculation, and the subsequent network layer will start the calculation.
  • the Linebuffer mainly implements Layer0 (network layer 0) ⁇ Layer1 (network layer 1), Layer1 (network layer 1) ⁇ Layer2 (network layer 2), Layer2 (network layer 2) ⁇ Layer3 (network layer) 3) Data transfer between.
  • the Linebuffer can be implemented using a section of memory, or through a set of registers.
  • Figure 4 shows a schematic diagram of building a Linebuffer through a shift register. Take the Linebuffer in Figure 2 as an example. For each template running in a row, the register is shifted to the left by the black line once (the step size of the template in the horizontal direction). See Figure 4. Register R00 discards one data, and register R22 reads Enter new data. Registers R00 to R22 will output the data contained in this template. Every time a line is changed, the register is shifted to the left by 3 bits (the width of the template in the horizontal direction), and three new numbers are read, as shown in Figure 5.
  • the calculation amount of different layers may be very different, so that the slow calculation layer often needs to wait for the calculation of the previous layer, which forms the bottleneck of the entire network calculation.
  • the three calculation units will equally distribute the convolution calculation of Layer1. Assuming that the calculation distribution of the three calculation units is shown in Figure 7, that is, Layer 1 needs to perform 9 template calculations in total, and each calculation unit is responsible for three template calculations. Note that each template is calculated as stencil[i][j], and the required data is recorded as data[i][j].
  • Linebuffer needs to provide data for each computing unit, and this process is shown in the first row as an example.
  • Layer0 calculates data 00 to data 22
  • Linebuffer sends data 00-22 calculated by layer0, that is, data[0][0] to calculation unit 1, and starts to calculate stencil[0][0].
  • the calculation unit 2 and the calculation unit 3 cannot start the calculation because the data 23 and the data 24 have not been calculated by Layer 0 yet.
  • Layer0 completes the calculation of a template again, and Linebuffer gets data 23
  • Linebuffer is updated once, and data[0][1] is sent to calculation unit 2 to start calculating stencil[0][1].
  • Linebuffer After Linebuffer obtains data 24, it sends data[0][2] to the calculation unit 3 to start calculating stencil[0][2]. It can be found that the three computing units cannot start computing at the same time, that is, they cannot be synchronized.
  • Layer0 calculates a template, and the time to obtain a number is S_0, and Layer1 calculates a template as S_1, ignoring the time of reading, updating, and sending data from the Linebuffer. The process is shown in Figure 8.
  • calculation unit 2 needs to wait for S 0 to start calculation, and the calculation unit 3 needs to wait for 2 S 0 to start calculation.
  • the calculations of the three computing units are not synchronized, and will not be synchronized in future calculations. If the underlying hardware architecture is a strong synchronization architecture, this non-synchronization operation will bring great trouble to algorithm scheduling, and even the hardware architecture does not support such operations at all.
  • the embodiment of the present invention provides a parallel calculation method based on the line buffer Linebuffer, which can be applied to template calculation, so that the Linebuffer has the ability of synchronous adjustment, thereby enabling the consumers of the Linebuffer to perform synchronous calculation.
  • the specific method may include: firstly, determining the template calculation object; secondly, constructing a preset template of the line buffer according to the template parameters of the template calculation object and the number of calculation units; finally, passing the line buffer of the Linebuffer
  • the preset template transmits the template data to the multiple computing units at the same time, and each computing unit processes its own computing tasks in parallel.
  • the parallel calculation method based on Linebuffer provided by the embodiment of the present invention may include:
  • Step S901 Determine the network layer to be processed in parallel
  • Step S902 Allocate multiple computing units to the network layer. Taking the example shown in FIG. 6 as an example, three calculation units are allocated to Layer 1, namely, calculation unit 1, calculation unit 2, and calculation unit 3.
  • calculation unit 1 calculation unit 1
  • calculation unit 2 calculation unit 3
  • calculation unit 3 calculation unit 3.
  • one or more network layers can be selected from all network layers according to the calculation amount of each network layer, which is not limited by the present invention.
  • step S903 a preset template of the line buffer Linebuffer is constructed according to the template parameters of the network layer and the number of calculation units.
  • step S904 the template data is simultaneously transmitted to the multiple computing units through the preset template of the line buffer, and each computing unit processes the tasks of the network layer in parallel, wherein the template data is all The original template data defined by the template parameters.
  • FIG. 10 shows a schematic diagram of the traditional Linebuffer technology, which expands the input feature map on the basis of FIG. 6.
  • An analysis of FIG. 10 with reference to FIG. 6 shows that the traditional solution uses a 3 ⁇ 3 template for each computing unit, and the step size is 1.
  • the template data can be simultaneously transmitted to multiple computing units allocated to the network layer requiring parallel processing through the preset template of the line buffer Linebuffer.
  • the preset template in the embodiment of the present invention is It is composed of multiple original templates of a specified size, and the template data required by the calculation unit to perform the convolution operation is located in the original template.
  • the number of original templates is equal to the number of computing units, and multiple original templates are sequentially connected in the preset template and at least partially overlapped.
  • the size of the original template may be the same or different, which is not limited in the present invention.
  • the traditional solution uses an original template to sequentially obtain the data required by each computing unit to perform convolution calculations.
  • multiple original templates are combined to form an enlarged preset template, and each computing unit is used simultaneously. Obtain the required processing data, and then realize the parallel calculation of multiple computing units.
  • the calculation template of Linebuffer is first expanded in the horizontal direction.
  • step S902 may further include: simultaneously transmitting the template data of each original template to the multiple computing units through the multiple original templates of the line buffer Linebuffer, and each computing unit processes the tasks of the network layer in parallel .
  • S902-1 Divide the input feature map of the convolutional neural network into a plurality of data blocks in advance, such as the 8 ⁇ 6 data block shown in FIG. 11;
  • S902-2 Use multiple original templates to simultaneously obtain template data required by each calculation unit to perform a convolution operation, and transmit the obtained template data to a corresponding calculation unit for calculation.
  • Linebuffer will obtain p original template data based on the data contained in the large template, and send it to p calculation units for template calculation at the same time.
  • step S902-2 it may further include: S902-3, continuously moving the multiple original templates in a specified direction by a preset step length, and after each movement of the multiple original templates based on the multiple data tiles simultaneously Obtain the new template data required by each calculation unit currently performing the convolution operation, and transmit the new template data to the corresponding calculation unit until all the multiple data blocks are read.
  • the new template data is stored in a preset data buffer; wherein, when the data buffer is full, the preset template moves by a preset step.
  • the oblique line in Figure 12 (data block 25, 26, 27) is the buffer.
  • the embodiment of the present invention can also add p ⁇ stride x buffers at the end of the Linebuffer.
  • the Linebuffer buffer is constantly Read the data generated by the upper data layer to reduce the time for the template to obtain the data, thereby improving the calculation efficiency.
  • the Linebuffer buffer can move a preset template including multiple original templates.
  • the Linebuffer with the added buffer can send multiple template data at the same time, so that multiple consumers can calculate in parallel at the same time.
  • the Linebuffer when the Linebuffer obtains all the data of the first template, it will send three template data to the three calculation units at the same time, and the three calculation units can start calculation simultaneously. At this time, the Linebuffer will continue to receive the data generated by Layer0 and store it in the buffer. When the buffer is full, the Linebuffer sends the three template data after it, and the calculation unit immediately starts the next round of calculation after receiving the template data.
  • Linebuffer uses line-pipeline technology.
  • the line buffer supports parallel methods.
  • Linebuffer buffer becomes stride y rows, i.e. rows stride y buffered data, as shown in FIG.
  • the Linebuffer when performing calculations based on the Linebuffer, is implemented by a set of registers; each original template in the preset template includes multiple registers to be based on the data block in the input feature map Read and write template data required for each execution of template calculation into the calculation unit.
  • one register can correspondingly read the data of one data block.
  • the parts of the registers R00 to R24 are similar to those shown in Figure 4 and Figure 5. They are three 3 ⁇ 3 templates, which are respectively sent to the three calculation units. While the calculation unit calculates this template, the synchronous Linebuffer continuously obtains new template data by reading in 2 for each calculation unit to perform template calculation. Synchronous Linebuffer reads in 1, continuously reads in new data (data 25, 26, 27 in Figure 12), and stores it in the buffer.
  • the buffer is composed of three shift registers B00, B01, and B02.
  • the write controller will continuously control the data entered from the read-in 1, and write it into B00, B01, B02, B00, B01, B02 and so on in turn.
  • the Linebuffer can move a large template.
  • all shift registers (including the buffer) in the Linebuffer are shifted to the left by 3 bits, and the state of the Linebuffer becomes as shown in Figure 16.
  • the registers R00 to R24 will send new templates to the calculation unit, and the buffers B00, B01, and B02 are waiting to read in new data 30, 31, and 32.
  • the Linebuffer reaches the position where it needs to wrap.
  • Linebuffer performs line feed operation, all registers move 3 bits to the left, and read in 3 new data.
  • the buffers B00, B01, and B02 wait for new data 35, 36, 37 to be read.
  • an embodiment of the present invention also provides a computing device, including a processor, configured to execute the parallel computing method based on the line buffer in any of the above embodiments.
  • the computing device may further include: a storage device for storing a computer program, which is loaded and executed by the processor when the computer program runs in the computing device.
  • the embodiment of the present invention provides a more efficient synchronization calculation method based on line buffer Linebuffer.
  • For neural networks first determine the network layer that needs to perform parallel calculation, and then allocate multiple calculation units to it, and according to the network layer
  • the template parameters and the number of calculation units construct the preset template of the line buffer Linebuffer, through which the template data is transmitted to multiple calculation units at the same time, and then the calculations are performed in parallel by the multiple calculation units.
  • the method provided by the embodiment of the present invention can be implemented on most common storage architectures, such as a register bank or RAM.
  • This linebuffer-based synchronization calculation method can solve the problem of asynchronous splitting of neural network algorithms, multi-step image processing algorithms and other algorithms after parallel splitting. Therefore, the synchronous Linebuffer can be widely applied to many-core neural network accelerator architectures and many-core image processors. Architecture and other hardware architectures are especially suitable for hardware architectures that require strong synchronization.
  • modules or units or components in the embodiments can be combined into one module or unit or component, and in addition, they can be divided into multiple sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or units are mutually exclusive, any combination can be used to compare all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method or methods disclosed in this manner or All the processes or units of the equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.

Abstract

一种基于行缓冲Linebuffer的并行计算方法及计算设备,应用于模板计算结构,所述方法包括:确定模板计算对象;根据所述模板计算对象的模板参数和计算单元的个数构建行缓冲Linebuffer的预设模板;通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理各自的计算任务。该方法由行缓冲Linebuffer的预设模板同时获取多个计算单元执行计算所需要的模板数据,进而由多个计算单元同步执行计算,计算效率更高,且速度更快。

Description

一种基于行缓冲Linebuffer的并行计算方法及计算设备 技术领域
本发明涉及卷积神经领域,特别是涉及一种基于行缓冲Linebuffer的并行计算方法及计算设备。
背景技术
近年来,随着人工智能的发展,卷积神经网络得到越来越多的应用,专为卷积神经网络设计加速器架构也不断涌现。
目前,卷积神经网络每次执行计算时需要为计算单元提供每次计算所需要的数据,而传统方式要么将全部输入数据存入片上存储,要么不断访问片外存储获取输入数据,采用第一种方式将增大片上存储压力,而采用第二种方式则会增大I/O访存压力。这时,一般会利用一种Linebuffer结构实现片上中间数据的缓冲,但是传统Linebuffer并不支持消费者的并行同步执行。
发明内容
鉴于上述问题,本发明提供了一种克服上述问题或至少部分地解决了上述问题的一种基于行缓冲Linebuffer的并行计算方法及计算设备。
根据本发明的一个方面,提供了一种基于行缓冲Linebuffer的并行计算方法,应用于模板计算结构,所述方法包括:
确定模板计算对象;
根据所述模板计算对象的模板参数和计算单元的个数建行缓冲Linebuffer的预设模板;
通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理各自的计算任务。根据模板计算对象的模板参数和计算单元的个数建行缓冲Linebuffer的预设模板可以同时获取多个计算单元执行计算所需要的模板数据,进而由多个计算单元同步执行计算,相较于传统的方案来讲计算效率更高,且速度更快。
可选地,所述模板计算对象为卷积神经网络时,所述方法包括:
确定需并行处理的网络层;
为所述网络层分配多个计算单元;
根据所述网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板;
通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,其中,所述模板数据是所述模板参数限定的原模板数据。
在卷积神经网络中选取需并行处理的网络层时,可以根据各网络层的计算量在所有网络层中选取一层或多层网络层,通过所选取网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板,进而基于该预设模板模板数据同时传输至多个计算单元,由每一个计算单元并行处理所述网络层的任务,相较于传统的方案来讲具备更高的计算效率和更快的计算速度。
可选地,所述Linebuffer的预设模板由多个指定大小的原模板组成,且所述原模板的个数与所述计算单元的数量相等;
其中,所述多个原模板在所述预设模板中依次连接,且至少部分重叠。通过将多个原模板组合到一起扩大模板的大小,以同时获取各计算单元所需数据,进而实现多个计算单元的并行计算。通过将多个原模板组合到一起构成一个扩大的预设模板,以各计算单元同时获取到所需处理数据,进而实现多个计算单元的并行计算。
可选地,所述通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,包括:
通过所述行缓冲Linebuffer的多个原模板将各原模板的模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务。
可选地,所述通过所述行缓冲Linebuffer的多个原模板将各原模板的模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,包括:
预先将所述卷积神经网络的输入特征图平均划分为多个数据图块;
利用所述多个原模板基于所述多个数据图块同时获取每个计算单元执行卷积运算所需的模板数据,并将所获取的模板数据传输至对应的计算单元;
将所述多个原模板按照指定方向连续移动预设步长,并在所述多个原模板每次移动后同时获取每个计算单元当前执行卷积运算所需的新模板数据,将所述新模板数据传输至对应的计算单元,直到所述多个数据图块全部读取完成。
可选地,所述方法还包括:
在获取所述多个计算单元执行当前所需的模板数据时,基于所述输入特征图获取所述多个计算单元执行下一次模板计算所需的新模板数据;
将所述新模板数据存入预设的数据缓冲区。当预设模板中的多个模板获取各计算单元执行卷积计算所需的数据的同时,Linebuffer缓冲区不断读取上一层数据层产生的数据,可以做到同时发送多个模板数据,使其多个消费者可以同时并行计算,以减少模板获取数据的时间,进而提升计算效率。
可选地,当所述数据缓冲区读满时,所述预设模板移动预设步长。
可选地,所述预设步长为p×stride x,其中,p表示所述计算单元的数量;stride x表示所述原模板的水平方向步长。
可选地,所述Linebuffer通过一组寄存器实现;
所述预设模板中的各原模板均包括多个寄存器,以基于所述输入特征图中的数据图块读取并将每次执行模板计算所需的模板数据写入所述计算单元每次执行模板计算所需的模板数据。
根据本发明的另一个方面,还提供了一种计算设备,包括:
处理器,用于执行通过上述任一项所述的基于行缓冲Linebuffer的并行计算方法。
可选地,所述计算设备还包括:
存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由所述处理器加载并执行。
本发明提供了一种更加高效的基于行缓冲Linebuffer的同步计算方法,确定需要执行并行计算的网络层之后会为其分配多个计算单元,并根据该网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板,通过该预设模板将模板数据同时传输至多个计算单元,进而由多个计算单元同步执行计算,相较于传统的方案来讲计算效率更高,且速度更快。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
根据下文结合附图对本发明具体实施例的详细描述,本领域技术人员将会更加明了本发明的上述以及其他目的、优点和特征。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了Linebuffer的工作原理示意图;
图2示出了基于Linebuffer的卷积计算示意图;
图3示出了Linebuffer在卷积神经网络的层之间保存中间计算结果示意图;
图4示出了利用移位寄存器的Linebuffer实现示意图;
图5示出了图4所示Linebuffer的换行示意图;
图6示出了对神经网络层进行拆分示意图;
图7示出了图6所示各计算单元分配到的模板计算示意图;
图8示出了传统各计算单元计算时间示意图;
图9示出了本发明实施例的基于Linebuffer的并行计算方法流程示意图;
图10示出了Linebuffer原模板示意图;
图11示出了基于多个原模板组成预设模板的示意图;
图12示出了实施例一的缓冲区设置示意图;
图13示出了实施例一的各计算单元同步计算时间示意图;
图14示出了实施例二的缓冲区设置示意图;
图15示出了同步Linebuffer工作示意图;
图16示出了同步Linebuffer移动工作示意图;以及
图17示出了同步Linebuffer换行工作示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
Linebuffer,也叫行缓冲,是一种广泛应用于模板计算的技术,而图像处理、人工智能等领域都大量使用了模板计算。一般来说,Linebuffer可以减少访存次数,减少片上存储,是在流水式模板计算中较为常见的结构。而对于卷积神经网络中的卷积运算也是一种的模板计算,所以一些卷积加速器架构中也经常使用Linebuffer技术,这使得Linebuffer这一技术在近年来再次被大量 应用。
图1展示了Linebuffer的工作原理示意图。图中,输入特征图大小为5×5,一个模板(划窗)不断在输入特征图上滑动。每一次滑动,模板所包含数据都会进行一次模板计算。图1中非白色部分(01~21)表示Linebuffer所存储的数据,其中深灰色部分(01、10、11、12、21)为此次模板计算的模板,即本次计算所涉及的输入数据。每次模板计算时,Linebuffer需要为计算单元提供此次计算需要的数据。在完成一次模板计算后,Linebuffer需要进行更新,需要读入新的数据并舍弃不会再次利用的数据。在本例中,计算单元完成第一次计算后,以步长为1向水平方向移动,Linebuffer抛弃数据01,读入数据22。如果不使用Linebuffer,则要么将全部输入数据存入片上存储,这将增大片上存储压力,要么不断访问片外存储获取输入数据,增大I/O访存压力。使用Linebuffer,极大的减少了片上存储压力或外部访存压力。图1展示了一种模板计算的示例中的模板为十字形,实际应用中模板可以是任意形状的。在一般的卷积神经网络中,模板的形状优选为矩形。
图2示出了一种传统的基于Linebuffer的卷积计算示意图,其中模板大小为3×3,步长为1。在流水线式的模板计算,比如卷积神经网络计算中,Linebuffer往往作为层与层之间的缓冲,以在最小存储代价的情况下保存层之间的中间结果。在流水过程中,前层与后层往往采用生产者、消费者模式,即前层计算得出后层一次计算所需的所有数据后,后层立刻开始一次计算。所以Linebuffer会在收到此次计算所需的所有数据后,将模板数据发送给之后的网络层,由之后的网络层开始计算。图3所示实施例中,Linebuffer主要实现Layer0(网络层0)~Layer1(网络层1)、Layer1(网络层1)~Layer2(网络层2)、Layer2(网络层2)~Layer3(网络层3)之间的数据传递。
在硬件上,Linebuffer可以使用一段内存实现,也可以通过一组寄存器实现,图4示出了通过移位寄存器构建Linebuffer的示意图。以图2的Linebuffer为例,其中一行内每运行一个模板,寄存器按照黑线向左移位1次(模板在水平方向的步长大小),参见图4,寄存器R00舍弃一个数据,寄存器R22读入新的数据。寄存器R00至R22会输出此模板包含的数据。每换一行,寄存器就向左移位3位(模板在水平方向的宽度),并读入三个新数字,如图5所示。
类似卷积神经网络的流水线模板计算中,不同层的计算量可能有很大差异,这样造成计算慢的层往往需要等待前层的计算,形成整个网络计算的瓶颈。
这种情况下,可以对计算慢的层进行并行。以图3中的Layer0,Layer1 为例,假设Layer0没有并行,在一个计算单元上计算,Layer1分裂成三份,即分为计算单元1、计算单元2和计算单元3,在三个计算单元上并行计算,如图6所示。
在这种情况下,三个计算单元将平均分配Layer1的卷积计算。假设三个计算单元的计算分配如图7所示,即Layer1共需进行9次模板计算,每个计算单元负责三次模板计算。记每个模板计算为stencil[i][j],其所需的数据记为data[i][j]。
Layer1拆分后,Linebuffer需要为每个计算单元提供数据,以第一行为例展示这个过程。当Layer0计算得出数据00到数据22后,Linebuffer发送layer0计算得到的数据00-22即data[0][0]到计算单元1,并开始计算stencil[0][0]。但此时,计算单元2与计算单元3并不能开始计算,因为数据23与数据24仍没有被Layer0计算得出。当Layer0又完成一个模板的计算,Linebuffer得到数据23后,Linebuffer进行一次更新,并将data[0][1]发送到计算单元2,开始计算stencil[0][1]。同理,Linebuffer得到数据24后,发送data[0][2]到计算单元3,开始计算stencil[0][2]。可以发现,三个计算单元无法同时开始计算,即它们无法同步。我们假设Layer0计算一个模板,得出一个数的时间为S_0,Layer1计算一个模板的时间为S_1,忽略Linebuffer的读取数据,更新,发送数据的时间。该过程如图8所示。
可以发现计算单元2需要等待S 0才可以开始计算,而计算单元3需要等待2S 0才可以开始计算。3个计算单元的计算是不同步的,且在将来的计算中都不会同步。如果底层的硬件架构为强同步架构,这种非同步的运算会给算法调度带来很大的麻烦,甚至硬件架构根本不支持这样的运行。
解决这个问题可以采用另三个计算单元之间协商进行同步的方法,但这无疑会增加三个计算单元之间的通信成本,也会使同步逻辑变得复杂。
本发明实施例提供了一种基于行缓冲Linebuffer的并行计算方法,可应用于模板计算,使得Linebuffer具有同步调节的能力,进而使得Linebuffer的消费者可以进行同步计算。可选地,具体方法可以包括:首先,确定模板计算对象;其次,根据所述模板计算对象的模板参数和计算单元的个数建行缓冲Linebuffer的预设模板;最后,通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理各自的计算任务。以卷积神经网络为模板计算对象为例,参见图9可知,本发明实施例提供的基于Linebuffer的并行计算方法可以包括:
步骤S901,确定需并行处理的网络层;
步骤S902,为所述网络层分配多个计算单元。以图6所示为例,为Layer1分配三个计算单元,分别为计算单元1、计算单元2和计算单元3。在卷积神经网络中选取需并行处理的网络层时,可以根据各网络层的计算量在所有网络层中选取一层或多层网络层,本发明不做限定。
步骤S903,根据所述网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板。
步骤S904,通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,其中,所述模板数据是所述模板参数限定的原模板数据。
图10示出了传统Linebuffer技术示意图,其在图6的基础上扩大了输入特征图。参考图6对图10进行分析可知,传统方案会对每个计算单元采用3×3的模板,且步长为1。
本发明实施例可以通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至为需要并行处理的网络层分配的多个计算单元,可选地,本发明实施例中的预设模板是由多个指定大小的原模板组成,计算单元执行卷积运算所需的模板数据位于原模板中。其中,原模板的个数与计算单元的数量相等,多个原模板在预设模板中依次连接,且至少部分重叠。所述原模板大小可以相同,也可以不同,本发明对此不做限定。
也就是说,传统方案采用一个原模板依次获取各计算单元执行卷积计算所需的数据,本发明实施例通过将多个原模板组合到一起构成一个扩大的预设模板,以各计算单元同时获取到所需处理数据,进而实现多个计算单元的并行计算。实际应用中,优先在水平方向扩展Linebuffer的计算模板。
假设Linebuffer消费者的并行度为p,则水平方向上连续的p个模板构成一个新模板,称之为大模板(即上述实施例中的预设模板)。大模板的水平方向步长为p×stride x。图11表示了该过程(以p=3为例)。参见图11可知,原模板的形状为矩形,且大小为3×3,通过在水平方向对原模板进行扩展,将连续的三个原模板可以构成了一个大模板,其中连续的三个原模板可以具有重叠部分。上述步骤S902可进一步包括:通过所述行缓冲Linebuffer的多个原模板将各原模板的模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务。可选地,具体包括:
S902-1,预先将卷积神经网络的输入特征图平均划分为多个数据图块,如 图11所示的8×6数据图块;
S902-2,利用多个原模板同时获取每个计算单元执行卷积运算所需的模板数据,并将所获取的模板数据传输至对应的计算单元执行计算。
在工作过程中,Linebuffer会在根据大模板包含的数据得到p个原模板数据,并将其同时发送给p个计算单元进行模板计算。
在上述步骤S902-2之后,还可以包括:S902-3,将多个原模板按照指定方向连续移动预设步长,并在多个原模板每次移动后基于所述多个数据图块同时获取每个计算单元当前执行卷积运算所需的新模板数据,将上述新模板数据传输至对应的计算单元,直到多个数据图块全部读取完成。
在本发明一可选实施例中,在获取多个计算单元执行当前所需的模板数据时,还可以基于输入特征图获取所述多个计算单元执行下一次模板计算所需的新模板数据;将所述新模板数据存入预设的数据缓冲区;其中,当所述数据缓冲区读满时,所述预设模板移动预设步长。图12斜线部分(数据图块25、26、27)即为缓冲区。
也就是说,本发明实施例还可以在Linebuffer末尾增加p×stride x个缓冲区,当预设模板中的多个模板获取各计算单元执行卷积计算所需的数据的同时,Linebuffer缓冲区不断读取上一层数据层产生的数据,以减少模板获取数据的时间,进而提升计算效率。当Linebuffer缓冲区读满时,Linebuffer可以进行一次包括多个原模板的预设模板的移动。增加缓冲区的Linebuffer可以做到同时发送多个模板数据,使其多个消费者可以同时并行计算。如图13所示,当Linebuffer获得到第一个模板的所有数据时,将同时发送三个模板数据给三个计算单元,三个计算单元可以同步开始计算。此时,Linebuffer将不断接收Layer0产生的数据,并将其存入缓冲区。当缓冲区存满时,Linebuffer发送之后的3个模板数据,计算单元接收到模板数据后立刻开始下一轮的计算。
图13中所示的实施例中,3S 0>S 1,所以计算单元在计算完一轮后会等待一段时间。如果并行数恰巧等于前层计算速度与后层计算速度的倍数,则后层计算单元不必等待,可以直接开始计算。此时如果忽略Linebuffer的开销和通信、控制的开销,两个两层的计算利用率都会是100%,且所有并行计算都会同步开始。综上所述,同步Linebuffer通过占用多一点的存储,达到了使并行计算可以同步的目的。
上述实施例中基于Linebuffer的并行计算方法的适用于控制粒度较为精细的情况。实际应用中,有时底层硬件只能提供较粗粒度的控制,比如以行为 单位的控制,这时Linebuffer采用行流水技术。当在控制粒度为多行的情况下,即多行流水情况下,行缓冲区支持并行的方法。此时,Linebuffer的缓冲区变为stride y行,即缓冲stride y行的数据,如图14所示。
可选地,本发明实施例在基于Linebuffer执行计算时,Linebuffer通过一组寄存器实现;所述预设模板中的各原模板均包括多个寄存器,以基于所述输入特征图中的数据图块读取并将每次执行模板计算所需的模板数据写入所述计算单元。可选地,一个寄存器可对应读取一个数据图块的数据。
如图15所示,寄存器R00至R24部分与图4、如5所示类似,为3个3×3的模板,将它们分别发送给3个计算单元。计算单元计算此模板的同时,同步Linebuffer通过读入2不断获取新的模板数据,以供各计算单元执行模板计算。同步Linebuffer通过读入1,不断读入新的数据(图12中的数据25、26、27),并存到缓冲区中,缓冲区由B00、B01、B02三个移位寄存器组成。写控制器会不断控制从读入1进入的数据,将其依次循环写入B00、B01、B02、B00、B01、B02等等。
当缓冲区存满,Linebuffer可以进行一次大模板的移动,此时Linebuffer中所有移位寄存器(包括缓冲区)向左移3位,Linebuffer的状态变为图16所示。寄存器R00至R24将发送新的模板至计算单元,缓冲区B00、B01、B02等待读入新的数据30、31、32。
之后,Linebuffer达到需要换行的位置。Linebuffer进行换行操作,所有寄存器向左移动3位,并读入3个新数据。之后Linebuffer达到图17所示状态,缓冲区B00、B01、B02等待读入新的数据35、36、37。
基于同一发明构思,本发明实施例还提供了一种计算设备,包括:处理器,用于执行通过上述任一实施例所述的基于行缓冲Linebuffer的并行计算方法。另外,计算设备还可以包括:存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由所述处理器加载并执行。
本发明实施例提供了一种更加高效的基于行缓冲Linebuffer的同步计算方法,对于神经网络来讲,首先确定需要执行并行计算的网络层之后会为其分配多个计算单元,并根据该网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板,通过该预设模板将模板数据同时传输至多个计算单元,进而由多个计算单元并行执行计算,相较于传统的方案来讲计算效率更高,且速度更快。本发明实施例提供的方法可以在大多通用的存储架构上实现,比如寄存器组或者RAM。该基于Linebuffer的同步计算方法可以解决神经网络算法、 多步图像处理算法等算法并行拆分后不同步的问题,所以该同步Linebuffer可以广泛的应用到众核神经网络加速器架构,众核图像处理器架构等等硬件架构上,尤其适用于需要强同步的硬件架构。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若 干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
至此,本领域技术人员应认识到,虽然本文已详尽示出和描述了本发明的多个示例性实施例,但是,在不脱离本发明精神和范围的情况下,仍可根据本发明公开的内容直接确定或推导出符合本发明原理的许多其他变型或修改。因此,本发明的范围应被理解和认定为覆盖了所有这些其他变型或修改。

Claims (10)

  1. 一种基于行缓冲Linebuffer的并行计算方法,应用于模板计算结构,所述方法包括:
    确定模板计算对象;
    根据所述模板计算对象的模板参数和计算单元的个数构建行缓冲Linebuffer的预设模板;
    通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理各自的计算任务。
  2. 根据权利要求1所述的方法,所述模板计算结构为卷积神经网络时,所述方法包括:
    确定需并行处理的网络层;
    为所述网络层分配多个计算单元;
    根据所述网络层的模板参数与计算单元的个数构建行缓冲Linebuffer的预设模板;
    通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,其中,所述模板数据是所述模板参数限定的原模板数据。
  3. 根据权利要求2所述的方法,其中,所述Linebuffer的预设模板由多个指定大小的原模板组成,且所述原模板的个数与所述计算单元的数量相等;
    其中,所述多个原模板在所述预设模板中依次连接,且至少部分重叠。
  4. 根据权利要求2或3所述的方法,其中,所述通过所述行缓冲Linebuffer的预设模板将模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,包括:
    通过所述行缓冲Linebuffer的多个原模板将所述多个原模板的模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务。
  5. 根据权利要求2-4任一项所述的方法,其中,所述通过所述行缓冲Linebuffer的多个原模板将所述多个原模板的模板数据同时传输至所述多个计算单元,由每一个所述计算单元并行处理所述网络层的任务,包括:
    预先将所述卷积神经网络的输入特征图平均划分为多个数据图块;
    利用所述多个原模板基于所述多个数据图块同时获取每个计算单元执行卷 积运算所需的模板数据,并将所获取的模板数据传输至对应的计算单元;
    将所述多个原模板按照指定方向连续移动预设步长,并在所述多个原模板每次移动后同时获取每个计算单元当前执行卷积运算所需的新模板数据,将所述新模板数据传输至对应的计算单元,直到所述多个数据图块全部读取完成。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    在获取所述多个计算单元执行当前所需的模板数据时,基于所述输入特征图获取所述多个计算单元执行下一次模板计算所需的新模板数据;
    将所述新模板数据存入预设的数据缓冲区。
  7. 根据权利要求6所述的方法,其中,当所述数据缓冲区读满时,所述预设模板移动预设步长。
  8. 根据权利要求1-7任一项所述的方法,其中,所述Linebuffer通过一组寄存器实现;
    所述预设模板中的各原模板均包括多个寄存器,以基于所述输入特征图中的数据图块读取并将每次执行模板计算所需的模板数据写入所述计算单元每次执行模板计算所需的模板数据。
  9. 一种计算设备,包括:
    处理器,用于执行通过权利要求1-8任一项所述的基于行缓冲Linebuffer的并行计算方法。
  10. 根据权利要求9所述的计算设备,其特征在于,所述计算设备还包括:
    存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由所述处理器加载并执行。
PCT/CN2020/082960 2019-04-19 2020-04-02 一种基于行缓冲Linebuffer的并行计算方法及计算设备 WO2020211654A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910317455.9A CN111832713A (zh) 2019-04-19 2019-04-19 一种基于行缓冲Linebuffer的并行计算方法及计算设备
CN201910317455.9 2019-04-19

Publications (1)

Publication Number Publication Date
WO2020211654A1 true WO2020211654A1 (zh) 2020-10-22

Family

ID=72838012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/082960 WO2020211654A1 (zh) 2019-04-19 2020-04-02 一种基于行缓冲Linebuffer的并行计算方法及计算设备

Country Status (2)

Country Link
CN (1) CN111832713A (zh)
WO (1) WO2020211654A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102842048A (zh) * 2011-06-20 2012-12-26 苏州科雷芯电子科技有限公司 一种图像识别中群相关并行计算的硬件实现方法
CN107862650A (zh) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 加速计算二维图像cnn卷积的方法
CN108229645A (zh) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 卷积加速和计算处理方法、装置、电子设备及存储介质
US20180189643A1 (en) * 2017-01-05 2018-07-05 Electronics And Telecommunications Research Institute Convolution circuit, application processor including the same, and operating method thereof
CN108388537A (zh) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 一种卷积神经网络加速装置和方法
CN108764182A (zh) * 2018-06-01 2018-11-06 阿依瓦(北京)技术有限公司 一种优化的用于人工智能的加速方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346622A (zh) * 2013-07-31 2015-02-11 富士通株式会社 卷积神经网络分类器及其分类方法和训练方法
TWI634490B (zh) * 2016-11-14 2018-09-01 美商耐能股份有限公司 卷積運算裝置及卷積運算方法
CN108182471B (zh) * 2018-01-24 2022-02-15 上海岳芯电子科技有限公司 一种卷积神经网络推理加速器及方法
CN109165728B (zh) * 2018-08-06 2020-12-18 浪潮集团有限公司 一种卷积神经网络的基本计算单元及计算方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102842048A (zh) * 2011-06-20 2012-12-26 苏州科雷芯电子科技有限公司 一种图像识别中群相关并行计算的硬件实现方法
US20180189643A1 (en) * 2017-01-05 2018-07-05 Electronics And Telecommunications Research Institute Convolution circuit, application processor including the same, and operating method thereof
CN108229645A (zh) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 卷积加速和计算处理方法、装置、电子设备及存储介质
CN107862650A (zh) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 加速计算二维图像cnn卷积的方法
CN108388537A (zh) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 一种卷积神经网络加速装置和方法
CN108764182A (zh) * 2018-06-01 2018-11-06 阿依瓦(北京)技术有限公司 一种优化的用于人工智能的加速方法和装置

Also Published As

Publication number Publication date
CN111832713A (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
CN109375951B (zh) 一种用于执行全连接层神经网络正向运算的装置和方法
WO2017124642A1 (zh) 用于执行人工神经网络正向运算的装置和方法
WO2017124641A1 (zh) 用于执行人工神经网络反向训练的装置和方法
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
US9953003B2 (en) Systems and methods for in-line stream processing of distributed dataflow based computations
US11294599B1 (en) Registers for restricted memory
US20200327079A1 (en) Data processing method and device, dma controller, and computer readable storage medium
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
US11379713B2 (en) Neural network processing
JP2019036298A (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
US11544525B2 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
WO2017185336A1 (zh) 用于执行pooling运算的装置和方法
EP3724822A1 (en) On-chip computational network
CN111008040A (zh) 缓存装置及缓存方法、计算装置及计算方法
US20210303988A1 (en) Multi-model training pipeline in distributed systems
WO2022179074A1 (zh) 数据处理装置、方法、计算机设备及存储介质
CN110100274A (zh) 具有降低功率渲染的混合现实系统
CN112905530A (zh) 片上架构、池化计算加速器阵列、单元以及控制方法
WO2022007265A1 (zh) 一种膨胀卷积加速计算方法及装置
WO2020211654A1 (zh) 一种基于行缓冲Linebuffer的并行计算方法及计算设备
JPWO2020003345A1 (ja) 演算処理装置
CN115345285B (zh) 基于gpu的时序图神经网络训练方法、系统及电子设备
WO2020134927A1 (zh) 一种数据处理方法及装置
WO2020238106A1 (zh) 一种数据处理方法、电子装置及计算机可读存储介质
CN107329733B (zh) 用于执行pooling运算的装置和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20792093

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20792093

Country of ref document: EP

Kind code of ref document: A1