WO2022095632A1 - 一种基于fpga实现数据卷积运算的方法、装置和介质 - Google Patents

一种基于fpga实现数据卷积运算的方法、装置和介质 Download PDF

Info

Publication number
WO2022095632A1
WO2022095632A1 PCT/CN2021/121220 CN2021121220W WO2022095632A1 WO 2022095632 A1 WO2022095632 A1 WO 2022095632A1 CN 2021121220 W CN2021121220 W CN 2021121220W WO 2022095632 A1 WO2022095632 A1 WO 2022095632A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution operation
data
fifo queue
target
operation unit
Prior art date
Application number
PCT/CN2021/121220
Other languages
English (en)
French (fr)
Inventor
葛海亮
李仁刚
阚宏伟
郝锐
宿栋栋
赵坤
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2022095632A1 publication Critical patent/WO2022095632A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a method, device and computer-readable storage medium for implementing data convolution operation based on FPGA.
  • CNN Convolutional Neural Network
  • the convolution operation is generally implemented through a sliding window multiply-add method. If the convolution operation is performed on a central processing unit (Central Processing Unit, CPU) or a graphics processing unit (Graphics Processing Unit, GPU), it is generally implemented by means of C language.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the disadvantage is that the convolution operation process can only be performed sequentially, and the delay overhead is large.
  • the purpose of the embodiments of the present application is to provide a method, a device and a computer-readable storage medium for implementing a data convolution operation based on an FPGA, which can improve the processing efficiency of the convolution operation.
  • an embodiment of the present application provides a method for implementing a data convolution operation based on an FPGA, including:
  • the data to be processed is stored in the FIFO queue of each convolution operation unit and the corresponding register in turn; wherein, a plurality of convolution operation units are set in the FPGA, Each of the convolution operation units includes a plurality of FIFO queues;
  • the target convolution operation unit is any one of the convolution operation units in all the convolution operation units
  • sequentially storing the data to be processed in the FIFO queue of each convolution operation unit and the corresponding register includes:
  • the to-be-processed data stored in the first FIFO queue is respectively shifted to the second FIFO queue and the data corresponding to the first FIFO queue.
  • the second FIFO queue is the next FIFO queue adjacent to the first FIFO queue;
  • the FIFO queue that has recently received the shifted data is used as the current FIFO queue, and if the number of data stored in the current FIFO queue reaches the preset threshold in each clock cycle, the stored data in the current FIFO queue will be used as the current FIFO queue.
  • the data to be processed is respectively shifted to the next FIFO queue adjacent to it and the register corresponding to the current FIFO queue.
  • the judging whether all the target registers corresponding to the target convolution operation unit meet the full load reading condition includes:
  • the to-be-processed data stored in the first FIFO queue is respectively shifted to the second FIFO queue and shared with the first FIFO queue.
  • the registers corresponding to a FIFO queue include:
  • the first stored data to be processed in the first FIFO queue is shifted to the second FIFO queue and the in the register corresponding to the first FIFO queue.
  • the first stored data to be processed in the first FIFO queue is respectively shifted to the second FIFO queue and the register corresponding to the first FIFO queue. After that it also includes:
  • the embodiment of the present application also provides a device for implementing data convolution operation based on FPGA, including a storage unit, a judgment unit and an operation unit;
  • the storage unit is used to store the data to be processed in the FIFO queue and the corresponding register of each convolution operation unit in turn according to the set data transmission rules when the convolution operation instruction is obtained; wherein, the FPGA is provided with A plurality of convolution operation units, each of the convolution operation units includes a plurality of FIFO queues;
  • the judging unit is used to judge whether all target registers corresponding to the target convolution operation unit satisfy the full load reading condition; wherein, the target convolution operation unit is any one of the convolution operation units in all the convolution operation units. operation unit;
  • the operation unit is used to execute the processing on the data to be processed stored in all the target registers corresponding to the target convolution operation unit if all the target registers corresponding to the target convolution operation unit satisfy the full load reading condition. Convolution operation.
  • the storage unit includes an input subunit, a judgment subunit, a shift subunit, and a subunit;
  • the input subunit is used to serially input data to be processed from the data layer to the first FIFO queue of the FPGA in each clock cycle;
  • the judging subunit is used for judging whether the number of data stored in the first FIFO queue reaches a preset threshold
  • the shifting subunit is configured to respectively shift the data to be processed stored in the first FIFO queue to the second FIFO queue and the data to be processed if the number of data stored in the first FIFO queue reaches a preset threshold. in the register corresponding to the first FIFO queue; wherein, the second FIFO queue is the next FIFO queue adjacent to the first FIFO queue;
  • the subunit is used to use the FIFO queue that has recently received the shifted data as the current FIFO queue. In each clock cycle, if the number of data stored in the current FIFO queue reaches the preset threshold, the The to-be-processed data stored in the current FIFO queue is respectively shifted to the next FIFO queue adjacent to it and the register corresponding to the current FIFO queue.
  • the judging unit is specifically used to judge whether all target registers corresponding to the target convolution operation unit are fully loaded and the current number of clock cycles reaches a preset clock count value; if the target convolution operation All target registers corresponding to the unit are fully loaded and the current number of clock cycles reaches the preset clock count value, then execute the execution of the volume of the data to be processed stored in all target registers corresponding to the target convolution operation unit. The steps of the product operation.
  • the shift subunit is specifically configured to, if the number of data stored in the first FIFO queue reaches a preset threshold, then in each clock cycle the first stored one in the first FIFO queue The data to be processed are respectively shifted into the second FIFO queue and the registers corresponding to the first FIFO queue.
  • the first judging unit is configured to respectively shift a piece of to-be-processed data first stored in the first FIFO queue to the second FIFO queue and the first FIFO queue in each clock cycle. After the corresponding register, determine whether the number of data stored in the register reaches a preset upper limit; wherein, the value of the preset upper limit is set according to the number of convolution kernels;
  • the deletion unit is configured to delete the first stored data to be processed in the register if the number of data stored in the register reaches a preset upper limit value.
  • the embodiment of the present application also provides a device for implementing data convolution operation based on FPGA, including:
  • the processor is configured to execute the computer program to implement the steps of the method for implementing a data convolution operation based on an FPGA according to any one of the above.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the implementation of the data volume based on the FPGA as described in any of the above-mentioned items is realized.
  • the steps of the method of product operation are described in any of the above-mentioned items.
  • each convolution operation unit includes multiple FIFO queues.
  • the data processing method of each convolution operation unit is the same. Take any convolution operation unit in all the convolution operation units, that is, the target convolution operation unit as an example, to determine whether all the target registers corresponding to the target convolution operation unit satisfy the Full load read condition.
  • the target registers corresponding to the target convolution operation unit meet the full-load read condition, it means that each memory in the target convolution operation unit has been stored with the pending processing of the convolution operation currently required to be performed.
  • the convolution operation is performed on the data to be processed stored in all the target registers corresponding to the target convolution operation unit.
  • the feature of FPGA that can execute multiple tasks concurrently per cycle is fully utilized, and the RTL-level convolution operation is implemented in a pipelined manner, which effectively improves the processing efficiency of the convolution operation.
  • FIG. 1 is a flowchart of a method for implementing a data convolution operation based on an FPGA provided by an embodiment of the present application;
  • FIG. 3 is a schematic structural diagram of a device for implementing data convolution operation based on FPGA provided by an embodiment of the present application;
  • FIG. 4 is a schematic diagram of the hardware structure of an apparatus for implementing a data convolution operation based on an FPGA according to an embodiment of the present application.
  • FIG. 1 is a flowchart of a method for implementing a data convolution operation based on an FPGA provided by an embodiment of the present application, and the method includes:
  • the parallel processing characteristic of the Field Programmable Gate Array (Field Programmable Gate Array, FPGA) hardware can be used to realize the processing of the data convolution operation.
  • the FPGA is provided with multiple convolution operation units, and each convolution operation unit includes multiple FIFO queues.
  • the data to be processed may be serially input from the data layer to the first FIFO queue of the FPGA in each clock cycle; it is determined whether the number of data stored in the first FIFO queue reaches a preset threshold.
  • the preset threshold may be used as the upper limit of the number of data that can be stored in the FIFO queue.
  • the pending data stored in the first FIFO queue can be separately The second FIFO queue is shifted to the second FIFO queue and the register corresponding to the first FIFO queue; wherein, the second FIFO queue is the next FIFO queue adjacent to the first FIFO queue.
  • the to-be-processed data stored in the second FIFO queue can be shifted to the third FIFO queue and the registers corresponding to the second FIFO queue, respectively.
  • the FIFO queue that has recently received the shifted data is used as the current FIFO queue. If the number of data stored in the current FIFO queue reaches the preset threshold in each clock cycle, the pending data stored in the current FIFO queue will be shifted respectively. to the next FIFO queue adjacent to it and the register corresponding to the current FIFO queue.
  • the pending data stored in the last FIFO queue can be directly shifted to the register corresponding to the last FIFO queue.
  • the data transfer can be implemented in the manner of a single data shift. If the number of data stored in the first FIFO queue reaches the preset threshold, the data to be processed first stored in the first FIFO queue can be shifted to the second FIFO queue and the first FIFO queue in each clock cycle. in the register corresponding to the queue.
  • the arrival of the rising (or falling) edge of the first clock can be defined as time tclk0
  • the arrival of the rising (or falling) edge of the second clock is defined as time tclk1
  • time tclkn-1 the arrival of the rising (or falling) edge of the nth clock (or falling) edge comes, which is defined as time tclkn-1.
  • serializing the data to be processed you can input 1 data to FIFO_1 at each clock edge. For example, input data D00 at tclk0, input data D01 at tclk1, input data D02 at tclk2, and input data at tclk3.
  • Data D03 is input
  • data D04 is input at time tclk4.
  • each convolution operation unit in the FPGA processes data in the same manner.
  • any convolution operation unit in all the convolution operation units, that is, the target convolution operation unit is used as an example to expand introduce.
  • a full read condition refers to a condition where the register is full of pending data.
  • the maximum number of data that can be stored in each register is a known amount
  • the clock count value can be used to represent the time spent when all registers in the target convolution operation unit are full of valid data to be processed. time.
  • the count can be started when the first data is written into the first FIFO queue of the target convolution operation unit. After each clock cycle, the count value is incremented by one, and the current number of clock cycles is the count value. value.
  • S103 When all target registers corresponding to the target convolution operation unit are fully loaded and the current number of clock cycles reaches the preset clock count value, S103 may be executed.
  • Registers have limited storage capacity.
  • the upper limit value corresponding to the register may be set depending on the number of convolution kernels.
  • the data in the register is Whether the number of stored data reaches the preset upper limit. If the number of data stored in the register reaches the preset upper limit, it means that the register is full of data. In order to ensure that the valid data to be processed can be successfully stored in the register, at this time, the first stored data to be processed in the register can be deleted. data.
  • FIG. 2 is an architectural diagram of a single convolution operation provided by an embodiment of the present application.
  • four FIFO queues are used as an example, and each FIFO queue corresponds to one register.
  • the input of FIFO_1 is the data serial input of the data layer a, and the output of FIFO_1 has two, one is the input to FIFO_2, the other is to register 1, and so on until FIFO_y is this transmission method.
  • the data of the convolution kernel is y*y
  • z FIFO queues are designed in each convolution operation unit, z is not less than y, and y FIFO queues are selected to participate in the data flow shift.
  • the purpose of the shift is to select y*y data in the data layer a to perform the multiplication and addition operation with the convolution kernel, so as to output the convolution result.
  • the registers corresponding to the y FIFO queues selected in the target convolution operation unit may be referred to as target registers corresponding to the target convolution operation unit.
  • Storing the data to be processed in the FIFO queue of the convolution operation unit and the corresponding register refers to storing the data to be processed in the selected y FIFO queues and the respective registers corresponding to the y FIFO queues.
  • each FIFO queue When the data is serially transmitted to the convolution operation unit, the number of data that can be stored in each FIFO queue is limited. For example, only M data is stored in each FIFO queue. The first data stored in the current FIFO queue is shifted to the next FIFO queue and the register corresponding to the current FIFO queue, so that the current FIFO queue can store newly arrived data.
  • each register can be set to y*K bits, that is, each register can hold y K bits of data.
  • y*K bits When the y fields of the register group are filled and the current number of clock cycles reaches the preset clock count value, the data of the stop data layer a can be serially input to FIFO_1. When the clock edge comes, the data in the register array is used. y*y and the convolution kernel data perform convolution operation to output the convolution result.
  • the data dimension of the convolution kernel is generally small, so it can be directly stored in the FPGA, including but not limited to the form of registers.
  • the dimension of the data layer a is generally relatively large, which is input to the FPGA from the outside, and is provided to the convolution operation unit in a serial manner.
  • the convolution operation is performed using the data y*y of the register array and the convolution kernel data, and the convolution result is output, it can be determined whether all the data in the data layer a has entered the FIFO queue and participated in the convolution operation. . If all the data in the data layer a enters the FIFO queue and participates in the convolution operation, this operation can be ended and the next convolution operation instruction can be directly waited for. If the data in the data layer a does not all enter the FIFO queue, at this time, it is possible to return to S101 to execute the step of sequentially storing the data to be processed in the FIFO queue of each convolution operation unit and the corresponding register according to the set data transmission rules.
  • each convolution operation unit includes multiple FIFO queues.
  • the data processing method of each convolution operation unit is the same. Take any convolution operation unit in all the convolution operation units, that is, the target convolution operation unit as an example, to determine whether all the target registers corresponding to the target convolution operation unit satisfy the Full load read condition.
  • the target registers corresponding to the target convolution operation unit meet the full-load read condition, it means that each memory in the target convolution operation unit has been stored with the pending processing of the convolution operation currently required to be performed.
  • the convolution operation is performed on the data to be processed stored in all the target registers corresponding to the target convolution operation unit.
  • the feature of FPGA that can execute multiple tasks concurrently per cycle is fully utilized, and the RTL-level convolution operation is implemented in a pipelined manner, which effectively improves the processing efficiency of the convolution operation.
  • FIG. 3 is a schematic structural diagram of a device for implementing data convolution operation based on an FPGA provided by an embodiment of the application, including a storage unit 31, a judgment unit 32, and an operation unit 33;
  • the storage unit 31 is used to store the data to be processed in the FIFO queue and the corresponding register of each convolution operation unit in turn according to the set data transmission rules when the convolution operation instruction is obtained; Convolution operation units, each convolution operation unit includes multiple FIFO queues;
  • Judging unit 32 for judging whether all target registers corresponding to the target convolution operation unit all meet the full load reading condition; Wherein, the target convolution operation unit is any one in all the convolution operation units. Convolution operation unit;
  • the operation unit 33 is configured to perform a convolution operation on the data to be processed stored in all the target registers corresponding to the target convolution operation unit if all the target registers corresponding to the target convolution operation unit satisfy the full load read condition.
  • the storage unit includes an input subunit, a judgment subunit, a shift subunit, and a subunit;
  • the input subunit is used to serially input the data to be processed from the data layer to the first FIFO queue of the FPGA in each clock cycle;
  • the shifting subunit is used to respectively shift the data to be processed stored in the first FIFO queue to the second FIFO queue and corresponding to the first FIFO queue if the number of data stored in the first FIFO queue reaches a preset threshold In the register of ; wherein, the second FIFO queue is the next FIFO queue adjacent to the first FIFO queue;
  • the FIFO queue that receives the latest shifted data is used as the current FIFO queue.
  • the stored data in the FIFO queue is shifted to the next FIFO queue adjacent to it and the register corresponding to the current FIFO queue, respectively.
  • the judging unit is specifically used to judge whether all target registers corresponding to the target convolution arithmetic unit are fully loaded and the current number of clock cycles reaches the preset clock count value; if all the targets corresponding to the target convolution arithmetic unit When the registers are fully loaded and the current number of clock cycles reaches the preset clock count value, the step of performing the convolution operation on the to-be-processed data stored in all target registers corresponding to the target convolution operation unit is performed.
  • the shift subunit is specifically used to shift the first stored data to be processed in the first FIFO queue in each clock cycle if the number of data stored in the first FIFO queue reaches a preset threshold. into the second FIFO queue and the register corresponding to the first FIFO queue.
  • the first judgment unit is used to respectively shift the data to be processed first stored in the first FIFO queue to the second FIFO queue and the register corresponding to the first FIFO queue in each clock cycle, and then judge the register Whether the number of data stored in the data reaches the preset upper limit; wherein, the value of the preset upper limit is set according to the number of convolution kernels;
  • the deletion unit is used for deleting the first stored data to be processed in the register if the number of data stored in the register reaches the preset upper limit value.
  • each convolution operation unit includes multiple FIFO queues.
  • the data processing method of each convolution operation unit is the same. Take any convolution operation unit in all the convolution operation units, that is, the target convolution operation unit as an example, to determine whether all the target registers corresponding to the target convolution operation unit satisfy the Full load read condition.
  • the target registers corresponding to the target convolution operation unit meet the full-load read condition, it means that each memory in the target convolution operation unit has been stored with the pending processing of the convolution operation currently required to be performed.
  • the convolution operation is performed on the data to be processed stored in all the target registers corresponding to the target convolution operation unit.
  • the feature of FPGA that can execute multiple tasks concurrently per cycle is fully utilized, and the RTL-level convolution operation is implemented in a pipelined manner, which effectively improves the processing efficiency of the convolution operation.
  • FIG. 4 is a schematic diagram of the hardware structure of a device 40 for implementing data convolution operation based on FPGA provided by an embodiment of the application, including:
  • memory 41 for storing computer programs
  • the processor 42 is configured to execute a computer program to implement the steps of the method for implementing a data convolution operation based on an FPGA according to any of the foregoing embodiments.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the FPGA-based data convolution operation as described in any of the foregoing embodiments is implemented. steps of the method.
  • a software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

Abstract

一种基于FPGA实现数据卷积运算的方法、装置和介质。方法包括:获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据(S101);其中,FPGA中设置有多个卷积运算单元,每个所述卷积运算单元包括多个FIFO队列;判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件(S102);其中,所述目标卷积运算单元为所有所述卷积运算单元中的任意一个卷积运算单元;若所述目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算(S103)。充分利用FPGA每周期可以并发执行多条任务的特性,流水式的实现RTL级的卷积运算,有效的提升了卷积运算的处理效率。

Description

一种基于FPGA实现数据卷积运算的方法、装置和介质
本申请要求在2020年11月06日提交中国专利局、申请号为202011229940.X、申请名称为“一种基于FPGA实现数据卷积运算的方法、装置和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,特别是涉及一种基于FPGA实现数据卷积运算的方法、装置和计算机可读存储介质。
背景技术
近年来卷积神经网络(Convolutional Neural Network,CNN)大量用于人工智能领域。其中在CNN中不可避免的需要使用卷积运算。卷积运算对于CNN的实现十分重要。
现有技术,实现卷积运算一般是通过滑窗乘加的方式。如果是在中央处理器(Central Processing Unit,CPU)或者图形处理器(Graphics Processing Unit,GPU)上执行卷积运算,一般是使用c语言等方式实现。缺点是卷积运算过程只能顺序执行,延迟开销较大。
可见,如何提升卷积运算的处理效率,是本领域技术人员需要解决的问题。
发明内容
本申请实施例的目的是提供一种基于FPGA实现数据卷积运算的方法、装置和计算机可读存储介质,可以提升卷积运算的处理效率。
为解决上述技术问题,本申请实施例提供一种基于FPGA实现数据卷积运算的方法,包括:
获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其 中,FPGA中设置有多个卷积运算单元,每个所述卷积运算单元包括多个FIFO队列;
判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件;其中,所述目标卷积运算单元为所有所述卷积运算单元中的任意一个卷积运算单元;
若所述目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
可选地,所述按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据包括:
在每个时钟周期内从数据层向所述FPGA的第一FIFO队列中串行输入待处理数据;
判断所述第一FIFO队列存储的数据个数是否达到预设阈值;
若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中;其中,所述第二FIFO队列是与所述第一FIFO队列相邻的下一个FIFO队列;
将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若所述当前FIFO队列存储的数据个数达到所述预设阈值,则将所述当前FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与所述当前FIFO队列相对应的寄存器中。
可选地,所述判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件包括:
判断所述目标卷积运算单元所对应的所有目标寄存器是否均已满载并且当前的时钟周期数达到预设的时钟计数值;
若所述目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则执行所述对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算的步骤。
可选地,所述若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中包括:
若所述第一FIFO队列存储的数据个数达到预设阈值,则在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中。
可选地,在所述在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中之后还包括:
判断所述寄存器中存储的数据个数是否达到预设上限值;其中,所述预设上限值的取值依据卷积的核数设置;
若所述寄存器中存储的数据个数达到预设上限值,则删除所述寄存器中最先存储的一个待处理数据。
本申请实施例还提供了一种基于FPGA实现数据卷积运算的装置,包括存储单元、判断单元和运算单元;
所述存储单元,用于获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个所述卷积运算单元包括多个FIFO队列;
所述判断单元,用于判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件;其中,所述目标卷积运算单元为所有所述卷积运算单元中的任意一个卷积运算单元;
所述运算单元,用于若所述目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
可选地,所述存储单元包括输入子单元、判断子单元、移位子单元和作为子单元;
所述输入子单元,用于在每个时钟周期内从数据层向所述FPGA的第一FIFO队列中串行输入待处理数据;
所述判断子单元,用于判断所述第一FIFO队列存储的数据个数是否达到预设阈值;
所述移位子单元,用于若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中;其中,所述第二FIFO队列是与所述第一FIFO队列相邻的下一个FIFO队列;
所述作为子单元,用于将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若所述当前FIFO队列存储的数据个数达到所述预设阈值,则将所述当前FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与所述当前FIFO队列相对应的寄存器中。
可选地,所述判断单元具体用于判断所述目标卷积运算单元所对应的所有目标寄存器是否均已满载并且当前的时钟周期数达到预设的时钟计数值;若所述目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则执行所述对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算的步骤。
可选地,所述移位子单元具体用于若所述第一FIFO队列存储的数据个数达到预设阈值,则在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中。
可选地,还包括第一判断单元和删除单元;
所述第一判断单元,用于在所述在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中之后,判断所述寄存器中存储的数据个数是否达到预设上限值;其中,所述预设上限值的取值依据卷积的核数设置;
所述删除单元,用于若所述寄存器中存储的数据个数达到预设上限值,则删除所述寄存器中最先存储的一个待处理数据。
本申请实施例还提供了一种基于FPGA实现数据卷积运算的装置,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序以实现如上述任意一项所述基于FPGA实现数据卷积运算的方法的步骤。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述任意一项所述基于FPGA实现数据卷积运算的方法的步骤。
由上述技术方案可以看出,获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个卷积运算单元包括多个FIFO队列。每个卷积运算单元的数据处理方式相同,以所有卷积运算单元中的任意一个卷积运算单元即目标卷积运算单元为例,判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件。若目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则说明目标卷积运算单元中的各存储器均已存储满当前所需执行卷积运算的待处理处理,此时可以对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。在该技术方案中,充分利用FPGA每周期可以并发执行多条任务的特性,流水式的实现RTL级的卷积运算,有效的提升了卷积运算的处理效率。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种基于FPGA实现数据卷积运算的方法的流程图;
图2为本申请实施例提供的一种单个卷积运算的架构图;
图3为本申请实施例提供的一种基于FPGA实现数据卷积运算的装置的结构示意图;
图4为本申请实施例提供的一种基于FPGA实现数据卷积运算的装置的硬件结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本申请保护范围。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。
接下来,详细介绍本申请实施例所提供的一种基于FPGA实现数据卷积运算的方法。图1为本申请实施例提供的一种基于FPGA实现数据卷积运算的方法的流程图,该方法包括:
S101:获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据。
在本申请实施例中,可以利用现场可编程门阵列(Field Programmable Gate Array,FPGA)硬件的并行处理的特性,实现对数据卷积运算的处理。
其中,FPGA中设置有多个卷积运算单元,每个卷积运算单元包括多个FIFO队列。
在本申请实施例中,可以在每个时钟周期内从数据层向FPGA的第一FIFO队列中串行输入待处理数据;判断第一FIFO队列存储的数据个数是否达到预设阈值。其中,预设阈值可以作为FIFO队列所能存储数据个数的上限值。
若第一FIFO队列存储的数据个数达到预设阈值,则说明此时第一FIFO队列已经存满数据,为了便于后续新数据的存入,可以将第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中;其中,第二FIFO队列是与第一FIFO队列相邻的下一个FIFO队列。依次类推,当第二FIFO队列存储的数据个数达到预设阈值,可以将第二FIFO队列中存储的待处理数据分别移位至第三FIFO队列以及与第二FIFO队列相对应的寄存器中。将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若当前FIFO队列存储的数据个数达到预设阈值,则将当前FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与当前FIFO队列相对应的寄存器中。
对于最后一个FIFO队列而言,当最后一个FIFO队列存储的数据个数达到预设阈值,直接将最后一个FIFO队列中存储的待处理数据移位到与最后一个FIFO队列相对应的寄存器中即可。
在具体实现中,可以按照单个数据移位的方式实现数据的转移。若第一FIFO队列存储的数据个数达到预设阈值,则可以在每个时钟周期内将第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中。
在向FIFO队列存储待处理数据时,可以按照时钟周期依次串行输入。在实际应用中,可以将第1个时钟上升(或者下降)沿到来,定义为时刻tclk0,第2个时钟上升(或者下降)沿到来,定义为时刻tclk1,依次类推,在第n个时钟上升(或者下降)沿到来,定义为时刻tclkn-1。在执行待处理数据的串行时,可以在每个时钟沿到来时向FIFO_1输入1个数据,比如,在tclk0时刻输入数据D00,在tclk1时刻输入数据D01,在tclk2输入数据D02,在tclk3时刻输入数据D03,在tclk4时刻输入数据D04。
S102:判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件。
在本申请实施例中,FPGA中每个卷积运算单元对数据的处理方 式相同,为了便于介绍,均以所有卷积运算单元中的任意一个卷积运算单元即目标卷积运算单元为例展开介绍。
满载读取条件指的是寄存器中存储满待处理数据的条件。
若目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则说明目标卷积运算单元中的各存储器均已存储满当前所需执行卷积运算的待处理处理,此时可以执行S103。
在具体实现中,可以判断目标卷积运算单元所对应的所有目标寄存器是否均已满载并且当前的时钟周期数达到预设的时钟计数值。
在本申请实施例中,每个寄存器中所能存储的最大数据个数为已知量,时钟计数值可以用于表征目标卷积运算单元中所有寄存器存储满有效的待处理数据时所花费的时间。
在实际应用中,可以将第一个数据写入目标卷积运算单元的第一个FIFO队列时开始计数,每经过一个时钟周期,计数值加一,当前的时钟周期数即为计数值的取值。
当目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则可以执行S103。
寄存器的存储容量有限。在本申请实施例中,为了便于将寄存器中的待处理数据与卷积核进行乘加运算,可以依赖于卷积的核数设置寄存器所对应的上限值。
在实际应用中,在每个时钟周期内将第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中之后,可以判断寄存器中存储的数据个数是否达到预设上限值。若寄存器中存储的数据个数达到预设上限值,则说明寄存器已经存满数据,为了保证可以将有效的待处理数据顺利存入寄存器,此时可以删除寄存器中最先存储的一个待处理数据。
S103:对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
如图2为本申请实施例提供的一种单个卷积运算的架构图,图2中以4个FIFO队列为例,每个FIFO队列对应一个寄存器。FIFO_1 的输入是数据层a的数据串行输入,FIFO_1的输出有两个,一个是给FIFO_2的输入,一个是给寄存器1,以此类推直到FIFO_y都是这种传输方式。假设卷积核的数据为y*y,在实际应用中,每个卷积运算单元中设计有z个FIFO队列,z不小于y,选择其中的y个FIFO队列参与数据流移位。移位的目的是为了在数据层a中选中y*y个数据与卷积核进行乘加运算,从而输出卷积结果。
以目标卷积运算单元为例,在本申请实施例中,可以将目标卷积运算单元中选择的y个FIFO队列所对应的寄存器称作目标卷积运算单元所对应的目标寄存器。向卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据指的是向选择的y个FIFO队列以及这y个FIFO队列各自对应的寄存器中存储待处理数据。
在数据串行传输至卷积运算单元时,会对每个FIFO队列所能存储的数据个数进行限定,例如,每个FIFO队列中只存储M个数据,当存满M个数据,会将当前FIFO队列中最先存储的一个数据移位到下一个FIFO队列以及当前FIFO队列所对应的寄存器中,以便于当前FIFO队列可以存储新到来的数据。
每个寄存器的位宽可以设置为y*K bit,也就是每个寄存器可以装y个K bit的数据。当填满寄存器组的y个字段且当前的时钟周期数达到预设的时钟计数值,则可以将停止数据层a的数据串行输入数据给FIFO_1,当时钟沿来临时,使用寄存器数组的数据y*y和卷积核数据做卷积运算,从而输出卷积结果。
卷积核的数据维数一般比较小,所以可以直接存储在FPGA中,包括但不限于寄存器的形式。数据层a的维数一般比较大,由外部输入给FPGA,使用串行的方式提供给卷积运算单元。
在本申请实施例中,在使用寄存器数组的数据y*y和卷积核数据做卷积运算,输出卷积结果之后,可以判断数据层a中数据是否全部进入FIFO队列且参与了卷积运算。如果数据层a中数据全部进入FIFO队列且参与了卷积运算,可以结束本次操作,直接等待下一次的卷积运算指令即可。如果数据层a中数据未全部进入FIFO队列,此时可 以返回S101执行按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据的步骤。
由上述技术方案可以看出,获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个卷积运算单元包括多个FIFO队列。每个卷积运算单元的数据处理方式相同,以所有卷积运算单元中的任意一个卷积运算单元即目标卷积运算单元为例,判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件。若目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则说明目标卷积运算单元中的各存储器均已存储满当前所需执行卷积运算的待处理处理,此时可以对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。在该技术方案中,充分利用FPGA每周期可以并发执行多条任务的特性,流水式的实现RTL级的卷积运算,有效的提升了卷积运算的处理效率。
图3为本申请实施例提供的一种基于FPGA实现数据卷积运算的装置的结构示意图,包括存储单元31、判断单元32和运算单元33;
存储单元31,用于获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个卷积运算单元包括多个FIFO队列;
判断单元32,用于判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件;其中,目标卷积运算单元为所有卷积运算单元中的任意一个卷积运算单元;
运算单元33,用于若目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
可选地,存储单元包括输入子单元、判断子单元、移位子单元和作为子单元;
输入子单元,用于在每个时钟周期内从数据层向FPGA的第一FIFO队列中串行输入待处理数据;
判断子单元,用于判断第一FIFO队列存储的数据个数是否达到预设阈值;
移位子单元,用于若第一FIFO队列存储的数据个数达到预设阈值,则将第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中;其中,第二FIFO队列是与第一FIFO队列相邻的下一个FIFO队列;
作为子单元,用于将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若当前FIFO队列存储的数据个数达到预设阈值,则将所述FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与当前FIFO队列相对应的寄存器中。
可选地,判断单元具体用于判断目标卷积运算单元所对应的所有目标寄存器是否均已满载并且当前的时钟周期数达到预设的时钟计数值;若目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则执行对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算的步骤。
可选地,移位子单元具体用于若第一FIFO队列存储的数据个数达到预设阈值,则在每个时钟周期内将第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中。
可选地,还包括第一判断单元和删除单元;
第一判断单元,用于在每个时钟周期内将第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与第一FIFO队列相对应的寄存器中之后,判断寄存器中存储的数据个数是否达到预设上限值;其中,预设上限值的取值依据卷积的核数设置;
删除单元,用于若寄存器中存储的数据个数达到预设上限值,则删除寄存器中最先存储的一个待处理数据。
图3所对应实施例中特征的说明可以参见图1所对应实施例的相 关说明,这里不再一一赘述。
由上述技术方案可以看出,获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个卷积运算单元包括多个FIFO队列。每个卷积运算单元的数据处理方式相同,以所有卷积运算单元中的任意一个卷积运算单元即目标卷积运算单元为例,判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件。若目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则说明目标卷积运算单元中的各存储器均已存储满当前所需执行卷积运算的待处理处理,此时可以对目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。在该技术方案中,充分利用FPGA每周期可以并发执行多条任务的特性,流水式的实现RTL级的卷积运算,有效的提升了卷积运算的处理效率。
图4为本申请实施例提供的一种基于FPGA实现数据卷积运算的装置40的硬件结构示意图,包括:
存储器41,用于存储计算机程序;
处理器42,用于执行计算机程序以实现如上述任意实施例所述的基于FPGA实现数据卷积运算的方法的步骤。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述任意实施例所述的基于FPGA实现数据卷积运算的方法的步骤。
以上对本申请实施例所提供的一种基于FPGA实现数据卷积运算的方法、装置和计算机可读存储介质进行了详细介绍。说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以 对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。

Claims (10)

  1. 一种基于FPGA实现数据卷积运算的方法,其特征在于,包括:
    获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个所述卷积运算单元包括多个FIFO队列;
    判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件;其中,所述目标卷积运算单元为所有所述卷积运算单元中的任意一个卷积运算单元;
    若所述目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
  2. 根据权利要求1所述的基于FPGA实现数据卷积运算的方法,其特征在于,所述按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据包括:
    在每个时钟周期内从数据层向所述FPGA的第一FIFO队列中串行输入待处理数据;
    判断所述第一FIFO队列存储的数据个数是否达到预设阈值;
    若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中;
    将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若所述当前FIFO队列存储的数据个数达到所述预设阈值,则将所述当前FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与所述当前FIFO队列相对应的寄存器中。
  3. 根据权利要求2所述的基于FPGA实现数据卷积运算的方法,其特征在于,所述判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件包括:
    判断所述目标卷积运算单元所对应的所有目标寄存器是否均已 满载并且当前的时钟周期数达到预设的时钟计数值;
    若所述目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则执行所述对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算的步骤。
  4. 根据权利要求2所述的基于FPGA实现数据卷积运算的方法,其特征在于,所述若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中包括:
    若所述第一FIFO队列存储的数据个数达到预设阈值,则在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中。
  5. 根据权利要求4所述的基于FPGA实现数据卷积运算的方法,其特征在于,在所述在每个时钟周期内将所述第一FIFO队列中最先存储的一个待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中之后还包括:
    判断所述寄存器中存储的数据个数是否达到预设上限值;其中,所述预设上限值的取值依据卷积的核数设置;
    若所述寄存器中存储的数据个数达到预设上限值,则删除所述寄存器中最先存储的一个待处理数据。
  6. 一种基于FPGA实现数据卷积运算的装置,其特征在于,包括存储单元、判断单元和运算单元;
    所述存储单元,用于获取到卷积运算指令时,按照设定的数据传输规则,依次向每个卷积运算单元的FIFO队列以及相应的寄存器中存储待处理数据;其中,FPGA中设置有多个卷积运算单元,每个所述卷积运算单元包括多个FIFO队列;
    所述判断单元,用于判断目标卷积运算单元所对应的所有目标寄存器是否均满足满载读取条件;其中,所述目标卷积运算单元为所有所述卷积运算单元中的任意一个卷积运算单元;
    所述运算单元,用于若所述目标卷积运算单元所对应的所有目标寄存器均满足满载读取条件,则对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算。
  7. 根据权利要求6所述的基于FPGA实现数据卷积运算的装置,其特征在于,所述存储单元包括输入子单元、判断子单元、移位子单元和作为子单元;
    所述输入子单元,用于在每个时钟周期内从数据层向所述FPGA的第一FIFO队列中串行输入待处理数据;
    所述判断子单元,用于判断所述第一FIFO队列存储的数据个数是否达到预设阈值;
    所述移位子单元,用于若所述第一FIFO队列存储的数据个数达到预设阈值,则将所述第一FIFO队列中存储的待处理数据分别移位至第二FIFO队列以及与所述第一FIFO队列相对应的寄存器中;其中,所述第二FIFO队列是与所述第一FIFO队列相邻的下一个FIFO队列;
    所述作为子单元,用于将最新接收到移位数据的FIFO队列作为当前FIFO队列,在每个时钟周期内若所述当前FIFO队列存储的数据个数达到所述预设阈值,则将所述当前FIFO队列中的存储的待处理数据分别移位至与其相邻的下一个FIFO队列以及与所述当前FIFO队列相对应的寄存器中。
  8. 根据权利要求7所述的基于FPGA实现数据卷积运算的装置,其特征在于,所述判断单元具体用于判断所述目标卷积运算单元所对应的所有目标寄存器是否均已满载并且当前的时钟周期数达到预设的时钟计数值;若所述目标卷积运算单元所对应的所有目标寄存器均已满载并且当前的时钟周期数达到预设的时钟计数值,则执行所述对所述目标卷积运算单元所对应的所有目标寄存器中存储的待处理数据执行卷积运算的步骤。
  9. 一种基于FPGA实现数据卷积运算的装置,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序以实现如权利要求1至5任意一项所述基于FPGA实现数据卷积运算的方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5任意一项所述基于FPGA实现数据卷积运算的方法的步骤。
PCT/CN2021/121220 2020-11-06 2021-09-28 一种基于fpga实现数据卷积运算的方法、装置和介质 WO2022095632A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011229940.XA CN112464150A (zh) 2020-11-06 2020-11-06 一种基于fpga实现数据卷积运算的方法、装置和介质
CN202011229940.X 2020-11-06

Publications (1)

Publication Number Publication Date
WO2022095632A1 true WO2022095632A1 (zh) 2022-05-12

Family

ID=74826366

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121220 WO2022095632A1 (zh) 2020-11-06 2021-09-28 一种基于fpga实现数据卷积运算的方法、装置和介质

Country Status (2)

Country Link
CN (1) CN112464150A (zh)
WO (1) WO2022095632A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309676A (zh) * 2022-10-12 2022-11-08 浪潮电子信息产业股份有限公司 一种异步fifo读写控制方法、系统及电子设备
CN115983337A (zh) * 2022-12-14 2023-04-18 北京登临科技有限公司 卷积计算单元、ai运算阵列及相关设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464150A (zh) * 2020-11-06 2021-03-09 苏州浪潮智能科技有限公司 一种基于fpga实现数据卷积运算的方法、装置和介质
CN113706366B (zh) * 2021-07-30 2024-02-27 浪潮电子信息产业股份有限公司 一种图像特征数据的提取方法、系统及相关装置
CN114528111B (zh) * 2022-02-17 2023-06-16 北京有竹居网络技术有限公司 用于数据召回的fpga芯片和数据召回方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170236053A1 (en) * 2015-12-29 2017-08-17 Synopsys, Inc. Configurable and Programmable Multi-Core Architecture with a Specialized Instruction Set for Embedded Application Based on Neural Networks
CN108595379A (zh) * 2018-05-08 2018-09-28 济南浪潮高新科技投资发展有限公司 一种基于多级缓存的并行化卷积运算方法及系统
CN109416756A (zh) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 卷积器及其所应用的人工智能处理装置
CN109711533A (zh) * 2018-12-20 2019-05-03 西安电子科技大学 基于fpga的卷积神经网络模块
CN110414672A (zh) * 2019-07-23 2019-11-05 江苏鼎速网络科技有限公司 卷积运算方法、装置及系统
CN112464150A (zh) * 2020-11-06 2021-03-09 苏州浪潮智能科技有限公司 一种基于fpga实现数据卷积运算的方法、装置和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170236053A1 (en) * 2015-12-29 2017-08-17 Synopsys, Inc. Configurable and Programmable Multi-Core Architecture with a Specialized Instruction Set for Embedded Application Based on Neural Networks
CN109416756A (zh) * 2018-01-15 2019-03-01 深圳鲲云信息科技有限公司 卷积器及其所应用的人工智能处理装置
CN108595379A (zh) * 2018-05-08 2018-09-28 济南浪潮高新科技投资发展有限公司 一种基于多级缓存的并行化卷积运算方法及系统
CN109711533A (zh) * 2018-12-20 2019-05-03 西安电子科技大学 基于fpga的卷积神经网络模块
CN110414672A (zh) * 2019-07-23 2019-11-05 江苏鼎速网络科技有限公司 卷积运算方法、装置及系统
CN112464150A (zh) * 2020-11-06 2021-03-09 苏州浪潮智能科技有限公司 一种基于fpga实现数据卷积运算的方法、装置和介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309676A (zh) * 2022-10-12 2022-11-08 浪潮电子信息产业股份有限公司 一种异步fifo读写控制方法、系统及电子设备
CN115309676B (zh) * 2022-10-12 2023-02-28 浪潮电子信息产业股份有限公司 一种异步fifo读写控制方法、系统及电子设备
CN115983337A (zh) * 2022-12-14 2023-04-18 北京登临科技有限公司 卷积计算单元、ai运算阵列及相关设备

Also Published As

Publication number Publication date
CN112464150A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
WO2022095632A1 (zh) 一种基于fpga实现数据卷积运算的方法、装置和介质
WO2018120989A1 (zh) 卷积运算芯片和通信设备
US7873817B1 (en) High speed multi-threaded reduced instruction set computer (RISC) processor with hardware-implemented thread scheduler
KR20200138339A (ko) 멀티 스레드, 자체 스케줄링 재구성 가능한 컴퓨팅 패브릭에 대한 조건부 브랜칭 제어
TWI827792B (zh) 多路徑神經網路、資源配置的方法及多路徑神經網路分析器
CN112487750B (zh) 一种基于存内计算的卷积加速计算系统及方法
US20230196500A1 (en) Image data storage method, image data processing method and system, and related apparatus
US20080282062A1 (en) Method and apparatus for loading data and instructions into a computer
WO2018129930A1 (zh) 快速傅里叶变换处理方法、装置和计算机存储介质
US20070052557A1 (en) Shared memory and shared multiplier programmable digital-filter implementation
US20110179251A1 (en) Power saving asynchronous computer
CN106325812A (zh) 一种针对乘累加运算的处理方法及装置
Singh et al. Applying real-time scheduling theory to the synchronous data flow model of computation
CN109800867B (zh) 一种基于fpga片外存储器的数据调用方法
WO2024066259A1 (zh) 一种指令调度方法、芯片及电子设备
CN114528526B (zh) 卷积数据处理方法、装置、卷积运算加速器和存储介质
Pang et al. Self-timed meshes are faster than synchronous
US20210303267A1 (en) Method of data processing, corresponding mac circuit, dsp system and computer program product
Xiao et al. A mobilenet accelerator with high processing-element-efficiency on fpga
Huang et al. MALMM: A multi-array architecture for large-scale matrix multiplication on FPGA
CN113610221A (zh) 一种基于fpga的可变膨胀卷积运算硬件系统
Kohutka et al. Heap queue: a novel efficient hardware architecture of MIN/MAX queues for real-time systems
US10387155B2 (en) Controlling register bank access between program and dedicated processors in a processing system
Zhang et al. Design of multifunctional convolutional neural network accelerator for iot endpoint soc
JP5544856B2 (ja) 調停装置、調停方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888329

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888329

Country of ref document: EP

Kind code of ref document: A1