WO2020143236A1 - 一种卷积神经网络的加速方法、装置、设备及存储介质 - Google Patents

一种卷积神经网络的加速方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020143236A1
WO2020143236A1 PCT/CN2019/103637 CN2019103637W WO2020143236A1 WO 2020143236 A1 WO2020143236 A1 WO 2020143236A1 CN 2019103637 W CN2019103637 W CN 2019103637W WO 2020143236 A1 WO2020143236 A1 WO 2020143236A1
Authority
WO
WIPO (PCT)
Prior art keywords
cnn
accelerated
preset
acceleration
action timing
Prior art date
Application number
PCT/CN2019/103637
Other languages
English (en)
French (fr)
Inventor
王丽
曹芳
郭振华
Original Assignee
广东浪潮大数据研究有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东浪潮大数据研究有限公司 filed Critical 广东浪潮大数据研究有限公司
Publication of WO2020143236A1 publication Critical patent/WO2020143236A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the invention relates to the field of algorithm acceleration, in particular to an acceleration method of a convolutional neural network.
  • the invention also relates to an acceleration device, equipment and storage medium of a convolutional neural network.
  • CNN Convolutional Neutral Network, Convolutional Neural Network
  • an accelerator card is used to accelerate the CNN computing process, but there are many different types of CNNs.
  • an acceleration method for convolutional neural networks including:
  • the calculation operation model capable of realizing each calculation operation of the CNN to be accelerated, as a standby calculation operation model
  • the field programmable gate array FPGA that controls the acceleration card compiles a kernel program for executing the CNN to be accelerated according to the computing operation model to be used;
  • the action timing parameter of acquiring the action timing of each calculation operation of the CNN to be accelerated is specifically:
  • the preset deep learning framework is caffe or TensorFlow.
  • the field programmable gate array FPGA that controls the acceleration card compiles a kernel program for executing the CNN to be accelerated according to the computing operation model to be used specifically:
  • the field programmable gate array FPGA that controls the acceleration card compiles a kernel program for executing the CNN to be accelerated through its own hardware compilation platform according to the computing operation model to be used.
  • the pre-received preset convolutional neural network CNN multiple preset types of calculation operation models are specifically:
  • a plurality of preset types of calculation operation models in the convolutional neural network CNN preset by the open arithmetic language OpenCL are received in advance.
  • the preset types include convolution operation, pooling operation, linear rectification function Relu and Norm function.
  • the present invention also provides an acceleration device for a convolutional neural network, including:
  • the receiving module is used to receive multiple preset types of calculation operation models in the preset convolutional neural network CNN in advance;
  • a first obtaining module configured to obtain, from a plurality of the calculation operation models, the calculation operation model capable of realizing each calculation operation of the CNN to be accelerated, as a standby calculation operation model;
  • a first control module for controlling the field programmable gate array FPGA of the acceleration card to compile a kernel program for executing the CNN to be accelerated according to the computing operation model to be used;
  • a second obtaining module configured to obtain an action timing parameter including the action timing of each calculation operation of the CNN to be accelerated
  • the second control module is configured to control the FPGA to execute the kernel program according to the action timing in the action timing parameters, and perform calculation on preset data to achieve acceleration.
  • the second obtaining module includes:
  • a conversion module configured to convert the CNN to be accelerated into the CNN to be accelerated of a preset deep learning framework
  • An obtaining submodule configured to obtain an action timing parameter of the action timing of each calculation operation of the CNN to be accelerated including a preset deep learning framework.
  • an acceleration device for convolutional neural networks including:
  • Memory used to store computer programs
  • the processor is configured to implement the steps of the acceleration method of the convolutional neural network as described in any one of the above items when the computer program is executed.
  • the present invention also provides a computer-readable storage medium that stores a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, the convolution described in any one of the above Steps of neural network acceleration method.
  • the invention provides an acceleration method of a convolutional neural network, which includes receiving in advance a plurality of preset types of calculation operation models in a preset convolutional neural network CNN;
  • the calculation operation model of each calculation operation of CNN is used as the calculation operation model to be used;
  • the field programmable gate array FPGA that controls the accelerator card compiles the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used;
  • the calculation operations models that can realize each calculation operation of the CNN to be accelerated can be obtained from a plurality of preset types of calculation operation models in the preset CNN
  • the calculation operation model is used as the calculation operation model to be used, and then the FPGA in the acceleration card can be controlled to compile the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used, and then the calculation operations including each calculation operation of the CNN to be accelerated can be obtained.
  • the invention adopts any acceleration card to execute any one of the CNNs to be accelerated It does not need to develop a variety of accelerator cards for accelerated operation, which is more flexible and saves R&D costs.
  • the invention also provides an acceleration device, equipment and storage medium of a convolutional neural network, which have the same beneficial effects as the acceleration method of the convolutional neural network above.
  • FIG. 1 is a schematic flowchart of an acceleration method of a convolutional neural network provided by the present invention
  • FIG. 2 is a schematic structural diagram of an acceleration device for a convolutional neural network provided by the present invention
  • FIG. 3 is a schematic structural diagram of a convolutional neural network acceleration device provided by the present invention.
  • FIG. 1 is a schematic flowchart of a convolutional neural network acceleration method provided by the present invention, including:
  • Step S1 Receive multiple preset types of calculation operation models in the preset CNN in advance;
  • a plurality of preset types of calculation operation models may be various calculation operation models capable of implementing calculation operations commonly used in various convolutional neural networks.
  • the A calculation operation model may implement the A calculation operation
  • the B calculation operation model may The number of B calculation operations and the like can be set independently according to requirements, and the embodiment of the present invention is not limited herein.
  • the execution subject in the embodiment of the present invention may be a CPU, and this step may specifically include that the storage module in the CPU receives multiple preset types of calculation operation models in a preset CNN in advance, or that the CPU receives After multiple preset types of calculation operation models in the preset CNN are stored in the storage module, in this case, after each calculation operation model is available in the storage module, subsequent steps can be performed in order to achieve Accelerated calculation of various algorithms.
  • Step S2 From a plurality of calculation operation models, obtain a calculation operation model that can realize each calculation operation of the CNN to be accelerated, as a standby calculation operation model;
  • the CNN to be accelerated may be any one of various CNNs, which is not limited in this embodiment of the present invention.
  • Step S3 The FPGA (Field-Programmable Gate Array) that controls the acceleration card compiles the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used;
  • FPGA Field-Programmable Gate Array
  • the FPGA wants to accelerate the accelerated CNN smoothly, it can compile the kernel program for executing the accelerated CNN according to the computing model to be used. In this case, the FPGA can execute the kernel program and cooperate with the subsequent steps to treat the acceleration CNN accelerates.
  • the acceleration implemented by the FPGA of the embodiment of the present invention may be heterogeneous acceleration, and may be applicable to various types of CNNs, and the embodiment of the present invention is not limited herein.
  • the kernel program After the kernel program is compiled, the kernel program can be controlled to be loaded into the FPGA for subsequent execution.
  • Step S4 Obtain the action timing parameters including the action timing of each calculation operation of the CNN to be accelerated;
  • the action timing parameters of the CNN to be accelerated can be obtained through multiple ways, for example, the CNN to be accelerated can be directly analyzed or obtained, or obtained from a pre-stored database, etc. This embodiment of the present invention is not limited herein.
  • the action timing parameters can include the action timing of the CNN to be accelerated, for example, the action B is executed after the action A is completed, and the action D is executed after the action B is executed.
  • the specific form of the action timing corresponds to the type of the CNN to be accelerated
  • the embodiment of the present invention is not limited herein.
  • Step S5 Control the FPGA to execute the kernel program according to the action sequence in the action sequence parameters, and perform calculation on the preset data to achieve acceleration.
  • the FPGA can be controlled to execute the kernel program according to the action timing in the action timing parameters, and during this process, the preset data is calculated, which realizes the acceleration of the CNN to be accelerated and improves the calculation. speed.
  • the preset data may be various types of data, such as face data obtained during face recognition, etc.
  • the preset data may be input into the FPGA from the global memory for calculation under the control of the CPU.
  • the embodiments are not limited herein.
  • the CPU can also obtain the operation results of the FPGA.
  • the process of obtaining the operation results can be to store the operation results for the control FPGA, and then the CPU obtains the operation results from the storage and outputs the operation results in various forms.
  • the form of a graph or the form of a voice prompt, etc. the embodiment of the present invention is not limited herein.
  • deep learning is one of the rapidly developing fields in artificial intelligence, which can help computers understand large amounts of data in the form of images, sounds, and text.
  • deep learning open source tools such as caffe (Convolutional Architecture for Fast Feature Embedding)
  • caffe Convolutional Architecture for Fast Feature Embedding
  • deep learning technology has developed rapidly.
  • deep learning is used in face recognition, speech recognition, The fields of precision medicine and unmanned driving are being widely used.
  • CNN is a type of artificial neural network, and it is the first deep learning algorithm to successfully train a multi-layer network structure. Developers use computationally intensive algorithms to create CNNs and implement them on various platforms.
  • CNN is used in the check reading system, OCR (Optical Character Recognition, optical character recognition) and handwriting recognition system, face recognition and license plate recognition in street view, and face recognition in France Telecom video conference system.
  • FPGA is accelerated by mapping algorithms to parallel hardware on the FPGA.
  • Each hardware module designed on the FPGA can be executed in parallel.
  • the interconnection of the input and output of each hardware module and the pipeline structure provided by the FPGA It can be well matched with the CNN algorithm, make full use of the parallelism within the algorithm network structure, and reduce the energy consumption while increasing the operation speed.
  • Some scholars have implemented CNNs with different structures on FPGA to do simple real-time image recognition or classification, but most of these research implementations are only for more complicated convolutional layers or based on a specific neural network, such as Aydonat. Et al. proposed a new CNN implementation framework to complete the heterogeneous acceleration of Alexnet network by FPGA.
  • the R&D personnel need to implement FPGA heterogeneous acceleration on the new convolutional neural network, they need to redesign and implement the FPGA implementation architecture according to the specific network structure of the new network, which has poor versatility and flexibility.
  • the invention provides an acceleration method of a convolutional neural network, which includes receiving in advance a plurality of preset types of calculation operation models in a preset convolutional neural network CNN;
  • the calculation operation model of each calculation operation of CNN is used as the calculation operation model to be used;
  • the field programmable gate array FPGA that controls the accelerator card compiles the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used;
  • the calculation operations models that can realize each calculation operation of the CNN to be accelerated can be obtained from a plurality of preset types of calculation operation models in the preset CNN
  • the calculation operation model is used as the calculation operation model to be used, and then the FPGA in the acceleration card can be controlled to compile the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used, and then each calculation operation including the CNN to be accelerated can be obtained.
  • the invention adopts any acceleration card to execute any one of the CNNs to be accelerated It does not need to develop a variety of accelerator cards for accelerated operation, which is more flexible and saves R&D costs.
  • the action timing parameters for obtaining the action timing of each calculation operation of the CNN to be accelerated are specifically:
  • the CNN to be accelerated can be multiple types of deep learning frameworks
  • you want to obtain the action timing parameters of the CNN to be accelerated from different types of deep learning frameworks you must pre-build various types of deep learning frameworks in the CPU
  • a preset deep learning framework can be built in the CPU only.
  • only the CNN to be accelerated needs to be converted to the CNN to be accelerated of the preset learning framework, then the CPU The action timing parameters in the accelerated CNN can be obtained, which saves resources.
  • the action timing parameters may also be obtained in other ways, for example, a variety of deep learning frameworks are built in the CPU in advance, and then the action timing parameters of the CNN to be accelerated are directly acquired, etc., and the embodiments of the present invention are not limited herein.
  • the preset deep learning framework is caffe or TensorFlow.
  • both caffe and TensorFlow are commonly used deep learning frameworks.
  • the CNN to be accelerated is exactly caffe and TensorFlow, then there is no need to convert the deep learning framework, which further saves computing resources.
  • the preset deep learning framework may also be of other types, which is not limited in this embodiment of the present invention.
  • the field programmable gate array FPGA that controls the acceleration card compiles the kernel program for executing the CNN to be accelerated according to the calculation operation model to be used specifically:
  • the field programmable gate array FPGA that controls the acceleration card compiles the kernel program for executing the CNN to be accelerated through its own hardware compilation platform according to the computing operation model to be used.
  • a plurality of preset types of calculation operation models previously received in the preset convolutional neural network CNN are specifically:
  • OpenCL has the advantages of simple structure and convenient use.
  • the calculation module in the CNN network layer can be used independently and uncorrelated.
  • the network layer calculation modules commonly used in CNN can be implemented by FPGA high-level programming language OpenCl respectively, and the parallel optimization design of OpenCL can be completed.
  • a plurality of preset types of calculation operation models are constructed, and all calculation operation models can be constructed as a network layer calculation library.
  • calculation operation model can also be implemented using other programming languages, and the embodiment of the present invention is not limited herein.
  • the preset types include convolution operation, pooling operation, Relu (Rectified Linear Unit), and Norm function.
  • convolution operations are commonly used calculation operations in various CNNs, and can well implement various types of CNNs.
  • the preset type may also include other various types, which is not limited in this embodiment of the present invention.
  • FIG. 2 is an acceleration device of a convolutional neural network provided by the present invention, including:
  • the receiving module 1 is used to receive in advance multiple preset types of calculation operation models in a preset convolutional neural network CNN;
  • the first obtaining module 2 is configured to obtain a computing operation model capable of realizing each computing operation of the CNN to be accelerated from a plurality of computing operation models as a standby computing operation model;
  • the first control module 3 is used to control the field programmable gate array FPGA of the acceleration card to compile a kernel program for executing the CNN to be accelerated according to the computing operation model to be used;
  • the second obtaining module 4 is used to obtain an action timing parameter including the action timing of each calculation operation of the CNN to be accelerated;
  • the second control module 5 is used to control the FPGA to execute the kernel program according to the action timing in the action timing parameters, and perform calculation on the preset data to achieve acceleration.
  • the second obtaining module 4 includes:
  • the conversion module is used to convert the CNN to be accelerated into the CNN to be accelerated by a preset deep learning framework
  • the obtaining sub-module is used to obtain the action timing parameters of the action timing of each calculation operation of the CNN to be accelerated including the preset deep learning framework.
  • FIG. 3 is a convolutional neural network acceleration device provided by the present invention, including:
  • Memory 6 used to store computer programs
  • the processor 7 is configured to implement the steps of the acceleration method of the convolutional neural network as in the foregoing embodiment when the computer program is executed.
  • the present invention also provides a computer-readable storage medium that stores a computer program on the computer-readable storage medium.
  • the computer program is executed by the processor 7, the steps of the acceleration method of the convolutional neural network in the foregoing embodiment are implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

一种卷积神经网络的加速方法、装置、设备以及存储介质,包括预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;从多个计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型;控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序;获取包含待加速CNN的各个计算操作的动作时序的动作时序参数;控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速。本发明采用任何一块加速卡均可以执行对任何一个待加速CNN的加速操作,无需开发出多种加速卡,灵活性较强,且节省了研发成本。

Description

一种卷积神经网络的加速方法、装置、设备及存储介质
本申请要求于2019年01月08日提交至中国专利局、申请号为201910016345.9、发明名称为“一种卷积神经网络的加速方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及算法加速领域,特别是涉及一种卷积神经网络的加速方法,本发明还涉及一种卷积神经网络的加速装置、设备及存储介质。
背景技术
CNN(Convolutional Neutral Network,卷积神经网络)是人工神经网络的一种,为了满足运算速度等要求,通常会使用加速卡对CNN的运算过程进行加速,但是CNN有多种不同的类型,现有技术中在对CNN的运算过程进行加速时,必须使用待加速的该种类型的CNN专用的加速卡,即每种类型的CNN都需要专用的加速卡才能够实现加速,灵活性较差,且研发多种类型的加速卡产生了较高的研发成本。
因此,如何提供一种解决上述技术问题的方案是本领域技术人员目前需要解决的问题。
发明内容
本发明的目的是提供一种卷积神经网络的加速方法,灵活性较强,且节省了研发成本;本发明的另一目的是提供一种卷积神经网络的加速装置、设备及存储介质,灵活性较强,且节省了研发成本。
为解决上述技术问题,本发明提供了一种卷积神经网络的加速方法,包括:
预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;
从多个所述计算操作模型中,获取能够实现待加速CNN的各个计算操作的所述计算操作模型,作为待用计算操作模型;
控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序;
获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数;
控制所述FPGA按照所述动作时序参数中的所述动作时序执行所述内核程序,并对预设数据进行运算,以便实现加速。
优选地,所述获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数具体为:
将所述待加速CNN转换为预设深度学习框架的所述待加速CNN;
获取包含预设深度学习框架的所述待加速CNN的各个所述计算操作的动作时序的动作时序参数。
优选地,所述预设深度学习框架为caffe或TensorFlow。
优选地,所述控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序具体为:
控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,通过自身的硬件编译平台,编译出用于执行所述待加速CNN的内核程序。
优选地,所述预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型具体为:
预先接收利用开放运算语言OpenCL预设的卷积神经网络CNN中的多个预设类型的计算操作模型。
优选地,所述预设类型包括卷积操作、池化操作、线性整流函数Relu以及Norm函数。
为解决上述技术问题,本发明还提供了一种卷积神经网络的加速装置,包括:
接收模块,用于预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;
第一获取模块,用于从多个所述计算操作模型中,获取能够实现待加速CNN的各个计算操作的所述计算操作模型,作为待用计算操作模型;
第一控制模块,用于控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序;
第二获取模块,用于获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数;
第二控制模块,用于控制所述FPGA按照所述动作时序参数中的所述动作时序执行所述内核程序,并对预设数据进行运算,以便实现加速。
优选地,所述第二获取模块包括:
转换模块,用于将所述待加速CNN转换为预设深度学习框架的所述待加速CNN;
获取子模块,用于获取包含预设深度学习框架的所述待加速CNN的各个所述计算操作的动作时序的动作时序参数。
为解决上述技术问题,本发明还提供了一种卷积神经网络的加速设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上任一项所述卷积神经网络的加速方法的步骤。
为解决上述技术问题,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上任一项所述卷积神经网络的加速方法的步骤。
本发明提供了一种卷积神经网络的加速方法,包括预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;从多个计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型;控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序;获取包含待加速CNN的各个计算操作的动作时序的动作时序参数;控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速。
可见,本发明中,在执行对任何一个待加速CNN的加速操作时,均可以从预设的CNN中的多个预设类型的计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型,然后可以控制加速卡中的FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序,接着可以获取包含待加速CNN的各个计算操作的动作时序的动作时序参数,并控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速,本发明采用任何一块加速卡均可以执行对任何一个待加速CNN的加速操作,无需开发出多种加速卡,灵活性较强,且节省了研发成本。
本发明还提供了一种卷积神经网络的加速装置、设备以及存储介质,具有如上卷积神经网络的加速方法相同的有益效果。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对现有技术和实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明提供的一种卷积神经网络的加速方法的流程示意图;
图2为本发明提供的一种卷积神经网络的加速装置的结构示意图;
图3为本发明提供的一种卷积神经网络的加速设备的结构示意图。
具体实施方式
本发明的核心是提供一种卷积神经网络的加速方法,灵活性较强,且节省了研发成本;本发明的另一核心是提供一种卷积神经网络的加速装置、设备及存储介质,灵活性较强,且节省了研发成本。
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提 下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图1,图1为本发明提供的一种卷积神经网络的加速方法的流程示意图,包括:
步骤S1:预先接收预设的CNN中的多个预设类型的计算操作模型;
具体的,多个预设类型的计算操作模型可以为能够实现各种卷积神经网络中常用的计算操作的各个计算操作模型,例如A计算操作模型可以实现A计算操作,而B计算操作模型可以实现B计算操作等,其数量可以根据需求进行自主设定,本发明实施例在此不做限定。
具体的,本发明实施例中的执行主体可以为CPU,本步骤具体可以为CPU中的存储模块预先接收预设的CNN中的多个预设类型的计算操作模型,也可以为CPU在接收到预设的CNN中的多个预设类型的计算操作模型后将其存储在存储模块中,此种情况下,当存储模块中有了各个计算操作模型后,便可以执行后续步骤,以便实现对各种算法的加速运算。
步骤S2:从多个计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型;
具体的,待加速CNN可以为各种CNN中的任意一种,本发明实施例在此不做限定。
其中,可以首先获取待加速CNN中的各个计算操作,即获知待加速CNN中的各个计算操作分别是什么,然后从多个计算操作模型中,能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型,以便执行后续步骤。
步骤S3:控制加速卡的FPGA(Field-Programmable Gate Array,现场可编程门阵列)根据待用计算操作模型,编译出用于执行待加速CNN的内核程序;
具体的,FPGA想要顺利对待加速CNN进行加速,可以根据待用计算模型,编译出用于执行待加速CNN的内核程序,此种情况下,FPGA便可以执行内核程序并配合后续步骤以对待加速CNN进行加速。
其中,本发明实施例通过FPGA实现的加速可以为异构加速,可以适用于各种类型的CNN,本发明实施例在此不做限定。
其中,在内核程序编译好后,可以控制内核程序加载到FPGA中,以便后续执行。
步骤S4:获取包含待加速CNN的各个计算操作的动作时序的动作时序参数;
具体的,可以通过多个途径获取待加速CNN的动作时序参数,例如可以直接对于待加速CNN进行解析获取,或者从预先存储的资料库中获取等,本发明实施例在此不做限定。
其中,动作时序参数中可以包含待加速CNN的动作时序,例如在A动作执行完毕后执行B动作,在B动作执行完后执行D动作等,动作时序的具体形式和待加速CNN的类型相对应,本发明实施例在此不做限定。
步骤S5:控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速。
具体的,在上述步骤均完成后,便可以控制FPGA按照动作时序参数中的动作时序执行内核程序,并在此过程中对预设数据进行运算,实现了对于待加速CNN的加速,提高了运算速度。
其中,预设数据可以为多种类型的数据,例如在进行人脸识别过程中获取到的人脸数据等,预设数据可以在CPU的控制下由全局内存输入至FPGA中进行运算,本发明实施例在此不做限定。
其中,可以将动作时序参数保存到数据组中,然后控制数组中数据的读写操作,将数据传入到FPGA的全局内存中,并启动FPGA内核程序,使其从全局内存中读取包括动作时序参数以及预设数据的输入数据,对算法进行加速。
另外,在运算结束之后,CPU还可以获取FPGA的运算结果,获取运算结果的过程可以为控制FPGA将运算结果存储起来,然后CPU从存储中获取运算结果,并将运算结果以多种形式输出,例如图表形式或语音提示的形式等,本发明实施例在此不做限定。
需要说明的是,深度学习作为机器学习的一个分支,是人工智能中发展迅速的领域之一,可帮助计算机理解大量图像、声音和文本形式的数据。近年来随着caffe(Convolutional Architecture for Fast Feature Embedding,用于快速特征嵌入的卷积结构)等深度学习开源工具趋于成熟,深度学习技术发展迅速,目前,深度学习在人脸识别、语音识别、精准医疗以及无人驾驶等领域正被广泛的应用。CNN是人工神经网络的一种,是第一个真正成功训练多层网络结构的深度学习算法。开发人员使用计算密集型算法创建CNN,并在各种平台上对其进行实施。因为它用多层神经元相连处理数据,能模仿生物视觉神经的行为获得很高识别准确率,已成为当前语音分析和图像识别领域的研究热点。支票读取系统、OCR(Optical Character Recognition,光学字符识别)和手写识别系统、街景中的人脸识别和车牌识别以及法国电信视频会议系统中的人脸识别都用到了CNN。
现有的大部分CNN实现主要是基于通用处理器CPU实现的,在CNN网络结构中,层内计算是独立不相关的,而层间结构可以理解为是一个流水结构。由于CNN的特殊计算模式,通用处理器CPU由于其自身特点无法充分地挖掘CNN内部的并行性,实现CNN并不高效,所以很难满足性能需求。最近基于FPGA,GPU(Graphics Processing Unit,图形处理器)甚至ASIC(Application Specific Integrated Circuit,专用集成电路)的不同加速器被相继提出以提升CNN设计性能。在这些方案中,基于FPGA的加速器由于其更好的性能,高能效,快速开发周期以及可重配置能力吸引了越来越多研究者的注意。FPGA作为一种计算密集型加速部件,通过将算法映射到FPGA上的并行硬件进行加速,FPGA上所设计的各个硬件模块可以并行执行,各个硬件模块输入输出的相互连接以及FPGA所提供的流水结构可以很好的和CNN算法相匹配,充分利用算法网络结构内部的并行性,在提高运算速度的同时缩小了能耗。之前已经有学者在FPGA上实现了不同结构的CNN来做简单的实时图像识别或分类,但这些研究实现大部分只是针对于计算较为复杂的卷积层或者基于某种特定的神经网络,如Aydonat等人提出了一个全新的CNN实现框架,完成FPGA对Alexnet网络的异构加速。当研发人员需要对新的卷积神经网络进行FPGA异构加 速时,则需要根据新的网络的特定的网络结构对FPGA的实现架构进行重新的设计实现,具有较差的通用性和灵活性。
本发明提供了一种卷积神经网络的加速方法,包括预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;从多个计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型;控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序;获取包含待加速CNN的各个计算操作的动作时序的动作时序参数;控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速。
可见,本发明中,在执行对任何一个待加速CNN的加速操作时,均可以从预设的CNN中的多个预设类型的计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型,然后可以控制加速卡中的FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序,接着可以获取包含待加速CNN的各个计算操作的动作时序的动作时序参数,并控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速,本发明采用任何一块加速卡均可以执行对任何一个待加速CNN的加速操作,无需开发出多种加速卡,灵活性较强,且节省了研发成本。
在上述实施例的基础上:
作为一种优选的实施例,获取包含待加速CNN的各个计算操作的动作时序的动作时序参数具体为:
将待加速CNN转换为预设深度学习框架的待加速CNN;
获取包含预设深度学习框架的待加速CNN的各个计算操作的动作时序的动作时序参数。
具体的,考虑到待加速CNN可以为多种类型的深度学习框架,想要从不同类型的深度学习框架中获取待加速CNN的动作时序参数,必须在CPU中预先搭建各种类型的深度学习框架,本发明实施例中,为了节省资源,可以只在CPU中搭建一种预设深度学习框架,此种情况下,只需要将 待加速CNN转换为预设学习框架的待加速CNN,那么CPU便可以对待加速CNN中的动作时序参数进行获取,节省了资源。
当然,获取动作时序参数也可以为其他方式,例如预先在CPU中搭建多种深度学习框架,然后对待加速CNN的动作时序参数进行直接获取等,本发明实施例在此不做限定。
作为一种优选的实施例,预设深度学习框架为caffe或TensorFlow。
具体的,caffe以及TensorFlow均为常用的深度学习框架,此种情况下,若待加速CNN正好为caffe以及TensorFlow,那么便无需进行深度学习框架进行转换,进一步节省了计算资源。
当然,除了caffe以及TensorFlow外,预设深度学习框架还可以为其他类型,本发明实施例在此不做限定。
作为一种优选的实施例,控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序具体为:
控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,通过自身的硬件编译平台,编译出用于执行待加速CNN的内核程序。
具体的,采用FPGA自身的硬件编译平台编译内核程序,可以节约成本,也无需将数据导出,提高了工作效率。
当然,除了采用FPGA自身的硬件编译平台编译内核程序外,还可以采用其他方式,本发明实施例在此不做限定。
作为一种优选的实施例,预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型具体为:
预先接收利用OpenCL(Open Computing Language,开放运算语言)预设的卷积神经网络CNN中的多个预设类型的计算操作模型。
具体的,OpenCL具有结构简单以及使用方便等优点。
其中,本发明实施例中可以利用CNN网络层内计算模块是独立不相关的特点,将CNN中常用的各网络层计算模块用FPGA高层次编程语言OpenCl分别实现,并完成OpenCL的并行优化设计,构建出多个预设类型的计算操作模型,并且可以将所有的计算操作模型构建为一个网络层计算库。
当然,除了OpenCL外,计算操作模型还可以利用其他的编程语言进行实现,本发明实施例在此不做限定。
作为一种优选的实施例,预设类型包括卷积操作、池化操作、Relu(Rectified Linear Unit,线性整流函数)以及Norm函数。
具体的,卷积操作、池化操作、Relu以及范数Norm函数均为各种CNN中常用的计算操作,能够很好地实现各种不同类型的CNN。
当然,预设类型还可以包括其他多种类型,本发明实施例在此不做限定。
请参考图2,图2为本发明提供的一种卷积神经网络的加速装置,包括:
接收模块1,用于预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;
第一获取模块2,用于从多个计算操作模型中,获取能够实现待加速CNN的各个计算操作的计算操作模型,作为待用计算操作模型;
第一控制模块3,用于控制加速卡的现场可编程门阵列FPGA根据待用计算操作模型,编译出用于执行待加速CNN的内核程序;
第二获取模块4,用于获取包含待加速CNN的各个计算操作的动作时序的动作时序参数;
第二控制模块5,用于控制FPGA按照动作时序参数中的动作时序执行内核程序,并对预设数据进行运算,以便实现加速。
作为一种优选的实施例,第二获取模块4包括:
转换模块,用于将待加速CNN转换为预设深度学习框架的待加速CNN;
获取子模块,用于获取包含预设深度学习框架的待加速CNN的各个计算操作的动作时序的动作时序参数。
对于本发明提供的卷积神经网络的加速装置的介质的介绍请参照前述的加速方法的实施例,本发明实施例在此不再赘述。
请参考图3,图3为本发明提供的一种卷积神经网络的加速设备,包括:
存储器6,用于存储计算机程序;
处理器7,用于执行计算机程序时实现如前述实施例中的卷积神经网络的加速方法的步骤。
对于本发明提供的卷积神经网络的加速设备的介质的介绍请参照前述的加速方法的实施例,本发明实施例在此不再赘述。
本发明还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器7执行时实现如前述实施例中的卷积神经网络的加速方法的步骤。
对于本发明提供的计算机可读存储介质的介绍请参照前述的加速方法的实施例,本发明实施例在此不再赘述。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显 而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其他实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种卷积神经网络的加速方法,其特征在于,包括:
    预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;
    从多个所述计算操作模型中,获取能够实现待加速CNN的各个计算操作的所述计算操作模型,作为待用计算操作模型;
    控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序;
    获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数;
    控制所述FPGA按照所述动作时序参数中的所述动作时序执行所述内核程序,并对预设数据进行运算,以便实现加速。
  2. 根据权利要求1所述的加速方法,其特征在于,所述获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数具体为:
    将所述待加速CNN转换为预设深度学习框架的所述待加速CNN;
    获取包含预设深度学习框架的所述待加速CNN的各个所述计算操作的动作时序的动作时序参数。
  3. 根据权利要求2所述的加速方法,其特征在于,所述预设深度学习框架为caffe或TensorFlow。
  4. 根据权利要求2所述的加速方法,其特征在于,所述控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序具体为:
    控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,通过自身的硬件编译平台,编译出用于执行所述待加速CNN的内核程序。
  5. 根据权利要求4所述的加速方法,其特征在于,所述预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型具体为:
    预先接收利用开放运算语言OpenCL预设的卷积神经网络CNN中的多个预设类型的计算操作模型。
  6. 根据权利要求1至5任一项所述的加速方法,其特征在于,所述预设类型包括卷积操作、池化操作、线性整流函数Relu以及Norm函数。
  7. 一种卷积神经网络的加速装置,其特征在于,包括:
    接收模块,用于预先接收预设的卷积神经网络CNN中的多个预设类型的计算操作模型;
    第一获取模块,用于从多个所述计算操作模型中,获取能够实现待加速CNN的各个计算操作的所述计算操作模型,作为待用计算操作模型;
    第一控制模块,用于控制加速卡的现场可编程门阵列FPGA根据所述待用计算操作模型,编译出用于执行所述待加速CNN的内核程序;
    第二获取模块,用于获取包含所述待加速CNN的各个所述计算操作的动作时序的动作时序参数;
    第二控制模块,用于控制所述FPGA按照所述动作时序参数中的所述动作时序执行所述内核程序,并对预设数据进行运算,以便实现加速。
  8. 根据权利要求7所述的加速装置,其特征在于,所述第二获取模块包括:
    转换模块,用于将所述待加速CNN转换为预设深度学习框架的所述待加速CNN;
    获取子模块,用于获取包含预设深度学习框架的所述待加速CNN的各个所述计算操作的动作时序的动作时序参数。
  9. 一种卷积神经网络的加速设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至6任一项所述卷积神经网络的加速方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述卷积神经网络的加速方法的步骤。
PCT/CN2019/103637 2019-01-08 2019-08-30 一种卷积神经网络的加速方法、装置、设备及存储介质 WO2020143236A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910016345.9 2019-01-08
CN201910016345.9A CN109858610A (zh) 2019-01-08 2019-01-08 一种卷积神经网络的加速方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020143236A1 true WO2020143236A1 (zh) 2020-07-16

Family

ID=66894174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103637 WO2020143236A1 (zh) 2019-01-08 2019-08-30 一种卷积神经网络的加速方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN109858610A (zh)
WO (1) WO2020143236A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858610A (zh) * 2019-01-08 2019-06-07 广东浪潮大数据研究有限公司 一种卷积神经网络的加速方法、装置、设备及存储介质
CN114365148A (zh) * 2019-10-22 2022-04-15 深圳鲲云信息科技有限公司 神经网络运行系统和方法
CN110929860B (zh) * 2019-11-07 2020-10-23 深圳云天励飞技术有限公司 一种卷积加速运算方法、装置、存储介质及终端设备
CN115829064B (zh) * 2023-02-17 2023-05-05 山东浪潮科学研究院有限公司 一种联邦学习加速方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657581A (zh) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 一种卷积神经网络cnn硬件加速器及加速方法
US20180300556A1 (en) * 2017-04-17 2018-10-18 Intel Corporation Person tracking and privacy and acceleration of data using autonomous machines
CN108764466A (zh) * 2018-03-07 2018-11-06 东南大学 基于现场可编程门阵列的卷积神经网络硬件及其加速方法
CN109858610A (zh) * 2019-01-08 2019-06-07 广东浪潮大数据研究有限公司 一种卷积神经网络的加速方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10726328B2 (en) * 2015-10-09 2020-07-28 Altera Corporation Method and apparatus for designing and implementing a convolution neural net accelerator
CN107463990A (zh) * 2016-06-02 2017-12-12 国家计算机网络与信息安全管理中心 一种卷积神经网络的fpga并行加速方法
US10656962B2 (en) * 2016-10-21 2020-05-19 International Business Machines Corporation Accelerate deep neural network in an FPGA
CN107992299B (zh) * 2017-11-27 2021-08-10 郑州云海信息技术有限公司 神经网络超参数提取转换方法、系统、装置及存储介质
CN108710941A (zh) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 用于电子设备的神经网络模型的硬加速方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300556A1 (en) * 2017-04-17 2018-10-18 Intel Corporation Person tracking and privacy and acceleration of data using autonomous machines
CN107657581A (zh) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 一种卷积神经网络cnn硬件加速器及加速方法
CN108764466A (zh) * 2018-03-07 2018-11-06 东南大学 基于现场可编程门阵列的卷积神经网络硬件及其加速方法
CN109858610A (zh) * 2019-01-08 2019-06-07 广东浪潮大数据研究有限公司 一种卷积神经网络的加速方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN109858610A (zh) 2019-06-07

Similar Documents

Publication Publication Date Title
WO2020143236A1 (zh) 一种卷积神经网络的加速方法、装置、设备及存储介质
Gschwend Zynqnet: An fpga-accelerated embedded convolutional neural network
Guan et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates
CN111967468B (zh) 一种基于fpga的轻量级目标检测神经网络的实现方法
WO2022068627A1 (zh) 一种数据处理方法及相关设备
Wang et al. Fpga implementation of object detection accelerator based on vitis-ai
Verma et al. Performance evaluation of deep learning compilers for edge inference
CN113157917B (zh) 基于OpenCL的优化分类模型的建立、优化分类方法及系统
Pham et al. AIoT solution survey and comparison in machine learning on low-cost microcontroller
Wang et al. Briefly Analysis about CNN Accelerator based on FPGA
Qian et al. R-cnn object detection inference with deep learning accelerator
Liang Ascend AI Processor Architecture and Programming: Principles and Applications of CANN
Gao et al. Optimized parallel implementation of face detection based on embedded heterogeneous many-core architecture
CN111831285A (zh) 一种面向内存计算平台的代码转换方法、系统及应用
Luo et al. ML-CGRA: an integrated compilation framework to enable efficient machine learning acceleration on CGRAs
CN111143208B (zh) 基于处理器技术辅助fpga实现ai算法的验证方法
Tapiador et al. Comprehensive evaluation of opencl-based convolutional neural network accelerators in xilinx and altera fpgas
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
Yu et al. INCAME: Interruptible CNN accelerator for multirobot exploration
Palkowski et al. Parallel tiled code generation with loop permutation within tiles
Zhou et al. Parallelizing convolutional neural network for the handwriting recognition problems with different architectures
Li et al. Accelerating gpu computing at runtime with binary optimization
Sun Deployment of neural networks through PYNQ
CN113673704B (zh) 一种基于软硬件协同加速的关系网络推理优化的方法
Rakhimov et al. THE POSSIBILITY OF CUDA TECHNOLOGY IN DEEP LEARNING PROCESSES

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909099

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909099

Country of ref document: EP

Kind code of ref document: A1