WO2020118608A1 - 一种反卷积神经网络的硬件加速方法、装置和电子设备 - Google Patents

一种反卷积神经网络的硬件加速方法、装置和电子设备 Download PDF

Info

Publication number
WO2020118608A1
WO2020118608A1 PCT/CN2018/120861 CN2018120861W WO2020118608A1 WO 2020118608 A1 WO2020118608 A1 WO 2020118608A1 CN 2018120861 W CN2018120861 W CN 2018120861W WO 2020118608 A1 WO2020118608 A1 WO 2020118608A1
Authority
WO
WIPO (PCT)
Prior art keywords
network layer
memory
current
input data
calculation result
Prior art date
Application number
PCT/CN2018/120861
Other languages
English (en)
French (fr)
Inventor
刘双龙
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to CN201880083893.7A priority Critical patent/CN111542839B/zh
Priority to PCT/CN2018/120861 priority patent/WO2020118608A1/zh
Publication of WO2020118608A1 publication Critical patent/WO2020118608A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the invention relates to the field of artificial intelligence, in particular to a hardware acceleration method, device, electronic equipment and storage medium of a deconvolution neural network.
  • Generative confrontation networks are one of them.
  • GAN GenerativeAdversarialNetworks
  • the GAN network is mainly composed of two modules: Generative Model and Discriminative Model. The two modules play each other for unsupervised or semi-supervised learning.
  • the generation model is used to generate data in the GAN network.
  • the generation model can perform unsupervised learning to simulate the distribution of these high-dimensional data; for scenarios where the amount of data is lacking, the generation model can help generate data To increase the amount of data, thereby using semi-supervised learning to improve learning efficiency. Therefore, the GAN network has a wide range of auxiliary applications in many application scenarios such as machine translation, image deblurring, image restoration, text-to-image conversion, and other research fields.
  • generative models are generally deconvolutional neural networks, which are composed of a series of deconvolutional layers.
  • the existing generation model that is, the deconvolution neural network is mainly realized by convolution operation on the CPU or GPU, the calculation efficiency is extremely low, and the data utilization rate is also very low; and the existing GAN network generation model is in the field programmable gate array ( The accelerator on FPGA does not take into account the difference between the deconvolution network and the convolution network, and the efficiency improvement is limited.
  • Embodiments of the present invention provide a hardware acceleration method, device, electronic device, and storage medium of a deconvolution neural network, which can improve the efficiency of data transmission and utilization.
  • an embodiment of the present invention provides a hardware acceleration method for a deconvolution neural network, including:
  • the next network layer includes all Describe the weight of this layer
  • the input data includes initial data stored in an off-chip memory
  • the acquiring input data of the current network layer includes:
  • the input data is deconvoluted in the current network layer to obtain the current calculation result, and the current calculation result is input into the second memory, and the current network layer includes the weight of the layer, including :
  • the calculation result is input to the second memory.
  • the acquiring the weight of the current network layer and inputting the current network layer includes:
  • the performing matrix operation on the input data and the weights of the current network layer to obtain an operation result includes:
  • the input data and the weight matrix of the current network layer are multiplied and accumulated to obtain a calculation result.
  • the acquiring the input data of the next network layer based on the current calculation result in the second memory includes:
  • the next network layer reads the operation result in the input second memory as input data.
  • the first memory and the second memory are on-chip memories of a field programmable gate array (FPGA).
  • FPGA field programmable gate array
  • an embodiment of the present invention provides a hardware acceleration device for a deconvolution neural network, including:
  • the first obtaining module is used to obtain the input data of the current network layer, where the input data is the calculation result of the previous network layer stored in the first memory;
  • a first calculation module configured to perform deconvolution calculation on the input data in the current network layer to obtain a current calculation result, and input the current calculation result into a second memory, where the current network layer includes the weight of the layer;
  • a second obtaining module configured to obtain input data of the next network layer based on the current calculation result in the second memory
  • the second calculation module is used to perform deconvolution calculation on the input data of the next network layer in the next network layer to obtain the current calculation result, and input the current calculation result into the first memory, the next A network layer includes the weight of the layer;
  • the repeating module is used to repeatedly call the above modules until the last layer of the deconvolution neural network and output the result.
  • an embodiment of the present invention provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, when the processor executes the computer program
  • the steps in the hardware acceleration method of the deconvolution neural network provided by the embodiments of the present invention are implemented.
  • an embodiment of the present invention provides a computer-readable storage medium, which is characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program is implemented. Steps in the hardware acceleration method of deconvolution neural networks.
  • S1 obtains input data of the current network layer, and the input data is a calculation result of a previous network layer stored in the first memory;
  • S2 inverts the input data in the current network layer Convolution calculation to obtain the current calculation result, and input the current calculation result into the second memory, the current network layer includes the weight of the current network layer;
  • S3, based on the current calculation result in the second memory, obtain the next network Input data of the layer;
  • S4 performing deconvolution calculation on the input data of the next network layer in the next network layer to obtain the current calculation result, and inputting the current calculation result into the first memory, the next One network layer includes the weights of the next network layer;
  • Steps S1, S2, S3, and S4 are repeatedly executed until the last layer of the deconvolution neural network and the result is output. Since the current network layer reads the calculation result of the previous network layer from the on-chip first memory, it is input to the on-chip second memory after calculation by the deconvolution module and used as input data of the next network layer. The input data is calculated by the deconvolution module and then the result is returned to the on-chip first memory, thereby achieving the fusion between the network layers, effectively avoiding repeated reading and writing of off-chip data, improving the efficiency of data transmission and utilization, and thus improving The calculation speed of the deconvolution neural network.
  • FIG. 1 is a hardware accelerated network architecture diagram of a deconvolution neural network that may be used in an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a hardware acceleration method of a deconvolution neural network provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a hardware acceleration device for a deconvolution neural network provided by an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the above network architecture includes a host computer module (HOST) and Two parts of FPGA acceleration module.
  • HOST host computer module
  • FPGA acceleration module Two parts of FPGA acceleration module.
  • the main computer module includes CPU and DDRMemory (double-rate memory), where the CPU can be used to provide a clock source to the FPGA acceleration module, and can send control instructions to read data stored in the double-rate memory to the FPGA acceleration module, or Write the output data of the FPGA acceleration module into the above double-rate memory.
  • DDRMemory double-rate memory
  • FPGA acceleration module includes control unit (Control Unit), direct memory access unit (Direct Memory Access (DMA), on-chip buffer A (Buffer A), on-chip buffer B (Buffer B), deconvolution operation unit (DeConv) And a layer counting unit (Layer Count), wherein the control unit is used to control the matrix size of the input and output data, the number of channels and the weight input of each layer; direct memory access is used to store the double-rate memory and the on-chip cache A and A are directly connected to the on-chip buffer B, which can directly manipulate the data in the memory to improve the reading and writing speed; the on-chip buffer A and the on-chip buffer B are used to temporarily store the data input by the above main computer module or the deconvolution operation unit The result of the output; the deconvolution operation unit is used to perform deconvolution calculation on the data in the above-mentioned on-chip buffer A or on-chip buffer B and the weight of each layer; the layer counting unit is used to notify the control unit to perform a deconvolution at a time In the calculation,
  • an embodiment of the present invention provides a hardware acceleration method of a deconvolution neural network, including the following steps:
  • the first memory may be the on-chip buffer A or the on-chip buffer B in the network architecture; the input data of the current network layer may be read from the first memory that temporarily stores the calculation result output by the previous network layer If the current network layer is the first layer, it can also be read from the double-rate memory of the host computer module.
  • the input data may be two-dimensional matrix data or high-dimensional data.
  • the second memory may be the one that is not used to cache the input data in the on-chip buffer A or the on-chip buffer B in the network architecture.
  • the current calculation result stored in the second memory in step S2 is read to the deconvolution operation unit in the network architecture as the input data of the next network layer, the input data is matrix data, and the input Before the deconvolution calculation unit performs the deconvolution calculation, the matrix data can also be processed by the control unit, such as padding and cropping.
  • the next network layer includes the The weight of the layer.
  • each layer sequentially performs the above steps of obtaining input data, performing deconvolution calculations, and outputting the calculation results, until the last layer of the deconvolution neural network, and then the last The calculation result of the first layer is processed by the control unit, and then output to the double-rate memory.
  • the current network layer, the next network layer and the previous network layer are relative, specifically determined by the layer counting unit in the above network architecture.
  • the layer counting unit will The end of the first memory of the input data this time is the current network layer, and the result is output to the end of the second memory as the next network layer after calculation, so the current network layer is above this layer relative to the next network layer A network layer.
  • S1 obtains input data of the current network layer, and the input data is a calculation result of a previous network layer stored in the first memory;
  • S2 inverts the input data in the current network layer Convolution calculation to obtain the current calculation result, and input the current calculation result into the second memory, the current network layer includes the weight of the current network layer;
  • S3, based on the current calculation result in the second memory, obtain the next network Input data of the layer;
  • S4 performing deconvolution calculation on the input data of the next network layer in the next network layer to obtain the current calculation result, and inputting the current calculation result into the first memory, the next One network layer includes the weights of the next network layer;
  • Steps S1, S2, S3, and S4 are repeatedly executed until the last layer of the deconvolution neural network and the result is output. Since the current network layer reads the calculation result of the previous network layer from the on-chip first memory, it is input to the on-chip second memory after calculation by the deconvolution module and used as input data of the next network layer. The input data is calculated by the deconvolution module and then the result is returned to the on-chip first memory, thereby achieving the fusion between the network layers, effectively avoiding repeated reading and writing of off-chip data, improving the efficiency of data transmission and utilization, and thus improving The calculation speed of the deconvolution neural network.
  • the input data includes initial data stored in an off-chip memory
  • the acquiring input data of the current network layer includes:
  • the off-chip memory can be the double-rate memory in the main computer module in the above network architecture, supporting read and write operations, and the initial data can be pixel data of the image, voice data, or semantic data of the text; by storing the above storage
  • the initial data on the off-chip memory is read into the first memory, and the initial data is only read once.
  • the first memory is the on-chip buffer A or the on-chip buffer of the FPGA acceleration module in the network architecture B, thus speeding up the data transmission rate.
  • the input data is deconvoluted in the current network layer to obtain the current calculation result, and the current calculation result is input into the second memory, and the current network layer includes the weight of the layer, including :
  • the calculation result is input to the second memory.
  • the first memory is the on-chip buffer A or on-chip of the FPGA acceleration module in the network architecture Buffer B to speed up the data transmission rate.
  • the input data and weights of the current network layer can be a square matrix, that is, the matrix has the same number of rows and columns.
  • the input data of the current network layer is 5 ⁇ 5, and the weight is 3 ⁇ 3; the square
  • the shape of the matrix can be controlled by the control unit in the above network architecture.
  • the control unit can fill the above input data matrix of the current network layer with 0 so that the matrix becomes 7 ⁇ 7 and so on.
  • the acquiring the weight of the current network layer and inputting the current network layer includes:
  • the off-chip memory is the double-rate memory of the above network architecture, and the weight of the current network layer is the above-mentioned weight square matrix; when the deconvolution calculation is performed on the current network layer, the direct memory access unit will directly store the The weight matrix of the current network layer in the double-rate memory is transmitted to the buffer where the current network layer is located.
  • the performing matrix operation on the input data and the weights of the current network layer to obtain an operation result includes:
  • the input data and the weight matrix of the current network layer are multiplied and accumulated to obtain a calculation result.
  • the input data and weight matrix of the current network layer stored in the same buffer are input to the above-mentioned deconvolution operation unit to perform matrix multiplication and accumulation to obtain the above calculation result.
  • the acquiring the input data of the next network layer based on the current calculation result in the second memory includes:
  • the next network layer reads the operation result in the input second memory as input data.
  • the input data and weight matrix of the current network layer are deconvoluted, and the calculation result is obtained and input into the second memory; when the next network layer starts deconvolution calculation, it is read from the second memory
  • the above calculation results are used as input data for this layer.
  • the input data of the current network layer is 2 ⁇ 2, which can be filled into 6 ⁇ 6, and the weight matrix of this layer is 3 ⁇ 3.
  • 4 ⁇ 4 After deconvolution calculation of the two matrices, 4 ⁇ 4 can be obtained.
  • the calculation result is input into the above-mentioned second memory and used as input data of the next network layer.
  • the above calculation result is matrix data, and the above calculation result can also be processed by the control unit, such as padding and cropping, before being input to the next network layer.
  • the first memory and the second memory are on-chip memories of a field programmable gate array (FPGA).
  • FPGA field programmable gate array
  • the first memory and the second memory are the on-chip buffer A and the on-chip buffer B provided on the FPGA acceleration module, and are connected to the off-chip memory through a direct memory access unit to improve data transmission and utilization efficiency .
  • the current network layer, the next network layer and the previous network layer are relative, specifically determined by the layer counting unit in the above network architecture.
  • the layer counting unit will The end of the above-mentioned on-chip buffer A as the current network layer of the input data this time, and the result is output to the end of the above-mentioned on-chip buffer B as the next network layer after calculation, so the current network layer is the layer relative to the next network layer The previous network layer; then the layer counting unit informs the control unit to read the weight data of the current network layer from the direct memory access unit to the on-chip buffer A where the previous network layer is located.
  • control unit of the FPGA acceleration module can be configured as needed to meet the needs of different deconvolution neural networks for different parameters, such as matrix filling processing of input data, cropping of output data, and deconvolution
  • the control unit can fill the input data with 0 or other values, the sliding step size of the deconvolution calculation can be set to 1 or 2, if the input data is the number of RGB color pixel channels, it can be 3. If the number of gray pixel channels can be 1, etc., this can improve the versatility of the network architecture and meet the needs of different scenarios.
  • the above optional embodiment is a supplementary embodiment of the hardware acceleration method of the deconvolution neural network in FIG. 2, and the implementation of the methods in the above optional implementation column can achieve the corresponding beneficial effects. To avoid repetition, they will not be repeated here. .
  • FIG. 3 is a schematic structural diagram of a hardware acceleration device for a deconvolution neural network according to an embodiment of the present invention. As shown in FIG. 3, the device includes:
  • the first obtaining module 201 is configured to obtain input data of a current network layer, where the input data is a calculation result of a previous network layer stored in the first memory;
  • the first calculation module 202 is configured to perform deconvolution calculation on the input data in the current network layer to obtain the current calculation result, and input the current calculation result into the second memory, where the current network layer includes the weight of the layer ;
  • the second obtaining module 203 is configured to obtain the input data of the next network layer based on the current calculation result in the second memory;
  • the second calculation module 204 is configured to perform deconvolution calculation on the input data of the next network layer in the next network layer to obtain the current calculation result, and input the current calculation result into the first memory, the The next network layer includes the weight of this layer;
  • the repeating module 205 is used to repeatedly call the above modules until the last layer of the deconvolution neural network and output the result.
  • the first calculation module includes:
  • the obtaining unit 2021 is used to obtain the weight of the current network layer and input the current network layer;
  • the operation unit 2022 is configured to perform a matrix operation on the input data and the weights of the current network layer to obtain an operation result
  • the input unit 2023 is configured to input the calculation result into the second memory.
  • An embodiment of the present invention provides a hardware acceleration device for a deconvolution neural network that can implement various implementation methods in the method embodiment of FIG. 2 and corresponding beneficial effects. To avoid repetition, details are not described herein again.
  • FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 4, it includes: a memory 402, a processor 401, and stored on the memory 402 and can be stored in the processor A computer program running on 401, of which:
  • the processor 401 is used to call the computer program stored in the memory 402 and perform the following steps:
  • the next network layer includes the Layer weights
  • the input data includes initial data stored in an off-chip memory
  • the processor 401 executes the acquiring input data of the current network layer, including:
  • the processor 401 executes the deconvolution calculation of the input data in the current network layer to obtain the current calculation result, and inputs the current calculation result into the second memory.
  • the current network layer includes the weight of the layer, including :
  • the calculation result is input to the second memory.
  • the process of obtaining the weight of the current network layer by the processor 401 and inputting the current network layer includes:
  • the matrix operation performed by the processor 401 on the input data and the weights of the current network layer to obtain an operation result includes:
  • the input data and the weight matrix of the current network layer are multiplied and accumulated to obtain a calculation result.
  • the process performed by the processor 401 to obtain input data of the next network layer based on the current calculation result in the second memory includes:
  • the next network layer reads the operation result in the input second memory as input data.
  • the first memory and the second memory are on-chip memories of a field programmable gate array (FPGA).
  • FPGA field programmable gate array
  • the processor 401 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor microprocessor
  • the processor 401 executes the computer program stored in the memory 402, the steps of the hardware acceleration method of the deconvolution neural network can be realized. Therefore, the hardware acceleration method of the deconvolution neural network All the embodiments are applicable to the above electronic devices, and all can achieve the same or similar beneficial effects.
  • specific embodiments of the present invention also provide a computer-readable storage medium 402 that stores a computer program that implements the hardware acceleration of the above-described deconvolutional neural network when the computer program is executed by a processor Method steps.
  • the steps of the hardware acceleration method of the above-described deconvolution neural network can be implemented, which can increase the speed of image processing.
  • the computer program of the computer-readable storage medium includes computer program code
  • the computer program code may be in a source code form, an object code form, an executable file, or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals and software distribution media, etc.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or software program modules.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
  • the technical solution of the present application essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the program may be stored in a computer-readable memory, and the memory may include: a flash disk , Read-Only Memory (English: Read-Only Memory, abbreviation: ROM), Random Access Device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.
  • ROM Read-Only Memory
  • RAM Random Access Device
  • magnetic disk or optical disk etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Semiconductor Memories (AREA)

Abstract

一种反卷积神经网络的硬件加速方法、装置、电子设备和存储介质,该方法包括:获取当前网络层的输入数据(S1),所述输入数据为存储在第一存储器中的上一网络层的计算结果;将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器(S2),所述当前网络层包括该层的权重;基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据(S3);将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器(S4),所述下一网络层包括所述该层的权重;重复执行上述各步骤,直到所述反卷积神经网络的最后一层并输出结果(S5)。上述方法提高了数据传输和利用的效率。

Description

一种反卷积神经网络的硬件加速方法、装置和电子设备 技术领域
本发明涉及人工智能领域,尤其涉及一种反卷积神经网络的硬件加速方法、装置、电子设备和存储介质。
背景技术
近几年,在人工智能领域,随着计算机算力和数据的剧增,基于神经网络的各种深度学习模型不断出现,引起了人们的广泛关注和研究,生成式对抗网络便是其中之一(GAN,Generative Adversarial Networks)。GAN网络主要由两个模块组成:生成模型(Generative Model)和判别模型(Discriminative Model),两个模块互相博弈进行无监督式或者半监督式学习。
其中,生成模型在GAN网络中用于生成数据。在已有大量训练数据的场景下,例如图像、语音、文本数据等,生成模型可以进行无监督式学习,模拟这些高维数据的分布;针对数据量缺乏的场景,生成模型则可以帮助生成数据,提高数据数量,从而利用半监督学习提升学习效率。因此GAN网络在很多应用场景例如机器翻译,图像去模糊,图像修复,文字到图像的转换等研究领域有着广泛的辅助应用。
但是,不同于判别模型通常由卷积神经网络构成,生成模型一般是反卷积神经网络,即由一系列的反卷积层组成。而现有的生成模型也就是反卷积神经网络在CPU或者GPU主要通过卷积操作实现,计算效率极低,数据利用率也很低;而已有的GAN网络生成模型在现场可编程门阵列(FPGA)上的加速器并没有考虑到反卷积网络与卷积网络的不同之处,效率提升有限。
发明内容
本发明实施例提供一种反卷积神经网络的硬件加速方法、装置、电子设备和存储介质,能够提高数据传输和利用的效率。
第一方面,本发明实施例提供一种反卷积神经网络的硬件加速方法,包括:
S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的 上一网络层的计算结果;
S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括所述该层的权重;
S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。
可选的,所述输入数据包括存储于片外存储器上的初始数据,所述获取当前网络层的输入数据,包括:
将所述存储于片外存储器上的初始数据读取到所述第一存储器中。
可选的,所述将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重,包括:
获取当前网络层的权重,并输入当前网络层;
将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果;
将所述运算结果输入所述第二存储器。
可选的,所述获取当前网络层的权重,并输入当前网络层,包括:
将存储于片外存储器上的当前网络层的权重矩阵读取到所述当前网络层。
可选的,所述将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果,包括:
将所述输入数据和所述当前网络层的权重矩阵进行相乘并累加,得到计算结果。
可选的,所述基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据,包括:
所述下一网络层读取所述输入第二存储器中的运算结果作为输入数据。
可选的,所述第一存储器、第二存储器为现场可编程门阵列(FPGA)的片上存储器。
第二方面,本发明实施例提供一种反卷积神经网络的硬件加速装置,包括:
第一获取模块,用于获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;
第一计算模块,用于将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
第二获取模块,用于基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
第二计算模块,用于将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重;
重复模块,用于重复调用上述各模块,直到所述反卷积神经网络的最后一层并输出结果。
第三方面,本发明实施例提供一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现本发明实施例提供的反卷积神经网络的硬件加速方法中的步骤。
第四方面,本发明实施例提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现本发明实施例提供的反卷积神经网络的硬件加速方法中的步骤。
本发明实施例中,S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括当前网络层的权重;S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括所述下一网络层的权重;S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。由于当前网络层从片上第一存储器内读取上一网络层的计算结果,通过反卷积模块计算之后输入到片上第二存储器并作为下一网络层的输入数据,然后该下一网络层的输入数据经过反卷积模块计算再将结果输回片上第一存储器,从而 实现了网络层之间的融合,有效地避免了重复读写片外数据,提高了数据传输和利用的效率,进而提高了反卷积神经网络的计算速度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。
图1是本发明实施例可能用到的一种反卷积神经网络的硬件加速的网络架构图;
图2是本发明实施例提供的一种反卷积神经网络的硬件加速方法的流程示意图;
图3是本发明实施例提供的一种反卷积神经网络的硬件加速装置示意图;
图4是本发明实施例提供的一种电子设备的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
下面首先结合相关附图来举例介绍下,本发明实施例可能用到的一种反卷积神经网络的硬件加速的网络架构,如图1所示,上述网络架构包括主计算机模块(HOST)和FPGA加速模块两大部分。
主计算机模块包括CPU和DDRMemory(双倍速率内存),其中CPU可用来向上述FPGA加速模块提供时钟源,并可以发送控制指令将存于上述双倍速率内存的数据读取到FPGA加速模块,或将FPGA加速模块的输出数据写入上述双倍速率内存中。
FPGA加速模块包括控制单元(Control Unit)、直接内存存取单元(Direct Memory Access,DMA)、片上缓存器A(Buffer A)、片上缓存器B(Buffer B)、 反卷积运算单元(DeConv)以及层计数单元(Layer Count),其中,控制单元用来控制输入输出数据的矩阵大小、通道数以及每一层的权值输入;直接内存存取用来将上述双倍速率内存与上述片上缓存器A、片上缓存器B直接相接,可以直接操作内存中的数据,提高读写速度;片上缓存器A、片上缓存器B用来临时存储上述主计算机模块输入的数据或反卷积运算单元的输出结果;反卷积运算单元用于将上述片上缓存器A或片上缓存器B内的数据和各层的权值进行反卷积计算;层计数单元用来通知控制单元在一次反卷积计算中,上述片上缓存器A和片上缓存器B哪一个作为反卷积运算单元的数据输入端、哪一个作为计算结果输出端,以便向数据输入端传送权重数据。
如图2所示,本发明实施例提供了一种反卷积神经网络的硬件加速方法,包括以下步骤:
S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果。
其中,上述第一存储器可以是上述网络架构中的片上缓存器A或片上缓存器B;上述当前网络层的输入数据,可以从临时存储上一网络层输出的计算结果的上述第一存储器读取,如果当前网络层为第一层,还可以从上述主计算机模块的双倍速率内存中读取。
需要说明的是,上述输入数据可以为二维的矩阵数据,也可以是高维数据。
S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重。
将上述获取到当前网络层的输入数据输入到上述网络架构中的反卷积运算单元,然后读取当前网络层的权重到上述反卷积运算单元,并将矩阵数据和权重进行矩阵相乘再相加,得到上述当前计算结果并存入上述第二存储器;上述第二存储器可以是上述网络架构中的片上缓存器A或片上缓存器B中没有用来缓存上述输入数据的那一个。
S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据。
其中,读取步骤S2中存入上述第二存储器的当前计算结果到上述网络架构中的反卷积运算单元,作为下一网络层的输入数据,该输入数据为矩阵数据, 并且在输入到反卷积运算单元进行反卷积计算之前,还可以通过上述控制单元对该矩阵数据进行处理,如填充、裁剪等。
S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重。
将步骤S3中获取到的所述下一网络层的输入数据输入到上述网络架构中的反卷积运算单元,然后读取该层的权重到上述反卷积运算单元,并将上述输入数据和权重进行矩阵相乘再相加,得到上述当前计算结果并存入上述第一存储器。
S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。
从上述反卷积神经网络第一层开始,每一层依次执行上述获取输入数据、进行反卷积计算、输出计算结果的步骤,直到反卷积神经网络的最后一层,然后可以将上述最后一层的计算结果经过上述控制单元的处理,再输出到上述双倍速率内存中。
值得说明的是,上述当前网络层、下一网络层以及上一网络层是相对的,具体由上述网络架构中的层计数单元来确定,例如,在一次反卷积计算中,层计数单元将本次输入数据的上述第一存储器一端作为当前网络层,计算后将结果输出到上述第二存储器的一端作为下一网络层,因此当前网络层相对于下一网络层来说是该层的上一网络层。
本发明实施例中,S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括当前网络层的权重;S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括所述下一网络层的权重;S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。由于当前网络层从片上第一存储器内读取上一网络层的计算结果,通过反卷积模块计算之后输入到片上第二存储器并作为下一网络层的输入数据,然后该下 一网络层的输入数据经过反卷积模块计算再将结果输回片上第一存储器,从而实现了网络层之间的融合,有效地避免了重复读写片外数据,提高了数据传输和利用的效率,进而提高了反卷积神经网络的计算速度。
可选的,所述输入数据包括存储于片外存储器上的初始数据,所述获取当前网络层的输入数据,包括:
将所述存储于片外存储器上的初始数据读取到所述第一存储器中。
其中,片外存储器可以为上述网络架构中的主计算机模块内的双倍速率内存,支持读写操作,初始数据可以是图像的像素数据、语音数据或是文本的语义数据等;通过将上述存储于片外存储器上的初始数据读取到上述第一存储器中,且只读取该初始数据一次,进一步的该第一存储器是上述网络架构中的FPGA加速模块的片上缓存器A或片上缓存器B,从而加快数据传输的速率。
可选的,所述将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重,包括:
获取当前网络层的权重,并输入当前网络层;
将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果;
将所述运算结果输入所述第二存储器。
其中,从上述网络架构中的直接内存存取单元中读取当前网络层的权重,并输入当前网络层的输入数据所在的缓存器中;然后将上述当前网络层的权重和输入数据在上述反卷积运算单元中进行矩阵乘法之后相加,得出计算结果并将该结果输入到上述第二存储器中;进一步的该第一存储器是上述网络架构中的FPGA加速模块的片上缓存器A或片上缓存器B,从而加快数据传输的速率。
值得注意的是,上述当前网络层的输入数据和权重可以是方形矩阵,即矩阵的行数、列数一样,例如当前网络层的输入数据为5×5、权重为3×3;上述方形矩阵的形状可由上述网络架构中的控制单元来控制,例如控制单元可以向当前网络层的上述输入数据矩阵填充0使得该矩阵变成7×7等。
可选的,所述获取当前网络层的权重,并输入当前网络层,包括:
将存储于片外存储器上的当前网络层的权重矩阵读取到所述当前网络层。
其中,片外存储器为上述网络架构的双倍速率内存,当前网络层的权重为 上述权重方形矩阵;当对当前网络层进行反卷积计算的时候,直接通过上述直接内存存取单元将存储于上述双倍速率内存中的当前网络层的权重矩阵传输到上述当前网络层所在的缓存器中。
可选的,所述将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果,包括:
将所述输入数据和所述当前网络层的权重矩阵进行相乘并累加,得到计算结果。
其中,将存储于同一缓存器中的当前网络层的输入数据和权重矩阵输入到上述反卷积运算单元进行矩阵乘法然后累加,得到上述计算结果。
可选的,所述基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据,包括:
所述下一网络层读取所述输入第二存储器中的运算结果作为输入数据。
其中,上述当前网络层的输入数据和权重矩阵进行反卷积计算,得到计算结果并输入到上述第二存储器中;当下一网络层开始反卷积计算的时候,从上述第二存储器中读取上述计算结果并作为该层的输入数据。例如,当前网络层的输入数据为2×2,可以经过填充后变成6×6,该层的权重矩阵为3×3,将两个矩阵进行进行反卷积计算后可以得到4×4的计算结果,输入到上述第二存储器中并作为下一网络层的输入数据。
值得一提的是,上述计算结果为矩阵数据,上述计算结果在输入到上述下一网络层之前还可以通过控制单元对该矩阵数据进行处理,如填充、裁剪等。
可选的,所述第一存储器、第二存储器为现场可编程门阵列(FPGA)的片上存储器。
其中,第一存储器、第二存储器为设置在上述FPGA加速模块上的片上缓存器A、片上缓存器B,通过直接内存存取单元与上述片外存储器相连接,以提高数据的传输和利用效率。
值得注意的是,上述当前网络层、下一网络层以及上一网络层是相对的,具体由上述网络架构中的层计数单元来确定,例如,在一次反卷积计算中,层计数单元将本次输入数据的上述片上缓存器A一端作为当前网络层,计算后将结果输出到上述片上缓存器B的一端作为下一网络层,因此当前网络层相对于下一网络层来说是该层的上一网络层;然后层计数单元通知控制单元将当 前网络层的权重数据从直接内存存取单元读取到前网络层所在的片上缓存器A。
进一步的,上述FPGA加速模块的控制单元是可以根据需要进行配置的,以满足不同的反卷积神经网络对不同参数的需求,如对输入数据的矩阵填充处理、对输出数据的裁剪、反卷积计算的滑动步长、通道数量等,例如控制单元可以对输入数据填充0或其他值,反卷积计算的滑动步长可以设为1或2,输入数据如果为RGB彩色像素通道数可以是3、如果为灰度像素通道数可以是1等,这样可以提高该网络架构的通用性,满足不同的场景需求。
以上可选的实施例为图2中反卷积神经网络的硬件加速方法的补充实施例,执行上述可选的实施列中的方法均能达到相应的有益效果,为避免重复,这里不再赘述。
请参见图3,图3是本发明实施例提供的一种反卷积神经网络的硬件加速装置的结构示意图,如图3所示,所述装置包括:
第一获取模块201,用于获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;
第一计算模块202,用于将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
第二获取模块203,用于基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
第二计算模块204,用于将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重;
重复模块205,用于重复调用上述各模块,直到所述反卷积神经网络的最后一层并输出结果。
可选的,所述第一计算模块包括:
获取单元2021,用于获取当前网络层的权重,并输入当前网络层;
运算单元2022,用于将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果;
输入单元2023,用于将所述运算结果输入所述第二存储器。
本发明实施例提供的一种反卷积神经网络的硬件加速装置能够实现图2的方法实施例中的各个实施方式,以及相应的有益效果,为避免重复,这里不再赘述。
参见图4,图4是本发明实施例提供的一种电子设备的结构示意图,如图4所示,包括:存储器402、处理器401及存储在所述存储器402上并可在所述处理器401上运行的计算机程序,其中:
处理器401用于调用存储器402存储的计算机程序,执行如下步骤:
S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;
S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重;
S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。
可选的,所述输入数据包括存储于片外存储器上的初始数据,处理器401执行所述获取当前网络层的输入数据,包括:
将所述存储于片外存储器上的初始数据读取到所述第一存储器中。
处理器401执行所述将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重,包括:
获取当前网络层的权重,并输入当前网络层;
将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果;
将所述运算结果输入所述第二存储器。
处理器401执行的所述获取当前网络层的权重,并输入当前网络层,包括:
将存储于片外存储器上的当前网络层的权重矩阵读取到所述当前网络层。,
处理器401执行的所述将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果,包括:
将所述输入数据和所述当前网络层的权重矩阵进行相乘并累加,得到计算结果。
处理器401执行的所述基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据,包括:
所述下一网络层读取所述输入第二存储器中的运算结果作为输入数据。
可选的,所述第一存储器、第二存储器为现场可编程门阵列(FPGA)的片上存储器。
上述处理器401在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。
需要说明的是,由于上述处理器401执行计存储于上述存储器402内的计算机程序时可实现上述的反卷积神经网络的硬件加速方法的步骤,因此上述反卷积神经网络的硬件加速方法的所有实施例均适用于上述电子设备,且均能达到相同或相似的有益效果。
此外,本发明的具体实施例还提供了一种计算机可读存储介质402,计算机可读存储介质402存储有计算机程序,该计算机程序被处理器执行时实现上述的反卷积神经网络的硬件加速方法的步骤。
即,在本发明的具体实施例中,计算机可读存储介质的计算机程序被处理器执行时实现上述的反卷积神经网络的硬件加速方法的步骤,能够提高图像处理的速度。
示例性的,计算机可读存储介质的计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。
需要说明的是,由于计算机可读存储介质的计算机程序被处理器执行时实现上述的反卷积神经网络的硬件加速方法的步骤,因此上述反卷积神经网络的硬件加速方法的所有实施例均适用于该计算机可读存储介质,且均能达到相同或相似的有益效果。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程, 是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储 器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上所揭露的仅为本发明较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (10)

  1. 一种反卷积神经网络的硬件加速方法,其特征在于,包括:
    S1、获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;
    S2、将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
    S3、基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
    S4、将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重;
    S5、重复执行步骤S1、S2、S3、S4,直到所述反卷积神经网络的最后一层并输出结果。
  2. 如权利要求1所述方法,其特征在于,所述输入数据包括存储于片外存储器上的初始数据,所述获取当前网络层的输入数据,包括:
    将所述存储于片外存储器上的初始数据读取到所述第一存储器中。
  3. 如权利要求2所述方法,其特征在于,所述将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重,包括:
    获取当前网络层的权重,并输入当前网络层;
    将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果;
    将所述运算结果输入所述第二存储器。
  4. 如权利要求3所述方法,其特征在于,所述获取当前网络层的权重,并输入当前网络层,包括:
    将存储于片外存储器上的当前网络层的权重矩阵读取到所述当前网络层。
  5. 如权利要求4所述方法,其特征在于,所述将所述输入数据与所述当前网络层的权重进行矩阵运算,得到运算结果,包括:
    将所述输入数据和所述当前网络层的权重矩阵进行相乘并累加,得到计算结果。
  6. 如权利要求5所述方法,其特征在于,所述基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据,包括:
    所述下一网络层读取所述输入第二存储器中的运算结果作为输入数据。
  7. 如权利要求6所述方法,其特征在于,所述第一存储器、第二存储器为现场可编程门阵列的片上存储器。
  8. 一种反卷积神经网络的硬件加速装置,其特征在于,包括:
    第一获取模块,用于获取当前网络层的输入数据,所述输入数据为存储在第一存储器中的上一网络层的计算结果;
    第一计算模块,用于将所述输入数据在当前网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第二存储器,所述当前网络层包括该层的权重;
    第二获取模块,用于基于所述第二存储器中的当前计算结果,获取下一网络层的输入数据;
    第二计算模块,用于将所述下一网络层的输入数据在所述下一网络层中进行反卷积计算,得到当前计算结果,将所述当前计算结果输入第一存储器,所述下一网络层包括该层的权重;
    重复模块,用于重复调用上述各模块,直到所述反卷积神经网络的最后一层并输出结果。
  9. 一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的反卷积神经网络的硬件加速方法中的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的反卷积神经网络的硬件加速方法中的步骤。
PCT/CN2018/120861 2018-12-13 2018-12-13 一种反卷积神经网络的硬件加速方法、装置和电子设备 WO2020118608A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880083893.7A CN111542839B (zh) 2018-12-13 2018-12-13 一种反卷积神经网络的硬件加速方法、装置和电子设备
PCT/CN2018/120861 WO2020118608A1 (zh) 2018-12-13 2018-12-13 一种反卷积神经网络的硬件加速方法、装置和电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/120861 WO2020118608A1 (zh) 2018-12-13 2018-12-13 一种反卷积神经网络的硬件加速方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2020118608A1 true WO2020118608A1 (zh) 2020-06-18

Family

ID=71075902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120861 WO2020118608A1 (zh) 2018-12-13 2018-12-13 一种反卷积神经网络的硬件加速方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN111542839B (zh)
WO (1) WO2020118608A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN112613605A (zh) * 2020-12-07 2021-04-06 深兰人工智能(深圳)有限公司 神经网络加速控制方法、装置、电子设备及存储介质
CN113673701A (zh) * 2021-08-24 2021-11-19 安谋科技(中国)有限公司 神经网络模型的运行方法、可读介质和电子设备
CN116681604A (zh) * 2023-04-24 2023-09-01 吉首大学 一种基于条件生成对抗网络的秦简文字修复方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860771B (zh) * 2020-06-19 2022-11-25 苏州浪潮智能科技有限公司 一种应用于边缘计算的卷积神经网络计算方法
CN112712174B (zh) * 2020-12-31 2022-04-08 湖南师范大学 全频域卷积神经网络的硬件加速器、加速方法和图像分类方法
CN112749799B (zh) * 2020-12-31 2022-04-12 湖南师范大学 基于自适应ReLU的全频域卷积神经网络的硬件加速器、加速方法和图像分类方法
CN113592066B (zh) * 2021-07-08 2024-01-05 深圳市易成自动驾驶技术有限公司 硬件加速方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060724A1 (en) * 2016-08-25 2018-03-01 Microsoft Technology Licensing, Llc Network Morphism
CN108133265A (zh) * 2016-12-01 2018-06-08 阿尔特拉公司 用于利用相同的处理单元实施不同类型的卷积运算的方法和装置
CN108765282A (zh) * 2018-04-28 2018-11-06 北京大学 基于fpga的实时超分辨方法及系统
CN108875915A (zh) * 2018-06-12 2018-11-23 辽宁工程技术大学 一种面向嵌入式应用的深度对抗网络优化方法
CN108876833A (zh) * 2018-03-29 2018-11-23 北京旷视科技有限公司 图像处理方法、图像处理装置和计算机可读存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10706348B2 (en) * 2016-07-13 2020-07-07 Google Llc Superpixel methods for convolutional neural networks
CN108062780B (zh) * 2017-12-29 2019-08-09 百度在线网络技术(北京)有限公司 图像压缩方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060724A1 (en) * 2016-08-25 2018-03-01 Microsoft Technology Licensing, Llc Network Morphism
CN108133265A (zh) * 2016-12-01 2018-06-08 阿尔特拉公司 用于利用相同的处理单元实施不同类型的卷积运算的方法和装置
CN108876833A (zh) * 2018-03-29 2018-11-23 北京旷视科技有限公司 图像处理方法、图像处理装置和计算机可读存储介质
CN108765282A (zh) * 2018-04-28 2018-11-06 北京大学 基于fpga的实时超分辨方法及系统
CN108875915A (zh) * 2018-06-12 2018-11-23 辽宁工程技术大学 一种面向嵌入式应用的深度对抗网络优化方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN112613605A (zh) * 2020-12-07 2021-04-06 深兰人工智能(深圳)有限公司 神经网络加速控制方法、装置、电子设备及存储介质
CN113673701A (zh) * 2021-08-24 2021-11-19 安谋科技(中国)有限公司 神经网络模型的运行方法、可读介质和电子设备
CN116681604A (zh) * 2023-04-24 2023-09-01 吉首大学 一种基于条件生成对抗网络的秦简文字修复方法
CN116681604B (zh) * 2023-04-24 2024-01-02 吉首大学 一种基于条件生成对抗网络的秦简文字修复方法

Also Published As

Publication number Publication date
CN111542839A (zh) 2020-08-14
CN111542839B (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
WO2020118608A1 (zh) 一种反卷积神经网络的硬件加速方法、装置和电子设备
US11928595B2 (en) Method of managing data representation for deep learning, method of processing data for deep learning and deep learning system performing the same
US11593594B2 (en) Data processing method and apparatus for convolutional neural network
CN109670575B (zh) 用于同时执行激活和卷积运算的方法和装置及其学习方法和学习装置
CN109670574B (zh) 用于同时执行激活和卷积运算的方法和装置及其学习方法和学习装置
US20220335272A1 (en) Fast sparse neural networks
US20220083857A1 (en) Convolutional neural network operation method and device
CN109416755B (zh) 人工智能并行处理方法、装置、可读存储介质、及终端
CN109978139B (zh) 图片自动生成描述的方法、系统、电子装置及存储介质
WO2019001323A1 (zh) 信号处理的系统和方法
CN111325332B (zh) 卷积神经网络的处理方法和装置
CN113222159B (zh) 一种量子态的确定方法及装置
CN114117992B (zh) 一种序列化和反序列化方法、装置和电子设备
KR20210014561A (ko) 다수 컨벌루션 윈도우 중의 이미지 데이터를 추출하는 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장매체
CN111340173A (zh) 一种针对高维数据的生成对抗网络训练方法、系统及电子设备
US20200349433A1 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
CN112966729A (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN117435855A (zh) 用于进行卷积运算的方法、电子设备和存储介质
CN113222151B (zh) 一种量子态的变换方法及装置
KR20210097448A (ko) 영상 데이터 처리 방법 및 영상 데이터 처리 방법을 수행하는 센서 장치
CN107894957B (zh) 面向卷积神经网络的存储器数据访问与插零方法及装置
CN107977923B (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN115688917A (zh) 神经网络模型的训练方法、装置、电子设备及存储介质
US20220391761A1 (en) Machine learning device, information processing method, and recording medium
CN110929854A (zh) 一种数据处理方法、装置及硬件加速器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18943216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18943216

Country of ref document: EP

Kind code of ref document: A1