WO2022057813A1 - 加速器参数确定方法及装置、计算机可读存储介质 - Google Patents

加速器参数确定方法及装置、计算机可读存储介质 Download PDF

Info

Publication number
WO2022057813A1
WO2022057813A1 PCT/CN2021/118418 CN2021118418W WO2022057813A1 WO 2022057813 A1 WO2022057813 A1 WO 2022057813A1 CN 2021118418 W CN2021118418 W CN 2021118418W WO 2022057813 A1 WO2022057813 A1 WO 2022057813A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
accelerator
neural network
memory
network
Prior art date
Application number
PCT/CN2021/118418
Other languages
English (en)
French (fr)
Inventor
熊先奎
朱炫鹏
徐东
王晓星
谢帅
蒋力
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022057813A1 publication Critical patent/WO2022057813A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a method for determining accelerator parameters, a device for determining accelerator parameters, and a computer-readable storage medium.
  • acceleration methods at the software level at this stage have shortcomings such as the need for exploration process or complex hardware implementation.
  • Various acceleration methods at the hardware level also have various problems. For example, the efficiency of accelerators running neural networks is low; performance/resource evaluation models Too much dependence on a certain infrastructure or algorithm; the results obtained from the evaluation model are only used for the optimization of the hardware architecture design, and are not coordinated with the software acceleration process.
  • An embodiment of the present disclosure provides a method for determining parameters of an accelerator, including: compressing a neural network according to preset compression parameters, and obtaining network parameters of the compressed neural network; and generating architecture parameters of the accelerator according to the network parameters, wherein , the accelerator is used to accelerate the operation of the neural network; and the architecture parameters are evaluated according to preset resource requirements and performance requirements, and after the evaluation is passed, the network parameters and the architecture parameters are output.
  • Embodiments of the present disclosure also provide an apparatus for determining accelerator parameters, including: one or more processors; and a storage apparatus for storing one or more programs, when the one or more programs are executed by the one or more programs When executed by the plurality of processors, the one or more processors are caused to implement the accelerator parameter determination method according to the present disclosure.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, causes the processor to implement the method for determining an accelerator parameter according to the present disclosure.
  • FIG. 1 is a flowchart of a method for determining an accelerator parameter according to an embodiment of the present disclosure
  • FIG. 2 is another flowchart of a method for determining an accelerator parameter according to an embodiment of the present disclosure
  • step S3 is a flowchart of a specific implementation method of step S3 in the accelerator parameter determination method according to an embodiment of the present disclosure
  • FIG. 4 is a structural block diagram of a field programmable gate array accelerator according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a specific implementation method of step S301 in the accelerator parameter determination method according to an embodiment of the present disclosure.
  • the accelerator parameter determination method, accelerator parameter determination device, and computer-readable storage medium provided by the present disclosure generate accelerator architecture parameters by compressing a neural network and according to the network parameters of the compressed neural network, and evaluate the architecture parameters , which realizes the determination of the architectural parameters without the need for actual running tests, effectively shortens the design cycle of the accelerator, and obtains the approximate optimal solution under the given hardware resources and performance requirements.
  • FIG. 1 is a flowchart of a method for determining accelerator parameters according to an embodiment of the present disclosure.
  • the method for determining accelerator parameters according to an embodiment of the present disclosure includes steps S1 to S3 .
  • step S1 the neural network is compressed according to the preset compression parameters, and the network parameters of the compressed neural network are obtained.
  • Compression parameters can include compression ratio and accuracy of neural network model, etc.
  • the step of compressing the neural network may include compressing the neural network by means of network pruning and weight quantization, that is, compressing the weight parameters of the neural network.
  • network pruning is used to remove unimportant weight parameters, which has better robustness and supports pre-training.
  • Network pruning is divided into structured pruning and unstructured pruning. Structured pruning is hardware-friendly, but the accuracy loss is large. Unstructured pruning is not hardware-friendly, but the accuracy loss is small.
  • Weight quantization includes the process of quantization, sharing and encoding. Similar to pruning, it realizes structured and regular processing of weights. The simple quantization, sharing and encoding process is more conducive to hardware implementation, but the precision loss is large; the complex quantization, sharing and encoding process has a small precision loss, which means that it can provide better compression effect, but the hardware control is more complicated.
  • step S2 the architecture parameters of the accelerator are generated according to the network parameters.
  • the architecture parameters of the accelerator can be generated according to the network parameters and in combination with the hardware environment and hardware characteristics of the corresponding accelerator.
  • the accelerator is used to accelerate the operation of the neural network, and the generation principle of the architecture parameters is to maximize the utilization of the logic unit array and the cache of the accelerator.
  • convolutional neural network For a convolutional neural network (Convolutional Neural Network, CNN), its operation includes convolution calculation, and convolution calculation involves three parts: input data, convolution kernel and output data, all of which are three-dimensional arrays.
  • the input data has the attributes of rows and columns, and has multiple layers, corresponding to multiple channels;
  • the convolution kernel is also called weight, the number of layers of the convolution kernel is the same as the number of layers of the input data, and there are multiple convolution kernels;
  • the output data is a volume
  • the number of layers (channels) of the output data depends on the number of convolution kernels.
  • convolution kernel which slides on the input data, at each position, the data points on the convolution kernel and the covered input data points are multiplied one by one, and the obtained products are all accumulated, and then Add the offset to get a data point in the output data.
  • a convolution kernel completes the sliding of all positions on the input data, it calculates one channel of the output data; multiple convolution kernels perform this process to calculate multiple channels of the output data.
  • step S3 the architecture parameters are evaluated according to the preset resource requirements and performance requirements, and after the evaluation is passed, the network parameters and the architecture parameters are output.
  • the architectural parameters are evaluated, that is, the accelerators corresponding to the architectural parameters are evaluated.
  • the evaluation is passed.
  • the method for determining accelerator parameters according to an embodiment of the present disclosure may further include: configuring system resources to generate accelerators according to the output network parameters and architecture parameters.
  • the embodiments of the present disclosure provide a method for determining accelerator parameters.
  • the architecture parameters are determined on the premise of running the test, which effectively shortens the design cycle of the accelerator, and obtains the approximate optimal solution under the given hardware resources and performance requirements.
  • FIG. 2 is another flowchart of a method for determining an accelerator parameter according to an embodiment of the present disclosure. As shown in FIG. 2 , the method is an optional embodiment based on the method shown in FIG. 1 . Specifically, step S1 shown in FIG. 1 may include step S101.
  • step S101 each layer of the neural network is compressed according to the compression parameters, and the network parameters of each layer of the compressed neural network are obtained.
  • each convolutional layer can be compressed according to the compression parameters, and the number of network layers of each layer after compression can be obtained, and the corresponding compression parameters of each layer can be set to be the same or different.
  • FIG. 3 is a flowchart of a specific implementation method of step S3 in the method for determining accelerator parameters according to an embodiment of the present disclosure.
  • the accelerator may be a Field Programmable Gate Array (FPGA) accelerator
  • the architectural parameters may include memory parameters.
  • the memory parameters may include the buffer capacity and the number of memory read and write cycles corresponding to each layer of the neural network.
  • the number of memory read and write cycles is related to the cycle level covered by the read and write memory operations, and the number of memory read and write cycles is a measured value and needs to be calculated during the evaluation test.
  • the cache capacity is also affected by the cycle level covered by the read and write memory operations.
  • the cycle level includes convolution kernel columns, convolution kernel rows, output data columns, output data rows, input channels, and output channels.
  • step S3 may include step S301.
  • step S301 the cache capacity is evaluated according to resource requirements, and the number of memory read and write cycles is evaluated according to performance requirements.
  • the number of memory read and write cycles may be calculated by dividing the product of the amount of data read at one time and the number of reads by the memory bandwidth.
  • the amount of data read at one time is the amount of data required for the inner loop level covered by the read and write memory operations, and correspondingly, the number of reads is the number of loops in the outer loop level not covered by the read and write memory operations.
  • architectural parameters may also include hierarchical loop order and tiling parameters.
  • FIG. 4 is a structural block diagram of an FPGA accelerator according to an embodiment of the present disclosure.
  • the FPGA accelerator includes memory, input data cache, weight cache, output data cache and logic cell array.
  • the memory can be a double-rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM, also known as DDR), which exists in the form of an external connection and plays the role of memory.
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • the FPGA accelerator is used to accelerate the inference of the neural network with lower power consumption and obtain higher performance.
  • the workflow for accelerating a layer of convolution in a convolutional neural network is: store input data and convolution kernels in memory; read (part of) input data and convolution kernels from memory to input data cache and weight cache ; The input data and convolution kernel enter the logic unit array, perform multiplication and addition calculation, and the result is cached in the output data cache. Repeat this step until the read input data and convolution kernel are used up; the calculated (part) The output data is saved to memory; the above steps are repeated until all input data and convolution kernels in memory are used up.
  • the acceleration effect of FPGA accelerator on neural network mainly depends on large-scale parallel computing capability.
  • a single logic unit in the logic unit array can perform one multiplication and addition in one clock cycle.
  • the design options of the FPGA accelerator can include: in each loop, slices can be taken out and aggregated for parallel computing, this process is called block; one layer of loop corresponds to one block
  • the value range of the block parameter is an integer greater than or equal to 1. If the block parameter is equal to 1, the layer loop does not perform block processing, and if the block parameter is greater than 1, the layer loop performs block processing.
  • the design options of the FPGA accelerator may further include: setting the sequence of the multi-layer loops and the loop level covered by the read and write memory operations.
  • the order of the loops is convolution kernel column, convolution kernel row, output data column, output data row, input channel and output channel;
  • the inner layer for setting the read input data operation has 4 layers of loops, Arranged as convolution kernel column, convolution kernel row, output data column and output data row, which means that the input data is read from the memory once and stored in the on-chip cache for these 4-layer loops, that is, the operation covers the inner 4-layer loop , at this time, the amount of data read at one time is the amount of data required by the inner 4-layer loop, and the number of reads is the number of loops in the outer 2-layer loop.
  • FIG. 5 is a flowchart of a specific implementation method of step S301 in the accelerator parameter determination method according to an embodiment of the present disclosure.
  • step S301 after the evaluation of the cache capacity is passed, the evaluation of the number of read and write cycles of the memory is performed.
  • step S301 the step of evaluating the cache capacity according to resource requirements may include steps S301a to S303a, the step of evaluating the number of memory read/write cycles according to performance requirements may include steps S301b to step S303b, and in addition Step S301 may further include step S301c.
  • step S301a compare whether all cache capacities are less than or equal to their respective resource thresholds.
  • step S301a the cache capacity of each layer can be compared with its corresponding resource threshold.
  • the resource threshold is determined according to the corresponding resource requirements. If it is compared that all cache capacities are less than or equal to their respective corresponding resource thresholds, step S302a is executed; if it is compared that there is at least one cache capacity greater than the resource threshold, step S303a is executed.
  • the resource thresholds corresponding to the cache capacity of each layer are different, or may be set to the same one.
  • step S302a the evaluation result of the buffer capacity is passed.
  • step S303a the architecture parameters are adjusted, and it is determined whether the preset loop jumping conditions are met.
  • the architecture parameters are adjusted. In some embodiments, if the loop-out condition is not set, the process returns to step S301a directly, and compares the entire cache capacity with its corresponding resource threshold.
  • step S303a if it is judged that the preset loop jumping condition is not met, then return to step S301a to compare whether all cache capacities are less than or equal to their respective corresponding resource thresholds; if it is judged that the preset loop jumping condition is met, Then step S301c is executed.
  • the loop jumping out condition may be set as the number of loops is greater than or equal to a preset number threshold, or the jumping out may be performed according to console behavior or user behavior.
  • step S301b compare whether the sum of all memory read and write cycles is less than or equal to a preset performance threshold.
  • step S301b the sum of all memory read and write cycles may be compared with a preset performance threshold.
  • the performance threshold is determined according to the corresponding performance requirements. If the sum of all memory read and write cycles is less than or equal to the performance threshold, step S302b is performed; if the sum of all memory read and write cycles is greater than the performance threshold, step S302b is performed. S303b.
  • step S302b the evaluation result of the number of read and write cycles of the memory is passed.
  • step S303b the architecture parameters are adjusted, and it is judged whether the preset loop jumping conditions are met.
  • step S301b is directly returned to compare the sum of all memory read and write cycles with a preset performance threshold.
  • step S303b if it is judged that the preset loop jumping condition is not met, then return to step S301b, and compare and compare whether the sum of all memory read and write cycles is less than or equal to the preset performance threshold; If the set loop escape condition is set, step S301c is executed.
  • the actual meaning of meeting the preset loop jumping conditions is that the architecture parameters cannot be adjusted so that the sum of all memory read and write cycles is less than or equal to the performance threshold.
  • step S301c the compression parameters are adjusted, and the neural network is re-compressed according to the adjusted compression parameters.
  • steps S301a to S303a may be executed before steps S301b to S303b, or steps S301a to S303a may be executed after steps S301b to S303b, or steps S301a to S303a and steps S301b to S303b are interspersed and executed, All of them belong to the protection scope of the present disclosure.
  • the embodiments of the present disclosure provide a method for determining accelerator parameters, by compressing a neural network, and generating the architecture parameters of the accelerator according to the network parameters of the compressed neural network, and evaluating the architecture parameters. Multiple rapid alternate iterations reduce the parameter testing period. At the same time, during each iteration, it can quickly determine whether the corresponding parameters are subject to resource constraints or performance constraints, and whether they do not match the software neural network model, thus restricting and approaching each other. Finally, A balance point is reached, and an overall approximate optimal solution is obtained.
  • parameters are determined for an FPGA accelerator, which is implemented on a Xilinx ZCU102 hardware platform and includes a memory, an input data cache, a weight cache, an output data cache, and a logic cell array.
  • the neural network model object to be accelerated is the YOLOV3-Tiny model.
  • each convolutional layer of the model is compressed according to preset compression parameters, a first compression model is generated, and network parameters of each layer of the first compression model are obtained.
  • Compression parameters can include compression rate and model accuracy threshold, the compression rate of the convolutional layer is 30%, and the model accuracy threshold is 95%.
  • the architecture parameters of the accelerator are generated according to the acquired network parameters, and the architecture parameters may include hierarchical cycle order, block parameters and memory parameters.
  • the memory parameters may include the corresponding cache capacity of each layer and the number of memory read and write cycles. The number of memory read and write cycles is a measured value and is represented by the cyclic hierarchy covered by read and write memory operations.
  • the cycle level may include convolution kernel columns, convolution kernel rows, output data columns, output data rows, input channels, and output channels.
  • the process of generating the architecture parameters is as follows:
  • Input data, weights and output data determine the level of data multiplexing according to the amount of data, the multiplexing level of input data is output channel, the multiplexing level of weight is output data column and output data row, the multiplexing level of output data is input Channels, Kernel Columns, and Kernel Rows.
  • the Conv12 layer has the largest amount of data, in which the number of input data is 86528, the number of weights is 4718592, and the number of output data is 173056. There are many weights and output data, so the multiplexing of the two is first satisfied. It is required to put the weighted multiplexing level output data column and output data row on the inner layer, the multiplexing level input channel of the output data on the outer layer, and the multiplexing level output channel of the input data on the outermost layer.
  • the resulting circular sequence is arranged as convolution kernel columns, convolution kernel rows, output data columns, output data rows, input channels, and output channels.
  • the block parameter is mainly related to the number of logical units, and the number of logical units is determined by the parallelism of each loop level. Specifically, the number of logical units can be obtained by multiplying the parallelism of each loop level.
  • the cycle level covered by the read and write memory operations has an impact on the cache size and the number of times the memory is read and written.
  • the candidate levels of the three data read and write operations are: input data corresponds to input channel and output channel, weight corresponds to input channel and output channel, and output data corresponds to output channel. Calculate the input data cache capacity, weight cache capacity, and output data cache capacity separately for the above positions, and the sum can be less than the on-chip cache capacity. If the calculated cache capacity of all positions is too large, consider reducing the block parameter or adjust the hierarchy cycle order.
  • the cache capacity needs to be evaluated according to the resource requirements, and the number of memory read and write cycles needs to be evaluated according to the performance requirements.
  • the resource threshold corresponding to the cache capacity of each convolutional layer is preset to be 500KB. If there is at least one cache capacity greater than 500KB, the architecture parameters are adjusted. If the architecture parameters cannot be adjusted so that all cache capacities are less than or equal to 500KB, then on the premise that the model accuracy is not lower than the model accuracy threshold, adjust the compression rate, generate a second compression model, and repeat the above steps until all the cache capacity is less than or equal to 500KB.
  • the evaluation of the number of memory read and write cycles is performed after the evaluation of the cache capacity is passed.
  • Embodiments of the present disclosure also provide an apparatus for determining accelerator parameters, including: one or more processors; and a storage apparatus for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the accelerator parameter determination method according to various embodiments of the present disclosure.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored.
  • the processor implements the method for determining accelerator parameters according to the embodiments of the present disclosure.
  • Computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)

Abstract

本公开提供了一种加速器参数确定方法,包括:根据预先设置的压缩参数对神经网络进行压缩,并获取压缩后的神经网络的网络参数;根据网络参数生成加速器的架构参数;以及根据预先设置的资源要求和性能要求对架构参数进行评估,并在评估通过后,输出网络参数和架构参数。本公开还提供了一种加速器参数确定装置和计算机可读存储介质。

Description

加速器参数确定方法及装置、计算机可读存储介质
本申请要求在2020年9月15日提交中国专利局、申请号为202010967711.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,特别涉及一种加速器参数确定方法、加速器参数确定装置和计算机可读存储介质。
背景技术
随着神经网络在各种任务的执行效率上超越人工水平,各行各业对于神经网络应用的需求愈发高涨,但是神经网络结构复杂,且参数量和计算量巨大,端侧和边缘侧由于无法提供类似于图形处理器(Graphics Processing Unit,简称GPU)的强大算力,实现遇到巨大阻碍。为了解决这一问题,提出了在软硬件层面对神经网络的计算进行加速的方案。
但现阶段软件层面的各类加速方式具有需要探索过程或硬件实现复杂等缺点,硬件层面的各类加速方式同样存在多种问题,例如,加速器运行神经网络的效率较低;性能/资源评估模型过于依赖某种基础架构或者算法;评估模型得到的结果仅仅用于硬件架构设计的优化,并没有和软件加速过程协同起来等。
发明内容
本公开实施例提供了一种加速器参数确定方法,包括:根据预先设置的压缩参数对神经网络进行压缩,并获取压缩后的神经网络的网络参数;根据所述网络参数生成加速器的架构参数,其中,所述加速器用于实现对神经网络的运算的加速;以及根据预先设置的资源要求和性能要求对所述架构参数进行评估,并在评估通过后,输出所述网络参数和所述架构参数。
本公开实施例还提供了一种加速器参数确定装置,包括:一个或多个处理器;以及存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现根据本公开的加速器参数确定方法。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器实现根据本公开的加速器参数确定方法。
附图说明
图1为根据本公开实施例的加速器参数确定方法的流程图;
图2为根据本公开实施例的加速器参数确定方法的另一流程图;
图3为根据本公开实施例的加速器参数确定方法中的步骤S3的具体实施方法流程图;
图4为根据本公开实施例的现场可编程门阵列加速器的结构框图;以及
图5为根据本公开实施例的加速器参数确定方法中的步骤S301的具体实施方法流程图。
具体实施方式
为使本领域的技术人员更好地理解本公开的技术方案,下面结合附图对本公开提供的加速器参数确定方法、加速器参数确定装置和计算机可读介质进行详细描述。
在下文中将参考附图更充分地描述示例实施例,但是所述示例实施例可以以不同形式来体现且不应当被解释为限于本文阐述的实施例。反之,提供这些实施例的目的在于使本公开透彻和完整,并将使本领域技术人员充分理解本公开的范围。
本文所使用的术语仅用于描述特定实施例,且不意欲限制本公开。如本文所使用的,单数形式“一个”和“该”也意欲包括复数形式,除非上下文另外清楚指出。还将理解的是,当本说明书中使用术语“包括”和/或“由……制成”时,指定存在所述特征、整体、步 骤、操作、元件和/或组件,但不排除存在或添加一个或多个其他特征、整体、步骤、操作、元件、组件和/或其群组。
将理解的是,虽然本文可以使用术语第一、第二等来描述各种元件,但这些元件不应当受限于这些术语。这些术语仅用于区分一个元件和另一元件。因此,在不背离本公开的指教的情况下,下文讨论的第一元件、第一组件或第一模块可称为第二元件、第二组件或第二模块。
除非另外限定,否则本文所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本公开的背景下的含义一致的含义,且将不解释为具有理想化或过度形式上的含义,除非本文明确如此限定。
本公开所提供的加速器参数确定方法、加速器参数确定装置和计算机可读存储介质,通过对神经网络进行压缩,并根据压缩后的神经网络的网络参数生成加速器的架构参数,以及对架构参数的评估,实现了在不需要进行实际运行测试的前提下确定架构参数,有效缩短了加速器的设计周期,并得出给定硬件资源和性能要求下的近似最优解。
图1为根据本公开实施例的加速器参数确定方法的流程图。
如图1所示,根据本公开实施例的加速器参数确定方法包括步骤S1至S3。
在步骤S1,根据预先设置的压缩参数对神经网络进行压缩,并获取压缩后的神经网络的网络参数。
压缩参数可以包括压缩比和神经网络模型的精度等。对神经网络进行压缩的步骤可以包括通过网络剪枝(Pruning)和权重量化等方式对神经网络进行压缩,即,对神经网络的权重参数进行压缩。具体地,网络剪枝用于移除不重要的权重参数,鲁棒性较好,且支持预训练。网络剪枝分为结构化剪枝和非结构化剪枝,结构化剪枝对硬件友好,但精度损失较大,非结构化剪枝对硬件不友好,但精度损失较小。权重量化包括量化、共享和编码过程,与剪枝类似,实现对权重 进行结构化、规整的处理。简单的量化、共享和编码过程更有利于硬件实现,但精度损失较大;复杂的量化、共享和编码过程精度损失较小,意味着能够提供更好的压缩效果,但是硬件控制较为复杂。
在步骤S2,根据网络参数生成加速器的架构参数。
可以根据网络参数并结合相应加速器的硬件环境和硬件特性,生成加速器的架构参数。加速器用于实现对神经网络的运算的加速,架构参数的生成原则为最大化利用加速器的逻辑单元阵列以及缓存。
特别地,针对卷积神经网络(Convolutional Neural Network,CNN),其运算包括卷积计算,卷积计算涉及输入数据、卷积核和输出数据三部分,三者均为三维数组。输入数据具有行列属性,并具有多个层,对应多个通道;卷积核又称为权重,卷积核的层数与输入数据的层数相同,共有多个卷积核;输出数据为卷积计算的结果,输出数据的层数(通道数)取决于卷积核的个数。具体进行卷积时,针对一个卷积核,其在输入数据上滑动,在每个位置,卷积核上的数据点与覆盖的输入数据点一一相乘,所得的乘积全部累加起来,再加上偏置,得到输出数据中的一个数据点。一个卷积核在输入数据上完成所有位置的滑动后,计算得到输出数据的一个通道;多个卷积核进行该过程,计算出输出数据的多个通道。
在步骤S3,根据预先设置的资源要求和性能要求对架构参数进行评估,并在评估通过后,输出网络参数和架构参数。
对架构参数进行评估,即,对与架构参数对应的加速器进行评估。当加速器占用的资源以及达到的性能均满足相应的资源要求和性能要求时,则评估通过。
在一些实施例中,根据本公开实施例的加速器参数确定方法还可以包括:根据输出的网络参数和架构参数配置系统资源以生成加速器。
有别于传统的加速器设计方法,需要等待加速器硬件开发完成,实际运行后才能测出性能和资源,耗时巨大。本公开实施例提供了一种加速器参数确定方法,通过对神经网络进行压缩,并根据压缩后的神经网络的网络参数生成加速器的架构参数,以及对架构参数的评估, 实现了在不需要进行实际运行测试的前提下确定架构参数,有效缩短了加速器的设计周期,并得出给定硬件资源和性能要求下的近似最优解。
图2为根据本公开实施例的加速器参数确定方法的另一流程图。如图2所示,该方法为基于图1所示方法的具体化可选实施方案。具体地,图1所示的步骤S1可以包括步骤S101。
在步骤S101,根据压缩参数对神经网络的每一层进行压缩,并获取压缩后的神经网络各层的网络参数。
具体地,针对卷积神经网络,可以根据压缩参数对各个卷积层进行压缩,并获取压缩后各层的网络层数,各层对应的压缩参数可设置为相同或不同。
图3为根据本公开实施例的加速器参数确定方法中的步骤S3的具体实施方法流程图。
具体地,加速器可以为现场可编程门阵列(Field Programmable Gate Array,FPGA)加速器,架构参数可以包括存储器参数。存储器参数可以包括神经网络各层对应的缓存(buffer)容量和存储器读写周期数。存储器读写周期数与读写存储器操作覆盖的循环层次有关,并且存储器读写周期数为测量值,需在评估测试的过程中计算得到。缓存容量同样受读写存储器操作覆盖的循环层次影响,循环层次包括卷积核列、卷积核行、输出数据列、输出数据行、输入通道和输出通道。
如图3所示,步骤S3可以包括步骤S301。
在步骤S301,根据资源要求对缓存容量进行评估,以及根据性能要求对存储器读写周期数进行评估。
在一些实施例中,存储器读写周期数可以由一次读取的数据量和读取次数的乘积除以存储器带宽计算得到。一次读取的数据量为读写存储器操作覆盖的内部循环层次所需的数据量,相应地,读取次数为读写存储器操作未覆盖的外部循环层次的循环次数。
在一些实施例中,架构参数还可以包括层次循环顺序和分块(tiling)参数。
图4为根据本公开实施例的FPGA加速器的结构框图。
如图4所示,FPGA加速器包括存储器、输入数据缓存、权重缓存、输出数据缓存和逻辑单元阵列。存储器可以为双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM,又称DDR),以外接的形式存在,起到内存的作用。
FPGA加速器用于以较低的功耗,对神经网络的推理进行加速,得到较高的性能。针对卷积神经网络中一层卷积进行加速的工作流程为:将输入数据和卷积核存放在存储器中;从存储器中读取(部分)输入数据和卷积核到输入数据缓存和权重缓存;输入数据和卷积核进入逻辑单元阵列,进行乘加计算,结果缓存到输出数据缓存,重复进行本步骤,直到读取的输入数据和卷积核都使用完;将计算得到的(部分)输出数据保存到存储器中;重复上述步骤,直到存储器中的所有输入数据和卷积核都使用完。
FPGA加速器对神经网络的加速作用,主要依赖于大规模并行计算能力。逻辑单元阵列中的单个逻辑单元一个时钟周期可以做一次乘加,在进行并行计算时,具备多少个逻辑单元就可做多少次乘加计算。因此,对应于架构参数中的分块参数,FPGA加速器的设计选项可以包括:在每个循环中,可取出切片,集合起来进行并行计算,这个过程称为分块;一层循环对应一个分块参数,分块参数的取值范围为大于或等于1的整数,若分块参数等于1,则该层循环不做分块处理,若分块参数大于1,则该层循环做分块处理。
另外,对应于架构参数中的层次循环顺序和存储器参数,FPGA加速器的设计选项还可以包括:设置多层循环的先后顺序和读写存储器操作覆盖的循环层次。例如,共6层循环,循环顺序的排列为卷积核列、卷积核行、输出数据列、输出数据行、输入通道和输出通道;设置读取输入数据操作的内层有4层循环,排列为卷积核列、卷积核行、输出数据列和输出数据行,表示从存储器读取一次输入数据,存入片上缓存,供这4层循环使用,即该操作覆盖内层4层循环,此时,一次读取的数据量是内部4层循环所需的数据量,读取次数是外部2 层循环的循环次数。
图5为根据本公开实施例的加速器参数确定方法中的步骤S301的具体实施方法流程图。
具体地,在步骤S301,在对缓存容量的评估通过后,进行对存储器读写周期数的评估。
如图5所示,在步骤S301中,根据资源要求对缓存容量进行评估的步骤可以包括步骤S301a至S303a,根据性能要求对存储器读写周期数进行评估的步骤可以包括步骤S301b至步骤S303b,此外步骤S301还可以包括步骤S301c。
在步骤S301a,比较全部缓存容量是否小于或等于其各自对应的资源阈值。
在步骤S301a中,可以比较各层的缓存容量与其各自对应的资源阈值。
资源阈值是根据相应资源要求确定的,若比较出全部缓存容量均小于或等于其各自对应的资源阈值,则执行步骤S302a;若比较出存在至少一个缓存容量大于资源阈值,则执行步骤S303a。
在一些实施例中,各层的缓存容量对应的资源阈值不同,或者可设置为同一个。
在步骤S302a,对缓存容量的评估结果为通过。
若全部缓存容量均小于或等于其各自对应的资源阈值,则对缓存容量的评估通过。
在步骤S303a,对架构参数进行调整,并判断是否符合预先设置的循环跳出条件。
若存在至少一个缓存容量大于其对应的资源阈值,则对架构参数进行调整。在一些实施例中,若未设置循环跳出条件,则直接返回步骤S301a,比较全部缓存容量与其各自对应的资源阈值。
在步骤S303a中,若判断出不符合预先设置的循环跳出条件,则返回步骤S301a,比较全部缓存容量是否小于或等于其各自对应的资源阈值的步骤;若判断出符合预先设置的循环跳出条件,则执行步骤S301c。
符合预先设置的循环跳出条件的实际意义为,无法通过对架构参数进行调整以使得全部缓存容量均小于或等于其各自对应的资源阈值。
在一些实施例中,循环跳出条件可设置为循环次数大于或等于预先设置的次数阈值,或可根据控制台行为或用户行为进行跳出。
在步骤S301b,比较全部存储器读写周期数的和是否小于或等于预先设置的性能阈值。
在步骤S301b中,可以比较全部存储器读写周期数的和与预先设置的性能阈值。
性能阈值是根据相应性能要求确定的,若比较出全部存储器读写周期数的和小于或等于性能阈值,则执行步骤S302b;若比较出全部存储器读写周期数的和大于性能阈值,则执行步骤S303b。
在步骤S302b,对存储器读写周期数的评估结果为通过。
若全部存储器读写周期数的和小于或等于性能阈值,则对存储器读写周期数的评估通过。
在步骤S303b,对架构参数进行调整,并判断是否符合预先设置的循环跳出条件。
若全部存储器读写周期数的和大于性能阈值,则对架构参数进行调整。在一些实施例中,若未设置循环跳出条件,则直接返回步骤S301b,比较全部存储器读写周期数的和与预先设置的性能阈值。
在在步骤S303b中,若判断出不符合预先设置的循环跳出条件,则返回步骤S301b,比较比较全部存储器读写周期数的和是否小于或等于预先设置的性能阈值的步骤;若判断出符合预先设置的循环跳出条件,则执行步骤S301c。
符合预先设置的循环跳出条件的实际意义为,无法通过对架构参数进行调整以使得全部存储器读写周期数的和小于或等于性能阈值。
在步骤S301c,调整压缩参数,并根据调整后的压缩参数重新进行对神经网络的压缩。
当符合预先设置的循环跳出条件时,调整压缩参数,并根据调 整后的压缩参数对神经网络重新进行压缩,并基于重新压缩后的神经网络返回步骤S1,获取压缩后的神经网络的网络参数的步骤。此后继续进行上述评估步骤,进行循环,直到输出网络参数和架构参数。
需要说明的是,上文描述的对缓存容量的评估以及对存储器读写周期数的评估的执行顺序,仅为本公开中的一种可选实现方式,其不会对本公开的技术方案产生限制,即在本公开的技术方案中,步骤S301a至S303a可位于步骤S301b至S303b之前执行,或步骤S301a至S303a可位于步骤S301b至S303b之后执行,或者步骤S301a至S303a与步骤S301b至S303b穿插执行,其均属于本公开的保护范围。
本公开实施例提供了一种加速器参数确定方法,通过对神经网络进行压缩,并根据压缩后的神经网络的网络参数生成加速器的架构参数,以及对架构参数的评估,在评估时通过软硬件的多次快速交替迭代,降低参数测试周期,同时每次迭代过程中,可快速确定相应参数是否会受到资源限制或者性能限制,以及是否与软件神经网络模型不匹配,从而相互制约又相互接近,最终达到一个平衡点,得到整体上近似最优解。
下面对本公开提供的加速器参数确定方法结合实际应用进行详细描述。
具体地,针对FPGA加速器进行参数确定,该加速器在赛灵思(Xilinx)ZCU102硬件平台上进行实现,并且包括存储器、输入数据缓存、权重缓存、输出数据缓存和逻辑单元阵列。需加速的神经网络模型对象为YOLOV3-Tiny模型。
首先,根据预先设置的压缩参数对该模型的各卷积层进行压缩,生成第一压缩模型并获取到第一压缩模型各层的网络参数。压缩参数可以包括压缩率和模型精度阈值,卷积层的压缩率均为30%,模型精度阈值为95%。
之后,根据获取到的网络参数生成加速器的架构参数,架构参数可以包括层次循环顺序、分块参数和存储器参数。存储器参数可以包括各层对应的缓存容量和存储器读写周期数。存储器读写周期数为测量值,由读写存储器操作覆盖的循环层次进行体现。
具体地,循环层次可以包括卷积核列、卷积核行、输出数据列、输出数据行、输入通道和输出通道,生成架构参数的流程如下:
首先,确定层次循环顺序。设置卷积核3乘3全并行,卷积核列和卷积核行被移入并行计算中不用再做设置,为最内层,剩余需确定的顺序根据数据复用的层次确定;对于涉及的输入数据、权重和输出数据,根据数据量确定数据复用的层次,输入数据的复用层次为输出通道,权重的复用层次为输出数据列和输出数据行,输出数据的复用层次为输入通道、卷积核列和卷积核行。对于YOLOV3-Tiny模型,数据量最多的是Conv12层,其中输入数据个数为86528,权重个数为4718592,输出数据个数为173056,权重和输出数据较多,因此首先满足两者的复用要求,将权重的复用层次输出数据列和输出数据行放在内层,输出数据的复用层次输入通道放在较外层次,输入数据的复用层次输出通道放在最外层,最终确定出的循环顺序的排列为卷积核列、卷积核行、输出数据列、输出数据行、输入通道和输出通道。
其次,确定分块参数。分块参数主要与逻辑单元个数相关,而逻辑单元个数由各循环层次的并行度确定,具体地,可以由各循环层次的并行度相乘得到逻辑单元个数。之前已设置卷积核3乘3全并行,对于输入通道和输出通道,由于神经网络模型的卷积通道数一般都是16的倍数,尝试设置并行度为输入通道和输出通道均为16并行,得出逻辑单元个数为16*16*3*3=2304,大于芯片上逻辑单元总数为1728,因此该设置不满足逻辑单元总数要求,再次尝试设置并行度为输入通道16并行,输出通道8并行,得出逻辑单元个数为16*8*3*3=1152,满足逻辑单元总数要求,从而完成并行度设置。
然后,确定读写存储器操作覆盖的循环层次。读写存储器操作覆盖的循环层次会对缓存容量和读写存储器次数产生影响。进行设置时,如果只考虑减小读写存储器次数,把读写存储器操作放在6层循环之外是最好的,这样只需读写一次,但需要设置很大的缓存容量,超过片上缓存容量,因此要把读写操作放在某个循环层次里,并且尽量高于数据复用的层次。三种数据读写操作的候选层次有:输入数据对应输入通道和输出通道,权重对应输入通道和输出通道,输出数据 对应输出通道。对以上位置分别计算输入数据缓存容量、权重缓存容量和输出数据缓存容量,相加小于片上缓存容量即可。如果所有位置算出的缓存容量都过大,则要考虑减小分块参数,或者调整层次循环顺序。
在完成对架构参数的生成后,需根据资源要求对缓存容量进行评估,以及根据性能要求对存储器读写周期数进行评估。先对缓存容量进行评估。预先设置各卷积层的缓存容量对应的资源阈值均为500KB,若存在至少一个缓存容量大于500KB,则对架构参数进行调整,若无法通过对架构参数进行调整以使得全部缓存容量均小于或等于500KB,则在模型精度不低于模型精度阈值的前提下,对压缩率进行调整,生成第二压缩模型并重复执行上述步骤,直到全部缓存容量均小于或等于500KB。在对缓存容量的评估通过后进行对存储器读写周期数的评估。预先设置全部存储器读写周期数的和对应的性能阈值为100M,若全部存储器读写周期数的和大于100M,则对架构参数进行调整,若无法通过对架构参数进行调整以使得全部存储器读写周期数的和小于或等于100M,则在模型精度不低于模型精度阈值的前提下,对压缩率进行调整,生成第三压缩模型并重复执行上述步骤,直到全部存储器读写周期数的和小于或等于100M。
在完成根据资源要求和性能要求的评估后,输出最终的网络参数和最终确定的架构参数。
本公开实施例还提供了一种加速器参数确定装置,包括:一个或多个处理器;以及存储装置,用于存储一个或多个程序。当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现根据本公开各实施例的加速器参数确定方法。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,处理器实现根据本公开各实施例的加速器参数确定方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块 /单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
本文已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其他实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本公开的范围的情况下,可进行各种形式和细节上的改变。

Claims (10)

  1. 一种加速器参数确定方法,包括:
    根据预先设置的压缩参数对神经网络进行压缩,并获取压缩后的神经网络的网络参数;
    根据所述网络参数生成加速器的架构参数,其中,所述加速器用于实现对神经网络的运算的加速;以及
    根据预先设置的资源要求和性能要求对所述架构参数进行评估,并在评估通过后,输出所述网络参数和所述架构参数。
  2. 根据权利要求1所述的加速器参数确定方法,其中,根据预先设置的压缩参数对神经网络进行压缩,并获取压缩后的神经网络的网络参数的步骤包括:
    根据所述压缩参数对所述神经网络的每一层进行压缩,并获取压缩后的神经网络各层的网络参数。
  3. 根据权利要求2所述的加速器参数确定方法,其中,加速器为现场可编程门阵列加速器,所述架构参数包括存储器参数,所述存储器参数包括神经网络各层对应的缓存容量和存储器读写周期数,并且
    根据预先设置的资源要求和性能要求对所述架构参数进行评估的步骤包括:
    根据所述资源要求对所述缓存容量进行评估;以及
    根据所述性能要求对所述存储器读写周期数进行评估。
  4. 根据权利要求3所述的加速器参数确定方法,其中,所述架构参数还包括层次循环顺序和分块参数。
  5. 根据权利要求3所述的加速器参数确定方法,其中,根据所述资源要求对所述缓存容量进行评估的步骤包括:
    比较全部所述缓存容量与其各自对应的资源阈值,其中,所述资源阈值是根据所述资源要求确定的;
    响应于全部所述缓存容量均小于或等于其各自对应的资源阈值,对所述缓存容量的评估结果为通过;
    响应于存在至少一个所述缓存容量大于其对应的资源阈值,对所述架构参数进行调整,并返回比较全部所述缓存容量与其各自对应的资源阈值的步骤。
  6. 根据权利要求3所述的加速器参数确定方法,其中,根据所述性能要求对所述存储器读写周期数进行评估的步骤包括:
    比较全部所述存储器读写周期数的和与预先设置的性能阈值,其中,所述性能阈值是根据所述性能要求确定的;
    响应于全部所述存储器读写周期数的和小于或等于所述性能阈值,对所述存储器读写周期数的评估结果为通过;
    响应于全部所述存储器读写周期数的和大于所述性能阈值,对所述架构参数进行调整,并返回比较全部所述存储器读写周期数的和与预先设置的性能阈值的步骤。
  7. 根据权利要求5或6所述的加速器参数确定方法,其中,对所述架构参数进行调整的步骤包括:
    当符合预先设置的循环跳出条件时,调整所述压缩参数,根据调整后的压缩参数对神经网络重新进行压缩,并获取压缩后的神经网络的网络参数。
  8. 根据权利要求3至6中任一所述的加速器参数确定方法,其中,在对所述缓存容量的评估通过后,进行对所述存储器读写周期数的评估。
  9. 一种加速器参数确定装置,包括:
    一个或多个处理器;以及
    存储装置,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1-8中任一所述的加速器参数确定方法。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时,使得所述处理器实现如权利要求1-8中任一所述的加速器参数确定方法。
PCT/CN2021/118418 2020-09-15 2021-09-15 加速器参数确定方法及装置、计算机可读存储介质 WO2022057813A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010967711.1 2020-09-15
CN202010967711.1A CN114186677A (zh) 2020-09-15 2020-09-15 加速器参数确定方法及装置、计算机可读介质

Publications (1)

Publication Number Publication Date
WO2022057813A1 true WO2022057813A1 (zh) 2022-03-24

Family

ID=80539129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118418 WO2022057813A1 (zh) 2020-09-15 2021-09-15 加速器参数确定方法及装置、计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114186677A (zh)
WO (1) WO2022057813A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236187A (zh) * 2023-09-28 2023-12-15 中国科学院大学 一种深度学习加速器芯片的参数化设计方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090560A (zh) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 基于fpga的lstm递归神经网络硬件加速器的设计方法
CN108280514A (zh) * 2018-01-05 2018-07-13 中国科学技术大学 基于fpga的稀疏神经网络加速系统和设计方法
CN109740731A (zh) * 2018-12-15 2019-05-10 华南理工大学 一种自适应卷积层硬件加速器设计方法
CN110378468A (zh) * 2019-07-08 2019-10-25 浙江大学 一种基于结构化剪枝和低比特量化的神经网络加速器
US20200193274A1 (en) * 2018-12-18 2020-06-18 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090560A (zh) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 基于fpga的lstm递归神经网络硬件加速器的设计方法
CN108280514A (zh) * 2018-01-05 2018-07-13 中国科学技术大学 基于fpga的稀疏神经网络加速系统和设计方法
CN109740731A (zh) * 2018-12-15 2019-05-10 华南理工大学 一种自适应卷积层硬件加速器设计方法
US20200193274A1 (en) * 2018-12-18 2020-06-18 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats
CN110378468A (zh) * 2019-07-08 2019-10-25 浙江大学 一种基于结构化剪枝和低比特量化的神经网络加速器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236187A (zh) * 2023-09-28 2023-12-15 中国科学院大学 一种深度学习加速器芯片的参数化设计方法及系统
CN117236187B (zh) * 2023-09-28 2024-03-19 中国科学院大学 一种深度学习加速器芯片的参数化设计方法及系统

Also Published As

Publication number Publication date
CN114186677A (zh) 2022-03-15

Similar Documents

Publication Publication Date Title
CN110378468B (zh) 一种基于结构化剪枝和低比特量化的神经网络加速器
Sheng et al. Flexgen: High-throughput generative inference of large language models with a single gpu
US20180204110A1 (en) Compressed neural network system using sparse parameters and design method thereof
US11816574B2 (en) Structured pruning for machine learning model
Xia et al. SparkNoC: An energy-efficiency FPGA-based accelerator using optimized lightweight CNN for edge computing
CN109472361B (zh) 神经网络优化方法
CN111831254A (zh) 图像处理加速方法、图像处理模型存储方法及对应装置
CN110738316B (zh) 基于神经网络的操作方法、装置及电子设备
CN110275733A (zh) 基于有限体积法求解声子玻尔兹曼方程的gpu并行加速方法
JP2021532437A (ja) 機械学習モデルを改良して局所性を改善させること
WO2022057813A1 (zh) 加速器参数确定方法及装置、计算机可读存储介质
CN111079923A (zh) 适用于边缘计算平台的Spark卷积神经网络系统及其电路
CN112598129A (zh) 基于ReRAM神经网络加速器的可调硬件感知的剪枝和映射框架
CN108615254A (zh) 基于树型结构网格矢量量化的点云渲染方法、系统及装置
CN113361695A (zh) 卷积神经网络加速器
CN116720549A (zh) 一种基于cnn输入全缓存的fpga多核二维卷积加速优化方法
TW202338668A (zh) 用於神經網路訓練的稀疏性掩蔽方法
CN109918281B (zh) 多带宽目标的加速器效能测试方法
Peng et al. Cmq: Crossbar-aware neural network mixed-precision quantization via differentiable architecture search
CN108038304A (zh) 一种利用时间局部性的格子玻尔兹曼方法并行加速方法
US20240004718A1 (en) Compiling tensor operators for neural network models based on tensor tile configurations
CN115983366A (zh) 面向联邦学习的模型剪枝方法及系统
Hemmat et al. Airnn: A featherweight framework for dynamic input-dependent approximation of cnns
Li et al. A high-performance inference accelerator exploiting patterned sparsity in CNNs
TW202230225A (zh) 用於執行神經網路計算的類比電路的校準方法及裝置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868635

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 010823)

122 Ep: pct application non-entry in european phase

Ref document number: 21868635

Country of ref document: EP

Kind code of ref document: A1