WO2021259106A1 - 神经网络芯片的优化方法、系统、设备和存储介质 - Google Patents

神经网络芯片的优化方法、系统、设备和存储介质 Download PDF

Info

Publication number
WO2021259106A1
WO2021259106A1 PCT/CN2021/100375 CN2021100375W WO2021259106A1 WO 2021259106 A1 WO2021259106 A1 WO 2021259106A1 CN 2021100375 W CN2021100375 W CN 2021100375W WO 2021259106 A1 WO2021259106 A1 WO 2021259106A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
chip
running time
intermediate expression
network chip
Prior art date
Application number
PCT/CN2021/100375
Other languages
English (en)
French (fr)
Inventor
邹伟
熊超
蔡权雄
牛昕宇
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Publication of WO2021259106A1 publication Critical patent/WO2021259106A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the embodiments of the present application relate to neural network technology, for example, to an optimization method, system, device, and storage medium of a neural network chip.
  • AI Artificial Intelligence
  • Inference chip application scenarios include cloud and terminal scenarios. Cloud scenarios do not have harsh requirements on the power consumption and performance of the chip. However, the terminal scenarios require real-time response and low chip power consumption due to factors such as automatic driving and security fields. The reasoning chip customized according to the scene was born.
  • the current customized AI chip approach basically uses hardware to accelerate certain operators and adapt the application scenarios to the AI chip. But in fact, there is usually an application scenario first, and a neural network model is implemented according to this application scenario, and then this neural network model requires an AI chip that can achieve extreme performance. However, the current customized AI chip only performs hardware acceleration for a certain algorithm, which is out of the actual application scenario, and the model does not fit well with the customized AI chip. Although it is better than the general-purpose chip, it cannot realize the full potential of the chip. .
  • the embodiments of the present application provide an optimization method, system, device, and storage medium of a neural network chip, so as to optimize the neural network chip so that it has a high degree of compatibility with a neural network model for a certain application scenario.
  • the embodiment of the present application provides a method for optimizing a neural network chip, the method including:
  • the chip parameters of the neural network chip are adjusted according to the total running time to optimize the neural network chip.
  • the compiling the neural network model into a first intermediate expression includes:
  • the second intermediate expression is converted into a first intermediate expression based on the data flow architecture.
  • the converting the second intermediate expression into a first intermediate expression based on a data flow architecture includes:
  • the step of compiling the neural network model into the first intermediate expression includes:
  • the first intermediate expression is optimized based on the chip parameters of the neural network chip.
  • the optimizing the first intermediate expression based on the chip parameters of the neural network chip includes:
  • the one or more first calculation units are classified and packaged according to the calculation sequence.
  • the adjusting the chip parameters of the neural network chip according to the total running time to optimize the neural network chip includes:
  • the input of the chip running time formula is the chip parameter of the neural network chip, and the output of the chip running time formula is the total running time;
  • the chip parameters of the neural network chip are adjusted multiple times according to the chip running time formula until the total running time reaches the fastest.
  • the adjusting the chip parameters of the neural network chip according to the total running time to optimize the neural network chip includes:
  • the neural network model is adjusted according to the total running time.
  • the embodiment of the present application also provides a neural network chip optimization system, which includes:
  • the model acquisition module is used to acquire the preset neural network model
  • a model compilation module configured to compile the neural network model into a first intermediate expression, the first intermediate expression including one or more first calculation units;
  • a time determining module configured to determine the total running time of the first intermediate expression on the neural network chip according to the running time of the one or more first computing units on the neural network chip;
  • the chip optimization module is configured to adjust the chip parameters of the neural network chip according to the total running time to optimize the neural network chip.
  • an embodiment of the present application also provides an optimization device for a neural network chip.
  • the device includes: one or more processors; a storage device for storing one or more programs, when the one or more The program is executed by the one or more processors, so that the one or more processors implement the method provided in any embodiment of the present application.
  • an embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, a method as provided in any embodiment of the present application is implemented.
  • the embodiment of the present application obtains a preset neural network model; compiles the neural network model into a first intermediate expression, and the first intermediate expression includes one or more first calculation units; according to the one or more first calculation units; A calculation unit determines the total running time of the first intermediate expression in the neural network chip during the running time of the neural network chip; adjusts the chip parameters of the neural network chip according to the total running time to optimize the neural network chip.
  • the network chip solves the problem that the neural network model is not compatible with the AI chip, and the full potential of the chip cannot be used, and the neural network chip is optimized to make it highly compatible with the neural network model for a certain application scenario The effect of degrees.
  • FIG. 1 is a schematic flowchart of a method for optimizing a neural network chip according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic flowchart of a method for optimizing a neural network chip according to Embodiment 2 of the present application;
  • step S230 is a schematic flowchart of step S230 in a method for optimizing a neural network chip provided in the second embodiment of the present application;
  • step S240 is a schematic flowchart of step S240 in a method for optimizing a neural network chip provided in the second embodiment of the present application;
  • FIG. 5 is a schematic structural diagram of a neural network chip optimization system provided in the third embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a neural network chip optimization device provided in the fourth embodiment of the application.
  • Some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowchart describes the steps as sequential processing, many of the steps can be implemented in parallel, concurrently, or simultaneously. In addition, the order of the steps can be rearranged. The processing may be terminated when its operations are completed, but may also have additional steps not included in the drawings. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and so on.
  • first”, “second”, etc. may be used herein to describe various directions, actions, steps or elements, etc., but these directions, actions, steps or elements are not limited by these terms. These terms are only used to distinguish a first direction, action, step or element from another direction, action, step or element.
  • the first module may be referred to as the second module, and similarly, the second module may be referred to as the first module. Both the first module and the second module are modules, but they are not the same module.
  • the terms “first”, “second”, etc. cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features.
  • first and second may explicitly or implicitly include one or more of the features.
  • a plurality of means at least two, such as two, three, etc., unless otherwise specifically defined.
  • the first embodiment of the present application provides a neural network chip optimization method, system, device, and storage medium.
  • the method includes:
  • the parameters of the AI chip can be adjusted based on the neural network model during the design of the AI chip.
  • Both the network model and the AI chip are used for a specific application scenario, so after the application scenario is determined, the first thing that can be determined is the neural network model suitable for the application scenario, so first obtain the preset neural network model
  • the preset neural network model can be a neural network model for a certain application scenario provided by the user.
  • TensorFlow a symbolic mathematical system based on data flow programming
  • Caffe Convolutional Architecture for Fast Feature Embedding
  • a convolutional neural network framework a model implemented by Pytorch (a Python-based deep learning framework).
  • S130 Determine the total running time of the first intermediate expression on the neural network chip according to the running time of the one or more first computing units on the neural network chip.
  • the neural network model needs to be compiled into the first intermediate expression to facilitate the analysis of the neural network model.
  • the first intermediate expression includes one or more first calculation units. Then select any fixed amount of data as the data input of the first intermediate expression, make the first intermediate expression run in the designed neural network chip, and determine the running time of one or more first computing units on the neural network chip, that is, calculate The calculation time of the arbitrary fixed amount of data is then determined according to the running time of the one or more first computing units on the neural network chip to determine the total running time of the first intermediate expression on the neural network chip.
  • a set of data with a length of 10 is used as the data input. If the running time of a first computing unit on the neural network chip is 1ms, if the running time of each first computing unit is the same, then this set of data will be executed. The calculation takes 10ms, which is the total running time.
  • S140 Adjust the chip parameters of the neural network chip according to the total running time to optimize the neural network chip.
  • the chip parameters of the neural network chip can be adjusted, and then the same data can be recalculated to obtain the total running time, and the above steps can be repeated , Determine the optimal chip parameters through multiple calculations and comparisons, and complete the optimization of the neural network chip.
  • the embodiment of the application obtains a preset neural network model; compiles the neural network model into a first intermediate expression, the first intermediate expression includes one or more first calculation units; according to the one or more first calculation units in the neural network
  • the running time of the chip determines the total running time of the first intermediate expression in the neural network chip; the chip parameters of the neural network chip are adjusted according to the total running time to optimize the neural network chip, which solves the problem that the neural network model and the AI chip are not highly compatible , The problem of not being able to use the full potential of the chip, realized the optimization of the neural network chip so that it has a high degree of fit with the neural network model for a certain application scenario.
  • the second embodiment of the present application provides a method for optimizing a neural network chip.
  • the second embodiment of the present application is explained and explained on the basis of the first embodiment of the present application, as shown in Figure 2,
  • the method includes:
  • S220 Convert the neural network model into a second intermediate expression based on the instruction set architecture.
  • the neural network models are all trained on the GPU/CPU chip based on the instruction set.
  • the expression includes one or more second calculation units.
  • the second intermediate expression can be optionally converted to the first intermediate expression based on the data flow architecture.
  • the first intermediate expression includes one or more first calculation units. The difference between the first calculation unit and the second calculation unit is that the first calculation unit has a coarser granularity, is more concise, and is suitable for classification and calculation of a large amount of data. This step can be directly implemented by the preset AI compiler.
  • the first intermediate expression can also be optimized based on the chip parameters of the neural network chip.
  • the first intermediate expression is fused and sorted by computing nodes, and the software simulates cache allocation.
  • the calculation method of the neural network chip is determined by the chip parameters of the neural network chip, thereby adaptively optimizing the first intermediate expression.
  • S250 Determine the total running time of the first intermediate expression on the neural network chip according to the running time of the one or more first computing units on the neural network chip.
  • S270 Adjust the chip parameters of the neural network chip multiple times according to the chip running time formula until the total running time reaches the fastest.
  • a chip running time formula can be defined, the chip parameters of the neural network chip are used as the input of the formula, and the total running time is used as the output of the formula, that is, an automation model is established to be suitable for multiple chip tests.
  • the chip running time formula adjust the chip parameters of the neural network chip several times until the total running time reaches the fastest. Each test only needs to input the adjusted parameters. After the total running time is obtained, the corresponding model report can also be generated for users Analyze, so that the user can determine the next parameter adjustment.
  • the neural network model is relatively small, and the amount of data that needs to be stored or calculated in the calculation process is not much. It is optimized by reducing the value of the internal buffer size of the neural network chip and increasing the clock frequency of reading and writing data of the neural network chip.
  • the neural network model can also be adjusted according to the total running time. It should be noted that the adjustment of the neural network model needs to ensure that the accuracy of the neural network model is not affected, and fine-tuning is performed to achieve the neural network model and The optimal balance of the neural network chip makes the neural network model closer to the principle of the underlying neural network chip design.
  • step S230 in the embodiment of the present application specifically includes:
  • S232 Match and map one or more second calculation units to one or more first calculation units based on a preset data flow architecture.
  • S233 Obtain a first intermediate expression based on the data flow architecture according to one or more first calculation units.
  • the second intermediate expression is first analyzed to obtain one or more second calculation units.
  • step S240 in the embodiment of the present application specifically includes:
  • S242 Classify and pack the one or more first calculation units according to the calculation sequence.
  • the calculation of the neural network chip is calculated according to the time when the data flows in. However, if the data type calculated by the same computing unit is used for a period of time, the calculation of the neural network chip will be blocked. , Greatly reducing the computational efficiency of the neural network chip, so first determine the calculation order of the neural network chip according to the chip parameters of the neural network chip, and then classify and pack one or more first calculation units according to this calculation order, and pack them into a DataPath( Data channel), there is no need to classify the received intermediate expressions at the AI chip level.
  • the calculation sequence of the neural network chip is A calculation unit-B calculation unit-C calculation unit, then all the calculation units in the neural network model are classified accordingly, and then the A calculation unit and B calculation unit obtained by the classification are classified And the C calculation unit is packaged into a continuous calculation DataPath.
  • the second intermediate expression is converted into a first intermediate expression based on a data flow architecture, the first intermediate expression includes one or more first calculation units; the first intermediate expression is optimized based on the chip parameters of the neural network chip; Adjust the neural network model according to the total running time, so that the neural network model, AI chip, and operating performance form a closed-loop positive feedback mechanism to improve the customization capabilities of the AI chip, and simulate the performance of the AI chip during the AI chip design stage to feed back AI chip design or AI chip resource parameters can be optimized to reduce R&D costs.
  • the third embodiment of the present application provides a neural network chip optimization system 100.
  • the neural network chip optimization system 100 provided in the third embodiment of the present application can execute the neural network provided by any embodiment of the present application.
  • the optimization method of the chip has the corresponding functional modules and beneficial effects of the execution method.
  • the optimization system 100 of the neural network chip includes a model acquisition module 200, a model compilation module 300, a time determination module 400, and a chip optimization module 500.
  • the model acquisition module 200 is used to obtain a preset neural network model; the model compilation module 300 is used to compile the neural network model into a first intermediate expression, the first intermediate expression includes one or more first calculation units; time determination The module 400 is used to determine the total running time of the first intermediate expression in the neural network chip according to the running time of one or more first computing units on the neural network chip; the chip optimization module 500 is used to compare the running time of the neural network chip according to the total running time The parameters are adjusted to optimize the neural network chip.
  • the model compilation module 300 is specifically configured to convert the neural network model into a second intermediate expression based on an instruction set architecture; and convert the second intermediate expression into a first intermediate expression based on a data flow architecture.
  • the model compilation module 300 is also specifically used to parse the second intermediate expression to obtain one or more second calculation units; based on the preset data flow architecture, the one or more second calculation units are matched and mapped to one or more first calculations. Unit; obtain the first intermediate expression based on the data flow architecture according to one or more first calculation units.
  • the optimization system 100 of the neural network chip further includes a model optimization module 600, and the model optimization module 600 is used to optimize the first intermediate expression based on the chip parameters of the neural network chip.
  • the model optimization module 600 is also specifically configured to determine the calculation sequence of the neural network chip according to the chip parameters of the neural network chip; and classify and pack one or more first calculation units according to the calculation sequence.
  • the chip optimization module 500 is specifically used to define a chip running time formula, the input of the chip running time formula is the chip parameter of the neural network chip, and the output of the chip running time formula is the total running time; according to the chip running time formula
  • the chip parameters of the neural network chip are adjusted many times until the total running time reaches the fastest.
  • the model optimization module 600 is also used to adjust the neural network model according to the total running time.
  • FIG. 6 is a schematic structural diagram of a neural network chip optimized computer device 12 provided in the fourth embodiment of the application.
  • FIG. 6 shows a block diagram of an exemplary computer device 12 suitable for implementing the embodiments of the present application.
  • the computer device 12 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the computer device 12 is represented in the form of a general-purpose computing device.
  • the components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • the bus 18 represents one or more of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures.
  • these architectures include but are not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and peripheral component interconnection ( PCI) bus.
  • ISA industry standard architecture
  • MAC microchannel architecture
  • VESA Video Electronics Standards Association
  • PCI peripheral component interconnection
  • the computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.
  • the system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
  • the computer device 12 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
  • the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 6 and generally referred to as a "hard drive").
  • a disk drive for reading and writing to removable non-volatile disks such as "floppy disks”
  • a removable non-volatile optical disk such as CD-ROM, DVD-ROM
  • each drive can be connected to the bus 18 through one or more data media interfaces.
  • the memory 28 may include at least one program product, the program product having a set (for example, at least one) program modules, and these program modules are configured to perform the functions of the embodiments of the present application.
  • a program/utility tool 40 having a set of (at least one) program module 42 may be stored in, for example, the memory 28.
  • Such program module 42 includes, but is not limited to, an operating system, one or more application programs, and other programs Modules and program data, each of these examples or some combination may include the realization of a network environment.
  • the program module 42 usually executes the functions and/or methods in the embodiments described in this application.
  • the computer device 12 can also communicate with one or more external devices 14 (such as keyboards, pointing devices, displays 24, etc.), and can also communicate with one or more devices that enable users to interact with the computer device 12, and/or communicate with Any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 22.
  • the computer device 12 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the computer device 12 through the bus 18.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, to implement the methods provided in the embodiments of the present application:
  • the fifth embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method as provided in all the application embodiments of the present application is implemented:
  • the computer storage medium of the embodiment of the present application may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory Erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the above.
  • the computer program code used to perform the operations of this application can be written in one or more programming languages or a combination thereof.
  • the programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Stored Programmes (AREA)

Abstract

本申请实施例公开了一种神经网络芯片的优化方法、系统、设备和存储介质。该方法包括:获取预设的神经网络模型;将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或多个第一计算单元;根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片。

Description

神经网络芯片的优化方法、系统、设备和存储介质
本申请要求在2020年06月22日提交中国专利局、申请号为202010574428.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及神经网络技术,例如涉及一种神经网络芯片的优化方法、系统、设备和存储介质。
背景技术
近年来随着神经网络的兴起,专用于神经网络的AI(Artificial Intelligence,人工智能)芯片也呈现出百家争鸣的盛况,每一款芯片的诞生都凝聚着研究人员日以继夜的努力。
目前人工智能芯片按功能类别可分为训练和推理,且当前的发展现状以训练为主。但是因为神经网络的应用不断下沉落地,对推理芯片的需求逐渐上升。推理芯片应用场景有云端、终端场景,云端场景对芯片的功耗以及性能上都没有苛刻的要求,但是终端场景因为面对如自动驾驶、安防领域等需要实时响应,芯片功耗低等因素,催生出了根据场景定制化的推理芯片。
然而当前的AI定制化芯片如何做到和应用场景深度结合还是一个难题,目前的定制化AI芯片做法基本上是使用硬件加速某些算子,让应用场景来适应AI芯片。但实际上通常是先有一个应用场景,根据这一应用场景实现了一个神经网络模型,然后这个神经网络模型需要一款能够实现极致性能的AI芯片。但当前定制化AI芯片只是针对某个算法进行硬件加速,脱离了实际应用场景,且模型与定制化AI芯片的契合性不高,虽然比通用性芯片有提高,但是无法发挥出芯片的全部潜力。
发明内容
本申请实施例提供一种神经网络芯片的优化方法、系统、设备和存储介质,以实现优化神经网络芯片,以使其与针对某个应用场景的神经网络模型具有高契合度。
本申请实施例提供了一种神经网络芯片的优化方法,该方法包括:
获取预设的神经网络模型;
将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或 多个第一计算单元;
根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;
根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片。
可选的,所述将所述神经网络模型编译为第一中间表达包括:
将所述神经网络模型转换为基于指令集架构的第二中间表达;
将所述第二中间表达转换为基于数据流架构第一中间表达。
可选的,所述将所述第二中间表达转换为基于数据流架构第一中间表达包括:
解析所述第二中间表达以得到一个或多个第二计算单元;
基于预设的数据流架构将所述一个或多个第二计算单元匹配映射为一个或多个第一计算单元;
根据所述一个或多个第一计算单元得到基于数据流架构第一中间表达。
可选的,所述将所述神经网络模型编译为第一中间表达之后包括:
基于神经网络芯片的芯片参数对所述第一中间表达进行优化。
可选的,所述基于神经网络芯片的芯片参数对所述第一中间表达进行优化包括:
根据所述神经网络芯片的芯片参数确定所述神经网络芯片的计算顺序;
根据所述计算顺序对所述一个或多个第一计算单元进行分类打包。
可选的,所述根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片包括:
定义一个芯片运行时间公式,所述芯片运行时间公式的输入为所述神经网络芯片的芯片参数,所述芯片运行时间公式的输出为所述运行总时间;
根据所述芯片运行时间公式对所述神经网络芯片的芯片参数进行多次调整直至所述运行总时间达到最快。
可选的,所述根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片之后包括:
根据所述运行总时间对所述神经网络模型进行调整。
一方面,本申请实施例还提供了一种神经网络芯片的优化系统,该系统包 括:
模型获取模块,用于获取预设的神经网络模型;
模型编译模块,用于将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或多个第一计算单元;
时间确定模块,用于根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;
芯片优化模块,用于根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片。
另一方面,本申请实施例还提供了一种神经网络芯片的优化设备,该设备包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任一实施例提供的方法。
又一方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任一实施例提供的方法。
本申请实施例通过获取预设的神经网络模型;将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或多个第一计算单元;根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片,解决了神经网络模型与AI芯片的契合性不高,无法发挥出芯片的全部潜力的问题,实现了优化神经网络芯片,以使其与针对某个应用场景的神经网络模型具有高契合度的效果。
附图说明
图1是本申请实施例一提供的一种神经网络芯片的优化方法的流程示意图;
图2是本申请实施例二提供的一种神经网络芯片的优化方法的流程示意图;
图3是本申请实施例二提供的一种神经网络芯片的优化方法中的步骤S230的流程示意图;
图4是本申请实施例二提供的一种神经网络芯片的优化方法中的步骤S240的流程示意图;
图5是本申请实施例三提供的一种神经网络芯片的优化系统的结构示意图;
图6为本申请实施例四提供的一种神经网络芯片的优化设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。可以理解的是,此处所描述的具体实施例用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各步骤描述成顺序的处理,但是其中的许多步骤可以被并行地、并发地或者同时实施。此外,各步骤的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。
此外,术语“第一”、“第二”等可在本文中用于描述各种方向、动作、步骤或元件等,但这些方向、动作、步骤或元件不受这些术语限制。这些术语仅用于将第一个方向、动作、步骤或元件与另一个方向、动作、步骤或元件区分。举例来说,在不脱离本申请的范围的情况下,可以将第一模块称为第二模块,且类似地,可将第二模块称为第一模块。第一模块和第二模块两者都是模块,但其不是同一模块。术语“第一”、“第二”等不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个所述特征。在本申请实施例的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
实施例一
如图1所示,本申请实施例一提供了一种神经网络芯片的优化方法、系统、设备和存储介质,该方法包括:
S110、获取预设的神经网络模型。
本实施例中,为了解决神经网络模型与AI芯片的契合性不高,无法发挥出芯片的全部潜力的问题,在AI芯片设计时就可以基于神经网络模型对AI芯片的参数进行调整,因神经网络模型和AI芯片都是为了针对某个具体的应用场景来使用的,所以应用场景确定之后,首先可以确定的是适用于该应用场景的神经网络模型,因此首先获取到预设的神经网络模型,该预设的神经网络模型可以是用户提供的针对某一应用场景的神经网络模型,具体的,可以是基于 TensorFlow(一个基于数据流编程的符号数学系统),Caffe(Convolutional Architecture for Fast Feature Embedding,一种卷积神经网络框架),Pytorch(一个基于Python的深度学习框架)实现的模型。
S120、将神经网络模型编译为第一中间表达,第一中间表达包括一个或多个第一计算单元。
S130、根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间。
本实施例中,获得神经网络模型后,就需要将神经网络模型编译为第一中间表达,以方便对该神经网络模型进行分析,第一中间表达包括一个或多个第一计算单元。然后选取任意固定量的数据作为第一中间表达的数据输入,使该第一中间表达在设计的神经网络芯片中运行,确定一个或多个第一计算单元在神经网络芯片的运行时间,即计算该任意固定量的数据的计算时间,然后根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间。
示例性的,将一组长度为10的数据作为数据输入,若一个第一计算单元在神经网络芯片的运行时间为1ms,若每个第一计算单元的运行时间相同,那么执行完这组数据的计算就需要10ms,即运行总时间。
S140、根据运行总时间对神经网络芯片的芯片参数进行调整以优化神经网络芯片。
本实施例中,确定第一中间表达在神经网络芯片的运行总时间后,就可以对神经网络芯片的芯片参数进行调整,然后取相同的数据重新进行计算,获得运行总时间,并重复上述步骤,通过多次计算和对比,确定最优的芯片参数,完成对神经网络芯片的优化。
本申请实施例通过获取预设的神经网络模型;将神经网络模型编译为第一中间表达,第一中间表达包括一个或多个第一计算单元;根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间;根据运行总时间对神经网络芯片的芯片参数进行调整以优化神经网络芯片,解决了神经网络模型与AI芯片的契合性不高,无法发挥出芯片的全部潜力的问题,实现了优化神经网络芯片,以使其与针对某个应用场景的神经网络模型具有高契合度的效果。
实施例二
如图2-图4所示,本申请实施例二提供了一种神经网络芯片的优化方法, 本申请实施例二是在本申请实施例一的基础上进行说明解释,如图2所示,该方法包括:
S210、获取预设的神经网络模型。
S220、将神经网络模型转换为基于指令集架构的第二中间表达。
S230、将第二中间表达转换为基于数据流架构第一中间表达,第一中间表达包括一个或多个第一计算单元。
本实施例中,获得神经网络模型后,首先需要将神经网络模型转换为基于指令集架构的第二中间表达,目前神经网络模型都是基于指令集的GPU/CPU芯片训练得到的,第二中间表达包括了一个或多个第二计算单元,为了优化计算的速度,可选的将第二中间表达转换为基于数据流架构第一中间表达,第一中间表达包括一个或多个第一计算单元,第一计算单元和第二计算单元的不同之处在于第一计算单元具有更粗的粒度,更加简洁,适于分类和大量数据的计算。该步骤可直接由预设的AI编译器实现。
S240、基于神经网络芯片的芯片参数对第一中间表达进行优化。
本实施例中,得到第一中间表达后,还可以基于神经网络芯片的芯片参数对第一中间表达进行优化,示例性的,对第一中间表达进行计算节点的融合和排序,软件模拟缓存分配,具体的,通过神经网络芯片的芯片参数确定神经网络芯片的计算方式,从而适应性的对第一中间表达进行优化。
S250、根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间。
S260、定义一个芯片运行时间公式,芯片运行时间公式的输入为神经网络芯片的芯片参数,芯片运行时间公式的输出为运行总时间。
S270、根据芯片运行时间公式对神经网络芯片的芯片参数进行多次调整直至运行总时间达到最快。
本实施例中,可以定义一个芯片运行时间公式,将神经网络芯片的芯片参数作为该公式的输入,运行总时间作为该公式的输出,即建立一个自动化模型,以适于多次的芯片试验,根据芯片运行时间公式对神经网络芯片的芯片参数进行多次调整直至运行总时间达到最快,每次试验只需要输入调整的参数,得到运行总时间后,还可以生成相应的模型报告以供用户分析,以便于用户确定下一次的参数调整。
示例性的,先对神经网络模型的参数设置一个默认值,运行芯片运行时间公式,得到模型报告后,结合具体的神经网络模型和神经网络芯片的参数进行 分析,判断哪些参数值影响了运行性能,以至于资源没有充分利用,多次试验后,最后达到一个平衡性。例如神经网络模型比较小,计算过程需要存储或计算的数据量不多,通过减少神经网络芯片的内部缓存大小的值,提高神经网络芯片的读写数据时钟频率来进行优化。
S280、根据运行总时间对神经网络模型进行调整。
本实施例中,还可以根据运行总时间对神经网络模型进行调整,需要说明的是,对神经网络模型的调整需要保证不影响该神经网络模型的准确性,进行微调,从而达到神经网络模型和神经网络芯片的最优平衡状态,让神经网络模型更贴近底层神经网络芯片设计的原理。
可选的,如图3所示,本申请实施例中的步骤S230具体包括:
S231、解析第二中间表达以得到一个或多个第二计算单元。
S232、基于预设的数据流架构将一个或多个第二计算单元匹配映射为一个或多个第一计算单元。
S233、根据一个或多个第一计算单元得到基于数据流架构第一中间表达。
本实施例中,首先将第二中间表达进行解析,得到一个或多个第二计算单元,示例性的,取任意一个神经网络模型的算子,其数学公式表达为y=x/sqrt(max(sum(x**2),epsilon)),那么对于第二中间表达,它会将其表示为x 1=x**2;x 1=max(x 1,epsilon);x 1=sqrt(x 1);y=x/x 1,这四个第二计算单元,但预设的数据流架构会将这个数学公式直接定义为L 1计算单元这个第一计算单元,并直接表示为y=L 1(x,epsilon),后续只要出现了这个数学公式结构,我们可以依旧把它匹配成L 1计算单元,由此基于预设的数据流架构将一个或多个第二计算单元匹配映射为一个或多个第一计算单元,有多少个数学公式就匹配映射出多少个第一计算单元,最后根据一个或多个第一计算单元得到基于数据流架构第一中间表达。
可选的,如图4所示,本申请实施例中的步骤S240具体包括:
S241、根据神经网络芯片的芯片参数确定神经网络芯片的计算顺序。
S242、根据计算顺序对一个或多个第一计算单元进行分类打包。
本实施例中,因在神经网络芯片的计算中是按数据流入的时间进行计算的,但如果一段时间内流入的都是使用同一个计算单元计算的数据类型,会造成神经网络芯片的计算阻塞,大大降低神经网络芯片的计算效率,因此先根据神经网络芯片的芯片参数确定神经网络芯片的计算顺序,然后根据这个计算顺序对一个或多个第一计算单元进行分类打包,打包成一个DataPath(数据通道), 就无需在AI芯片层面对接收到中间表达再进行分类了。具体的,神经网络芯片的计算顺序为A计算单元-B计算单元-C计算单元,那么就相应的将神经网络模型中全部的计算单元进行分类,然后将分类得到的A计算单元、B计算单元和C计算单元打包为一个连续计算的DataPath。
本申请实施例通过将第二中间表达转换为基于数据流架构第一中间表达,第一中间表达包括一个或多个第一计算单元;基于神经网络芯片的芯片参数对第一中间表达进行优化;根据运行总时间对神经网络模型进行调整,让神经网络模型、AI芯片和运行性能形成一个闭环的正向反馈机制,提高AI芯片的定制化能力,在AI芯片设计阶段模拟AI芯片性能从而反馈给AI芯片设计或者AI芯片资源参数以进行优化,降低研发成本。
实施例三
如图5所示,本申请实施例三提供了一种神经网络芯片的优化系统100,本申请实施例三所提供的神经网络芯片的优化系统100可执行本申请任意实施例所提供的神经网络芯片的优化方法,具备执行方法相应的功能模块和有益效果。该神经网络芯片的优化系统100包括模型获取模块200、模型编译模块300、时间确定模块400和芯片优化模块500。
具体的,模型获取模块200用于获取预设的神经网络模型;模型编译模块300用于将神经网络模型编译为第一中间表达,第一中间表达包括一个或多个第一计算单元;时间确定模块400用于根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间;芯片优化模块500用于根据运行总时间对神经网络芯片的芯片参数进行调整以优化神经网络芯片。
本实施例中,模型编译模块300具体用于将神经网络模型转换为基于指令集架构的第二中间表达;将第二中间表达转换为基于数据流架构第一中间表达。模型编译模块300具体还用于解析第二中间表达以得到一个或多个第二计算单元;基于预设的数据流架构将一个或多个第二计算单元匹配映射为一个或多个第一计算单元;根据一个或多个第一计算单元得到基于数据流架构第一中间表达。
可选的,该神经网络芯片的优化系统100还包括模型优化模块600,模型优化模块600用于基于神经网络芯片的芯片参数对第一中间表达进行优化。模型优化模块600具体还用于根据神经网络芯片的芯片参数确定神经网络芯片的计算顺序;根据计算顺序对一个或多个第一计算单元进行分类打包。
本实施例中,芯片优化模块500具体用于定义一个芯片运行时间公式,芯片运行时间公式的输入为神经网络芯片的芯片参数,芯片运行时间公式的输出为运行总时间;根据芯片运行时间公式对神经网络芯片的芯片参数进行多次调整直至运行总时间达到最快。模型优化模块600还用于根据运行总时间对神经网络模型进行调整。
实施例四
图6为本申请实施例四提供的一种神经网络芯片的优化计算机设备12的结构示意图。图6示出了适于用来实现本申请实施方式的示例性计算机设备12的框图。图6显示的计算机设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。计算机设备12可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图6未显示,通常称为“硬盘驱动器”)。尽管图6中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储 器28中,这样的程序模块42包括——但不限于——操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与计算机设备12的其它模块通信。应当明白,尽管图中未示出,可以结合计算机设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本申请实施例所提供的方法:
获取预设的神经网络模型;
将神经网络模型编译为第一中间表达,第一中间表达包括一个或多个第一计算单元;
根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间;
根据运行总时间对神经网络芯片的芯片参数进行调整以优化神经网络芯片。
实施例五
本申请实施例五还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有申请实施例提供的方法:
获取预设的神经网络模型;
将神经网络模型编译为第一中间表达,第一中间表达包括一个或多个第一计算单元;
根据一个或多个第一计算单元在神经网络芯片的运行时间确定第一中间表达在神经网络芯片的运行总时间;
根据运行总时间对神经网络芯片的芯片参数进行调整以优化神经网络芯片。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。

Claims (10)

  1. 一种神经网络芯片的优化方法,包括:
    获取预设的神经网络模型;
    将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或多个第一计算单元;
    根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;
    根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片。
  2. 根据权利要求1所述的方法,其中,所述将所述神经网络模型编译为第一中间表达包括:
    将所述神经网络模型转换为基于指令集架构的第二中间表达;
    将所述第二中间表达转换为基于数据流架构第一中间表达。
  3. 根据权利要求2所述的方法,其中,所述将所述第二中间表达转换为基于数据流架构第一中间表达包括:
    解析所述第二中间表达以得到一个或多个第二计算单元;
    基于预设的数据流架构将所述一个或多个第二计算单元匹配映射为一个或多个第一计算单元;
    根据所述一个或多个第一计算单元得到基于数据流架构第一中间表达。
  4. 根据权利要求1所述的方法,其中,所述将所述神经网络模型编译为第一中间表达之后包括:
    基于神经网络芯片的芯片参数对所述第一中间表达进行优化。
  5. 根据权利要求4所述的方法,其中,所述基于神经网络芯片的芯片参数对所述第一中间表达进行优化包括:
    根据所述神经网络芯片的芯片参数确定所述神经网络芯片的计算顺序;
    根据所述计算顺序对所述一个或多个第一计算单元进行分类打包。
  6. 根据权利要求1所述的方法,其中,所述根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片包括:
    定义一个芯片运行时间公式,所述芯片运行时间公式的输入为所述神经网络芯片的芯片参数,所述芯片运行时间公式的输出为所述运行总时间;
    根据所述芯片运行时间公式对所述神经网络芯片的芯片参数进行多次调整 直至所述运行总时间达到最快。
  7. 根据权利要求1所述的方法,其中,所述根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片之后包括:
    根据所述运行总时间对所述神经网络模型进行调整。
  8. 一种神经网络芯片的优化系统,包括:
    模型获取模块,用于获取预设的神经网络模型;
    模型编译模块,用于将所述神经网络模型编译为第一中间表达,所述第一中间表达包括一个或多个第一计算单元;
    时间确定模块,用于根据所述一个或多个第一计算单元在神经网络芯片的运行时间确定所述第一中间表达在所述神经网络芯片的运行总时间;
    芯片优化模块,用于根据所述运行总时间对所述神经网络芯片的芯片参数进行调整以优化所述神经网络芯片。
  9. 一种神经网络芯片的优化设备,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的方法。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-7中任一所述的方法。
PCT/CN2021/100375 2020-06-22 2021-06-16 神经网络芯片的优化方法、系统、设备和存储介质 WO2021259106A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010574428.2A CN111753973A (zh) 2020-06-22 2020-06-22 一种神经网络芯片的优化方法、系统、设备和存储介质
CN202010574428.2 2020-06-22

Publications (1)

Publication Number Publication Date
WO2021259106A1 true WO2021259106A1 (zh) 2021-12-30

Family

ID=72675561

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100375 WO2021259106A1 (zh) 2020-06-22 2021-06-16 神经网络芯片的优化方法、系统、设备和存储介质

Country Status (2)

Country Link
CN (1) CN111753973A (zh)
WO (1) WO2021259106A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753973A (zh) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 一种神经网络芯片的优化方法、系统、设备和存储介质
CN112529175B (zh) * 2020-11-05 2022-03-18 上海交通大学 神经网络的编译方法、系统、计算机存储介质及编译设备
CN112328674B (zh) * 2020-11-17 2024-05-14 深圳力维智联技术有限公司 跨数据格式的模型转化加速方法及装置
CN114328098B (zh) * 2021-12-23 2023-04-18 北京百度网讯科技有限公司 一种慢节点检测方法、装置、电子设备及存储介质
CN115437642B (zh) * 2022-11-07 2024-05-14 深圳鲲云信息科技有限公司 一种模型编译方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349189A1 (en) * 2017-06-03 2018-12-06 Apple Inc. Dynamic task allocation for neural networks
CN109299780A (zh) * 2018-09-05 2019-02-01 深圳灵图慧视科技有限公司 神经网络模型压缩方法、装置和计算机设备
CN110515739A (zh) * 2019-10-23 2019-11-29 上海燧原智能科技有限公司 深度学习神经网络模型负载计算方法、装置、设备及介质
CN111210005A (zh) * 2019-12-31 2020-05-29 Oppo广东移动通信有限公司 设备运行方法、装置、存储介质及电子设备
CN111753973A (zh) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 一种神经网络芯片的优化方法、系统、设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11003984B2 (en) * 2016-05-31 2021-05-11 Samsung Electronics Co., Ltd. Timing sequence for digital STDP synapse and LIF neuron-based neuromorphic system
CN106650922B (zh) * 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作系统
KR102208989B1 (ko) * 2017-03-24 2021-01-28 구글 엘엘씨 강화 학습을 통한 디바이스 배치 최적화
CN107958285A (zh) * 2017-11-21 2018-04-24 深圳普思英察科技有限公司 面向嵌入式系统的神经网络的映射方法及装置
CN108596331A (zh) * 2018-04-16 2018-09-28 浙江大学 一种细胞神经网络硬件架构的优化方法
CN111160515B (zh) * 2019-12-09 2023-03-21 中山大学 运行时间预测方法、模型搜索方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349189A1 (en) * 2017-06-03 2018-12-06 Apple Inc. Dynamic task allocation for neural networks
CN109299780A (zh) * 2018-09-05 2019-02-01 深圳灵图慧视科技有限公司 神经网络模型压缩方法、装置和计算机设备
CN110515739A (zh) * 2019-10-23 2019-11-29 上海燧原智能科技有限公司 深度学习神经网络模型负载计算方法、装置、设备及介质
CN111210005A (zh) * 2019-12-31 2020-05-29 Oppo广东移动通信有限公司 设备运行方法、装置、存储介质及电子设备
CN111753973A (zh) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 一种神经网络芯片的优化方法、系统、设备和存储介质

Also Published As

Publication number Publication date
CN111753973A (zh) 2020-10-09

Similar Documents

Publication Publication Date Title
WO2021259106A1 (zh) 神经网络芯片的优化方法、系统、设备和存储介质
CN110852438B (zh) 模型生成方法和装置
US6816814B2 (en) Method and apparatus for decomposing and verifying configurable hardware
WO2021139633A1 (zh) 深度学习模型的转化方法、装置、服务器及存储介质
CN109491494B (zh) 功率参数的调整方法、装置及强化学习模型训练方法
WO2019001418A1 (zh) 数据共享系统及其数据共享方法
US11657305B2 (en) Multi-method system for optimal predictive model selection
WO2020000689A1 (zh) 基于迁移学习的智能投顾策略生成方法及装置、电子设备、存储介质
CN114118433A (zh) 一种设备的配置参数的推荐方法及装置
CN102981827A (zh) 一种基于中间件的显示界面数据处理方法及平台
Erbas System-level modelling and design space exploration for multiprocessor embedded system-on-chip architectures
US20220206770A1 (en) Using artificial intelligence to optimize software to run on heterogeneous computing resource
WO2022012233A1 (zh) 一种量化校准方法、计算装置和计算机可读存储介质
CN114648103A (zh) 用于处理深度学习网络的自动多目标硬件优化
WO2023193547A1 (zh) 用于生成和存储电路仿真过程中的波形数据的方法、电子设备和存储介质
CN111552652A (zh) 基于人工智能芯片的数据处理方法、装置和存储介质
WO2022028224A1 (zh) 数据存储方法、装置、设备和存储介质
CN114968585A (zh) 资源配置方法、装置、介质和计算设备
CN111309382B (zh) 基于神经网络的指令推送方法、系统、设备及存储介质
CN114253550A (zh) 优化策略生成方法和算子构建方法
WO2021077282A1 (zh) 神经网络模型转化方法、装置、服务器及存储介质
WO2024046463A1 (zh) 模型构建方法、装置、平台、电子设备及存储介质
WO2022042519A1 (zh) 资源分配方法和装置、计算机设备、计算机可读存储介质
US20220292361A1 (en) Method, electronic device, and computer program product for data processing
WO2020073874A1 (zh) 机器学习运算的分配系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829295

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21829295

Country of ref document: EP

Kind code of ref document: A1