WO2021077281A1 - 深度学习框架的调整方法、装置、服务器及存储介质 - Google Patents

深度学习框架的调整方法、装置、服务器及存储介质 Download PDF

Info

Publication number
WO2021077281A1
WO2021077281A1 PCT/CN2019/112463 CN2019112463W WO2021077281A1 WO 2021077281 A1 WO2021077281 A1 WO 2021077281A1 CN 2019112463 W CN2019112463 W CN 2019112463W WO 2021077281 A1 WO2021077281 A1 WO 2021077281A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
data flow
flow calculation
calculation graph
target
Prior art date
Application number
PCT/CN2019/112463
Other languages
English (en)
French (fr)
Inventor
邹伟
熊超
牛昕宇
蔡权雄
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to PCT/CN2019/112463 priority Critical patent/WO2021077281A1/zh
Priority to CN201980100791.6A priority patent/CN114514506A/zh
Priority to US17/771,035 priority patent/US20220366249A1/en
Publication of WO2021077281A1 publication Critical patent/WO2021077281A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks

Definitions

  • the embodiments of the present application relate to the field of deep learning technology, for example, to a method, device, server, and storage medium for adjusting a deep learning framework.
  • the data format of the deep learning framework is designed for the instruction set architecture.
  • the feature of the instruction set architecture is that the data format can be correspondingly split into a single instruction form, the granularity of the computing unit is small, and the computing units can be combined arbitrarily.
  • the data format running on the data flow architecture has a larger granularity of computing units, and the supported combinations of computing units are also limited.
  • the corresponding form is a data path instead of an instruction unit.
  • a data path is often It is composed of multiple complex calculation units.
  • the embodiments of the present application provide a method, device, server, and storage medium for adjusting a deep learning framework to achieve the effect of improving the calculation efficiency of the deep learning framework of the data stream architecture.
  • the embodiment of the present application provides a method for adjusting a deep learning framework, including:
  • the initial data flow calculation graph including a first operator that calculates an initial constant expression
  • a target data flow calculation graph is obtained according to the parameters in the initial constant expression, the target data flow calculation graph includes a second operator, and the target data flow calculation graph is used to control the deep learning framework chip to perform data calculations.
  • the granularity of the second operator is greater than the granularity of the first operator to adjust the calculation amount of the deep learning framework chip.
  • An embodiment of the present application provides an adjustment device for a deep learning framework, including:
  • An obtaining module configured to obtain an initial data flow calculation graph, the initial data flow calculation graph including a first operator that calculates an initial constant expression
  • the optimization module is configured to obtain a target data flow calculation graph according to the parameters in the initial constant expression, the target data flow calculation graph includes a second operator, and the target data flow calculation graph is used to control the deep learning framework chip to perform Data calculation, the granularity of the second operator is greater than the granularity of the first operator to adjust the calculation amount of the deep learning framework chip.
  • the embodiment of the present application provides a server, including:
  • One or more processors are One or more processors;
  • Storage device set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the deep learning framework adjustment method provided in any embodiment of the present application.
  • the embodiment of the present application provides a computer-readable storage medium storing a computer program, and when the program is executed by a processor, the method for adjusting the deep learning framework as provided in any embodiment of the present application is realized.
  • FIG. 1 is a schematic flowchart of a method for adjusting a deep learning framework provided in Embodiment 1 of this application;
  • FIG. 2 is a schematic flowchart of another method for adjusting a deep learning framework provided in Embodiment 2 of the application;
  • FIG. 3 is a schematic flowchart of another method for adjusting a deep learning framework provided in Embodiment 2 of the application;
  • FIG. 4 is a schematic structural diagram of a deep learning framework adjustment device provided in Embodiment 3 of this application.
  • FIG. 5 is a schematic structural diagram of a server provided in Embodiment 3 of the present application.
  • Some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowchart describes multiple steps as sequential processing, many steps in this document can be implemented in parallel, concurrently, or simultaneously. In addition, the order of multiple steps can be rearranged. The processing may be terminated when the multiple step operations are completed, but there may also be additional steps not included in the drawing. Processing can correspond to methods, functions, procedures, subroutines, subroutines, and so on.
  • first the granularity of the first operator
  • second operator the granularity of the second operator
  • first operator Graininess.
  • the first operator granularity and the second operator granularity are both operator granularity, but the first operator granularity and the second operator granularity are not the same operator granularity.
  • first, second, etc. cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this application, “plurality” means at least two, such as two, three, etc., unless otherwise defined.
  • Figure 1 is a schematic flow diagram of a method for adjusting a deep learning framework provided by Embodiment 1 of the application, which can be applied to a scenario where a deep learning framework developed based on a data flow architecture is optimized.
  • This method can be implemented by a deep learning framework adjusting device To execute, the device can be implemented in software and/or hardware, and can be integrated on a server.
  • the method for adjusting the deep learning framework provided by the first embodiment includes:
  • the data flow calculation graph is a directed graph used to represent data-driven calculations.
  • each node represents an operator.
  • the first operator refers to the operator that calculates the initial constant expression in the initial data flow calculation graph.
  • the initial data flow calculation graph refers to the data flow calculation graph that has not been optimized.
  • Constant expression means that there are only constant values in the expression, and initial constant expression means the constant expression that needs to be calculated in the initial data flow calculation graph.
  • the initial constant expression can be a+b or a*b, and the first operator is used to calculate a+b or a*b, which is not limited herein.
  • both a and b are constants, for example, a is 1, and b is 2. This text does not limit the value of the constant.
  • the parameter refers to the constant value in the initial constant expression.
  • the initial constant expression is a*b
  • the parameter values a and b are constants.
  • the target data flow calculation graph is obtained by optimizing the parameters of the initial constant expression, and the target data flow calculation graph is used to control the deep learning framework chip for data calculation.
  • the calculation can be performed with another constant value.
  • the final result to be calculated is a*b+c.
  • Optimizing parameters refers to calculating the parameters that need to be calculated once, for example, directly calculating a*b+c, so as to directly output the result.
  • the target data flow calculation graph includes a second operator for calculating an expression after optimizing the parameters in the initial constant expression.
  • the granularity of the operator affects the calculation amount of the deep learning framework. Since the calculation is more complicated after parameter optimization, the granularity of the second operator is greater than the granularity of the first operator to adjust the deep learning framework chip The amount of calculation.
  • the technical solution of the embodiment of the present application obtains an initial data flow calculation graph, the initial data flow calculation graph includes a first operator for calculating an initial constant expression; the target data flow calculation is obtained according to the parameters in the initial constant expression Figure. It realizes the optimization of the initial data flow calculation graph into the target data flow calculation graph, and the calculation of the parameters in the data flow calculation graph in the neural network chip can be completed in one step, which improves the calculation time of the neural network chip for the deep learning framework. At the same time, the granularity of the second operator in the target data flow calculation graph is greater than the granularity of the first operator in the initial data flow calculation graph, so the calculation amount of the second operator in the target data flow calculation graph is also greater. The problem of low computational efficiency of the deep learning framework based on the data flow architecture is solved, and the technical effect of improving the computational efficiency of the deep learning framework is achieved.
  • Embodiment 2 is a schematic flowchart of another method for adjusting a deep learning framework provided in Embodiment 2 of the present application. This embodiment is described on the basis of the above technical solution, and is suitable for a scenario where the target data flow calculation graph is optimized.
  • the method can be executed by an adjustment device of a deep learning framework, which can be implemented in software and/or hardware, and can be integrated on a server.
  • the method for adjusting the deep learning framework provided in the second embodiment of the present application includes:
  • the data flow calculation graph is a directed graph used to represent data-driven calculations.
  • each node represents an operator.
  • the first operator refers to the operator that calculates the initial constant expression in the initial data flow calculation graph.
  • the initial data flow calculation graph refers to the data flow calculation graph that has not been optimized.
  • Constant expression means that there are only constant values in the expression, and initial constant expression means the constant expression that needs to be calculated in the initial data flow calculation graph.
  • the initial constant expression can be a+b or a*b, and the first operator is used to calculate a+b or a*b, which is not limited here.
  • both a and b are constants, for example, a is 1, and b is 2. This text does not limit the value of the constant.
  • the parameter refers to the constant value in the initial constant expression.
  • the initial constant expression is a*b
  • the parameter values a and b are constant values.
  • the target data flow calculation graph is obtained by optimizing the parameters of the initial constant expression, and the target data flow calculation graph is used to control the deep learning framework chip for data calculation.
  • the second operator in the target data flow calculation graph is used to calculate a target expression, and the target expression is optimized based on the parameters of the initial constant expression.
  • the calculation can be performed with another constant value.
  • the final result to be calculated is a*b+c.
  • the initial constant expression can be a*b, n+c.
  • the initial constant expression can only evaluate two parameters at a time.
  • the target expression is obtained by optimizing the parameters of the initial constant expression.
  • the target expression is a*b+c, the parameters of the initial constant expression are combined, and a, b, and c are all constant values.
  • the second operator is obtained by fusing at least two first operators.
  • the granularity of the second operator obtained by the fusion of the first operator is greater than the granularity of the first operator.
  • the granularity of A1 is 1, the granularity of A2 is 1, and the granularity of B1 is 2.
  • the second operator is an additive multiplication combination operator.
  • each second operator can only calculate one target expression.
  • Obtaining at least two second operators that calculate the same target expression means to calculate the same target expression.
  • the input value is a variable, and if the value of X has been input a value, it is a constant. Since the target expressions calculated by the B1 operator and the B3 operator are consistent, the B1 operator and the B3 operator are obtained. There can also be more operators to calculate the same target expression.
  • S240 Perform fusion on at least two of the second operators to obtain a third operator.
  • At least two operators that calculate the same target expression can be fused.
  • the granularity of the third operator is greater than the granularity of the second operator, and the granularity of the third operator is determined according to the granularity of the fused second operator.
  • the granularity of the B1 operator and the B2 operator is 2, and the granularity of the fused C1 operator is 4, which increases the calculation amount of the operator.
  • S250 Obtain a final data flow calculation graph based on the unfused second operator and the third operator in the target data flow calculation graph.
  • the second operator that does not have the same target expression cannot be merged, and the granularity of the second operator is retained.
  • the final data flow calculation graph is obtained by optimizing the target data flow calculation graph, and the deep learning architecture is calculated by the second operator and/or the third operator in the final data flow calculation graph.
  • the third operator is obtained by fusing the second operator with the same target expression, which increases the granularity of the operators in the data flow calculation graph, and improves the calculation ability and calculation of the neural network architecture. effectiveness.
  • step S250 includes:
  • correlation means that the input of the current operator needs to be determined according to the output result of the previous operator, and the output result of the current operator is used as the input of the next operator.
  • the third operator needs the calculation result of the second operator as data, so the second operator and the third operator can be combined into a data path. The connection between the operators is determined according to the correlation of the target expression.
  • a data path includes a head operator, a successor operator, and an output operator.
  • the head operator is used to initialize all parameters, and the successor operator is used to obtain the previous operator.
  • the output of the output operator is used to output data.
  • the head operator refers to the first operator to perform calculations, and the output operator refers to the operator that outputs the final result.
  • the successor operator refers to the operator that takes the calculation result of the previous operator as input, and the predecessor operator refers to the operator that points to the output result of the next operator.
  • there are four operators A, B, C, and D and the order of calculation is A, B, C, D, then A is the head operator, D is the output operator, and A, B, and C are the front operators.
  • Successor operators, B, C, D are successor operators.
  • related operators are connected to form a data path, and irrelevant operators are not in this data path, so there is at least one data path. All the data paths are combined into the final data flow calculation graph, so as to perform the calculation of the deep learning framework.
  • the sorting between operators follows the design of the underlying cache, which greatly reduces the time for the previous operator to input the calculation result to the next operator, and improves the efficiency of calculation.
  • the technical solution of the embodiment of the present application obtains an initial data flow calculation graph, the initial data flow calculation graph includes a first operator for calculating an initial constant expression; the target data flow calculation is obtained according to the parameters in the initial constant expression Figure.
  • the optimization of the initial data flow calculation graph into the target data flow calculation graph is realized, and the parameter calculation in the data flow calculation graph can be completed in one step, which improves the calculation time of the deep learning framework.
  • the granularity of the second operator in the target data flow calculation graph is greater than the granularity of the first operator in the initial data flow calculation graph, so the calculation amount of the second operator in the target data flow calculation graph is also greater.
  • the technical effect of improving the computational efficiency of the deep learning framework is achieved.
  • FIG. 4 is a schematic structural diagram of a deep learning framework adjustment device provided in Embodiment 3 of the application. This embodiment can be applied to a scenario in which a deep learning framework developed based on a data flow architecture is optimized.
  • the device can be implemented in software and/or hardware, and can be integrated on a server.
  • the device for adjusting the deep learning framework includes: an acquisition module 410 and an adjustment module 420.
  • the obtaining module 410 is configured to obtain an initial data flow calculation graph, and the initial data flow calculation graph includes a first operator for calculating an initial constant expression.
  • the adjustment module 420 is configured to obtain a target data flow calculation graph according to the parameters in the initial constant expression, the target data flow calculation graph includes a second operator, and the target data flow calculation graph is used to control the deep learning framework chip Data calculation is performed, and the granularity of the second operator is greater than the granularity of the first operator to adjust the calculation amount of the deep learning framework chip.
  • the second operator is obtained by fusing the at least two first operators.
  • the second operator is used to calculate a target expression, and the target expression is obtained based on the parameters of the initial constant expression.
  • the number of the target expressions and the number of the second operators are both multiple, and the obtaining module 410 is further configured to obtain at least two of the first two of the target expressions that have the same calculation.
  • the device also includes a fusion module, the fusion module is configured to fuse at least two of the second operators to obtain a third operator; based on the target data flow calculation graph is not fused second operator And the third operator to obtain a final data flow calculation graph.
  • the fusion module is configured to obtain a final data flow calculation graph based on the unfused second operator and the third operator in the target data flow calculation graph in the following manner: The second operator and the third operator of a target expression are combined into a data path; the final data flow calculation graph is obtained based on all the data paths, where the multiple target expressions are used to calculate the relevant multiple The output of the operator of one target expression in the target expressions is the input of the operator used to calculate another target expression in the plurality of related target expressions.
  • the data path includes a head operator, a successor operator, and an output operator.
  • the head operator is used to initialize all parameters, and the successor operator is used to obtain the output of the previous operator.
  • the output operator is used to output data.
  • the granularity of the third operator is greater than the granularity of the second operator.
  • the device for adjusting the deep learning framework provided by the embodiment of the present application can execute the method for adjusting the deep learning framework provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
  • the device for adjusting the deep learning framework provided by the embodiment of the present application can execute the method for adjusting the deep learning framework provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 5 is a schematic structural diagram of a server provided in Embodiment 4 of the present application.
  • Figure 5 shows a block diagram of an exemplary server 612 suitable for implementing embodiments of the present application.
  • the server 612 shown in FIG. 5 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the server 612 is represented in the form of a general server.
  • the components of the server 612 may include, but are not limited to: one or more processors 616, a storage device 628, and a bus 618 connecting different system components (including the storage device 628 and the processor 616).
  • the bus 618 represents one or more of several types of bus structures, including a storage device bus or a storage device controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures.
  • these architectures include, but are not limited to, Industry Subversive Alliance (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
  • the server 612 includes a variety of computer system readable media. These media may be any available media that can be accessed by the server 612, including volatile and non-volatile media, removable and non-removable media.
  • the storage device 628 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (RAM) 630 and/or a cache memory 632.
  • the terminal 612 may include other removable/non-removable, volatile/non-volatile computer system storage media.
  • the storage system 634 may be configured to read and write a non-removable, non-volatile magnetic medium (not shown in FIG. 5, usually called a "hard drive”).
  • a disk drive configured to read and write to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk such as a compact disc (Compact Disc Read) can be provided.
  • each drive can be connected to the bus 618 through one or more data media interfaces.
  • the storage device 628 may include at least one program product, and the program product has a set of (for example, at least one) program modules, and these program modules are configured to perform the functions of the embodiments of the present application.
  • a program/utility tool 640 having a set of (at least one) program module 642 may be stored in, for example, the storage device 628.
  • Such program module 642 includes but is not limited to an operating system, one or more application programs, other program modules, and programs Data, each of these examples or a combination may include the realization of a network environment.
  • the program module 642 usually executes the functions and/or methods in the embodiments described in this application.
  • the server 612 can also communicate with one or more external devices 614 (such as keyboards, pointing terminals, displays 624, etc.), and can also communicate with one or more terminals that enable users to interact with the server 612, and/or communicate with
  • the server 612 can communicate with any terminal (such as a network card, a modem, etc.) that communicates with one or more other computing terminals. This communication can be performed through an input/output (Input/Output, I/O) interface 622.
  • the server 612 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 620. As shown in FIG.
  • the network adapter 620 communicates with other modules of the server 612 through the bus 618.
  • other hardware and/or software modules can be used in conjunction with the server 612, including but not limited to: microcode, terminal drives, redundant processors, external disk drive arrays, and disk arrays (Redundant Arrays of Independent Disks, RAID) systems, tape drives, and data backup storage systems.
  • the processor 616 executes a variety of functional applications and data processing by running programs stored in the storage device 628, for example, to implement a deep learning framework adjustment method provided by any embodiment of the present application.
  • the method may include: obtaining the initial A data flow calculation graph, where the initial data flow calculation graph includes a first operator for calculating an initial constant expression; a target data flow calculation graph is obtained according to the parameters in the initial constant expression, and the target data flow calculation graph includes a first operator Two operators, the target data flow calculation graph is used to control the deep learning framework chip to perform data calculations, and the granularity of the second operator is greater than the granularity of the first operator to adjust the calculation amount of the deep learning framework chip .
  • the technical solution of the embodiment of the present application obtains an initial data flow calculation graph, the initial data flow calculation graph includes a first operator for calculating an initial constant expression; the target data flow calculation is obtained according to the parameters in the initial constant expression Figure.
  • the optimization of the initial data flow calculation graph into the target data flow calculation graph is realized, and the parameter calculation in the data flow calculation graph can be completed in one step, which improves the calculation time of the deep learning framework.
  • the granularity of the second operator in the target data flow calculation graph is greater than the granularity of the first operator in the initial data flow calculation graph, so the calculation amount of the second operator in the target data flow calculation graph is also greater.
  • the technical effect of improving the computational efficiency of the deep learning framework is achieved.
  • the fifth embodiment of the present application also provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the method for adjusting the deep learning framework as provided in any embodiment of the present application, the method may include: Obtain an initial data flow calculation graph, the initial data flow calculation graph includes a first operator for calculating an initial constant expression; a target data flow calculation graph is obtained according to the parameters in the initial constant expression, the target data flow calculation graph It includes a second operator, the target data flow calculation graph is used to control the deep learning framework chip to perform data calculations, and the granularity of the second operator is greater than the granularity of the first operator to adjust the depth of the deep learning framework chip. Calculation amount.
  • the computer storage media in the embodiments of the present application may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Examples of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, read-only memory (ROM), erasable memory Erasable Programmable Read-Only Memory (EPROM or flash memory), optical fiber, CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and the computer-readable signal medium carries computer-readable program code. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • suitable medium including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • the computer program code used to perform the operations of this application can be written in one or more programming languages or a combination thereof.
  • the programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or terminal.
  • the remote computer may be connected to the user computer through any kind of network including LAN or WAN, or may be connected to an external computer (for example, using an Internet service provider to connect through the Internet).
  • the technical solution of the embodiment of the present application obtains an initial data flow calculation graph, the initial data flow calculation graph includes a first operator for calculating an initial constant expression; the target data flow calculation is obtained according to the parameters in the initial constant expression Figure. Realize the optimization of the initial data flow calculation graph into the target data flow calculation graph, and the calculation of the parameters in the data flow calculation graph can be completed in one step, which improves the calculation time of the deep learning framework. At the same time, the granularity of the second operator in the target data flow calculation graph is greater than the granularity of the first operator in the initial data flow calculation graph, so the calculation amount of the second operator in the target data flow calculation graph is also greater. The technical effect of improving the computational efficiency of the deep learning framework is achieved.

Abstract

本申请实施例公开了一种深度学习框架的调整方法、装置、服务器及存储介质,该方法包括:获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。

Description

深度学习框架的调整方法、装置、服务器及存储介质 技术领域
本申请实施例涉及深度学习技术领域,例如涉及一种深度学习框架的调整方法、装置、服务器及存储介质。
背景技术
随着数据流架构的发展,数据格式的优化对提升数据流架构的效率越来越重要。
深度学习框架的数据格式都是针对指令集架构设计,指令集架构的特点是数据格式能够对应的拆分成单个指令形式,计算单元颗粒度小,计算单元间可以任意组合。然而,运行在数据流架构上的数据格式,相比指令集架构而言,计算单元颗粒度大,支持的计算单元组合也有限制,对应的形式是数据通路,而不是指令单元,一条数据通路往往由多个复杂计算单元组成。为了解决数据流架构的数据格式优化问题,研究人员研究了一种通用的基于数据流的数据格式设计。
然而,该通用的基于数据流的数据格式设计受限于数据流架构的设计,计算效率低下。
发明内容
本申请实施例提供一种深度学习框架的调整方法、装置、服务器及存储介质,以实现提高数据流架构的深度学习框架计算效率的效果。
本申请实施例提供一种深度学习框架的调整方法,包括:
获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;
根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
本申请实施例提供一种深度学习框架的调整装置,包括:
获取模块,设置为获取初始数据流计算图,所述初始数据流计算图包括计 算初始常量表达式的第一算子;
优化模块,设置为根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
本申请实施例提供一种服务器,包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任意实施例所提供的深度学习框架的调整方法。
本申请实施例提供一种计算机可读存储介质,存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例所提供的深度学习框架的调整方法。
附图说明
图1为本申请实施例一提供的一种深度学习框架的调整方法的流程示意图;
图2为本申请实施例二提供的另一种深度学习框架的调整方法的流程示意图;
图3为本申请实施例二提供的另一种深度学习框架的调整方法的流程示意图;
图4为本申请实施例三提供的一种深度学习框架的调整装置的结构示意图;
图5是本申请实施例三提供的一种服务器的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。本文所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将多个步骤描述成顺序的处理,但是本文中的许多步骤可以被并行地、并发地或者同时实施。此外,多个步骤的顺序可以被重新安排。当多个步骤操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。
术语“第一”、“第二”等可在本文中用于描述多种方向、动作、步骤或 元件等,但这些方向、动作、步骤或元件不受这些术语限制。这些术语仅用于将第一个方向、动作、步骤或元件与另一个方向、动作、步骤或元件区分。举例来说,在不脱离本申请的范围的情况下,可以将第一算子颗粒度称为第二算子颗粒度,且类似地,可将第二算子颗粒度称为第一算子颗粒度。第一算子颗粒度和第二算子颗粒度两者都是算子颗粒度,但第一算子颗粒度和第二算子颗粒度不是同一算子颗粒度。术语“第一”、“第二”等而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有限定。
实施例一
图1为本申请实施例一提供的一种深度学习框架的调整方法的流程示意图,可适用于对基于数据流架构开发的深度学习框架进行优化的场景,该方法可以由深度学习框架的调整装置来执行,该装置可以采用软件和/或硬件的方式实现,并可集成在服务器上。
如图1所示,本实施例一提供的深度学习框架的调整方法包括:
S110、获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子。
本实施例中,数据流计算图是一种有向图,用来表示数据驱动计算。在数据流计算图中,每个节点表示一个算子。第一算子是指在初始数据流计算图中计算初始常量表达式的算子。初始数据流计算图是指未被优化的数据流计算图。常量表达式是指表达式里面只有常量值,初始常量表达式是指初始数据流计算图中需要计算的常量表达式。一实施例中,初始常量表达式可以是a+b,也可以是a*b,而第一算子用来计算a+b或a*b,本文不作限制。本实施例中,a、b都是常量,例如a是1、b是2,本文对于常量的数值不作限制。
S120、根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
本实施例中,参数是指初始常量表达式中的常量值。例如,初始常量表达式为a*b,则参数值a和b为常量。目标数据流计算图是对初始常量表达式的参数优化得到的,目标数据流计算图用于控制深度学习框架芯片进行数据计算。
一实施例中,在初始数据流计算图中,只能在计算两个常量值输出一个结果后,再和另一个常量值进行计算。示例性的,有a、b、c三个常量,要计算的 最终结果是a*b+c,而在初始数据流计算图中,则要先计算a*b=n,再计算n+c从而输出结果。对参数进行优化是指将需要计算的参数一次计算,例如直接计算a*b+c,从而直接输出结果。本实施例中,目标数据流计算图中包括第二算子,用于计算对初始常量表达式中的参数优化后的表达式。一实施例中,算子的颗粒度影响深度学习框架的计算量,由于对参数优化后计算更复杂,因此第二算子的颗粒度大于第一算子的颗粒度,以调整深度学习框架芯片的计算量。
本申请实施例的技术方案,通过获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图。实现了将初始数据流计算图优化成目标数据流计算图,对于神经网络芯片内的数据流计算图中的参数计算可以一步到位,提高了神经网络芯片对于深度学习框架的计算时间。同时目标数据流计算图中的第二算子的颗粒度大于初始数据流计算图中第一算子的颗粒度,因此在目标数据流计算图中的第二算子的计算量也更大,解决了基于数据流架构的深度学习框架计算效率低下的问题,达到了提高深度学习框架计算效率的技术效果。
实施例二
图2是本申请实施例二提供的另一种深度学习框架的调整方法的流程示意图。本实施例是在上述技术方案的基础上进行说明,适用于对目标数据流计算图进行优化的场景。该方法可以由深度学习框架的调整装置来执行,该装置可以采用软件和/或硬件的方式实现,并可集成在服务器上。
如图2所示,本申请实施例二提供的深度学习框架的调整方法包括:
S210、获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子。
本实施例中,数据流计算图是一种有向图,用来表示数据驱动计算。在数据流计算图中,每个节点表示一个算子。第一算子是指在初始数据流计算图中计算初始常量表达式的算子。初始数据流计算图是指未被优化的数据流计算图。常量表达式是指表达式里面只有常量值,初始常量表达式是指初始数据流计算图中需要计算的常量表达式。一实施例中,初始常量表达式可以是a+b,也可以是a*b,而第一算子用来计算a+b或a*b,此处不作限制。本实施例中,a、b都是常量,例如a是1、b是2,本文对于常量的数值不作限制。
S220、根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整 深度学习框架芯片的计算量。
本实施例中,参数是指初始常量表达式中的常量值。例如初始常量表达式为a*b,则参数值a和b为常量值。目标数据流计算图是对初始常量表达式的参数优化得到的,目标数据流计算图用于控制深度学习框架芯片进行数据计算。
目标数据流计算图中的第二算子用于计算目标表达式,所述目标表达式基于所述初始常量表达式的参数优化得到。一实施例中,在初始数据流计算图中,只能在计算两个常量值输出一个结果后,再和另一个常量值进行计算。示例性的,有a、b、c三个常量,要计算的最终结果是a*b+c,而在初始数据流计算图中,则要先计算a*b=n,再计算n+c从而输出结果。初始常量表达式可以是a*b、n+c。初始常量表达式一次只能计算两个参数。而目标表达式是对初始常量表达式的参数优化得到。示例性的,目标表达式是a*b+c,对初始常量表达式的参数进行合并,a、b、c都是常量值。
目标表达式可以一次计算多个常量。例如,需要输出a*b+c+d的计算结果,初始常量表达式则是a*b=n1,n1+c=n2,n2+d=n3,最后输出n3的结果。而目标表达式则为a*b+c+d=n3,经过一次计算直接输出n3的结果,计算效率大大提升。
由于第一算子用于计算初始常量表达式,而第二算子是计算经初始常量表达式优化得到的目标表达式,因此第二算子是通过至少两个第一算子融合得到的。示例性的,第一算子有A1和A2,A1计算a*b=n,A2算子计算n+c从而输出结果,则可以融合A1算子和A2算子,得到B1算子计算a*b+c。本实施例中,经第一算子融合得到的第二算子的颗粒度大于第一算子的颗粒度。示例性的,A1的颗粒度为1,A2的颗粒度为1,则B1的颗粒度为2。一实施例中,第二算子为加法乘法组合算子。
S230、获取计算相同的所述目标表达式的至少两个所述第二算子。
本实施例中,目标表达式有多个,每一个第二算子只能计算一个目标表达式,获取计算相同的目标表达式的至少两个第二算子是指对计算相同的目标表达式的第二算子识别。一实施例中,有B1、B2和B3三个第二算子,B1算子计算的目标表达式为Y1=a*X+b,B2算子计算的目标表达式为Y2=a*X+c,B3算子计算的目标表达式为Y3=a*X+b,Y1、Y2和Y3为输出的计算结果,a、b、c都是常量,X是常量或者变量,如果X的值未输入数值则为变量,如果X的值已输入数值则为常量。由于B1算子和B3算子计算的目标表达式一致,因此获取B1算子和B3算子。还可以有更多的算子计算相同的目标表达式。
S240、对至少两个所述第二算子进行融合得到第三算子。
本实施例中,对于计算相同的目标表达式的至少两个算子,可以进行融合。示例性的,B1算子和B3算子计算相同的目标表达式Y=a*X+b,因此可以对B1算子和B3算子融合得到第三算子C1,从而对目标表达式Y=a*X+b进行计算。本实施例中,第三算子的颗粒度大于第二算子的颗粒度,第三算子的颗粒度根据融合的第二算子的颗粒度确定。示例性的,B1算子和B2算子的颗粒度为2,则融合后的C1算子的颗粒度为4,提高了算子的计算量。
S250、基于所述目标数据流计算图中未融合的第二算子和所述第三算子得到最终数据流计算图。
本实施例中,对于没有相同的目标表达式的第二算子则无法融合,保留第二算子的颗粒度。最终数据流计算图是对目标数据流计算图优化得到的,最终数据流计算图中通过第二算子和/或第三算子对深度学习架构进行计算。
在本实施例中,通过对具有相同的目标表达式的第二算子融合得到第三算子,增大了数据流计算图中算子的颗粒度,提高了神经网络架构的计算能力和计算效率。
参考图3,在一实施例中,步骤S250包括:
S2510、将计算相关的多个目标表达式的第二算子和第三算子组合成一个数据通路。
本实施例中,相关是指当前算子的输入需要根据上一个算子的输出结果确定,当前算子的输出结果作为下一个算子的输入。示例性的,第二算子计算的目标表达式可以为Y1=a*X1+c,第三算子计算的目标表达式可以为Y2=Y1*X2+d,a、c、d为常量,而X1和X2为变量,X1和X2的数值需要等待数据输入才能确定。由于第二算子计算的目标表达式中存在变量,因此不能合并。而且第三算子需要第二算子的计算结果作为数据,因此可以将第二算子和第三算子组合成一个数据通路。算子之间的连接根据目标表达式的相关性确定。
一实施例中,在一个数据通路中,包括头部算子、后继算子和输出算子,所述头部算子用于承担所有参数初始化,所述后继算子用于获取前继算子的输出,所述输出算子用于输出数据。头部算子是指进行计算的第一个算子,输出算子是指输出最终结果的算子。后继算子是指根据上一个算子计算结果作为输入的算子,前继算子是指向下一个算子输出结果的算子。示例性的,有A、B、C、D四个算子,计算的顺序为A、B、C、D,则A为头部算子,D为输出算子,A、B、C为前继算子,B、C、D为后继算子。
S2520、基于所有数据通路得到最终数据流计算图。
本实施例中,相关的算子连接形成一个数据通路,不相关的算子不在此数据通路中,因此数据通路至少为一个。所有的数据通路组合成最终数据流计算图,从而进行深度学习框架的计算。一实施例中,算子间的排序遵循底层缓存设计,大大减少了上一个算子将计算结果输入给下一个算子的时间,提高了计算的效率。
本申请实施例的技术方案,通过获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图。实现了将初始数据流计算图优化成目标数据流计算图,对于数据流计算图中的参数计算可以一步到位,提高了深度学习框架的计算时间。同时目标数据流计算图中的第二算子的颗粒度大于初始数据流计算图中第一算子的颗粒度,因此在目标数据流计算图中的第二算子的计算量也更大,达到了提高深度学习框架计算效率的技术效果。
实施例三
图4为本申请实施例三提供的一种深度学习框架的调整装置的结构示意图,本实施例可适用于将基于数据流架构开发的深度学习框架进行优化的场景。该装置可以采用软件和/或硬件的方式实现,并可集成在服务器上。
如图4所示,本申请实施例三提供的深度学习框架的调整装置包括:获取模块410和调整模块420。
获取模块410,设置为获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子。
调整模块420,设置为根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
一实施例中,所述第二算子通过所述至少两个所述第一算子融合得到。
一实施例中,所述第二算子用于计算目标表达式,所述目标表达式基于所述初始常量表达式的参数得到。
一实施例中,所述目标表达式的个数和所述第二算子的个数均为多个,获取模块410还设置为获取计算相同的所述目标表达式的至少两个所述第二算子;所述装置还包括融合模块,融合模块设置为对至少两个所述第二算子进行融合得到第三算子;基于所述目标数据流计算图中未融合的第二算子和所述第三算子得到最终数据流计算图。
一实施例中,融合模块是设置为通过如下方式基于所述目标数据流计算图中未融合的第二算子和所述第三算子得到最终数据流计算图:将计算相关的所述多个目标表达式的第二算子和第三算子组合成一个数据通路;基于所有数据通路得到最终数据流计算图,其中,相关的多个目标表达式是指用于计算所述相关的多个目标表达式中一个目标表达式的算子的输出是用于计算所述相关的多个目标表达式中另一个目标表达式的算子的输入。
一实施例中,所述数据通路包括头部算子、后继算子和输出算子,所述头部算子用于承担所有参数初始化,所述后继算子用于获取前继算子的输出,所述输出算子用于输出数据。
一实施例中,所述第三算子颗粒度大于所述第二算子颗粒度。
本申请实施例所提供的深度学习框架的调整装置可执行本申请任意实施例所提供的深度学习框架的调整方法,具备执行方法相应的功能模块和有益效果。本实施例中未详尽描述的内容可以参考本申请任意方法实施例中的描述。
实施例四
图5是本申请实施例四提供的一种服务器的结构示意图。图5示出了适于用来实现本申请实施方式的示例性服务器612的框图。图5显示的服务器612仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图5所示,服务器612以通用服务器的形式表现。服务器612的组件可以包括但不限于:一个或者多个处理器616,存储装置628,连接不同系统组件(包括存储装置628和处理器616)的总线618。
总线618表示几类总线结构中的一种或多种,包括存储装置总线或者存储装置控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Subversive Alliance,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
一实施例中,服务器612包括多种计算机系统可读介质。这些介质可以是任何能够被服务器612访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储装置628可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)630和/或高速缓存存储器632。 一实施例中,终端612可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统634可以设置为读写不可移动的、非易失性磁介质(图5未显示,通常称为“硬盘驱动器”)。尽管图5中未示出,可以提供设置为对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘,例如只读光盘(Compact Disc Read-Only Memory,CD-ROM),数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线618相连。存储装置628可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请实施例的功能。
具有一组(至少一个)程序模块642的程序/实用工具640,可以存储在例如存储装置628中,这样的程序模块642包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或一种组合中可能包括网络环境的实现。程序模块642通常执行本申请所描述的实施例中的功能和/或方法。
服务器612也可以与一个或多个外部设备614(例如键盘、指向终端、显示器624等)通信,还可与一个或者多个使得用户能与该服务器612交互的终端通信,和/或与使得该服务器612能与一个或多个其它计算终端进行通信的任何终端(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口622进行。并且,服务器612还可以通过网络适配器620与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图5所示,网络适配器620通过总线618与服务器612的其它模块通信。尽管图中未示出,可以结合服务器612使用其它硬件和/或软件模块,包括但不限于:微代码、终端驱动器、冗余处理器、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理器616通过运行存储在存储装置628中的程序,从而执行多种功能应用以及数据处理,例如实现本申请任意实施例所提供的一种深度学习框架的调整方法,该方法可以包括:获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
本申请实施例的技术方案,通过获取初始数据流计算图,所述初始数据流 计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图。实现了将初始数据流计算图优化成目标数据流计算图,对于数据流计算图中的参数计算可以一步到位,提高了深度学习框架的计算时间。同时目标数据流计算图中的第二算子的颗粒度大于初始数据流计算图中第一算子的颗粒度,因此在目标数据流计算图中的第二算子的计算量也更大,达到了提高深度学习框架计算效率的技术效果。
实施例五
本申请实施例五还提供了一种计算机可读存储介质,存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例所提供的深度学习框架的调整方法,该方法可以包括:获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、只读存储器(Read-Only Memory,ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,计算机可读的信号介质中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或终端上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
本申请实施例的技术方案,通过获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;根据所述初始常量表达式中的参数得到目标数据流计算图。实现了将初始数据流计算图优化成目标数据流计算图,对于数据流计算图中的参数计算可以一步到位,提高了深度学习框架的计算时间。同时目标数据流计算图中的第二算子的颗粒度大于初始数据流计算图中第一算子的颗粒度,因此在目标数据流计算图中的第二算子的计算量也更大,达到了提高深度学习框架计算效率的技术效果。

Claims (10)

  1. 一种深度学习框架的调整方法,包括:
    获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;
    根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
  2. 如权利要求1所述的方法,其中,所述第二算子通过至少两个第一算子融合得到。
  3. 如权利要求1或2所述的方法,其中,所述第二算子用于计算目标表达式,所述目标表达式基于所述初始常量表达式的参数得到。
  4. 如权利要求3所述的方法,其中,所述目标表达式的个数和所述第二算子的个数均为多个;
    在所述根据所述初始常量表达式中的参数得到目标数据流计算图之后,还包括:
    获取计算相同的目标表达式的至少两个第二算子;
    对所述至少两个第二算子进行融合得到第三算子;
    基于所述目标数据流计算图中未融合的第二算子和所述第三算子得到最终数据流计算图。
  5. 如权利要求4所述的方法,其中,所述基于所述目标数据流计算图中未融合的第二算子和所述第三算子得到最终数据流计算图,包括:
    将计算相关的多个目标表达式的第二算子和第三算子组合成一个数据通路,其中,所述相关的多个目标表达式是指用于计算所述相关的多个目标表达式中一个目标表达式的算子的输出是用于计算所述相关的多个目标表达式中另一个目标表达式的算子的输入;
    基于所有数据通路得到最终数据流计算图。
  6. 如权利要求5所述的方法,其中,所述数据通路包括头部算子、后继算子和输出算子,所述头部算子用于承担所有参数初始化,所述后继算子用于获取前继算子的输出,所述输出算子用于输出数据。
  7. 如权利要求4所述的方法,其中,所述第三算子颗粒度大于所述第二算子颗粒度。
  8. 一种深度学习框架的调整装置,包括:
    获取模块,设置为获取初始数据流计算图,所述初始数据流计算图包括计算初始常量表达式的第一算子;
    优化模块,设置为根据所述初始常量表达式中的参数得到目标数据流计算图,所述目标数据流计算图包括第二算子,所述目标数据流计算图用于控制深度学习框架芯片进行数据计算,所述第二算子的颗粒度大于所述第一算子的颗粒度以调整深度学习框架芯片的计算量。
  9. 一种服务器,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-7中任一项所述的深度学习框架的调整方法。
  10. 一种计算机可读存储介质,存储有计算机程序,所述程序被处理器执行时实现如权利要求1-7中任一项所述的深度学习框架的调整方法。
PCT/CN2019/112463 2019-10-22 2019-10-22 深度学习框架的调整方法、装置、服务器及存储介质 WO2021077281A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2019/112463 WO2021077281A1 (zh) 2019-10-22 2019-10-22 深度学习框架的调整方法、装置、服务器及存储介质
CN201980100791.6A CN114514506A (zh) 2019-10-22 2019-10-22 深度学习框架的调整方法、装置、服务器及存储介质
US17/771,035 US20220366249A1 (en) 2019-10-22 2019-10-22 Method and device for adjusting deep learning network, server, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/112463 WO2021077281A1 (zh) 2019-10-22 2019-10-22 深度学习框架的调整方法、装置、服务器及存储介质

Publications (1)

Publication Number Publication Date
WO2021077281A1 true WO2021077281A1 (zh) 2021-04-29

Family

ID=75619589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/112463 WO2021077281A1 (zh) 2019-10-22 2019-10-22 深度学习框架的调整方法、装置、服务器及存储介质

Country Status (3)

Country Link
US (1) US20220366249A1 (zh)
CN (1) CN114514506A (zh)
WO (1) WO2021077281A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167437A (zh) * 2023-04-18 2023-05-26 之江实验室 一种芯片管理系统、方法、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102656554A (zh) * 2009-09-16 2012-09-05 起元技术有限责任公司 映射数据集元素
CN105426504A (zh) * 2015-11-27 2016-03-23 陕西艾特信息化工程咨询有限责任公司 一种基于内存计算的分布式数据分析处理方法
CN106547522A (zh) * 2015-09-17 2017-03-29 华为技术有限公司 一种流应用优化的方法及装置
CN109325069A (zh) * 2018-09-07 2019-02-12 腾讯科技(深圳)有限公司 业务处理方法、装置及网络设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102656554A (zh) * 2009-09-16 2012-09-05 起元技术有限责任公司 映射数据集元素
CN106547522A (zh) * 2015-09-17 2017-03-29 华为技术有限公司 一种流应用优化的方法及装置
CN105426504A (zh) * 2015-11-27 2016-03-23 陕西艾特信息化工程咨询有限责任公司 一种基于内存计算的分布式数据分析处理方法
CN109325069A (zh) * 2018-09-07 2019-02-12 腾讯科技(深圳)有限公司 业务处理方法、装置及网络设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167437A (zh) * 2023-04-18 2023-05-26 之江实验室 一种芯片管理系统、方法、设备及存储介质

Also Published As

Publication number Publication date
US20220366249A1 (en) 2022-11-17
CN114514506A (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
US10656909B2 (en) Learning intended user actions
KR102484617B1 (ko) 이종 그래프 노드를 표현하는 모델 생성 방법, 장치, 전자 기기, 저장 매체 및 프로그램
US20190073197A1 (en) Chatbot development and deployment platform
WO2021129645A1 (zh) 数据并行化处理方法、系统、设备和存储介质
JP2023520420A (ja) チャットボットのために不均衡なトレーニングデータを取り扱うためのバッチング技術
CN109376852B (zh) 运算装置及运算方法
JP2022018095A (ja) マルチモーダル事前訓練モデル取得方法、装置、電子デバイス及び記憶媒体
US8570905B2 (en) Adaptive enterprise service bus (ESB) runtime system and method
WO2021218069A1 (zh) 基于场景动态配置的交互处理方法、装置、计算机设备
WO2021228264A1 (zh) 一种应用机器学习的方法、装置、电子设备及存储介质
US11030035B2 (en) Preventing cascade failures in computer systems
US20230139106A1 (en) Conversion method and apparatus for deep learning model, server, and storage medium
WO2021259106A1 (zh) 神经网络芯片的优化方法、系统、设备和存储介质
WO2021259041A1 (zh) Ai计算图的排序方法、装置、设备及存储介质
CN111985831A (zh) 云计算资源的调度方法、装置、计算机设备及存储介质
US20220044678A1 (en) Speech processing method and method for generating speech processing model
CN114528044B (zh) 一种接口调用方法、装置、设备及介质
JP2022067639A (ja) プロセッサを備えるシステム、コンピュータ実装方法、プログラム(合成システム障害生成)
WO2021225901A1 (en) Techniques for converting natural speech to programming code
WO2021077281A1 (zh) 深度学习框架的调整方法、装置、服务器及存储介质
US20180095865A1 (en) Event-driven software test sequence determination
WO2022028224A1 (zh) 数据存储方法、装置、设备和存储介质
CN107766944B (zh) 一种利用api分析进行系统功能流优化的系统和方法
JP7331178B2 (ja) シャーシシミュレーション方法、装置、サーバ、記憶媒体及びプログラム
WO2021077282A1 (zh) 神经网络模型转化方法、装置、服务器及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19949796

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 27/09/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19949796

Country of ref document: EP

Kind code of ref document: A1