WO2021035397A1 - Procédé et appareil d'optimisation de tâches de déplacement de données - Google Patents

Procédé et appareil d'optimisation de tâches de déplacement de données Download PDF

Info

Publication number
WO2021035397A1
WO2021035397A1 PCT/CN2019/102268 CN2019102268W WO2021035397A1 WO 2021035397 A1 WO2021035397 A1 WO 2021035397A1 CN 2019102268 W CN2019102268 W CN 2019102268W WO 2021035397 A1 WO2021035397 A1 WO 2021035397A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
move
identified
task
move task
Prior art date
Application number
PCT/CN2019/102268
Other languages
English (en)
Inventor
Yuqing Wang
Youben YE
Weiming Zhao
Peng Zhou
Weifeng Zhang
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to CN201980098172.8A priority Critical patent/CN114041116A/zh
Priority to PCT/CN2019/102268 priority patent/WO2021035397A1/fr
Publication of WO2021035397A1 publication Critical patent/WO2021035397A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure is generally related to the field of computation optimization, and in particular, to methods and apparatus for data-move task optimizing.
  • Embodiments of the present disclosure provide a method for optimizing data-move tasks associated with a machine learning model.
  • the method can include identifying a data-move task in a program executable in a device, identifying one or more operations that are associated with the data-move task, determining hardware capabilities of the device and requirements of accelerating data-move tasks, and in response to a determination that the hardware capabilities of the device satisfy the requirements, mapping the identified data-move task to data writing of the identified one or more operations that are associated with the data-move task.
  • Embodiments of the present disclosure also provide an apparatus for optimizing data-move tasks associated with a machine learning model.
  • the apparatus can comprise a memory storing a set of instructions, and one or more processors configured to execute the set of instruction to cause the apparatus to perform: identifying a data-move task in a program executable in a device, identifying one or more operations that are associated with the data-move task, determining hardware capabilities of the device and requirements of accelerating data-move tasks, and in response to a determination that the hardware capabilities of the device satisfy the requirements, mapping the identified data-move task to data writing of the identified one or more operations that are associated with the data-move task.
  • Embodiments of the present disclosure also provide a non-transitory computer readable medium that stores a set of instructions that is executable by at least one processor of a computer to cause the computer to perform a method for optimizing data-move tasks associated with a machine learning model.
  • the method can comprise identifying a data-move task in a program executable in a device, identifying one or more operations that are associated with the data-move task, determining hardware capabilities of the device and requirements of accelerating data-move tasks, and in response to a determination that the hardware capabilities of the device satisfy the requirements, mapping the identified data-move task to data writing of the identified one or more operations that are associated with the data-move task.
  • FIG. 1 illustrates a schematic diagram of an exemplary deep learning accelerator system, consistent with embodiments of the present disclosure.
  • FIG. 2 illustrates a block diagram of exemplary components of a system including an apparatus for optimizing data-move tasks, consistent with embodiments of the present disclosure.
  • FIG. 3A illustrates a schematic diagram of an exemplary calculation model before being optimized, consistent with embodiments of the present disclosure.
  • FIG. 3B illustrates a schematic diagram of an exemplary calculation model after being optimized, consistent with embodiments of the present disclosure.
  • FIG. 4 illustrates a flow chart of an exemplary method of optimizing data-move tasks, consistent with embodiments of the present disclosure.
  • FIG. 5 illustrates a flow chart of an exemplary method of optimizing data-move tasks, consistent with embodiments of the present disclosure.
  • FIG. 6A illustrates a schematic diagram of an exemplary data move optimizing of merged calculation processes, consistent with embodiments of the present disclosure.
  • FIG. 6B illustrates a schematic diagram of data move optimizing of interleaved calculation processes, consistent with embodiments of the present disclosure.
  • some current solutions focus on improving processes of data-move tasks, for example, by using manual assembler code and copying data with multiple threads.
  • the present disclosure addresses the problem by merging the data-move tasks into operations (e.g., calculation operations) that are performed before the data-move tasks and redirecting output operands of the data-move tasks to offsets generated by a function such that the data-move tasks can be eliminated.
  • the optimization of data-move tasks can be used in machine learning systems in a Point-of-Sale (POS) machine that accelerates model learning using field programmable gate arrays (FPGAs) .
  • POS Point-of-Sale
  • FPGAs field programmable gate arrays
  • the embodiments of the present disclosure can be used in many systems, including autonomous driving systems, voice recognition systems and identity recognition systems, which benefit from computation acceleration.
  • the optimization can be applied in heterogeneous devices such as FPGAs and neural processing units (NPUs) .
  • NPUs neural processing units
  • the optimization can be used in any type of compiler and computation framework.
  • the embodiments can be used in a CPU and any type of accelerator (e.g., FPGAs, NPU, GPU, GRU and ASIC) .
  • FIG. 1 illustrates a block diagram of an exemplary deep learning accelerator system 100, according to embodiments of the disclosure.
  • Deep learning accelerator system 100 may include a neural network processing unit (NPU) 102, a NPU memory 104, a host CPU 108, a host memory 110 associated with host CPU 108, and a disk 112.
  • NPU neural network processing unit
  • Accelerator system 100 can perform data-move optimizing. It is appreciated that while FIG. 1 shows the accelerator system as using an NPU, any type of accelerator can be used.
  • NPU 102 may be connected to host CPU 108 through a peripheral interface (not shown) .
  • a neural network processing unit e.g., NPU 102
  • NPU 102 may be configured to be used as a co-processor of host CPU 108.
  • NPU 102 may comprise a compiler (not shown) .
  • the compiler may be a program or a computer software that transforms computer code written in one programming language into NPU instructions to create an executable program.
  • a compiler may perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, code optimization, code generation, or combinations thereof.
  • the compiler may be on a host unit (e.g., host CPU 108 or host memory 110 of FIG. 1) , configured to push one or more commands to NPU 102. Based on these commands, a task manager (not shown) of NPU 102 may assign any number of tasks to one or more processing elements of NPU 102. Some of the commands may instruct a DMA unit to load instructions and data from host memory into a global memory. The loaded instructions may then be distributed to each processing element assigned with the corresponding task, and the one or more processing elements may process these instructions.
  • a host unit e.g., host CPU 108 or host memory 110 of FIG. 1
  • a task manager not shown
  • Some of the commands may instruct a DMA unit to load instructions and data from host memory into a global memory. The loaded instructions may then be distributed to each processing element assigned with the corresponding task, and the one or more processing elements may process these instructions.
  • the first few instructions received by the processing element may instruct the processing element to load/store data from the global memory into one or more local memories of the processing element (e.g., a memory of the processing element or a local memory for each active processing element) .
  • Each processing element may then initiate the instruction pipeline, which involves fetching the instruction (e.g., via a fetch unit) from the local memory, decoding the instruction (e.g., via an instruction decoder) and generating local memory addresses (e.g., corresponding to an operand) , reading the source data, executing or loading/storing operations, and then writing back results.
  • Host CPU 108 may be associated with host memory 110 and disk 112.
  • host memory 110 may be an integral memory or an external memory associated with host CPU 108.
  • Host memory 110 may be a local or a global memory.
  • disk 112 may comprise an external memory configured to provide additional memory for host CPU 108.
  • FIG. 2 illustrates a block diagram of exemplary components of a system including an apparatus for optimizing data-move tasks, consistent with embodiments of the present disclosure.
  • Apparatus 200 for optimizing data-move tasks can be implemented within a system.
  • the system can be a neural network accelerator system 100 of FIG. 1.
  • apparatus 200 can be a complier on host CPU 108 or host memory 110 shown in FIG. 1.
  • the compiler is used to perform data-move optimizing
  • optimizing data-move tasks can also be implemented in any framework that employs optimization (e.g., TensorFlow for deep learning) .
  • the framework can be an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software.
  • Apparatus 200 can include a data-move task finder 210, a task analyzer 220, a storage allocator 230, an address redirector 240 and a data-move task remover 250.
  • Data-move task finder 210 can identify a data-move task (DMT) in a program executable in a device (e.g., deep learning accelerator system 100 as shown in FIG. 1) .
  • the data-move task can contain an operation of moving or copying data.
  • Data-move task finder 210 can also identify one or more operations that are associated with the data-move task.
  • the associated operations include operations that are performed before the data-move task.
  • the operations can be calculation operations such as multiplication and addition.
  • input of the compiler can be model definitions and model data files.
  • the module definition describes primary inputs of the model, intermediate computations and outputs of the model.
  • the model data files can include weights, biases, parameters and coefficients.
  • the compiler can first analyze structure and operations of the model, recognize the part that can be optimized and conduct optimization, convert the model definitions to Intermediate Representation (IR) , and generate binary instructions according to the instruction set architecture (ISA) of a device.
  • Input and output of the model can be tensors (e.g., input tensors and output tensors shown in FIG. 3A and FIG. 3B) .
  • Data-move task finder 210 shown in FIG. 2 can recognize operations of a data-move type such as concatenate operations (e.g., Concat as shown in FIG. 3A) , and Split operations when analyzing the operations of the model.
  • the hardware may be required to concatenate two 3-dimensional input data along one dimension. For instance, data matrix [10] [20] [30] and data matrix [10] [40] [30] are concatenated along the second dimension to generate a result of [10] [60] [30] .
  • a multi-dimensional data can be split along one dimension to generate multiple output data. For instance, data matrix [2] [3] [4] can be split along the third dimension to generate four outputs of [2] [3] .
  • the above exemplary operations can be identified as data-move types.
  • FIG. 3A A schematic diagram of an exemplary calculation model before being optimized is illustrated in FIG. 3A, consistent with embodiments of the present disclosure, and a schematic diagram of an exemplary calculation model after being optimized is illustrated in FIG. 3B, consistent with embodiments of the present disclosure.
  • blocks of various sizes represent the exemplary inputs tensors and output tensors. It is appreciated that while the optimization can be used in deep learning, computation using any data block can employ the data-move task optimization.
  • the input is model structures and definitions.
  • a data block as an input can be a matrix, a vector, an integer, or a float.
  • the data block can be defined in granularity.
  • the data block as an input can have larger granularity, while when CPU and GPU are used, the data block as an input can represent one integer.
  • the exemplary calculation model shown in FIG. 3A can involve three operations including a matrix-multiplication (MatMul) operation, an addition (Add) operation, and a concatenation (Concat) operation.
  • MatMul matrix-multiplication
  • Add addition
  • Concat concatenation
  • four input tensors are calculated to generate one output tensor.
  • the compiler can identify that Concat operation concatenates two data operands along a second axis.
  • task analyzer 220 can determine whether hardware capabilities of the device meet requirements associated with the data-move task.
  • Task analyzer 220 can also determine sizes of output operands (e.g. output tensors shown in FIG. 3A) of the identified operations that are associated with the data-move task. Not all data-move type operations can be optimized, because patterns of some operations cannot be extracted or the hardware are not able to optimize the operations. The requirements are met when a minimum size of data that the hardware is configured to write is equal to or smaller than the minimum size of the output operands.
  • an accelerator such as an NPU can be limited in hardware capabilities, as the ISA of the NPU may not support such optimization.
  • the optimization may require writing one integer each time when the data-move task is merged into the identified operations. Since the minimum size is ten integers, the NPU is not capable of writing one integer, which is less than ten, or skipping one integer to map offsets for optimization.
  • the accelerator can be limited by the hardware capabilities so that the accelerator cannot write results of operations in a certain way that may be required for optimization. If the hardware is capable of optimizing the operations, the elimination of data-move operations can be performed. The data-move task can be merged into a calculation operation that is performed prior to the data-move operation.
  • Address redirector 240 can map the identified data-move task to data writing of the identified operations that are associated with the data-move task when it is determined that the hardware of the device (e.g., deep learning accelerator system 100 as shown in FIG. 1) meet the requirements of optimizing the data-move task.
  • Address redirector 240 can generate a function that maps input offsets and output offsets.
  • An offset can be an integer indicating distance between a base address and a given operand.
  • the function can infer the output offset of the data-move task when the input offset of the data-move task is given.
  • an output offset generator can be used to write results of the calculation operation to an output offset.
  • the output offset generator can be adjusted by the function.
  • Address redirector 240 can merge the data-move task into the calculation operation by redirecting the results of the calculation operation to the output offset of the data-move task by adjusting the output offset generator of the calculation operation with the function that infers the output offset of the data-move task.
  • address redirector 240 in the exemplary compiler can mark in the matrix-multiplication (MatMul) operation and the addition (Add) operation to instruct writing the results in distance-based intervals, and adjust instructions for the matrix-multiplication operation and the addition operation based on offset change of the results.
  • the result of the matrix-multiplication operation is a 2*4 sized Temporary Tensor (shown as 2*4 blocks in FIG. 3A)
  • the result of the addition operation is a 2*1 sized Temporary Tensor (shown as 2*1 blocks in FIG.
  • the compiler can generate instructions to merge the calculation operations and the data-move operation.
  • the optimized calculation model can perform two calculation operations without performing a data-move operation. Accordingly, when the compiler employs the optimization of data-move operations by combining the calculation operations with the data-move task, computation performance can be improved.
  • storage allocator 230 is provided ahead of the operations (e.g., the calculation operations) , and storage allocator 230 can allocate storage that is determined to be sufficient to store the operations before the operations are performed.
  • Storage allocator 230 can work in combination with address redirector 240 for optimizing the data-move tasks.
  • Data-move task remover 250 can eliminate the data-move task that has been mapped.
  • performing the data-move task is replaced by redirecting and writing the results to the output offset generated by the function based on sizes of output operands of the operations.
  • the results of MatMul operation and Add operation are concatenated without performing a concatenate operation. Accordingly, the data-move task can be eliminated. Therefore, resources that are dedicated for the data-move tasks before optimization can be released after optimization. The overall computation of the accelerator is improved.
  • FIG. 4 illustrates a flow chart of an exemplary method 400 of optimizing data-move tasks, consistent with embodiments of the present disclosure.
  • Method 400 can include the following steps.
  • Method 400 can be performed by a compiler on a processing device (e.g., host CPU 108 or NPU 102 in FIG. 1) .
  • a data-move task is identified in a program executable in a device such as an accelerator.
  • the data-move task can be an operation of moving or copying data.
  • a compiler can determine if an operation is a data-move task when the compiler can analyze behaviors of the operation.
  • the compiler can determine how input operands are concatenated along a dimension to generate output operands for the concatenate operation and how input operands are split along a dimension to generate output operands for the split operation.
  • the compiler can recognize that the concatenate operation and the split operation are data-move tasks.
  • the number of the input operands of the data-move task can vary.
  • the number of the output operands of the data-move task can also vary.
  • step 402 one or more operations that are associated with the data-move task are identified.
  • the one or more operations that are associated with the data-move task include operations that are performed prior to the data-move task.
  • the operations can be calculation operations such as multiplication and addition.
  • the compiler can identify the operations that are immediately prior to the data-move tasks for merging the data-move tasks and the operations.
  • Step 403 hardware capabilities of the device and requirements of optimizing data-move tasks are determined. If the hardware capabilities satisfy the requirements, method 400 proceeds to step 404. If the hardware capabilities do not satisfy the requirements, method 400 ends.
  • Step 403 can include determining sizes of output operands (e.g. output tensors shown in FIG. 3A) of the identified operations that are associated with the data-move task.
  • the compiler can determine whether optimization can be performed on the operations based on the definitions of the operations and capabilities of the hardware.
  • the requirements may be met when a minimum size of data that the hardware is configured to write is equal to or less than the minimum size of the output operands.
  • an accelerator such as an NPU can be limited in hardware capabilities, as the ISA of the NPU may not support optimization.
  • the minimum size that the NPU is configured to write can be ten integers
  • the optimization may require writing one integer, which is less than ten, and skipping one integer when the data-move task is merged into the identified operations. Since the minimum size is ten integers, the NPU is not capable of writing one integer or skipping one integer to map offset for the optimization.
  • step 404 if the hardware capabilities of the device meet the requirements of optimizing the data-move task, the identified data-move task is mapped to data writing of one or more operations that are associated with the data-move task.
  • the operations can include calculation operations such as matrix-multiplication and addition. Mapping the identified data-move task to data writing of the identified one or more operations that are associated with the data-move task is further described in FIG. 5.
  • Step 404 can further include step 501 and step 502 shown in FIG. 5.
  • FIG. 5 illustrates a flow chart of an exemplary method 500 of optimizing data-move tasks, consistent with embodiments of the present disclosure. Method 500 can include the following steps.
  • a function that maps an output offset of the identified data-move task in response to an input offset of the identified data-move task is generated.
  • An offset can be an integer indicating distance between a base address and a given operand.
  • the base address can be an address serving as a reference point for other addresses.
  • the base address can represent a starting address of a storage space.
  • An output offset generator can be used to write results of the identified operations associated with the data-move task to a certain offset.
  • the function can map offset addresses using the following approaches.
  • the function can modify a page table of an operating system.
  • the page table is a data structure mapping between virtual addresses used in a program and physical addresses used in hardware.
  • the function can construct and maintain a lookup table of offset mapping.
  • the function can call a designed formula.
  • the function can also directly modify how the offset changes, e.g., from increasing 1 increment each time to increasing 2, 2, 5 increments periodically.
  • the increments can be determined based on the sizes of the output operands of the operations associated with the data-move task. Any of the above approaches can be used to convert an old address to a new address (e.g., address 100 to address 300) for writing results of operations.
  • the physical address for writing the results is obtained by adding the base address with the offset.
  • step 502 the output offset of the identified data-move task is redirected using the function based on the sizes of the output operands of the identified operations associated with the identified data-move tasks.
  • the function can be used to adjust the output offset generator such that the results of the identified operation are redirected and written to the output offset generated by the function.
  • the output offset can be adjusted to obtain an actual address for writing the results of the data-move task.
  • step 405 the data-move task that has been mapped is eliminated.
  • the data-move task is merged into the associated operations.
  • Concat operation of two temporary tensors in FIG. 3A is eliminated.
  • the results of MatMul operation and Add operation are concatenated without performing a concatenate operation. Therefore, the mapped data-move task can be removed such that the computation is optimized.
  • step 406 it is determined whether there are any more identified data-move tasks to be processed. If another identified data-move task exists to be processed, method 400 proceeds to step 403. If there are no other identified data-move tasks to be processed, method 400 ends.
  • FIG. 6A illustrates a schematic diagram of an exemplary data move optimizing of merged calculation processes, consistent with embodiments of the present disclosure.
  • calculation process 1 and calculation process 2 are merged sequentially by a merge operation.
  • results of calculation process 1 and calculation process 2 are merged sequentially without performing a merge operation. Accordingly, using the optimization techniques consistent with the disclosed embodiments, the data-move task associated with the calculation processes is eliminated.
  • FIG. 6B illustrates a schematic diagram of data move optimizing of interleaved calculation processes, consistent with embodiments of the present disclosure.
  • calculation process 1 and calculation process 2 are interleaved by an interleave operation.
  • the results of calculation process 1 and calculation process 2 are interleaved without performing an interleave operation.
  • the data-move task is eliminated.
  • calculation process 1 and calculation process 2 as shown in FIG. 6A and FIG. 6B can run on different hardware (e.g., on two separate GPUs, or calculation process 1 on CPU while calculation process 2 on GPU) in the condition that the CPU and the GPU can access shared memory where the CPU and the GPU can both write results to address offset reserved based on the result sizes of calculation process 1 and calculation process 2.
  • Calculation process 1 and calculation process 2 can also be executed by one accelerator such as a GPU.
  • the CPU and the GPU are capable of optimizing the operations according to the embodiments of the present disclosure.
  • any process runs on FPGAs or NPUs there may be a determination whether the hardware of FPGAs or NPUs are capable of performing such optimization, because ISA of the hardware may not support such optimization.
  • a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by an apparatus (such as a compiler on CPU 108 or NPU 102 in FIG. 1) , for performing the above-described methods.
  • an apparatus such as a compiler on CPU 108 or NPU 102 in FIG. 1.
  • Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same.
  • the device may include one or more processors (such as CPUs and processing accelerators) , an input/output interface, a network interface, and/or a memory.
  • the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.
  • the above described embodiments can be implemented by hardware, or software (program codes) , or a combination of hardware and software. If implemented by software, it may be stored in the above-described computer-readable media. The software, when executed by the processor can perform the disclosed methods.
  • the computing units and other functional units described in this disclosure can be implemented by hardware, or software, or a combination of hardware and software.
  • One of ordinary skill in the art will also understand that multiple ones of the above described modules/units may be combined as one module/unit, and each of the above described modules/units may be further divided into a plurality of sub-modules/sub-units.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

L'invention concerne des procédés et des dispositifs permettant d'optimiser des tâches de déplacement de données associées à un modèle d'apprentissage automatique. Le procédé peut consister à identifier une tâche de déplacement de données dans un programme exécutable dans un dispositif, à identifier une ou plusieurs opérations qui sont associées à la tâche de déplacement de données, à déterminer des capacités matérielles du dispositif et des exigences d'optimisation des tâches de déplacement de données, et en réponse à une détermination selon laquelle les capacités matérielles du dispositif satisfont aux exigences, à mettre en correspondance la tâche de déplacement de données identifiée avec l'écriture de données de la ou des opérations identifiées qui sont associées à la tâche de déplacement de données.
PCT/CN2019/102268 2019-08-23 2019-08-23 Procédé et appareil d'optimisation de tâches de déplacement de données WO2021035397A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980098172.8A CN114041116A (zh) 2019-08-23 2019-08-23 数据移动任务优化的方法和装置
PCT/CN2019/102268 WO2021035397A1 (fr) 2019-08-23 2019-08-23 Procédé et appareil d'optimisation de tâches de déplacement de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/102268 WO2021035397A1 (fr) 2019-08-23 2019-08-23 Procédé et appareil d'optimisation de tâches de déplacement de données

Publications (1)

Publication Number Publication Date
WO2021035397A1 true WO2021035397A1 (fr) 2021-03-04

Family

ID=74683730

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102268 WO2021035397A1 (fr) 2019-08-23 2019-08-23 Procédé et appareil d'optimisation de tâches de déplacement de données

Country Status (2)

Country Link
CN (1) CN114041116A (fr)
WO (1) WO2021035397A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2609702A (en) * 2021-04-26 2023-02-15 Nvidia Corp Acceleration of operations

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018129327A1 (fr) * 2017-01-06 2018-07-12 Google Llc Fusion de boucles et de bibliothèque
CN108491359A (zh) * 2016-04-22 2018-09-04 北京中科寒武纪科技有限公司 子矩阵运算装置及方法
US20180322390A1 (en) * 2017-05-05 2018-11-08 Intel Corporation Optimized compute hardware for machine learning operations
US20190065146A1 (en) * 2017-08-31 2019-02-28 Qualcomm Incorporated Providing efficient floating-point operations using matrix processors in processor-based systems

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2023548A1 (fr) * 2007-08-09 2009-02-11 Nokia Siemens Networks Oy Terminal de communication mobile, système de communication, réseau de communication et procédé de communication
CN109144916A (zh) * 2017-06-16 2019-01-04 深圳市中兴微电子技术有限公司 一种处理数据包的方法及装置、芯片
CN109618399A (zh) * 2018-12-26 2019-04-12 东华大学 多用户移动边缘计算系统中的分布式能量管理优化方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491359A (zh) * 2016-04-22 2018-09-04 北京中科寒武纪科技有限公司 子矩阵运算装置及方法
WO2018129327A1 (fr) * 2017-01-06 2018-07-12 Google Llc Fusion de boucles et de bibliothèque
US20180322390A1 (en) * 2017-05-05 2018-11-08 Intel Corporation Optimized compute hardware for machine learning operations
US20190065146A1 (en) * 2017-08-31 2019-02-28 Qualcomm Incorporated Providing efficient floating-point operations using matrix processors in processor-based systems

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2609702A (en) * 2021-04-26 2023-02-15 Nvidia Corp Acceleration of operations

Also Published As

Publication number Publication date
CN114041116A (zh) 2022-02-11

Similar Documents

Publication Publication Date Title
US11893414B2 (en) Operation method, device and related products
US10268454B2 (en) Methods and apparatus to eliminate partial-redundant vector loads
US7793278B2 (en) Systems and methods for affine-partitioning programs onto multiple processing units
US8997065B2 (en) Automatic modularization of source code
US20180203673A1 (en) Execution of computation graphs
US9990186B2 (en) Determination of branch convergence in a sequence of program instruction
US20080082969A1 (en) Software Testing Technique Supporting Dynamic Data Structures
US11132196B2 (en) Apparatus and method for managing address collisions when performing vector operations
US10372430B2 (en) Method of compiling a program
US11694075B2 (en) Partitioning control dependency edge in computation graph
CN104641351A (zh) 部分向量化编译系统
US8276111B2 (en) Providing access to a dataset in a type-safe manner
US20230350673A1 (en) Microkernel-based software optimization of neural networks
US9921838B2 (en) System and method for managing static divergence in a SIMD computing architecture
WO2021035397A1 (fr) Procédé et appareil d'optimisation de tâches de déplacement de données
US20200012250A1 (en) Program editing device, program editing method, and computer readable medium
US20170269931A1 (en) Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
US10496433B2 (en) Modification of context saving functions
US9910650B2 (en) Method and apparatus for approximating detection of overlaps between memory ranges
KR102594770B1 (ko) 데이터 처리장치에서의 연속값들의 매칭
KR20150040663A (ko) 소프트웨어 파이프라이닝을 이용한 명령어 스케줄링 방법 및 장치
JP2019164704A (ja) コンパイラ
US8290917B2 (en) Reordering of data elements in a data parallel system
CN117742787A (zh) 一种指令数据冲突诊断方法、电子设备及存储介质
Jin Extending the SYCL Joint Matrix for Binarized Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19942976

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19942976

Country of ref document: EP

Kind code of ref document: A1