WO2022057468A1 - 一种深度学习模型推理加速的方法、系统、设备及介质 - Google Patents

一种深度学习模型推理加速的方法、系统、设备及介质 Download PDF

Info

Publication number
WO2022057468A1
WO2022057468A1 PCT/CN2021/109609 CN2021109609W WO2022057468A1 WO 2022057468 A1 WO2022057468 A1 WO 2022057468A1 CN 2021109609 W CN2021109609 W CN 2021109609W WO 2022057468 A1 WO2022057468 A1 WO 2022057468A1
Authority
WO
WIPO (PCT)
Prior art keywords
deep learning
loss function
learning model
model
trimming
Prior art date
Application number
PCT/CN2021/109609
Other languages
English (en)
French (fr)
Inventor
刘姝
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2022057468A1 publication Critical patent/WO2022057468A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present invention relates to the field of deep learning, and more particularly, to a method, system, computer device and computer-readable storage medium for accelerating inference of a deep learning model.
  • model inference includes model compression, software library optimization, heterogeneous computing, hardware acceleration and other technologies.
  • Existing inference optimization software such as TVM (an open source inference optimizer for CPU to implement deep learning) and tensorrt (deep learning inference optimizer launched by NVIDIA) are all in-depth inference optimization for deep learning models. Computational optimization is performed at the compiler level. On the other hand, the inference and deployment of deep learning on the hardware platform is accelerated by using techniques such as operator fusion and parameter quantization for computing features in deep learning. The other is to use the sparsity of the deep learning model to compress the model by reducing the calculation amount or parameter amount of the model, which can reduce the memory or bandwidth usage of the model, and can be deployed to the inference platform more conveniently. At the same time, inference can be achieved. acceleration effect.
  • TVM an open source inference optimizer for CPU to implement deep learning
  • tensorrt deep learning inference optimizer launched by NVIDIA
  • the unstructured cropping and low-bit (the number of bits in computer storage) quantized model due to the irregularity of its structural changes, cannot be used in traditional software and hardware to achieve the acceleration effect. Only special software and hardware support can complete inference deployment and acceleration, resulting in increased deployment costs. At the same time, the compressed model generally needs to be retrained, and improper retraining will lead to a loss of model accuracy to a certain extent.
  • the purpose of the embodiments of the present invention is to propose a method, system, computer equipment and computer-readable storage medium for accelerating the inference of a deep learning model. Due to the limitation of the hardware platform, it can be directly deployed to the same inference platform as the model before trimming; the optimized training method of model distillation is used to retrain the trimmed model. This training method can double the performance of the trimmed model. The accuracy is not degraded.
  • one aspect of the embodiments of the present invention provides a method for accelerating inference of a deep learning model, including the following steps: trimming the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming; calculating the depth before trimming learning a first loss function of the model, and calculating a second loss function of the pruned deep learning model; adding the first loss function to the second loss function to update the second loss function; and The trimmed deep learning model is trained by the updated second loss function.
  • the trimming of the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming includes: assigning weights to the performance improvement and the accuracy improvement, calculating an improved score according to the weights, and using the one with the highest score.
  • the clipping scheme clips the deep learning model.
  • the trimming of the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming includes: calculating the performance values of different candidate trimming structures on the inference platform, and using the candidate trimming structure with the largest performance value to The deep learning model is cropped.
  • the calculating the first loss function of the deep learning model before trimming includes: adopting a preset strategy for the prediction output of the deep learning model before trimming to obtain a softened probability distribution, and according to the The softened probability distribution is used to calculate the first loss function of the deep learning model.
  • the calculating the second loss function of the pruned deep learning model includes: acquiring a predicted probability distribution of the pruned deep learning model, and calculating the deep learning according to the predicted probability distribution The second loss function for the model.
  • the adding the first loss function to the second loss function to update the second loss function includes: assigning weights to the first loss function and the second loss function , and replace the second loss function with the result calculated based on the weight.
  • the training of the trimmed deep learning model by using the updated second loss function includes: sequentially reducing the weight of the second loss function, and according to the updated The second loss function is used to train the tailored deep learning model.
  • a deep learning model inference acceleration system including: a trimming module, configured to trim the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming; a computing module, configured for in calculating the first loss function of the deep learning model before cutting, and calculating the second loss function of the deep learning model after cutting; the updating module is configured to add the first loss function to the second loss function a loss function to update the second loss function; and a training module configured to train the trimmed deep learning model through the updated second loss function.
  • a trimming module configured to trim the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming
  • a computing module configured for in calculating the first loss function of the deep learning model before cutting, and calculating the second loss function of the deep learning model after cutting
  • the updating module is configured to add the first loss function to the second loss function a loss function to update the second loss function
  • a training module configured to train the trimmed deep learning model through the updated second loss function.
  • a computer device comprising: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor.
  • the processor implements the steps of the above method when executed.
  • a computer-readable storage medium stores a computer program that implements the above method steps when executed by a processor.
  • the invention has the following beneficial technical effects: by structurally trimming the deep learning model, the trimmed model is not limited by the software and hardware platform, and can be directly deployed to the same reasoning platform as the model before trimming; an optimization training method using model distillation is adopted. , and retrain the cropped model. This training method can maintain the accuracy without doubling the performance of the cropped model.
  • FIG. 1 is a schematic diagram of an embodiment of a method for accelerating inference of a deep learning model provided by the present invention
  • FIG. 2 is a schematic diagram of the hardware structure of an embodiment of a computer device for accelerating inference of a deep learning model provided by the present invention
  • FIG. 3 is a schematic structural diagram of an embodiment of a computer-readable storage medium for accelerating inference of a deep learning model provided by the present invention
  • FIG. 4 is a schematic structural diagram of a system for accelerating inference of a deep learning model provided by an embodiment of the present invention.
  • FIG. 1 shows a schematic diagram of an embodiment of a method for accelerating inference of a deep learning model provided by the present invention.
  • the embodiment of the present invention includes the following steps:
  • model clipping is to cut out the parameters in the model by certain technical means, including the structure Structured pruning and unstructured pruning, structured pruning is coarse-grained pruning, such as kernel (kernel) or channel (channel in neural network) level, the pruned model can be deployed on the same platform as the original model, and unstructured pruning is Fine-grained pruning, such as a single weight parameter level, requires special software and hardware platform support for the pruned platform, otherwise the inference acceleration effect will not be achieved.
  • Model quantization is to represent the weight parameters in the model with fewer bits.
  • the parameters represented by float32 are reduced to float16 (16-bit floating-point data), which can reduce the memory footprint.
  • the quantized model needs the support of a specific software and hardware platform, otherwise it is difficult to achieve the effect of inference acceleration.
  • the deep learning model is structured and trimmed, and the trimmed model is not limited by the hardware and software platform, and can be directly deployed to the same reasoning platform as the model before trimming; at the same time, during the model trimming process, the trimming index is based on the trimmed model in the
  • the actual performance improvement of the inference platform is used as a guide, which can greatly improve the deployment efficiency and operation efficiency of the trimmed model on the inference platform, while the traditional model trimming is often only guided by the model itself, resulting in limited improvement efficiency in the inference platform; the implementation of the present invention
  • the optimized training method of model distillation is used to retrain the trimmed model. This training method can maintain the accuracy without doubling the performance of the trimmed model.
  • the deep learning model is trimmed according to the comprehensive improvement in performance and accuracy before and after trimming.
  • This embodiment of the present invention uses structured pruning to prune the deep learning model.
  • the structured pruning is channel-level pruning, and the pruned model can be directly deployed to the same software and hardware inference platform as the model before pruning, without customizing special software and hardware.
  • Deep learning models such as neural network models include multi-layer convolutions, each layer of convolution is composed of multiple channels, and the number of channels in each layer is generally tens to thousands, such as resnet50 (Residual Network50, residual network 50)
  • the first layer volume The number of product channels is 64, and the number of convolution channels in the last layer is 2048. Appropriate tailoring of the model channels can reduce model redundancy and improve model running speed.
  • the model cropping in the embodiment of the present invention is based on the following rules: based on the structural rules of the model itself, some convolutional layers are cropped more, and other convolutional layers are cropped less, so as to maximize the preservation of the model itself.
  • the trimming of the deep learning model according to the comprehensive improvement in performance and accuracy before and after trimming includes: assigning weights to the performance improvement and the accuracy improvement, calculating an improved score according to the weights, and using the one with the highest score.
  • the clipping scheme clips the deep learning model.
  • the improvement of performance tends to bring about the reduction of accuracy, so you can assign weights to performance and accuracy according to your needs. For example, if you want better performance, you can assign more weights to performance, if you want to The accuracy is better, you can assign more weight to the accuracy, and if you want to keep performance and accuracy at the same time, you can assign the same weight to both.
  • the trimming of the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming includes: calculating the performance values of different candidate trimming structures on the inference platform, and using the candidate trimming structure with the largest performance value to The deep learning model is cropped.
  • the actual performance improvement of the trimmed model on the inference platform is considered in the model trimming process, and the actual latency (latency) of the model in the inference platform is used as a guide to first calculate the latency performance of different candidate trimming structures on the inference platform, and select The most effective clipping structure for improving the actual latency performance is the final clipping target.
  • the clipping model obtained in this way can maximize the actual operating efficiency on the inference platform, thereby improving the inference speed.
  • the trimmed model needs to be retrained to restore the accuracy. It is often difficult to restore the accuracy of the trimmed model to the same accuracy as the un trimmed model by using traditional training methods. Carry out retraining, that is, use the large uncropped model to guide the training of the cropped model, and transfer the knowledge of the generalization ability of the uncropped complex model to the network of the cropped model.
  • a first loss function of the deep learning model before clipping is calculated, and a second loss function of the deep learning model after clipping is calculated.
  • the calculating the first loss function of the deep learning model before trimming includes: adopting a preset strategy for the prediction output of the deep learning model before trimming to obtain a softened probability distribution, and according to the The softened probability distribution is used to calculate the first loss function of the deep learning model.
  • the prediction output of the uncropped deep learning model is changed through a preset strategy to obtain a softened probability distribution, and the loss function (ie, soft target loss) of the uncropped deep learning model is calculated.
  • the pre-set strategy can be the prediction probability of the uncropped deep learning model divided by a fixed parameter.
  • the calculating the second loss function of the pruned deep learning model includes: acquiring a predicted probability distribution of the pruned deep learning model, and calculating the deep learning according to the predicted probability distribution The second loss function for the model. Obtain the predicted probability distribution of the trimmed deep learning model, and calculate the loss function (ie, hard target loss) of the trimmed deep learning model.
  • the first loss function is added to the second loss function to update the second loss function. That is, the soft target loss is added to the hard target loss to guide the calculation and update of the hard target loss, that is, the training knowledge of the uncropped model is used to guide the training knowledge of the cropped model to compensate for the accuracy drop caused by model cropping. .
  • This method maintains no loss of accuracy while cropping the model in half.
  • the adding the first loss function to the second loss function to update the second loss function includes: assigning weights to the first loss function and the second loss function , and replace the second loss function with the result calculated based on the weight. For example, a weight of 0.3 may be assigned to the first loss function, a weight of 0.7 may be assigned to the second loss function, and the second loss function may be updated according to the above weights.
  • the cropped deep learning model is trained by the updated second loss function.
  • the training of the trimmed deep learning model by using the updated second loss function includes: sequentially reducing the weight of the second loss function, and according to the updated The second loss function is used to train the tailored deep learning model.
  • the second loss function assigned a weight of 0.65 can be either the original second loss function or the updated second loss function, which can be specifically selected according to the specific situation.
  • the embodiment of the present invention can realize the simplified compression of large-scale deep learning model, reduce the calculation amount and parameter amount of the model, and at the same time, the compressed model has less accuracy loss and less restrictions on the hardware platform, and can be used to quickly deploy the deep learning model to the memory. , bandwidth and other resource-constrained inference platforms, improve the speed and efficiency of online inference of deep learning applications, and then promote the deployment and rapid development of inference in deep learning applications.
  • a system 500 for accelerating inference of a deep learning model including: a cropping module 501 , which is configured to combine performance and precision before and after cropping The improvement is to cut the deep learning model; the calculation module 502 is configured to calculate the first loss function of the deep learning model before cutting, and calculate the second loss function of the deep learning model after cutting; the update module 503, is configured to add the first loss function to the second loss function to update the second loss function; and a training module 504, configured to use the updated second loss function to tune the cropped The deep learning model is trained.
  • the cropping module 501 is configured to: assign weights to performance improvement and accuracy improvement, calculate an improved score according to the weights, and use the cropping scheme with the highest score to crop the deep learning model .
  • the clipping module 501 is configured to: calculate the performance values of different candidate clipping structures on the inference platform, and use the candidate clipping structure with the largest performance value to clip the deep learning model.
  • the calculation module 502 is configured to: adopt a preset strategy for the prediction output of the deep learning model before trimming to obtain a softened probability distribution, and calculate the depth according to the softened probability distribution Learn the first loss function of the model.
  • the calculation module 502 is configured to: obtain a prediction probability distribution of the deep learning model after trimming, and calculate a second loss function of the deep learning model according to the prediction probability distribution.
  • the updating module 503 is configured to: assign weights to the first loss function and the second loss function, and replace the second loss function with a result calculated based on the weights.
  • the training module 504 is configured to: sequentially reduce the weight of the second loss function, and train the trimmed deep learning model according to the second loss function after each update .
  • a computer device including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to The following steps are implemented: S1, cutting the deep learning model according to the comprehensive improvement of performance and accuracy before and after cutting; S2, calculating the first loss function of the deep learning model before cutting, and calculating the second loss function of the deep learning model after cutting S3, adding the first loss function to the second loss function to update the second loss function; and S4, training the tailored deep learning model through the updated second loss function.
  • the trimming of the deep learning model according to the comprehensive improvement in performance and accuracy before and after trimming includes: assigning weights to the performance improvement and the accuracy improvement, calculating an improved score according to the weights, and using the one with the highest score.
  • the clipping scheme clips the deep learning model.
  • the trimming of the deep learning model according to the comprehensive improvement of performance and accuracy before and after trimming includes: calculating the performance values of different candidate trimming structures on the inference platform, and using the candidate trimming structure with the largest performance value to The deep learning model is cropped.
  • the calculating the first loss function of the deep learning model before trimming includes: adopting a preset strategy for the prediction output of the deep learning model before trimming to obtain a softened probability distribution, and according to the The softened probability distribution is used to calculate the first loss function of the deep learning model.
  • the calculating the second loss function of the pruned deep learning model includes: acquiring a predicted probability distribution of the pruned deep learning model, and calculating the deep learning according to the predicted probability distribution The second loss function for the model.
  • the adding the first loss function to the second loss function to update the second loss function includes: assigning weights to the first loss function and the second loss function , and replace the second loss function with the result calculated based on the weight.
  • the training of the trimmed deep learning model by using the updated second loss function includes: sequentially reducing the weight of the second loss function, and according to the updated The second loss function is used to train the tailored deep learning model.
  • FIG. 2 it is a schematic diagram of the hardware structure of an embodiment of the computer device for accelerating the inference of the deep learning model provided by the present invention.
  • the device includes a processor 301 and a memory 302 , and may further include an input device 303 and an output device 304 .
  • the processor 301 , the memory 302 , the input device 303 and the output device 304 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 2 .
  • the memory 302 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the method for accelerating inference of a deep learning model in the embodiments of the present application Corresponding program instruction/module.
  • the processor 301 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 302, that is, implementing the method for accelerating the inference of the deep learning model in the above method embodiments.
  • the memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function; the storage data area may store data created according to the use of the method for inference acceleration of the deep learning model Wait. Additionally, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 303 can receive input information such as user name and password.
  • the output device 304 may include a display device such as a display screen.
  • the program instructions/modules corresponding to one or more deep learning model inference acceleration methods are stored in the memory 302, and when executed by the processor 301, execute the deep learning model inference acceleration method in any of the above method embodiments.
  • Any embodiment of the computer device that executes the above-mentioned method for accelerating inference of a deep learning model can achieve the same or similar effects as any of the foregoing method embodiments corresponding to it.
  • the present invention further provides a computer-readable storage medium 400 , where the computer-readable storage medium 400 stores a computer program 402 for executing the above method when executed by the processor 401 .
  • the storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种深度学习模型推理加速的方法、系统、设备和存储介质,方法包括:根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;计算裁剪前的深度学习模型的第一损失函数,并计算裁剪后的深度学习模型的第二损失函数;将第一损失函数加入第二损失函数以对第二损失函数进行更新;以及通过更新后的第二损失函数对裁剪后的深度学习模型进行训练。本发明实现了大规模深度学习模型的精简压缩,降低模型的计算量和参数量,同时压缩后的模型精度损失小,对硬件平台的限制少,提升深度学习应用在线推理的速度与效率,进而推进深度学习应用的推理部署与快速发展。

Description

一种深度学习模型推理加速的方法、系统、设备及介质
本申请要求于2020年09月18日提交中国国家知识产权局,申请号为202010985523.1,发明名称为“一种深度学习模型推理加速的方法、系统、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及深度学习领域,更具体地,特别是指一种深度学习模型推理加速的方法、系统、计算机设备及计算机可读存储介质。
背景技术
近几年来随着深度学习技术的快速发展,深度学习越来越多的应用到工业界中,如基于深度学习的图像识别、自动驾驶、自动翻译系统等,目前的深度学习模型由于其计算复杂性高、参数冗余,对硬件平台的内存、带宽等条件要求较高,从而导致在一些场景或设备上的推理部署存在限制。近几年来模型推理的优化方法包括模型压缩、软件库优化、异构计算、硬件加速等技术。
现有推理优化软件如TVM(一种用于CPU实现深度学习的开源推理优化器)、tensorrt(NVIDIA公司推出的深度学习推理优化器),都是对深度学习模型进行深度推理优化,一方面从编译器级别进行计算优化,另一方面通过对深度学习中的计算特征进行算子融合、参数量化等技术,来加速深度学习在硬件平台的推理与部署。另一种是利用深度学习模型的稀疏 性,通过降低模型的计算量或参数量来对模型进行压缩,进而可以降低模型的内存或带宽占用量,可以更方便的部署到推理平台,同时达到推理加速的效果。
但是,目前的模型压缩技术中,非结构化裁剪与低bit(计算机存储中的比特数)量化后的模型,由于其结构变化的不规则,继续应用于传统软硬件无法达到加速的效果,需要特殊的软硬件支持才能完成推理部署与加速,导致部署成本增加,同时,压缩后的模型一般需要重训练,而重训练不当的情况下在一定程度上会导致模型精度损失。
发明内容
有鉴于此,本发明实施例的目的在于提出一种深度学习模型推理加速的方法、系统、计算机设备及计算机可读存储介质,通过对深度学习模型进行结构化裁剪,裁剪后的模型不受软硬件平台的限制,可以直接部署到与裁剪前模型相同的推理平台;采用模型蒸馏的优化训练方法,对裁剪后的模型重训练,这种训练方式可在将裁剪后模型性能提升一倍的情况下保持精度不下降。
基于上述目的,本发明实施例的一方面提供了一种深度学习模型推理加速的方法,包括如下步骤:根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;计算裁剪前的所述深度学习模型的第一损失函数,并计算裁剪后的所述深度学习模型的第二损失函数;将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新;以及通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:对性能提升和精度提升分配权重,根据所述权重计算提升的分数,并使用所述分数最高的裁剪方案对所述深度学习模型进 行裁剪。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:计算不同的候选裁剪结构在推理平台的性能值,并使用性能值最大的候选裁剪结构对所述深度学习模型进行裁剪。
在一些实施方式中,所述计算裁剪前的所述深度学习模型的第一损失函数包括:对裁剪前的所述深度学习模型的预测输出采用预设策略以得到软化的概率分布,并根据所述软化的概率分布计算所述深度学习模型的第一损失函数。
在一些实施方式中,所述计算裁剪后的所述深度学习模型的第二损失函数包括:获取裁剪后的所述深度学习模型的预测概率分布,并根据所述预测概率分布计算所述深度学习模型的第二损失函数。
在一些实施方式中,所述将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新包括:对所述第一损失函数和所述第二损失函数分配权重,并将基于所述权重计算得到的结果替换所述第二损失函数。
在一些实施方式中,所述通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练包括:依次降低所述第二损失函数的权重,并根据每次更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
本发明实施例的另一方面,还提供了一种深度学习模型推理加速系统,包括:裁剪模块,配置用于根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;计算模块,配置用于计算裁剪前的所述深度学习模型的第一损失函数,并计算裁剪后的所述深度学习模型的第二损失函数;更新模块,配置用于将所述第一损失函数加入所述第二损失函数以对所述第二 损失函数进行更新;以及训练模块,配置用于通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
本发明实施例的又一方面,还提供了一种计算机设备,包括:至少一个处理器;以及存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现如上方法的步骤。
本发明实施例的再一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时实现如上方法步骤的计算机程序。
本发明具有以下有益技术效果:通过对深度学习模型进行结构化裁剪,裁剪后的模型不受软硬件平台的限制,可以直接部署到与裁剪前模型相同的推理平台;采用模型蒸馏的优化训练方法,对裁剪后的模型重训练,这种训练方式可在将裁剪后模型性能提升一倍的情况下保持精度不下降。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。
图1为本发明提供的深度学习模型推理加速的方法的实施例的示意图;
图2为本发明提供的深度学习模型推理加速的计算机设备的实施例的硬件结构示意图;
图3为本发明提供的深度学习模型推理加速的计算机可读存储介质的实施例的结构示意图;
图4为本发明实施例提供的深度学习模型推理加速的系统的结构示意 图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明实施例进一步详细说明。
需要说明的是,本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本发明实施例的限定,后续实施例对此不再一一说明。
基于上述目的,本发明实施例的第一个方面,提出了一种深度学习模型推理加速的方法的实施例。图1示出的是本发明提供的深度学习模型推理加速的方法的实施例的示意图。如图1所示,本发明实施例包括如下步骤:
S1、根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;
S2、计算裁剪前的深度学习模型的第一损失函数,并计算裁剪后的深度学习模型的第二损失函数;
S3、将第一损失函数加入第二损失函数以对第二损失函数进行更新;以及
S4、通过更新后的第二损失函数对裁剪后的深度学习模型进行训练。
在模型推理优化中,模型压缩因成本低、对软硬件的限制少逐渐被应用,目前模型压缩技术包括模型裁剪和模型量化:模型裁剪是通过一定技术手段将模型中的参数裁剪掉,包括结构化裁剪与非结构化裁剪,结构化裁剪是粗粒度裁剪,如kernal(内核)或channel(神经网络中的通道)级别,裁剪后的模型可与原模型部署到相同平台,非结构化裁剪是细粒度裁 剪,如单个权重参数级别,裁剪后的平台需要特殊的软硬件平台支持,不然达不到推理加速效果。而模型量化是将模型中的权重参数以更少的bit数表示,如float32(32位浮点型数据)表示的参数降低到float16(16位浮点型数据)表示,可将内存占用量降低到一半,和非结构化裁剪一样,量化后的模型需要特定的软硬件平台支持,不然难以达到推理加速效果。
本发明实施例对深度学习模型进行结构化裁剪,裁剪后的模型不受软硬件平台的限制,可以直接部署到与裁剪前模型相同的推理平台;同时模型裁剪过程中裁剪指标以裁剪后模型在推理平台的实际性能提升为指导,可大幅提升裁剪后模型在推理平台的部署效率和运行效率,而传统的模型裁剪往往仅以模型本身为指导,导致在推理平台的提升效率有限;本发明实施例采用模型蒸馏的优化训练方法,对裁剪后的模型重训练,这种训练方式可在将裁剪后模型性能提升一倍的情况下保持精度不下降。
根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪。本发明实施例采用结构化剪枝的方式对深度学习模型进行裁剪,结构化裁剪是channel级别的裁剪,裁剪后的模型可以直接部署到与裁剪前模型相同的软硬件推理平台,无需定制特殊的软硬件。深度学习模型如神经网络模型包含多层卷积,每层卷积由多个channel组成,每层channel个数一般几十到上千,如resnet50(Residual Network50,残差网络50)第一层卷积channel数为64,最后一层卷积channel数为2048,对模型channel做适当的裁剪可以降低模型冗余、提高模型运行速度。在结构化裁剪过程,本发明实施例的模型裁剪基于以下规则:基于模型本身的结构规则,对一些卷积层进行多裁剪,对另外一些卷积层进行少裁剪,以最大限度保持模型本身的结构和裁剪后模型的精度;传统的模型裁剪仅考虑裁剪后模型计算量的降低,而计算量的降低并不能代表模型在推理平台上真实的性能提升。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度 学习模型进行裁剪包括:对性能提升和精度提升分配权重,根据所述权重计算提升的分数,并使用所述分数最高的裁剪方案对所述深度学习模型进行裁剪。在大部分情况下,性能的提升往往会带来精度的降低,因此,可以根据需求对性能和精度分配权重,例如,如果想要性能更好,可以对性能分配更多的权重,如果想要精度更好,可以对精度分配更多的权重,如果想同时保证性能和精度,可以对两者分配相同的权重。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:计算不同的候选裁剪结构在推理平台的性能值,并使用性能值最大的候选裁剪结构对所述深度学习模型进行裁剪。本发明实施例在模型裁剪过程中考虑裁剪后模型在推理平台的实际性能提升,以模型在推理平台的实际latency(延迟)为指导,首先计算不同的候选裁剪结构在推理平台的latency性能,选取实际latency性能提升最有效的裁剪结构为最终裁剪目标,这样得到的裁剪模型可以最大限度提升在推理平台的实际运行效率,进而提高推理速度。
裁剪后的模型需要重训练来恢复精度,用传统的训练方法对裁剪后的模型进行重训练往往很难恢复到与未裁剪模型一致的精度,本发明实施例用知识蒸馏的方法对裁剪后模型进行重训练,即用未裁剪的大模型来指导裁剪后的模型进行训练,将未裁剪的复杂模型推广能力知识迁移到裁剪模型的网络中。
计算裁剪前的深度学习模型的第一损失函数,并计算裁剪后的深度学习模型的第二损失函数。
在一些实施方式中,所述计算裁剪前的所述深度学习模型的第一损失函数包括:对裁剪前的所述深度学习模型的预测输出采用预设策略以得到软化的概率分布,并根据所述软化的概率分布计算所述深度学习模型的第一损失函数。对未裁剪的深度学习模型的预测输出通过预设策略进行变化, 得到软化的概率分布,计算未裁剪的深度学习模型的损失函数(即软目标损失)。预设策略可以是用未裁剪的深度学习模型的预测概率除以一个固定参数。
在一些实施方式中,所述计算裁剪后的所述深度学习模型的第二损失函数包括:获取裁剪后的所述深度学习模型的预测概率分布,并根据所述预测概率分布计算所述深度学习模型的第二损失函数。获取裁剪后的深度学习模型的预测概率分布,计算裁剪后的深度学习模型的损失函数(即硬目标损失)。
将第一损失函数加入第二损失函数以对第二损失函数进行更新。也即是将软目标损失加入到硬目标损失中,用于指导硬目标损失的计算与更新,即用未裁剪模型的训练知识来指导裁剪后模型的训练知识,以补偿模型裁剪引起的精度下降。该方法可在将模型裁剪一半的情况下保持精度不下降。
在一些实施方式中,所述将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新包括:对所述第一损失函数和所述第二损失函数分配权重,并将基于所述权重计算得到的结果替换所述第二损失函数。例如,可以给第一损失函数分配权重0.3,给第二损失函数分配权重0.7,并根据上述权重对第二损失函数进行更新。
通过更新后的第二损失函数对裁剪后的深度学习模型进行训练。
在一些实施方式中,所述通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练包括:依次降低所述第二损失函数的权重,并根据每次更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。继续上例,在每次训练完成之后降低给第二损失函数分配的权重,例如,可以给第一损失函数分配权重0.35,给第二损失函数分配权重0.65,根据上述权重对第二损失函数进行更新,并根据更新后的第二损失函数对深度学习模型进行再次训练。分配权重0.65的第二损失函数既可以是原始 第二损失函数,也可以是更新后的第二损失函数,可以根据具体的情形进行具体的选择。
本发明实施例可以实现大规模深度学习模型的精简压缩,降低模型的计算量和参数量,同时压缩后的模型精度损失小,对硬件平台的限制少,可用于快速将深度学习模型部署到内存、带宽等资源受限的推理平台,提升深度学习应用在线推理的速度与效率,进而可推进深度学习应用的推理部署与快速发展。
需要特别指出的是,上述深度学习模型推理加速的方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于深度学习模型推理加速的方法也应当属于本发明的保护范围,并且不应将本发明的保护范围局限在实施例之上。
如图4所示,基于上述目的,本发明实施例的第二个方面,提出了一种深度学习模型推理加速的系统500,包括:裁剪模块501,配置用于根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;计算模块502,配置用于计算裁剪前的所述深度学习模型的第一损失函数,并计算裁剪后的所述深度学习模型的第二损失函数;更新模块503,配置用于将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新;以及训练模块504,配置用于通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
在一些实施方式中,所述裁剪模块501配置用于:对性能提升和精度提升分配权重,根据所述权重计算提升的分数,并使用所述分数最高的裁剪方案对所述深度学习模型进行裁剪。
在一些实施方式中,所述裁剪模块501配置用于:计算不同的候选裁剪结构在推理平台的性能值,并使用性能值最大的候选裁剪结构对所述深度学习模型进行裁剪。
在一些实施方式中,所述计算模块502配置用于:对裁剪前的所述深度学习模型的预测输出采用预设策略以得到软化的概率分布,并根据所述软化的概率分布计算所述深度学习模型的第一损失函数。
在一些实施方式中,所述计算模块502配置用于:获取裁剪后的所述深度学习模型的预测概率分布,并根据所述预测概率分布计算所述深度学习模型的第二损失函数。
在一些实施方式中,所述更新模块503配置用于:对所述第一损失函数和所述第二损失函数分配权重,并将基于所述权重计算得到的结果替换所述第二损失函数。
在一些实施方式中,所述训练模块504配置用于:依次降低所述第二损失函数的权重,并根据每次更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
基于上述目的,本发明实施例的第三个方面,提出了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:S1、根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;S2、计算裁剪前的深度学习模型的第一损失函数,并计算裁剪后的深度学习模型的第二损失函数;S3、将第一损失函数加入第二损失函数以对第二损失函数进行更新;以及S4、通过更新后的第二损失函数对裁剪后的深度学习模型进行训练。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:对性能提升和精度提升分配权重,根据所述权重计算提升的分数,并使用所述分数最高的裁剪方案对所述深度学习模型进行裁剪。
在一些实施方式中,所述根据裁剪前后性能和精度的综合提升对深度 学习模型进行裁剪包括:计算不同的候选裁剪结构在推理平台的性能值,并使用性能值最大的候选裁剪结构对所述深度学习模型进行裁剪。
在一些实施方式中,所述计算裁剪前的所述深度学习模型的第一损失函数包括:对裁剪前的所述深度学习模型的预测输出采用预设策略以得到软化的概率分布,并根据所述软化的概率分布计算所述深度学习模型的第一损失函数。
在一些实施方式中,所述计算裁剪后的所述深度学习模型的第二损失函数包括:获取裁剪后的所述深度学习模型的预测概率分布,并根据所述预测概率分布计算所述深度学习模型的第二损失函数。
在一些实施方式中,所述将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新包括:对所述第一损失函数和所述第二损失函数分配权重,并将基于所述权重计算得到的结果替换所述第二损失函数。
在一些实施方式中,所述通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练包括:依次降低所述第二损失函数的权重,并根据每次更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
如图2所示,为本发明提供的上述深度学习模型推理加速的计算机设备的一个实施例的硬件结构示意图。
以如图2所示的装置为例,在该装置中包括一个处理器301以及一个存储器302,并还可以包括:输入装置303和输出装置304。
处理器301、存储器302、输入装置303和输出装置304可以通过总线或者其他方式连接,图2中以通过总线连接为例。
存储器302作为一种非易失性计算机可读存储介质,可用于存储非易 失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的深度学习模型推理加速的方法对应的程序指令/模块。处理器301通过运行存储在存储器302中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例的深度学习模型推理加速的方法。
存储器302可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据深度学习模型推理加速的方法的使用所创建的数据等。此外,存储器302可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器302可选包括相对于处理器301远程设置的存储器,这些远程存储器可以通过网络连接至本地模块。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置303可接收输入的用户名和密码等信息。输出装置304可包括显示屏等显示设备。
一个或者多个深度学习模型推理加速的方法对应的程序指令/模块存储在存储器302中,当被处理器301执行时,执行上述任意方法实施例中的深度学习模型推理加速的方法。
执行上述深度学习模型推理加速的方法的计算机设备的任何一个实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
如图3所示,本发明还提供了一种计算机可读存储介质400,计算机可读存储介质400存储有被处理器401执行时执行如上方法的计算机程序402。
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方 法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,深度学习模型推理加速的方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
以上是本发明公开的示例性实施例,但是应当注意,在不背离权利要求限定的本发明实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本发明实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。
上述本发明实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本发明实施例公开的范围(包括权利要求)被限于这些例子;在本发明实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本发明实施例的不同方面的许多 其它变化,为了简明它们没有在细节中提供。因此,凡在本发明实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本发明实施例的保护范围之内。

Claims (10)

  1. 一种深度学习模型推理加速的方法,其特征在于,包括以下步骤:
    根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;
    计算裁剪前的所述深度学习模型的第一损失函数,并计算裁剪后的所述深度学习模型的第二损失函数;
    将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新;以及
    通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
  2. 根据权利要求1所述的方法,其特征在于,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:
    对性能提升和精度提升分配权重,根据所述权重计算提升的分数,并使用所述分数最高的裁剪方案对所述深度学习模型进行裁剪。
  3. 根据权利要求1所述的方法,其特征在于,所述根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪包括:
    计算不同的候选裁剪结构在推理平台的性能值,并使用性能值最大的候选裁剪结构对所述深度学习模型进行裁剪。
  4. 根据权利要求1所述的方法,其特征在于,所述计算裁剪前的所述深度学习模型的第一损失函数包括:
    对裁剪前的所述深度学习模型的预测输出采用预设策略以得到软化的概率分布,并根据所述软化的概率分布计算所述深度学习模型的第一损失函数。
  5. 根据权利要求4所述的方法,其特征在于,所述计算裁剪后的所述深度学习模型的第二损失函数包括:
    获取裁剪后的所述深度学习模型的预测概率分布,并根据所述预测概率分布计算所述深度学习模型的第二损失函数。
  6. 根据权利要求5所述的方法,其特征在于,所述将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新包括:
    对所述第一损失函数和所述第二损失函数分配权重,并将基于所述权重计算得到的结果替换所述第二损失函数。
  7. 根据权利要求6所述的方法,其特征在于,所述通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练包括:
    依次降低所述第二损失函数的权重,并根据每次更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
  8. 一种深度学习模型推理加速的系统,其特征在于,包括:
    裁剪模块,配置用于根据裁剪前后性能和精度的综合提升对深度学习模型进行裁剪;
    计算模块,配置用于计算裁剪前的所述深度学习模型的第一损失函数,并计算裁剪后的所述深度学习模型的第二损失函数;
    更新模块,配置用于将所述第一损失函数加入所述第二损失函数以对所述第二损失函数进行更新;以及
    训练模块,配置用于通过更新后的所述第二损失函数对裁剪后的所述深度学习模型进行训练。
  9. 一种计算机设备,其特征在于,包括:
    至少一个处理器;以及
    存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现权利要求1-7任意一项所述方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程 序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-7任意一项所述方法的步骤。
PCT/CN2021/109609 2020-09-18 2021-07-30 一种深度学习模型推理加速的方法、系统、设备及介质 WO2022057468A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010985523.1 2020-09-18
CN202010985523.1A CN112200313A (zh) 2020-09-18 2020-09-18 一种深度学习模型推理加速的方法、系统、设备及介质

Publications (1)

Publication Number Publication Date
WO2022057468A1 true WO2022057468A1 (zh) 2022-03-24

Family

ID=74015452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109609 WO2022057468A1 (zh) 2020-09-18 2021-07-30 一种深度学习模型推理加速的方法、系统、设备及介质

Country Status (2)

Country Link
CN (1) CN112200313A (zh)
WO (1) WO2022057468A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200313A (zh) * 2020-09-18 2021-01-08 苏州浪潮智能科技有限公司 一种深度学习模型推理加速的方法、系统、设备及介质
CN114444658A (zh) * 2021-12-31 2022-05-06 苏州浪潮智能科技有限公司 一种深度学习模型推理方法、系统、设备及计算机介质
CN114861890B (zh) * 2022-07-05 2022-09-09 深圳比特微电子科技有限公司 构建神经网络的方法、装置、计算设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091177A (zh) * 2019-11-12 2020-05-01 腾讯科技(深圳)有限公司 一种模型压缩方法、装置、电子设备和存储介质
CN111126573A (zh) * 2019-12-27 2020-05-08 深圳力维智联技术有限公司 基于个体学习的模型蒸馏改进方法、设备及存储介质
CN111461226A (zh) * 2020-04-01 2020-07-28 深圳前海微众银行股份有限公司 对抗样本生成方法、装置、终端及可读存储介质
CN111488990A (zh) * 2020-04-17 2020-08-04 苏州浪潮智能科技有限公司 一种基于性能感知的模型裁剪方法、装置、设备和介质
CN112200313A (zh) * 2020-09-18 2021-01-08 苏州浪潮智能科技有限公司 一种深度学习模型推理加速的方法、系统、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091177A (zh) * 2019-11-12 2020-05-01 腾讯科技(深圳)有限公司 一种模型压缩方法、装置、电子设备和存储介质
CN111126573A (zh) * 2019-12-27 2020-05-08 深圳力维智联技术有限公司 基于个体学习的模型蒸馏改进方法、设备及存储介质
CN111461226A (zh) * 2020-04-01 2020-07-28 深圳前海微众银行股份有限公司 对抗样本生成方法、装置、终端及可读存储介质
CN111488990A (zh) * 2020-04-17 2020-08-04 苏州浪潮智能科技有限公司 一种基于性能感知的模型裁剪方法、装置、设备和介质
CN112200313A (zh) * 2020-09-18 2021-01-08 苏州浪潮智能科技有限公司 一种深度学习模型推理加速的方法、系统、设备及介质

Also Published As

Publication number Publication date
CN112200313A (zh) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2022057468A1 (zh) 一种深度学习模型推理加速的方法、系统、设备及介质
US11069345B2 (en) Speech recognition using convolutional neural networks
US11886998B2 (en) Attention-based decoder-only sequence transduction neural networks
US11144831B2 (en) Regularized neural network architecture search
CN109978142B (zh) 神经网络模型的压缩方法和装置
WO2021208612A1 (zh) 数据处理的方法与装置
WO2015089148A2 (en) Reducing dynamic range of low-rank decomposition matrices
AU2023202949B2 (en) Two-pass end to end speech recognition
CN111651207B (zh) 一种神经网络模型运算芯片、方法、装置、设备及介质
CN113392962A (zh) 对神经网络的权重进行解码的方法、设备及电路
CN111612134A (zh) 神经网络结构搜索方法、装置、电子设备及存储介质
US11676078B2 (en) Neural trees
CN113449859A (zh) 一种数据处理方法及其装置
US20200151623A1 (en) N- best softmax smoothing for minimum bayes risk training of attention based sequence-to-sequence models
CN113837376A (zh) 基于动态编码卷积核融合的神经网络剪枝方法
WO2023071592A1 (zh) 面向超大搜索空间的网络结构搜索方法、系统及介质
CN112036564A (zh) 神经网络的剪枝方法、装置、设备及存储介质
CN112101547A (zh) 一种对网络模型的剪枝方法、装置、电子设备及存储介质
CN110297894B (zh) 一种基于辅助网络的智能对话生成方法
CN115329744A (zh) 一种自然语言处理方法、系统、设备及存储介质
CN112925894B (zh) 对话中标问匹配方法、系统及装置
US20230056315A1 (en) Computer-implemented methods and systems for compressing recurrent neural network (rnn) models and accelerating rnn execution in mobile devices to achieve real-time inference
CN112132281B (zh) 一种基于人工智能的模型训练方法、装置、服务器及介质
CN112633516B (zh) 性能预测和机器学习编译优化方法及装置
KR102393761B1 (ko) 이미지 처리를 위한 인공 신경망 모델 학습 방법 및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868297

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21868297

Country of ref document: EP

Kind code of ref document: A1