WO2021253857A1 - 一种融合裁剪与量化的模型压缩方法及系统 - Google Patents

一种融合裁剪与量化的模型压缩方法及系统 Download PDF

Info

Publication number
WO2021253857A1
WO2021253857A1 PCT/CN2021/076975 CN2021076975W WO2021253857A1 WO 2021253857 A1 WO2021253857 A1 WO 2021253857A1 CN 2021076975 W CN2021076975 W CN 2021076975W WO 2021253857 A1 WO2021253857 A1 WO 2021253857A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
weight parameter
parameter space
compressed
quantization
Prior art date
Application number
PCT/CN2021/076975
Other languages
English (en)
French (fr)
Inventor
刘姝
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2021253857A1 publication Critical patent/WO2021253857A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • This application relates to the technical field of neural network model deep compression, and in particular, to a model compression method and system that combines cropping and quantization.
  • model compression has become an important goal for the acceleration of deep learning applications. How to perform model compression, thereby effectively reducing parameter redundancy, reducing storage occupation, communication bandwidth, and computational complexity, and reducing the delay in the model application stage, is an important technical issue for accelerating the deployment and development of deep learning applications.
  • the current CNN (Convolutional Neural Networks, convolutional neural network) model compression method usually adopts cropping and quantization methods. Specifically, the full-precision parameter space is first fused to complete the cropping of the model, and then low-bit (binary digit, bit) quantization is performed on the cropped fixed model parameters, so as to achieve model compression.
  • model compression process is relatively independent, and the interaction between quantization and model structure is not considered.
  • the setting of the channel value of a specific layer in the model will greatly affect the quantization result, resulting in limited model compression space, resulting in insufficient model compression accuracy and insufficient model compression effect.
  • This application provides a model compression method and system that integrates clipping and quantization to solve the problem that the compression method in the prior art makes the model compression accuracy insufficient and the model compression effect is not good enough.
  • a model compression method combining cropping and quantization comprising:
  • the super network is trained to generate a first weight parameter space of the model to be compressed, wherein the first weight parameter space is represented by float32, and the first weight parameter space includes a plurality of first weight parameters.
  • the first weight parameter is used to evaluate the accuracy of the model to be compressed;
  • the first weight parameter space is quantized to form a second weight parameter space, wherein the second weight parameter space is represented by low-bit bits, and the second weight parameter space includes a plurality of second weight parameters.
  • the weight parameter is used to evaluate the accuracy of the compressed model
  • model clipping and accuracy evaluation of the compressed model are performed to obtain the optimal model after compression.
  • the performing model clipping and accuracy evaluation of the compressed model in the second weight parameter space to obtain the compressed optimal model includes:
  • the optimal cutting model is determined.
  • the constraint conditions include: calculation amount and delay.
  • the method for searching all crop models that meet the constraint condition in the second weight parameter space is specifically as follows:
  • an automated search method based on AutoM1 is used to search for all crop models that meet the constraint conditions in the second weight parameter space.
  • the method before generating the super network based on the model to be compressed, the method further includes:
  • the dimensions include: structural dimensions and parameter space dimensions;
  • the mode of model compression is determined.
  • determining the mode of model compression according to the dimensions includes:
  • the quantization method is adopted to compress the model of the parameter space dimension.
  • the model to be compressed includes: a CNN model, a target detection model, and a natural language processing model.
  • a model compression system integrating cropping and quantization comprising:
  • the super network generation module is used to generate the super network according to the model to be compressed
  • the training module is used to train the super network to generate a first weight parameter space of the model to be compressed, wherein the first weight parameter space is represented by float32, and the first weight parameter space contains multiple first weight parameters.
  • a weight parameter where the first weight parameter is used to evaluate the accuracy of the model to be compressed;
  • the quantization module is configured to quantize the first weight parameter space to form a second weight parameter space, wherein the second weight parameter space is represented by low-bit bits, and the second weight parameter space contains multiple second weight parameters , The second weight parameter is used to evaluate the accuracy of the compressed model;
  • the cropping module is used to perform model cropping and the accuracy evaluation of the compressed model in the second weight parameter space, and obtain the compressed optimal model.
  • system further includes:
  • the compression dimension definition module is used to define the dimensions of the model compression, and the dimensions include: structural dimensions and parameter space dimensions;
  • the compression mode determination module is used to determine the mode of model compression according to the dimensions.
  • the cropping module includes:
  • the search unit is configured to search for all crop models that meet the constraint conditions in the second weight parameter space according to the set constraint conditions;
  • the accuracy evaluation unit is used to evaluate the accuracy of any of the cropping models
  • the crop model structure determination unit is used to determine the crop model structure that matches the low-bit quantization on the channels of each layer according to the accuracy evaluation results of all crop models;
  • the optimal cutting model determining unit is used to determine the optimal cutting model according to the cutting model structure.
  • This application provides a model compression method that combines cropping and quantization.
  • the method first generates a super network based on the model to be compressed, then trains the super network to generate the first weight parameter space of the model to be compressed, and then quantizes the first weight parameter space
  • the second weight parameter space is formed, and finally the model clipping and the accuracy evaluation of the compressed model are performed in the second weight parameter space, and the compressed optimal model is obtained.
  • the first weight parameter space is represented by float32
  • the second weight parameter space is represented by low-bit bits.
  • And accuracy evaluation can determine the tailoring model structure that matches the low-bit quantization on each layer of the channel, effectively combining model tailoring and model quantization, can obtain a deep compression model that is adaptively optimized in the tailoring and quantization dimensions, which is conducive to improving the model The accuracy of compression and the effect of model compression.
  • an automated search method based on AutoM1 is adopted in the model clipping method, the search space is more flexible, and channel-level clipping can be realized, which is beneficial to further improving the accuracy of model compression.
  • This application also provides a model compression system that integrates cropping and quantization.
  • the system mainly includes: a super network generation module, a training module, a quantization module, and a cropping module.
  • the first weight parameter space is generated by the training module
  • the second weight space is generated by the quantization module
  • the cropping module performs model cropping and the accuracy evaluation of the compressed model in the low-bit second weight parameter space, and finally obtains the compressed model Optimal model.
  • model tailoring and model quantization can be combined, and the important influence of different model structures on the quantization results when the model is low-bit quantized is fully considered, such as the setting of channel values of each layer.
  • the effects of the quantization results are different, and finally a depth compression model that is adaptively optimized in the cropping and quantization dimensions is obtained.
  • FIG. 1 is a schematic flowchart of a model compression method for fusion cropping and quantization provided by an embodiment of the application;
  • 2 is a schematic diagram of the principle of model compression when the method in this embodiment is applied to 4bit quantization
  • FIG. 3 is a schematic structural diagram of a model compression system integrating cropping and quantization provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a model compression method based on fusion cropping and quantization provided by an embodiment of the application. It can be seen from FIG. 1 that the model compression method of fusion cropping and quantization in this embodiment mainly includes the following processes:
  • the super network is usually a float32 (32-bit floating point number) super network represented in full precision.
  • the model to be compressed in this embodiment includes: a CNN model, a target detection model, and a natural language processing model. The following description mainly takes the CNN model as an example.
  • the first weight parameter space is represented by float32
  • the first weight parameter space contains multiple first weight parameters
  • the first weight parameters are used to evaluate the accuracy of the model to be compressed.
  • the model compression method adopted in this embodiment includes model clipping and model quantization, and fusion of the two.
  • the super network is trained through step S4 to generate the first weight parameter space.
  • the first weight parameter space is represented by float32, which means that the weight value of the neural network is represented by float32, and its bit is compared with the second weight parameter. Larger space. Both the first weight parameter and the second weight parameter are used to evaluate the accuracy of the model, where the first weight parameter is used to evaluate the accuracy of the model to be compressed.
  • step S5 is executed: quantify the first weight parameter space to form a second weight parameter space.
  • the second weight parameter space is represented by low bit bits, the second weight parameter space contains multiple second weight parameters, and the second weight parameters are used to evaluate the accuracy of the compressed model.
  • the quantized super network can generate the weight parameter represented by the low bit range, that is, the second weight parameter, and convert the float32 full-precision model parameter space Use low-bit parameter space instead to provide conditions for subsequent model compression.
  • step S6 model cropping and accuracy evaluation of the compressed model are performed in the second weight parameter space to obtain the compressed optimal model.
  • step S6 includes the following process:
  • the constraint conditions in this embodiment include: calculation amount and delay. That is, model compression can be performed under the set calculation amount and the set delay constraint, and finally the calculation amount of the model to be compressed is compressed to the set calculation amount, and the delay of the model to be compressed is compressed to the set delay.
  • step S61 may adopt the following manner:
  • an automated search method based on AutoM1 is used to search for all crop models that meet the constraints in the second weight parameter space.
  • the tailoring dimensions of each layer in the deep learning model can be flexibly set, so that the tailoring optimization model that meets the conditions can be searched to the maximum in the specific search space. Therefore, this search method is more flexible in the search space, can achieve channel-level clipping, and is beneficial to further improve the accuracy of model compression.
  • step S62 is executed: the accuracy of any crop model is evaluated.
  • a crop model that meets the constraint condition can be searched for and its accuracy is evaluated until all crop models that meet the constraint condition have been searched.
  • step S63 After searching and evaluating the accuracy of all cropping models that meet the constraints one by one, the accuracy evaluation results are obtained, and step S63 is performed: According to the accuracy evaluation results of all cropping models, determine the cropping model that matches the low-bit quantization on the channels of each layer structure.
  • This embodiment integrates automatic cropping and quantization, which can effectively solve the influence of the setting of the number of channels in the model structure on the quantization result, thereby avoiding the independence of the cropping operation and the quantization operation resulting in the fixed model not adapting to the quantization space after the model is cropped.
  • the problem is to obtain a tailored model structure suitable for low-bit quantization in each layer of channel settings, which is beneficial to improve the accuracy of the compression model and improve the performance of the model.
  • Step S1 Define the dimensions of the model compression.
  • the model dimensions include: structural dimensions and parameter space dimensions.
  • step S2 includes the following process:
  • the compression of the model's structural dimensions is achieved through model tailoring, that is, a specific number of channels in each layer of the CNN model are trimmed.
  • model quantization that is, the parameter space represented by float32 is quantized to low-bit representation.
  • FIG. 2 For a schematic diagram of the principle of the model compression method of fusion cropping and quantization in this embodiment, refer to FIG. 2, and the low-bit bits in FIG. 2 are taken as an example of 4bit quantization.
  • Figure 2 shows the model to be compressed, the process of cutting and quantizing the compressed model, and the compressed model in order.
  • the weight parameter space generated by the super network for the first time is represented by float32, which is the first weight parameter space.
  • the second weight parameter space represented by 4 bits is obtained, and the second weight parameter space is obtained in the second
  • the model is cut in the weight parameter space, and the compressed model is finally obtained, where the dotted line in the compressed model represents the cut part of the model.
  • FIG. 3 is a schematic structural diagram of a model compression system for fusion cropping and quantization provided by an embodiment of the application. It can be seen from FIG. 3 that the model compression system integrating cropping and quantization in this embodiment mainly includes: a super network generation module, a training module, a quantization module, and a cropping module.
  • the super network generation module is used to generate a super network according to the model to be compressed.
  • the training module is used to train the super network to generate the first weight parameter space of the model to be compressed, where the first weight parameter space is represented by float32, and the first weight parameter space contains multiple first weight parameters, and the first weight
  • the parameters are used to evaluate the accuracy of the model to be compressed.
  • the quantization module is used to quantize the first weight parameter space to form a second weight parameter space, where the second weight parameter space is represented by low bit bits, and the second weight parameter space contains multiple second weight parameters, and the second weight parameter space
  • the parameters are used to evaluate the accuracy of the compressed model.
  • the cropping module is used to perform model cropping and the accuracy evaluation of the compressed model in the second weight parameter space, and obtain the compressed optimal model.
  • the system also includes: a compression dimension definition module and a compression mode determination module.
  • the compression dimension definition module is used to define the dimensions of the model compression.
  • the dimensions include: structural dimensions and parameter space dimensions; the compression method determination module is used to determine the model compression method according to the dimensions.
  • the cropping module includes: a search unit, an accuracy evaluation unit, a crop model structure determination unit, and an optimal crop model determination unit.
  • the search unit is used to search for all crop models that meet the constraint conditions in the second weight parameter space according to the set constraint conditions.
  • the accuracy evaluation unit is used to evaluate the accuracy of any trimming model.
  • the crop model structure determination unit is used to determine the crop model structure that matches the low-bit quantization on the channels of each layer according to the accuracy evaluation results of all crop models.
  • the optimal cutting model determining unit is used to determine the optimal cutting model according to the structure of the cutting model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种融合裁剪与量化的模型压缩方法及系统,该方法包括:基于待压缩模型生成超网络(S3);对超网络进行训练,生成待压缩模型的第一权重参数空间(S4);对第一权重参数空间量化,形成第二权重参数空间(S5);在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型(S6)。该系统包括:超网络生成模块、训练模块、量化模块以及裁剪模块。通过该方法和系统,能够将模型裁剪与模型量化融合起来,有效处理在模型低bit量化时不同模型结构对量化结果产生的重要影响,最终获取到在裁剪和量化维度上自适应优化的深度压缩模型。

Description

一种融合裁剪与量化的模型压缩方法及系统
本申请要求于2020年6月18日提交中国专利局、申请号为CN202010558278.6、发明名称为“一种融合裁剪与量化的模型压缩方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及神经网络模型深度压缩技术领域,特别是涉及一种融合裁剪与量化的模型压缩方法及系统。
背景技术
随着深度学习技术的发展,神经网络模型被设计的越来越复杂,随之带来的问题是这些复杂模型难以部署到内存、带宽等资源受限的硬件平台或移动设备中。而且,对于一些如在线学习、增量学习以及自动驾驶的实时应用来说,计算量、参数量高达数千万或上亿的复杂模型很难满足时间上的实时要求。因此,模型压缩成为深度学习应用加速的一大重要目标。如何进行模型压缩,从而有效降低参数冗余,减少存储占用、通信带宽及计算复杂度,同时降低模型应用阶段的延迟,是加快深度学习的应用部署和发展的重要技术问题。
目前CNN(Convolutional Neural Networks,卷积神经网络)模型压缩的方法,通常是采用裁剪与量化方法。具体地,先融合全精度参数空间完成模型的裁剪,然后对裁剪后的固定模型参数进行低bit(binary digit,比特)量化,从而实现模型压缩。
然而,目前的CNN模型压缩方法,由于模型裁剪和量化是先后分开执行的,模型压缩过程相对独立,没有考虑到量化与模型结构之间的相互影响。某些情况下,模型中特定层的channel(通道)数值的设置会极大影响到量化结果,从而导致模型压缩空间受限,进而导致模型压缩精度不够高,模型压缩效果不够好。
发明内容
本申请提供了一种融合裁剪与量化的模型压缩方法及系统,以解决现有技术中的压缩方法使得模型压缩精度不够高以及模型压缩效果不够好的问题。
为了解决上述技术问题,本申请实施例公开了如下技术方案:
一种融合裁剪与量化的模型压缩方法,所述方法包括:
基于待压缩模型生成超网络;
对所述超网络进行训练,生成待压缩模型的第一权重参数空间,其中,所述第一权重参数空间用float32表示,所述第一权重参数空间中包含多个第一权重参数,所述第一权重参数用于评估待压缩模型的精度;
对第一权重参数空间量化,形成第二权重参数空间,其中,所述第二权重参数空间用低bit位表示,所述第二权重参数空间中包含多个第二权重参数,所述第二权重参数用于评估压缩后模型的精度;
在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
可选地,所述在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型,包括:
根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型;
评估任一所述裁剪模型的精度;
根据所有裁剪模型的精度评估结果,确定在各层channel上与低bit量化相匹配的裁剪模型结构;
根据所述裁剪模型结构,确定最优裁剪模型。
可选地,所述约束条件包括:计算量和延迟。
可选地,根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型的方法,具体为:
根据设定的约束条件,采用基于AutoM1的自动化搜索方法,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型。
可选地,基于待压缩模型生成超网络之前,所述方法还包括:
定义模型压缩的维度,所述维度包括:结构维度和参数空间维度;
根据所述维度,确定模型压缩的方式。
可选地,根据所述维度,确定模型压缩的方式,包括:
采用模型裁剪的方式,进行结构维度的模型压缩;
采用量化的方式,进行参数空间维度的模型压缩。
可选地,所述待压缩模型包括:CNN模型、目标检测模型以及自然语言处理模型。
一种融合裁剪与量化的模型压缩系统,所述系统包括:
超网络生成模块,用于根据待压缩模型生成超网络;
训练模块,用于对所述超网络进行训练,生成待压缩模型的第一权重参数空间,其中,所述第一权重参数空间用float32表示,所述第一权重参数空间中包含多个第一权重参数,所述第一权重参数用于评估待压缩模型的精度;
量化模块,用于对第一权重参数空间量化,形成第二权重参数空间,其中,所述第二权重参数空间用低bit位表示,所述第二权重参数空间中包含多个第二权重参数,所述第二权重参数用于评估压缩后模型的精度;
裁剪模块,用于在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
可选地,所述系统中还包括:
压缩维度定义模块,用于定义模型压缩的维度,所述维度包括:结构维度和参数空间维度;
压缩方式确定模块,用于根据所述维度,确定模型压缩的方式。
可选地,所述裁剪模块包括:
搜索单元,用于根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型;
精度评估单元,用于评估任一所述裁剪模型的精度;
裁剪模型结构确定单元,用于根据所有裁剪模型的精度评估结果,确定在各层channel上与低bit量化相匹配的裁剪模型结构;
最优裁剪模型确定单元,用于根据所述裁剪模型结构,确定最优裁剪模型。
本申请的实施例提供的技术方案可以包括以下有益效果:
本申请提供一种融合裁剪与量化的模型压缩方法,该方法首先基于待压缩模型生成超网络,然后对超网络进行训练生成待压缩模型的第一权重参数空间,然后对第一权重参数空间量化形成第二权重参数空间,最后在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取到压缩后的最优模型。本实施例中第一权重参数空间用float32表示,第二权重参数空间用低bit位表示,通过将低bit位量化的约束添加到模型裁剪中,基于低bit量化的参数空间进行裁剪模型的搜索和精度评估,能够确定在各层channel上与低bit量化相匹配的裁剪模型结构,有效结合模型裁剪和模型量化,能够得到在裁剪和量化维度上自适应优化的深度压缩模型,有利于提高模型压缩的精度以及模型压缩效果。
另外,本实施例中在模型裁剪方法上采用基于AutoM1的自动化搜索方法,搜索空间更加灵活,能够实现channel级别的裁剪,有利于进一步提高模型压缩的精度。
本申请还提供一种融合裁剪与量化的模型压缩系统,该系统主要包括:超网络生成模块、训练模块、量化模块以及裁剪模块。通过训练模块生成第一权重参数空间,通过量化模块生成第二权重空间,然后通过裁剪模块在低bit位的第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,最终获取压缩后的最优模型。本实施例通过四个模块的设置,能够将模型裁剪与模型量化融合起来,充分考虑到在模型低bit量化时,不同模型结构对量化结果产生的重要影响,如各层channel值的设定对量化结果的影响不同,最终获取到在裁剪和量化维度上自适应优化的深度压缩模型。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
图1为本申请实施例所提供的一种融合裁剪与量化的模型压缩方法的流程示意图;
图2为本实施例中的方法应用于4bit量化时的模型压缩原理示意图;
图3为本申请实施例所提供的一种融合裁剪与量化的模型压缩系统的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。融合本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
为了更好地理解本申请,下面结合附图来详细解释本申请的实施方式。
实施例一
参见图1,图1为本申请实施例所提供的一种融合裁剪与量化的模型压缩方法的流程示意图。由图1可知,本实施例中融合裁剪与量化的模型压缩方法,主要包括如下过程:
S3:基于待压缩模型生成超网络。
超网络通常为float32(32位浮点型数)全精度表示的超网络。本实施例中的待压缩模型包括:CNN模型、目标检测模型以及自然语言处理模型。下面主要以CNN模型为例进行描述。
S4:对超网络进行训练,生成待压缩模型的第一权重参数空间。
其中,第一权重参数空间用float32表示,第一权重参数空间中包含多个第一权重参数,第一权重参数用于评估待压缩模型的精度。
本实施例中采用的模型压缩方法包括模型裁剪和模型量化,并将两者进行融合。生成超网络后,通过步骤S4对超网络进行训练,生成第一权重参数空间,第一权重参数空间采用float32表示,也就是以float32表示神经网络的权重值,其bit位相比于第二权重参数空间较大。第一权重参数和第二权重参数都用于评估模型的精度,其中,第一权重参数用于评估待压缩模型的精度。
获取到第一权重参数空间之后,执行步骤S5:对第一权重参数空间量 化,形成第二权重参数空间。
其中,第二权重参数空间用低bit位表示,第二权重参数空间中包含多个第二权重参数,第二权重参数用于评估压缩后模型的精度。
通过对第一权重参数空间量化,获取到更低bit位的第二权重参数空间,量化后的超网络可以生成低bit范围表示的权重参数,即第二权重参数,将float32全精度模型参数空间用低bit参数空间代替,为后续实现模型压缩提供条件。
继续参见图1可知,获取到第二权重参数空间之后,执行步骤S6:在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
具体地,步骤S6包括如下过程:
S61:根据设定的约束条件,在第二权重参数空间搜索所有符合约束条件的裁剪模型。
本实施例中的约束条件包括:计算量和延迟。也就是可以在设定的计算量和设定的延迟约束下进行模型压缩,最终实现将待压缩模型的计算量压缩至设定的计算量,将待压缩模型的延迟压缩至设定的延迟。
具体地,步骤S61可以采用如下方式:
根据设定的约束条件,采用基于AutoM1的自动化搜索方法,在第二权重参数空间搜索所有符合约束条件的裁剪模型。
采用基于AutoM1的自动化搜索方法,针对给定特定模型,可以灵活设定深度学习模型中各层的裁剪维度,从而能够在特定的搜索空间中最大限度搜索到符合条件的裁剪优化模型。因此这种搜索方法搜索空间更加灵活,能够实现channel级别的裁剪,有利于进一步提高模型压缩的精度。
搜索到符合约束条件的裁剪模型之后,执行步骤S62:评估任一裁剪模型的精度。
本实施例中可以搜索到一个符合约束条件的裁剪模型即对其进行精度评估,直到所有符合约束条件的裁剪模型均搜索完毕。
对所有符合约束条件的裁剪模型逐一搜索并进行精度评估之后,获取到精度评估结果,执行步骤S63:根据所有裁剪模型的精度评估结果,确 定在各层channel上与低bit量化相匹配的裁剪模型结构。
本实施例将自动化裁剪与量化融合,能够有效解决模型结构中channel个数的设定对量化结果的影响,从而避免裁剪操作与量化操作独立进行所导致的模型裁剪后固定模型不适应量化空间的问题,获取在各层channel设定上适应于低bit量化的裁剪模型结构,有利于提高压缩模型的精度,提高模型的性能。
S64:根据裁剪模型结构,确定最优裁剪模型。
由以上步骤S61-S64可知,在计算量、延迟等特定的约束条件下,在搜索空间搜索所有可能的裁剪模型,用第二权重参数来评估各个裁剪模型的精度,最终基于精度评估结果获取到符合需求的最优裁剪模型。
进一步地,本实施例中在步骤S3之前还包括步骤S1和S2。其中步骤S1:定义模型压缩的维度,模型维度包括:结构维度和参数空间维度。
也就是定义模型压缩维度,对模型进行结构维度的压缩和参数空间维度的压缩。
S2:根据维度,确定模型压缩的方式。
具体地,步骤S2包括如下过程:
S21:采用模型裁剪的方式,进行结构维度的模型压缩。
模型结构维度的压缩通过模型裁剪实现,即:裁减掉CNN模型各层中特定个数的channel。
S22:采用量化的方式,进行参数空间维度的模型压缩。
模型参数空间维度的压缩通过模型量化实现,即将float32表示的参数空间量化至低bit位表示。
本实施例中融合裁剪与量化的模型压缩方法的原理示意图可以参见图2所示,图2中的低bit位以4bit量化为例。图2中以待压缩模型、通过裁剪和量化压缩模型的过程、压缩后模型的顺序进行展示。第一次通过超网络生成的权重参数空间以float32表示,为第一权重参数空间,将float32表示的权重参数空间进行量化后,获取到以4bit位表示的第二权重参数空间,并在第二权重参数空间中进行模型裁剪,最终获取到压缩后模型,其中,压缩后模型中虚线部分表示模型中裁剪掉的部分。
实施例二
在图1和图2所示实施例的基础之上参见图3,图3为本申请实施例所提供的一种融合裁剪与量化的模型压缩系统的结构示意图。由图3可知本实施例中融合裁剪与量化的模型压缩系统主要包括:超网络生成模块、训练模块、量化模块以及裁剪模块。
其中,超网络生成模块,用于根据待压缩模型生成超网络。训练模块,用于对超网络进行训练,生成待压缩模型的第一权重参数空间,其中,第一权重参数空间用float32表示,第一权重参数空间中包含多个第一权重参数,第一权重参数用于评估待压缩模型的精度。量化模块,用于对第一权重参数空间量化,形成第二权重参数空间,其中,第二权重参数空间用低bit位表示,第二权重参数空间中包含多个第二权重参数,第二权重参数用于评估压缩后模型的精度。裁剪模块,用于在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
进一步地,该系统中还包括:压缩维度定义模块和压缩方式确定模块。其中,压缩维度定义模块,用于定义模型压缩的维度,维度包括:结构维度和参数空间维度;压缩方式确定模块,用于根据维度,确定模型压缩的方式。
裁剪模块包括:搜索单元、精度评估单元、裁剪模型结构确定单元和最优裁剪模型确定单元。其中,搜索单元,用于根据设定的约束条件,在第二权重参数空间搜索所有符合约束条件的裁剪模型。精度评估单元,用于评估任一裁剪模型的精度。裁剪模型结构确定单元,用于根据所有裁剪模型的精度评估结果,确定在各层channel上与低bit量化相匹配的裁剪模型结构。最优裁剪模型确定单元,用于根据裁剪模型结构,确定最优裁剪模型。
该实施例中融合裁剪与量化的模型压缩系统的工作原理和工作方法,在图1和图2所示的实施例中已经详细阐述,在此不再赘述。
以上所述仅是本申请的具体实施方式,使本领域技术人员能够理解或实现本申请。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情 况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种融合裁剪与量化的模型压缩方法,其特征在于,所述方法包括:
    基于待压缩模型生成超网络;
    对所述超网络进行训练,生成待压缩模型的第一权重参数空间,其中,所述第一权重参数空间用float32表示,所述第一权重参数空间中包含多个第一权重参数,所述第一权重参数用于评估待压缩模型的精度;
    对第一权重参数空间量化,形成第二权重参数空间,其中,所述第二权重参数空间用低bit位表示,所述第二权重参数空间中包含多个第二权重参数,所述第二权重参数用于评估压缩后模型的精度;
    在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
  2. 根据权利要求1所述的一种融合裁剪与量化的模型压缩方法,其特征在于,所述在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型,包括:
    根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型;
    评估任一所述裁剪模型的精度;
    根据所有裁剪模型的精度评估结果,确定在各层channel上与低bit量化相匹配的裁剪模型结构;
    根据所述裁剪模型结构,确定最优裁剪模型。
  3. 根据权利要求2所述的一种融合裁剪与量化的模型压缩方法,其特征在于,所述约束条件包括:计算量和延迟。
  4. 根据权利要求2所述的一种融合裁剪与量化的模型压缩方法,其特征在于,根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型的方法,具体为:
    根据设定的约束条件,采用基于AutoM1的自动化搜索方法,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型。
  5. 根据权利要求1所述的一种融合裁剪与量化的模型压缩方法,其特征在于,基于待压缩模型生成超网络之前,所述方法还包括:
    定义模型压缩的维度,所述维度包括:结构维度和参数空间维度;
    根据所述维度,确定模型压缩的方式。
  6. 根据权利要求5所述的一种融合裁剪与量化的模型压缩方法,其特征在于,根据所述维度,确定模型压缩的方式,包括:
    采用模型裁剪的方式,进行结构维度的模型压缩;
    采用量化的方式,进行参数空间维度的模型压缩。
  7. 根据权利要求1-6中任一所述的一种融合裁剪与量化的模型压缩方法,其特征在于,所述待压缩模型包括:CNN模型、目标检测模型以及自然语言处理模型。
  8. 一种融合裁剪与量化的模型压缩系统,其特征在于,所述系统包括:
    超网络生成模块,用于根据待压缩模型生成超网络;
    训练模块,用于对所述超网络进行训练,生成待压缩模型的第一权重参数空间,其中,所述第一权重参数空间用float32表示,所述第一权重参数空间中包含多个第一权重参数,所述第一权重参数用于评估待压缩模型的精度;
    量化模块,用于对第一权重参数空间量化,形成第二权重参数空间,其中,所述第二权重参数空间用低bit位表示,所述第二权重参数空间中包含多个第二权重参数,所述第二权重参数用于评估压缩后模型的精度;
    裁剪模块,用于在第二权重参数空间中进行模型裁剪以及压缩后模型的精度评估,获取压缩后的最优模型。
  9. 根据权利要求8所述的一种融合裁剪与量化的模型压缩系统,其特征在于,所述系统中还包括:
    压缩维度定义模块,用于定义模型压缩的维度,所述维度包括:结构维度和参数空间维度;
    压缩方式确定模块,用于根据所述维度,确定模型压缩的方式。
  10. 根据权利要求8所述的一种融合裁剪与量化的模型压缩系统,其特征在于,所述裁剪模块包括:
    搜索单元,用于根据设定的约束条件,在第二权重参数空间搜索所有符合所述约束条件的裁剪模型;
    精度评估单元,用于评估任一所述裁剪模型的精度;
    裁剪模型结构确定单元,用于根据所有裁剪模型的精度评估结果,确定在各层channel上与低bit量化相匹配的裁剪模型结构;
    最优裁剪模型确定单元,用于根据所述裁剪模型结构,确定最优裁剪模型。
PCT/CN2021/076975 2020-06-18 2021-02-20 一种融合裁剪与量化的模型压缩方法及系统 WO2021253857A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010558278.6A CN111860770A (zh) 2020-06-18 2020-06-18 一种融合裁剪与量化的模型压缩方法及系统
CN202010558278.6 2020-06-18

Publications (1)

Publication Number Publication Date
WO2021253857A1 true WO2021253857A1 (zh) 2021-12-23

Family

ID=72986272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/076975 WO2021253857A1 (zh) 2020-06-18 2021-02-20 一种融合裁剪与量化的模型压缩方法及系统

Country Status (2)

Country Link
CN (1) CN111860770A (zh)
WO (1) WO2021253857A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860770A (zh) * 2020-06-18 2020-10-30 苏州浪潮智能科技有限公司 一种融合裁剪与量化的模型压缩方法及系统
CN112950221B (zh) * 2021-03-26 2022-07-26 支付宝(杭州)信息技术有限公司 建立风控模型的方法和装置以及风险控制方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635936A (zh) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 一种基于重训练的神经网络剪枝量化方法
CN110222820A (zh) * 2019-05-28 2019-09-10 东南大学 基于权值剪枝和量化的卷积神经网络压缩方法
CN110782396A (zh) * 2019-11-25 2020-02-11 武汉大学 一种轻量化的图像超分辨率重建网络和重建方法
CN111160524A (zh) * 2019-12-16 2020-05-15 北京时代民芯科技有限公司 一种两阶段的卷积神经网络模型压缩方法
CN111860770A (zh) * 2020-06-18 2020-10-30 苏州浪潮智能科技有限公司 一种融合裁剪与量化的模型压缩方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635936A (zh) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 一种基于重训练的神经网络剪枝量化方法
CN110222820A (zh) * 2019-05-28 2019-09-10 东南大学 基于权值剪枝和量化的卷积神经网络压缩方法
CN110782396A (zh) * 2019-11-25 2020-02-11 武汉大学 一种轻量化的图像超分辨率重建网络和重建方法
CN111160524A (zh) * 2019-12-16 2020-05-15 北京时代民芯科技有限公司 一种两阶段的卷积神经网络模型压缩方法
CN111860770A (zh) * 2020-06-18 2020-10-30 苏州浪潮智能科技有限公司 一种融合裁剪与量化的模型压缩方法及系统

Also Published As

Publication number Publication date
CN111860770A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2021253857A1 (zh) 一种融合裁剪与量化的模型压缩方法及系统
DE102018010463B3 (de) Tragbare Vorrichtung, computerlesbares Speicherungsmedium, Verfahren und Einrichtung für energieeffiziente und leistungsarme verteilte automatische Spracherkennung
CN111612147A (zh) 深度卷积网络的量化方法
US11403528B2 (en) Self-tuning incremental model compression solution in deep neural network with guaranteed accuracy performance
DE602004003512T2 (de) Kompression gausscher Modelle
CN111260022A (zh) 一种卷积神经网络全int8定点量化的方法
CN111275187A (zh) 深度神经网络模型的压缩方法及装置
CN111860771B (zh) 一种应用于边缘计算的卷积神经网络计算方法
CN109978144B (zh) 一种模型压缩方法和系统
CN111178514A (zh) 神经网络的量化方法及系统
CN110674924B (zh) 一种深度学习推理自动量化方法和装置
CN110837890A (zh) 一种面向轻量级卷积神经网络的权值数值定点量化方法
CN111126595A (zh) 一种神经网络的模型压缩的方法和设备
US20230252294A1 (en) Data processing method, apparatus, and device, and computer-readable storage medium
CN112861996A (zh) 深度神经网络模型压缩方法及装置、电子设备、存储介质
CN112465140A (zh) 一种基于分组通道融合的卷积神经网络模型压缩方法
CN114528987A (zh) 一种神经网络边缘-云协同计算分割部署方法
US10592799B1 (en) Determining FL value by using weighted quantization loss values to thereby quantize CNN parameters and feature values to be used for optimizing hardware applicable to mobile devices or compact networks with high precision
CN117521752A (zh) 一种基于fpga的神经网络加速方法及系统
CN116187420B (zh) 轻量化的深度神经网络的训练方法、系统、设备和介质
CN117151178A (zh) 一种面向fpga的cnn定制网络量化加速方法
WO2023078051A1 (zh) 量化感知训练方法、装置、设备、介质及卷积神经网络
CN115310607A (zh) 一种基于注意力图的视觉Transformer模型剪枝方法
CN114742036A (zh) 一种预训练语言模型的组合式模型压缩方法及系统
CN112116089A (zh) 针对资源受限设备视频处理的深度学习网络裁剪方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21826496

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21826496

Country of ref document: EP

Kind code of ref document: A1