CN114781615A - Two-stage quantization implementation method and device based on compressed neural network - Google Patents

Two-stage quantization implementation method and device based on compressed neural network Download PDF

Info

Publication number
CN114781615A
CN114781615A CN202210458582.2A CN202210458582A CN114781615A CN 114781615 A CN114781615 A CN 114781615A CN 202210458582 A CN202210458582 A CN 202210458582A CN 114781615 A CN114781615 A CN 114781615A
Authority
CN
China
Prior art keywords
neural network
layer
stage
clustering
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210458582.2A
Other languages
Chinese (zh)
Inventor
杨文鑫
支小莉
童维勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202210458582.2A priority Critical patent/CN114781615A/en
Publication of CN114781615A publication Critical patent/CN114781615A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a two-stage quantization implementation method and device based on a compressed neural network. The method comprises the following steps: (1) and training the neural network model to be converged by using a CPU or a GPU according to the target task of the neural network. (2) And setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage. (3) Quantizing the weight of each layer of the neural network in two stages according to a quantization bit width preset by a target task, wherein the first stage of quantization is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form. The method can reduce the calculation cost and the storage cost required by the model, improve the reasoning speed of the model, and effectively make up for the precision loss of the traditional linear quantization under the condition of low bit width.

Description

Two-stage quantization implementation method and device based on compressed neural network
Technical Field
The invention relates to the technical field of neural networks, in particular to a two-stage quantization implementation method and device based on a compressed neural network.
Background
The deep neural network has high discrimination capability in complex applications such as image classification, target detection, speech synthesis, semantic segmentation and the like. But these models require significant computational and storage costs that make them less well deployed at the edge end devices. For projects that can access powerful computing resources through network connections, deploying large neural networks may not create a resource strain problem. However, for marginal computing on embedded hardware platforms, the computational resources available are limited due to major considerations of security, privacy, and latency (e.g., smart sensors, wearable devices, autonomous and unmanned aircraft tracking) such that its reasoning must be performed locally or at the network edge, while such computing is subject to strict area and power constraints.
The method aims to solve the problems of calculation cost and storage cost of the neural network. Researchers have proposed compressing and quantizing neural networks. Clustering is a common neural network compression technology, the method is mainly used for maximally compressing a neural network model, researchers store a clustering label in a weight matrix of a neural network to convert the weight matrix of floating point numbers into fixed point numbers, but the clustering label cannot be used for calculation, and during network reasoning, the clustering center of the floating point numbers is still used. The idea of converting floating point weights into fixed point weights by quantization was proposed in the last 90 s, and the quantized neural network can use fixed point calculations to accelerate network reasoning. Linear quantization is a commonly used quantization method, which does not cause a significant decrease in the inference accuracy of the neural network model in 8-bit quantization bit width, but the network model loses the inference capability when the model is quantized below 6-bit using linear quantization. The current edge end equipment, such as FPGA, ASIC and the like, can enable the model to use low bit width weight value to carry out reasoning acceleration by customizing the bit width of an adder, a multiplier and a memory. Therefore, how to compress and quantize the model at the same time and how to compensate for the precision loss in the low bit width quantization process is a difficult point of current research.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a method and a device for realizing two-stage quantization based on a compressed neural network, so as to improve the model reasoning accuracy of a neural network model under low-bit width quantization while ensuring the compression of the neural network.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a two-stage quantization implementation method based on a compressed neural network comprises the following steps:
and S1, training the neural network model to be convergent by using the CPU or the GPU according to the target task of the neural network.
S2, setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage.
S3, quantizing the weight of each layer of the neural network in two stages according to the preset quantization bit width of the target task, wherein the first stage is clustering and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form.
Preferably, the setting of the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task in step S2 specifically includes: according to the compression ratio of the neural network required by the target task, except for the first layer of the neural network, setting the target sparsity of each layer of the neural network; the target sparsity of each layer of the neural network is determined by the type of each layer of the neural network and the depth of the neural network where the layer of the network is located.
Preferably, the pruning the weight step by step in the step S2 specifically includes: determining the weight number M of required pruning in each layer of the network according to the target sparsity, the initial sparsity and the preset pruning frequency of each layer of the neural network, wherein M is a positive integer, determining the M weights with the minimum weight magnitude without pruning as the weights required for the pruning in the current stage, and pruning the weights of each layer of the neural network stage by stage.
Preferably, the quantifying the weight of each layer of the neural network in two stages in the step S3 specifically includes:
the first stage is as follows: determining the number of clustering centers according to the quantization bit width required by a target task, clustering the weight matrix subjected to pruning by using a particle swarm clustering algorithm to obtain clustering centers and clustering labels, and storing the obtained clustering centers in the weight matrix of each layer of the neural network;
and a second stage: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, and storing the weight matrix of each layer in a form of a triple as a final result.
Preferably, the first stage specifically includes: obtaining a clustering center number X according to a preset quantization bit width n of a target task, wherein X is 2n-1; and because the number of the clustering centers is known, the initial clustering center of each layer of weight of the neural network is obtained by using a particle swarm algorithm, and then the weight of each layer is clustered by using the clustering algorithm to obtain the clustering center and the clustering label.
Preferably, the second stage specifically includes: the input of the linear quantization algorithm is X clustering centers of each layer of the neural network, and the whole weight matrix is not required to be input; and storing the weight matrix of each layer in a triple form, and respectively storing the row and the column of each weight and a clustering center obtained by clustering the weights.
The invention also provides a two-stage quantization device based on the compressed neural network, which comprises the following steps:
a training module: the neural network model is selected according to the target task, and the CPU or the GPU is used for training the neural network model to be in a convergence state;
a pruning module: the target sparsity of each layer of the network is respectively set aiming at layers of different types and depths in the network according to the compression ratio required by the neural network model, and the weight of each layer is pruned stage by stage;
a clustering module: the device comprises a neural network model, a clustering center and a clustering label, wherein the neural network model is used for determining the number of the clustering centers according to the quantization bit width required by a target task, clustering the weight of each layer of the pruned neural network model by using a particle swarm clustering algorithm to obtain the clustering centers and the clustering labels, and the weight matrix of each layer of the neural network stores the clustering centers;
a quantization module: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, storing the weight matrix of each layer in a triple form as a final result, and quantizing and compressing the neural network.
Compared with the prior art, the invention has the following advantages:
the method and the device effectively reduce the storage cost and the calculation cost of the neural network model through pruning, particle swarm clustering and linear quantization. The invention provides a method for separating the quantization process into two stages, wherein the first stage uses a particle swarm clustering algorithm to cluster the weight of each layer of a neural network, and the second stage uses a linear quantization algorithm to scale the weight of each layer of the neural network to a fixed point number, so that the linear quantization only plays a scaling role, and the precision loss of the traditional linear quantization in the low-bit-width quantization process can be effectively compensated through the two-stage quantization of the weight of the neural network. And the pruned and quantized sparse matrix is compressed and stored in a triple form, so that the memory space occupied by the model can be effectively reduced.
Drawings
Fig. 1 is a flowchart of a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a two-stage quantization implementation apparatus based on a compressed neural network according to an embodiment of the present invention.
Fig. 3 is a flowchart of a clustering algorithm in a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.
Fig. 4 is a flowchart of a particle group algorithm in a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.
Detailed Description
Technical solutions and features in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be obtained by those skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, a two-stage quantization implementation method based on a compressed neural network provided by an embodiment of the present invention includes the following steps:
and S1, training the neural network model to be converged by using the CPU or the GPU according to the target task of the neural network.
In this embodiment, the neural network model needs to be trained to a convergence state according to the target task, so as to ensure that the neural network model has performance and accuracy for realizing the target task before compression and quantization.
S2, setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage.
In this embodiment, the neural network model may include multiple layers of different types of networks. And according to the compression ratio of the neural network model required by the target task, setting respective target sparsity for each layer of network except the first layer of network.
In one possible implementation, the sparsity of network targets of different types is different. For example, the target sparsity of the fully connected layer is 0.8, and the target sparsity of the convolutional layer is 0.6.
In one possible implementation, different depths of convolutional layers and target sparsity of convolutional layers are different. For example, the target sparsity of the first convolutional layer is 0.4, and the target sparsity of the second convolutional layer is 0.5.
In this embodiment, the pruning of the weight is divided into a plurality of stages, and the weight required for pruning in each stage is pruned stage by stage.
In one possible implementation, the target sparsity differs for each stage of each layer network. SpIs the target sparsity at stage p; s. the0Is the initial sparsity, typically 0; sfIs the final target sparsity; q represents a preset total number of stages; the target sparsity calculation process of each layer of the network at each stage is as follows:
Figure BDA0003619625780000041
in one possible implementation, each stage of each layer network may or may not prune according to the set pruning frequency.
In one possible implementation, the number of weights of pruning required for each stage of each layer of the network is different. And obtaining the neural network weight required to be pruned in the stage according to the target sparsity, the initial sparsity, the preset pruning frequency and the part, which is not pruned, of the neural network weight in the current stage.
S3, quantizing the weight of each layer of the neural network in two stages according to the preset quantization bit width of the target task, wherein the first stage is clustering and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form.
In this embodiment, the neural network determines the number of clustering centers according to the quantization bit width required by the target task, and clusters the weight of the pruned neural network model to obtain the clustering centers and the clustering labels.
In a possible implementation mode, the weight of the neural network needs to be quantized to n bits, and the number of clustering centers is 2n-1. For example, the quantization is 8 bits according to the requirement of the target task neural network, and the clustering center number is 255.
In this embodiment, according to the number of clustering centers, a particle swarm clustering algorithm is used to cluster the weights of each layer of the neural network, so that the convergence rate of the clustering algorithm can be effectively increased, and the problem that the final result cannot converge due to the fact that the traditional clustering algorithm falls into a local optimal solution can also be avoided.
In a possible implementation mode, a particle swarm algorithm is used for the weight of each layer of neural network to obtain an initial clustering center of the clustering algorithm. Where t denotes the current time, t +1 denotes the next time, XiAnd ViRespectively the velocity and position of each particle, where ω is the inertia factor, c1、c2Is a learning factor, typically 2, the individual optimum position pbiAnd historical optimal position gb. The calculation process is as follows:
Vi(t+1)=ωVi(t)+c1random(0,1)(pbi(t)-Xi(t))+c2random(0,1)(gb-Xi(t))
Xi(t+1)=Vi(t)+Vi(t+1)
in a possible implementation manner, as shown in fig. 4, in the iteration process, when the iteration number of the particle swarm algorithm exceeds the maximum iteration number or the average change of the centroid update is smaller than a preset threshold, the iteration is ended, and the clustering initial center of the weight of each layer of the network is output.
In the embodiment, a linear quantization algorithm is divided into two stages, wherein the first stage is clustering, and the weight of each layer of a neural network is clustered by using a clustering algorithm; and the second stage is scaling, and scaling the obtained clustering center to a fixed point number according to the preset quantization bit width of the target task. By dividing the linear quantization algorithm into two stages, the precision loss in the linear quantization process can be compensated, so that the accuracy rate required by the target task can still be realized after the fixed-point numeration of the neural network model is ensured.
In this embodiment, the weights of each layer are clustered according to the obtained initial clustering center of each layer of the neural network, so as to obtain a final clustering center and a clustering label of each layer of weights.
In a possible implementation manner, each layer of initial clustering center obtained according to a particle swarm algorithm is used as an initial center of the clustering algorithm, Euclidean distance d is used as a measurement mode among weights, wherein (M1, M2.. Mk) is a centroid set, C is an input data dimension degree, and C is a current dimension; and dividing the data set into K clusters, and continuously iterating to update the mass center, so that the smaller the point set interval in each cluster is, the larger the point set interval outside each cluster is. The calculation process is as follows:
Figure BDA0003619625780000051
Figure BDA0003619625780000052
wherein Z ismRepresenting the mth data vector; m is a group ofjRepresents the jth centroid vector; rjRepresenting the jth set of centroids and x representing each sample in the current set of centroids.
In one possible implementation, the iteration process is as shown in fig. 3, until the iteration reaches the maximum number of times or the current centroid is not updated, the weight cluster center and the cluster label of each layer of neural network are output, and the cluster center is stored in the neural network weight matrix.
In the present embodiment, the cluster center of each layer is quantized using a linear quantization algorithm according to the number of bits required for quantization of the target task.
In a possible implementation mode, the clustering centers are quantized to the fixed point number through the maximum value max and the minimum value min of the clustering centers stored in the weight matrix of each layer of the network and the fixed point scaling range determined according to the target task, the input of a linear quantization algorithm is X clustering centers of each layer of the neural network, the whole weight matrix does not need to be input, and the quantization efficiency can be effectively improved. For example, the target task quantization bit number is 8 bits, the fixed point scaling range is-127 to 127, and the calculation process is as follows:
Figure BDA0003619625780000061
Figure BDA0003619625780000062
wherein R represents a floating-point number matrix; a represents the minimum value of the current matrix, and b represents the maximum value of the current matrix; n is the target task quantization bit width; s represents a scaling coefficient, and Z represents a constant before and after quantization.
In this embodiment, the matrix of the neural network after pruning is a sparse matrix, the quantized clustering center is stored in the weight matrix of each layer of the network, and the sparse weight matrix is finally stored in the triplet, so that the compression rate of the neural network model can be improved.
In a possible implementation manner, the weight matrix of each layer of the neural network is stored in a triple form, and rows and columns of each weight and a clustering center obtained by clustering the weights are respectively stored.
As shown in fig. 2, an embodiment of the present invention further provides a two-stage quantization apparatus based on a compressed neural network, including:
the training module 21: and the neural network model is trained to be in a convergence state by using the CPU or the GPU according to the target task.
Pruning module 22: and the target sparsity of each layer is respectively set aiming at layers of different types and depths in the network according to the compression ratio required by the neural network model, and the weight of each layer is pruned stage by stage.
The clustering module 23: the method is used for determining the number of clustering centers according to the quantization bit width required by a target task, clustering the weight of the pruned neural network model to obtain the clustering centers and clustering labels, and storing the clustering centers in the weight matrix of each layer of the neural network.
The quantization module 24: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, and storing the weight matrix of each layer in a triple form as a final result.
The invention provides a two-stage quantization implementation method and device based on a compressed neural network, which effectively reduce the calculation cost and the storage cost of the neural network. Aiming at the problems that a clustering label accessed by a weight matrix of the traditional clustering method only plays a role of a compression network and cannot be used for actual calculation, the low bit width linear quantization loses reasoning ability and the like, the invention provides a two-stage quantization method combining clustering and linear quantization.
The first stage uses a particle swarm clustering algorithm to cluster the weight of each layer of the neural network, and the second stage uses a linear quantization algorithm to scale the weight of each layer to a fixed point number, so that the linear quantization only plays a scaling role. By dividing the linear quantization algorithm into two stages, the precision loss in the linear quantization process can be effectively compensated, so that the accuracy rate required by the target task can still be realized after the fixed-point numeration of the neural network model is ensured.
The method uses multi-stage unstructured pruning aiming at the redundant neural network weight, and the finally obtained sparse matrix is compressed and stored in a triple mode, so that the calculation cost and the storage cost of the neural network model are reduced at the same time.
The embodiment is based on a two-stage quantization implementation method and device of a compressed neural network, and a CPU or a GPU is used for training a neural network model to be convergent according to a target task of the neural network; setting target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage; quantizing the weight of each layer of the neural network in two stages according to a preset quantization bit width of a target task, wherein the first stage of quantization is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form. The embodiment of the invention can reduce the calculation cost and the storage cost required by the model, improve the reasoning speed of the model and effectively make up the precision loss of the traditional linear quantization under the low bit width.
Finally, it should be noted that: the foregoing is merely a preferred embodiment of the present invention and is not intended to be exhaustive. The technical solutions of the present invention are only used for illustration, and the scope of the present invention is not limited. The terminology used herein was chosen in order to best explain the principles of various embodiments. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (7)

1. A two-stage quantization implementation method based on a compressed neural network is characterized by comprising the following steps:
s1, according to the target task of the neural network, using a CPU or a GPU to train the neural network model to convergence;
s2, setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage;
s3, quantizing the weight of each layer of the neural network in two stages according to the preset quantization bit width of the target task; the first stage is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering centers are scaled to fixed points according to the preset quantization bit width of the target task, and the final result stores the weight matrix of each layer in a triple form.
2. The method according to claim 1, wherein the step S2 is to set a target sparsity of each layer of the neural network according to a compression ratio of the neural network required by a target task, and specifically includes: according to the compression ratio of the neural network required by the target task, except for the first layer of the neural network, setting the target sparsity of each layer of the neural network; the target sparsity of each layer of the neural network is determined by the type of each layer of the neural network and the depth of the neural network where the layer of the network is located.
3. The method for implementing two-stage quantization based on compressed neural network of claim 1, wherein the pruning of the weights stage by stage in the step S2 specifically includes: determining the number M of weights of pruning required by each layer of the network at the current stage according to the target sparsity, the initial sparsity and the preset pruning frequency of each layer of the neural network, wherein M is a positive integer, determining the M weights with the minimum weight magnitude of non-pruning as the weights of pruning required by the current stage, and pruning the weights of each layer of the neural network stage by stage.
4. The method according to claim 1, wherein the step S3 of quantizing the weights of each layer of the neural network in two stages specifically includes:
the first stage is as follows: determining the number of clustering centers according to the quantization bit width required by a target task, clustering the weight matrix subjected to pruning by using a particle swarm clustering algorithm to obtain clustering centers and clustering labels, and storing the obtained clustering centers in the weight matrix of each layer of the neural network;
and a second stage: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, and storing the weight matrix of each layer in a form of a triple as a final result.
5. The compressed neural network-based two-stage quantization implementation method of claim 4, wherein the first stage specifically comprises: obtaining a clustering center number X according to a preset quantization bit width n of a target task, wherein X is 2n-1; and because the number of the clustering centers is known, the initial clustering center of each layer of weight of the neural network is obtained by using a particle swarm algorithm, and then the weight of each layer is clustered by using the clustering algorithm to obtain the clustering center and the clustering label.
6. The method of claim 4, wherein the second stage comprises: the input of the linear quantization algorithm is X clustering centers of each layer of the neural network, and the whole weight matrix is not required to be input; and storing the weight matrix of each layer in a triple form, and respectively storing the row and the column of each weight and a clustering center obtained by clustering the weights.
7. A two-stage quantization apparatus based on a compressed neural network, comprising:
a training module: the neural network model is selected according to the target task, and the CPU or the GPU is used for training the neural network model to be in a convergence state;
a pruning module: the target sparsity of each layer of the network is respectively set aiming at layers of different types and depths in the network according to the compression ratio required by the neural network model, and the weight of each layer is pruned stage by stage;
a clustering module: the device comprises a neural network model, a particle swarm clustering algorithm, a cluster center and a cluster label, wherein the neural network model is used for dividing a target task into a plurality of layers;
a quantization module: and (3) quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, storing the weight matrix of each layer in a triple form as a final result, and quantizing and compressing the neural network.
CN202210458582.2A 2022-04-24 2022-04-24 Two-stage quantization implementation method and device based on compressed neural network Pending CN114781615A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210458582.2A CN114781615A (en) 2022-04-24 2022-04-24 Two-stage quantization implementation method and device based on compressed neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210458582.2A CN114781615A (en) 2022-04-24 2022-04-24 Two-stage quantization implementation method and device based on compressed neural network

Publications (1)

Publication Number Publication Date
CN114781615A true CN114781615A (en) 2022-07-22

Family

ID=82433128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210458582.2A Pending CN114781615A (en) 2022-04-24 2022-04-24 Two-stage quantization implementation method and device based on compressed neural network

Country Status (1)

Country Link
CN (1) CN114781615A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357554A (en) * 2022-10-24 2022-11-18 浪潮电子信息产业股份有限公司 Graph neural network compression method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357554A (en) * 2022-10-24 2022-11-18 浪潮电子信息产业股份有限公司 Graph neural network compression method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
CN107239825B (en) Deep neural network compression method considering load balance
CN106919942B (en) Accelerated compression method of deep convolution neural network for handwritten Chinese character recognition
CN111079781B (en) Lightweight convolutional neural network image recognition method based on low rank and sparse decomposition
WO2022006919A1 (en) Activation fixed-point fitting-based method and system for post-training quantization of convolutional neural network
US8700552B2 (en) Exploiting sparseness in training deep neural networks
CN109445935B (en) Self-adaptive configuration method of high-performance big data analysis system in cloud computing environment
CN110852439A (en) Neural network model compression and acceleration method, data processing method and device
CN105488563A (en) Deep learning oriented sparse self-adaptive neural network, algorithm and implementation device
CN109871749B (en) Pedestrian re-identification method and device based on deep hash and computer system
CN114170512A (en) Remote sensing SAR target detection method based on combination of network pruning and parameter quantification
CN111105007A (en) Compression acceleration method of deep convolutional neural network for target detection
CN108197707A (en) Compression method based on the convolutional neural networks that global error is rebuild
CN114781615A (en) Two-stage quantization implementation method and device based on compressed neural network
CN114970853A (en) Cross-range quantization convolutional neural network compression method
CN114239861A (en) Model compression method and system based on multi-teacher combined guidance quantification
CN112766360A (en) Time sequence classification method and system based on time sequence bidimensionalization and width learning
Chen et al. Learning Deep Unsupervised Binary Codes for Image Retrieval.
CN113177580A (en) Image classification system based on channel importance pruning and binary quantization
CN112597919A (en) Real-time medicine box detection method based on YOLOv3 pruning network and embedded development board
CN114943335A (en) Layer-by-layer optimization method of ternary neural network
CN110633787A (en) Deep neural network compression method based on multi-bit neural network nonlinear quantization
CN112115837A (en) Target detection method based on YoloV3 and dual-threshold model compression
Variani et al. West: Word encoded sequence transducers
Hossain et al. Computational Complexity Reduction Techniques for Deep Neural Networks: A Survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination