WO2022103291A1 - Procédé et système permettant de quantifier un réseau neuronal - Google Patents

Procédé et système permettant de quantifier un réseau neuronal Download PDF

Info

Publication number
WO2022103291A1
WO2022103291A1 PCT/RU2020/000601 RU2020000601W WO2022103291A1 WO 2022103291 A1 WO2022103291 A1 WO 2022103291A1 RU 2020000601 W RU2020000601 W RU 2020000601W WO 2022103291 A1 WO2022103291 A1 WO 2022103291A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
vector
quantization parameters
quantized
basis
Prior art date
Application number
PCT/RU2020/000601
Other languages
English (en)
Inventor
Vladimir Maximovich CHIKIN
Kirill Igorevich SOLODSKIKH
Anna Dmitrievna TELEGINA
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2020/000601 priority Critical patent/WO2022103291A1/fr
Priority to EP20851354.9A priority patent/EP4196919A1/fr
Priority to CN202080104047.6A priority patent/CN116472538A/zh
Publication of WO2022103291A1 publication Critical patent/WO2022103291A1/fr

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3082Vector coding
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound

Definitions

  • the present disclosure relates to a system and method for determining mixed-precision quantization parameters for a neural network.
  • Neural networks have delivered impressive results across a wide range of applications in recent years. This has led to widespread adoption across many different hardware platforms including mobile devices and embedded devices. In these types of devices hardware constraints may limit the usefulness of neural networks where high accuracy cannot be achieved efficiently.
  • Quantization methods may reduce the memory footprint and inference time in neural networks. Quantization compresses data in a neural network from large floating point representations to smaller fixed-point representations. Lower bit-width quantization permits greater optimization. However, lowering the bit width to too great an extent may reduce the accuracy too much.
  • a method for determining mixed-precision quantization parameters to quantize a neural network comprises a plurality of layers. Each layer is associated with a weight vector comprising multiple floating point data values.
  • the neural network is trained on a training dataset and the weight vectors are selected to minimise a first loss function associated to the neural network.
  • the method comprises determining a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset, evaluating a second loss function on the basis of the training vector and the vector of quantization parameters and modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function.
  • Each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.
  • the method according to the first aspect provides a general purpose method for determining quantization parameters to quantize a neural network to mixed-precision.
  • the method may be, for example, implemented or executed by one or more processors.
  • the data used for training (and later for inference, also referred to as operation) may be pictures, e.g. still picture or video pictures, or respective picture data, audio data, any other measured or captured physical data or numerical data.
  • the second loss function comprises the first loss function, a first regularization component and a second regularization component.
  • the first regularization component is selected to constrain the quantized weight vector for each layer to a pre-determined range of values.
  • the second regularization component is selected to constrain quantized input data values to each layer of the corresponding quantized neural network to a pre-determined range of values.
  • first and/or second regularization components comprise functions that depend continuously on the data values of the vector of quantization parameters.
  • modifying the weight vectors and the vector of quantization parameters comprises determining a local minimum of the second loss function.
  • the local minimum is determined according to a gradient descent method.
  • the method comprises accessing a validation dataset; and evaluating, for each one of multiple validation vectors in the validation dataset, the quantized neural network on the basis of the quantization parameters.
  • determining the vector of quantization parameters on the basis of a size of the weight vectors comprises: determining a sum of the sizes of the multiple floating point data values of the weight vectors; and generating a parameter surface on the basis of the determination.
  • the parameter surface comprises a portion of an ellipsoidal surface.
  • the method comprises generating a set of size parameters; and selecting the first regularization component to train the quantization parameters to the set of size parameters.
  • the second regularization component is selected to constrain the quantized input data values to each layer of the corresponding quantized neural network to the set of bit-width parameters associated to the first regularization component.
  • the method comprises quantizing input data values to each layer of the corresponding quantized neural network to a pre-determined bit-width.
  • the method comprises determining a further vector of quantization parameters on the basis of a size of the input data values of each layer of the neural network, evaluating the second loss function on the basis of the further vector and modifying the further vector on the basis of the evaluation.
  • the method comprises performing inference based on an input to a neural network.
  • the fourteenth implementation form comprises determining quantization parameters for the neural network according to the method according to the first aspect, determining a quantized neural network corresponding to the neural network on the basis of the quantization parameters and evaluating the quantized neural network on the basis of the input to infer an output.
  • a method for operating a neural network comprising: obtaining data to be processed by the neural network: and processing the data by the neural network, wherein the neural network is configured by quantization parameters obtained or obtainable according to any of the methods described above and herein.
  • the processing of the neural network may comprise processing of pictures or other data for signal enhancement (e.g. picture enhancement, e.g. for super resolution), denoising (e.g. still or video picture denoising), speech and audio processing (e.g. natural language processing, NLP) or other purposes.
  • signal enhancement e.g. picture enhancement, e.g. for super resolution
  • denoising e.g. still or video picture denoising
  • speech and audio processing e.g. natural language processing, NLP
  • a computer program to perform the method according to the first or second aspect is provided.
  • a non-transitory computer readable medium comprises instructions that, when executed by a processor cause the processor perform the method according to the first or second aspect.
  • a computing system for determining mixed-precision quantization parameters to quantize a neural network system for comprises at least one processor and at least one memory including program code which when executed by the at least one processor provides instructions to determine a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset: evaluate a second loss function on the basis of the training vector and the vector of quantization parameters; and modify the weight vectors and the vector of quantization parameters to minimize an output of the second loss function.
  • Each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.
  • Figure 1 shows a schematic diagram of a'neural network, according to an example.
  • Figure 2 shows a diagram of a portion of a neural network, according to an example.
  • Figure 3 shows a graph of a regularizer function, according to an example.
  • Figure 4 shows a diagram of a parameter surface, according to an example.
  • Figure 5 shows a flow diagram of a method for determining quantization parameters, according to an example.
  • Figure 6 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision.
  • Figure 7 shows a diagram of mixed precision bit-widths for layers of a neural network, according to an example.
  • Figure 8 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision.
  • Figure 9 shows a diagram of mixed-precision bit-widths for layers of a neural network, according to an example.
  • Figure 10 shows a table comparing quantization methods for quantization of MobileNet_v2 on Imagenet.
  • Figure 11 shows a table of distribution of bit-widths, according to an example.
  • Figure 12 is a block diagram of a computing system that may be used for implementing the devices and methods disclosed herein.
  • Quantization methods may reduce the memory footprint and inference time in neural networks.
  • a neural network may comprise of several computing blocks of different sizes, each of which can be quantized into any bit-width. By quantizing different blocks of a neural network into different bit-widths, neural networks with different degrees of compression, acceleration and quality may be achieved.
  • an inference model represented by a neural network may comprise of three blocks that fill 20%, 30% and 50% of the model size respectively. All weights of the full precision model use a 32-bit floating point representation.
  • the accuracy of the full precision model may be equal to 99% for a compression ratio of 1
  • the accuracy of the full 8-bit model with all weights quantized to 8 bits may be equal to 98% with a compression ratio of 4
  • the accuracy of the model in which the first block (20%) is quantized to a 4-bit representation and the rest to an 8-bit representation may be equal to 97% with a compression ratio of 4.44
  • the accuracy of the model in which the second block (30%) are quantized to a 4-bit representation and the rest to 8-bit may be equal to 65% with a compression ratio of 4.71
  • the rest to 8 bits may be equal to 55% with a compression ratio of 5.33.
  • a user that requires higher than 65% accuracy may select the quantized model from among a
  • E Loss( ,, W)) - min may be reduced to solving the problem: in the domain of definition of parameters (V , s w , s a ) for some numbers A w and A a .
  • equation (1) is the input data distribution; 14Z ( and A are quantized weights and quantized inputs to layers of the neural network and s w . and s a are corresponding scale factors of weights and inputs.
  • FIG. 1 is a simplified schematic diagram of a neural network architecture 100, according to an example.
  • the neural network architecture 100 may be used to train a modified neural network that is constructed from a neural network.
  • Each layer of the neural network may comprise a number of nodes which are interconnected with nodes of the preceding and subsequent layer of the neural network.
  • the nodes may be associated with data values referred to herein as weights.
  • Each layer may be associated with a weight vector comprising the vector of weights for the nodes of the layer.
  • the weights of the neural network are specified by floating-point data values.
  • training data from a training data set 110 is fed into the neural network architecture 100 as input data 120.
  • the neural network architecture 100 comprises a modified neural network 130, which is generated from the underlying neural network, according to the methods described herein.
  • the modified neural network 130 comprises a plurality of blocks of layers, where blocks are quantized to a specific size.
  • references to the “size” of a data value herein may refer to the bit-width of a data value.
  • the “model size” may refer to the sum of the bit-widths of weight vectors of a neural network.
  • a user specifies a required model size.
  • training the modified neural network 130 comprises initialising a set of trainable variables. These trainable variables optimize the sizes of quantized weights for the model size specified by the user.
  • the neural network architecture 100 comprises a regularization computation 140.
  • the regularization computation 140 is an accumulation of computations from each layer of the modified neural network 130. The computations are generated from evaluating functions in the family of regularizer functions ⁇ p previously described.
  • the block 150 comprises an output of the modified neural network 130.
  • Block 160 comprises a loss function computation that is determined from the output 150 of the modified neural network 130 and the regularization computation 140.
  • the parameters of the modified neural network 130, including the size parameters for quantizing weights are updated on the basis of the computation represented by block 160.
  • the data used for training may be pictures, e.g. still picture or video pictures, or respective picture data, audio data, any other measured or captured physical data or numerical data.
  • Figure 2 is a simplified schematic diagram 200 of an intermediate layer of the neural network architecture 130 shown in Figure 1.
  • the block 210 shown in Figure 2 comprises input values to an intermediate layer 220 of the modified neural network 130.
  • the block 230 comprises a quantization layer that is applied to the input values 210.
  • Block 240 comprises a local computation of regularizer functions ⁇ p, which may be generated on the basis of the quantization parameters, the input data and the weight vector for the intermediate layer.
  • the block 250 comprises a global accumulator of regularizer terms which are input to the loss function computation 160 in Figure 1.
  • Figure 3 shows a graph of a regularizer function (p, according to an example.
  • the family of regularizer functions cp(x, t) are defined in such a way to depend smoothly on a parameter t and at each moment have 2
  • Functions from this class are constructed in such a way that if ⁇ p(x, t) is close to 0, then components of x are close to a grid of integers from segment [- t , t - 1] in the case of p int , or alternatively from segment [0, 2 t - 1] in the case of p uint .
  • Examples of the functions may be defined as follows:
  • the loss function computation 160 for the neural network architecture 100 comprises minimising the following loss function:
  • bit-widths of all layers may become large.
  • N - 1 independent variables e [0, 1] may be defined that parameterize the first quadrant (x ; > 0) of the surface of the ellipsoid given by equation
  • Figure 4 shows a diagram of a quadrant 400 of an ellipsoid parameterized by the variables , are related to the variables x t via the equations:
  • a sinusoidal regularizer defined by: sin 2 TCX 2 may be added to the loss function in equation (2).
  • a special regularizer function may be added to LQ to contract bit-width values of weights and activations to the specific set.
  • a user or other entity may define a set of required bit-width values such as ⁇ 4, 8, 16 ⁇ . As a result of this the bit-widths of the weights of the layers of the resulting model are equal to 4, 8 and 16-bits only.
  • Figure 5 is a block diagram of a method 500 for determining mixed-precision quantization parameters to quantize a neural network according to an example.
  • the method 500 may be used with the other methods and examples described herein.
  • the method 500 may be used with any neural network plurality of layers where each layer is associated with a weight vector comprising multiple floating point data values, where the weight vectors are selected to minimize a first loss function associated to the neural network.
  • the method 500 comprises determining a vector of quantization parameters on the basis of a size of the weight vectors.
  • the vector of quantization parameters comprises the vector of trainable variables previously defined.
  • each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.
  • Determining the vector of quantization parameters on the basis of a size of the weight vectors may comprise determining a sum of the sizes of the multiple floating point data values of the weight vectors and generating a parameter surface, such as the ellipsoidal parameter surface 400 shown in Figure 4, on the basis of the determination.
  • the method 500 comprises, for each training vector in the training data set, evaluating a second loss function on the basis of the training vector and the vector of quantization parameters.
  • the second loss function may be the loss function defined in equation (2).
  • the second loss function may comprise the first loss function associated to the neural network, a first regularization component and a second regularization component.
  • the first regularization component may be selected to constrain the quantized weight vector for each layer to a pre-determined range of values.
  • the second regularization component may also be selected to constrain quantized input data values to each layer of the corresponding quantized neural network to a pre-determined range of values.
  • the first and/or second regularization components comprise functions that depend continuously on the data values of the vector of quantization parameters. These properties are achieved according to functions p x, t) previously defined.
  • the method 500 comprises modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function.
  • minimizing an output of the second loss function may comprise determining a local minimum of the second loss function. This may be performed using a gradient descent method.
  • the method 500 may further comprise accessing a validation dataset and evaluating, for each one of multiple validation vectors in the validation dataset, the quantized neural network on the basis of the quantization parameters.
  • the method 500 may further comprise applying quantization to input vectors aka activations of respective layers of the neural network.
  • the method of determining quantization parameters for activations may be similar to the method 500 for determining quantization parameters for the weights.
  • the method 500 may further comprise determining a further vector of quantization parameters on the basis of a size of the input data values of each layer of the neural network, evaluating the second loss function on the basis of the further vector and modifying the further vector on the basis of the evaluation.
  • the inputs may be quantized using the same quantization parameters as the weights, or to a pre-determined bitwidth.
  • Figure 6 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision. All weights are quantized except the first and last layers so that 0.91 % of the model remains 32-bit.
  • the table shows a comparison of different quantitation methods with non-quantized activations including SinReQ and quantization using smooth regularizer with fixed bit-width (QSin). In this case model accuracy using the methods described herein is 91.83%.
  • Figure 7 shows a diagram of mixed precision bit-widths for layers of ResNet-20, quantized using the mixed-prevision quantization method described herein. The majority of layers are quantized to 4-bits.
  • Figure 8 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision with 4-bit activations.
  • the method described herein is compared with DoReFa, PACT, SinReQ and QSin.
  • the method described herein demonstrates superior accuracy for mixed-precision quantized models over quantized 4-bit models of the same total model size. Full precision model accuracy is 91 .73%.
  • Figure 9 shows a diagram of mixed precision bit-widths for layers of ResNet-20, with 4-bit quantization of activations.
  • Figure 10 shows a table comparing quantization methods for quantization of MobileNet_v2 on Imagenet. Weights of all model layers are quantized and 1 % of the model remains 32-bit, namely biases and batch norms. Activations are quantized to 8 bit. The method is compared with two quantization methods: TensorFlow 8-bit quantization using the straight through estimator, and a mixed precision DNN method. Using the method described herein full precision model accuracy is 71.88%.
  • Figure 11 shows a table of distributions of bit-widths, using the method described herein.
  • the methods and systems described provide a general approach to achieve mixed-prevision quantization of any neural network architecture independent of layer types, activation functions or network topology. Furthermore the method described shows improved results in classification, regression tasks and image enhancement tasks.
  • the method is memory efficient: full precision weights are used on both forward and backward propagation during training without rounded weights and there is no need to store multiple instances of models with different bit-widths. Model training may be performed using gradient decent giving an improved convergence rate.
  • the method provides the ability to explicitly set constraints on the overall model size.
  • the model may be trained to give quantization parameters to a specific set of bit-widths such as special hardware-specific bit-widths.
  • Figure 12 is a block diagram of a computing system 1200 that may be used for implementing the methods disclosed herein.
  • the computing system 1200 includes a processing unit 1202.
  • the processing unit includes a central processing unit (CPU) 1214, memory 1408, and may further include a mass storage device 1204, a video adapter 1210, and an I/O interface 1212 connected to a bus 1220.
  • CPU central processing unit
  • memory 1408 may further include a mass storage device 1204, a video adapter 1210, and an I/O interface 1212 connected to a bus 1220.
  • the bus 1220 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or a video bus.
  • the CPU 1214 may comprise any type of electronic data processor.
  • the memory 1208 may comprise any type of non-transitory system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), or a combination thereof.
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • ROM read-only memory
  • the memory 1208 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.
  • the mass storage 1204 may comprise any type of non-transitory storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1220.
  • the mass storage 1204 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, or an optical disk drive.
  • the video adapter 1210 and the I/O interface 1212 provide interfaces to couple external input and output devices to the processing unit 1202.
  • input and output devices include a display 1218 coupled to the video adapter 1210 and a mouse, keyboard, or printer 1216 coupled to the I/O interface 1412.
  • Other devices may be coupled to the processing unit 1202, and additional or fewer interface cards may be utilized.
  • a serial interface such as Universal Serial Bus (USB) (not shown) may be used to provide an interface for an external device.
  • the processing unit 1202 also includes one or more network interfaces 1206, which may comprise wired links, such as an Ethernet cable, or wireless links to access nodes or different networks.
  • the network interfaces 1206 allow the processing unit 1202 to communicate with remote units via the networks.
  • the network interfaces 1206 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas.
  • the processing unit 1202 is coupled to a localarea network 1222 or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, or remote storage facilities.
  • a signal may be transmitted by a transmitting unit or a transmitting module.
  • the respective units or modules may be hardware, software, or a combination thereof.
  • one or more of the units or modules may be an integrated circuit, such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs).
  • FPGAs field programmable gate arrays
  • ASICs application-specific integrated circuits

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

La présente invention concerne un procédé et un système informatique permettant de déterminer des paramètres de quantification de précision mixte pour quantifier un réseau neuronal. Le procédé comprend la détermination d'un vecteur de paramètres de quantification sur la base d'une taille des vecteurs de pondération du réseau neuronal et, pour chacun de multiples vecteurs d'apprentissage d'un ensemble de données d'apprentissage, l'évaluation d'une seconde fonction de perte sur la base du vecteur d'apprentissage et du vecteur de paramètres de quantification et la modification des vecteurs de pondération et du vecteur de paramètres de quantification afin de minimiser une sortie de la seconde fonction de perte. Chacun des paramètres de quantification du vecteur de paramètres de quantification contraint la taille d'un vecteur de pondération quantifié d'une couche d'un réseau neuronal quantifié correspondant au vecteur de pondération de la couche correspondante du réseau neuronal.
PCT/RU2020/000601 2020-11-13 2020-11-13 Procédé et système permettant de quantifier un réseau neuronal WO2022103291A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/RU2020/000601 WO2022103291A1 (fr) 2020-11-13 2020-11-13 Procédé et système permettant de quantifier un réseau neuronal
EP20851354.9A EP4196919A1 (fr) 2020-11-13 2020-11-13 Procédé et système permettant de quantifier un réseau neuronal
CN202080104047.6A CN116472538A (zh) 2020-11-13 2020-11-13 用于量化神经网络的方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000601 WO2022103291A1 (fr) 2020-11-13 2020-11-13 Procédé et système permettant de quantifier un réseau neuronal

Publications (1)

Publication Number Publication Date
WO2022103291A1 true WO2022103291A1 (fr) 2022-05-19

Family

ID=74561976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2020/000601 WO2022103291A1 (fr) 2020-11-13 2020-11-13 Procédé et système permettant de quantifier un réseau neuronal

Country Status (3)

Country Link
EP (1) EP4196919A1 (fr)
CN (1) CN116472538A (fr)
WO (1) WO2022103291A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114075A (zh) * 2023-10-19 2023-11-24 湖南苏科智能科技有限公司 神经网络模型量化方法、装置、设备及介质

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
AHMED T ELTHAKEB ET AL: "SinReQ: Generalized Sinusoidal Regularization for Low-Bitwidth Deep Quantized Training", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 May 2019 (2019-05-04), XP081542586 *
AHMED T ELTHAKEB ET AL: "WaveQ: Gradient-Based Deep Quantization of Neural Networks through Sinusoidal Adaptive Regularization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 February 2020 (2020-02-29), XP081651983 *
MAXIM NAUMOV ET AL: "On Periodic Functions as Regularizers for Quantization of Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 November 2018 (2018-11-24), XP081040812 *
STEFAN UHLICH ET AL: "Mixed Precision DNNs: All you need is a good parametrization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 May 2019 (2019-05-27), XP081663581 *
STEVEN K ESSER ET AL: "Learned Step Size Quantization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 May 2020 (2020-05-07), XP081663151 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114075A (zh) * 2023-10-19 2023-11-24 湖南苏科智能科技有限公司 神经网络模型量化方法、装置、设备及介质
CN117114075B (zh) * 2023-10-19 2024-01-26 湖南苏科智能科技有限公司 神经网络模型量化方法、装置、设备及介质

Also Published As

Publication number Publication date
EP4196919A1 (fr) 2023-06-21
CN116472538A (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
CN111652368B (zh) 一种数据处理方法及相关产品
US11275986B2 (en) Method and apparatus for quantizing artificial neural network
US10726336B2 (en) Apparatus and method for compression coding for artificial neural network
US20200264876A1 (en) Adjusting activation compression for neural network training
TW201918939A (zh) 用於學習低精度神經網路的方法及裝置
WO2020142192A1 (fr) Compression d'activation de réseau neuronal dotée d'une virgule flottante de bloc étroit
WO2020142183A1 (fr) Compression d'activation de réseau neuronal avec virgule flottante de bloc aberrant
US20230196202A1 (en) System and method for automatic building of learning machines using learning machines
EP3915056A1 (fr) Compression d'activation de réseau neuronal avec des mantisses non uniformes
CN112955907A (zh) 量化训练的长短期记忆神经网络
TWI744724B (zh) 處理卷積神經網路的方法
CN114781618A (zh) 一种神经网络量化处理方法、装置、设备及可读存储介质
CN112884146A (zh) 一种训练基于数据量化与硬件加速的模型的方法及系统
WO2022103291A1 (fr) Procédé et système permettant de quantifier un réseau neuronal
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
CN112561050B (zh) 一种神经网络模型训练方法及装置
US20220405561A1 (en) Electronic device and controlling method of electronic device
CN114970822A (zh) 一种神经网络模型量化方法、系统、设备及计算机介质
WO2021232907A1 (fr) Appareil et procédé de formation de modèle de réseau neuronal, et dispositif associé
US11861452B1 (en) Quantized softmax layer for neural networks
CN114065913A (zh) 模型量化方法、装置及终端设备
CN114580625A (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
Zhen et al. A Secure and Effective Energy-Aware Fixed-Point Quantization Scheme for Asynchronous Federated Learning.
US20240143326A1 (en) Kernel coefficient quantization
JP7506276B2 (ja) 半導体ハードウェアにおいてニューラルネットワークを処理するための実装および方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20851354

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080104047.6

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 2020851354

Country of ref document: EP

Effective date: 20230314

NENP Non-entry into the national phase

Ref country code: DE