CN116644796B - Network model quantization method, voice data processing method, device and chip - Google Patents

Network model quantization method, voice data processing method, device and chip Download PDF

Info

Publication number
CN116644796B
CN116644796B CN202310927992.1A CN202310927992A CN116644796B CN 116644796 B CN116644796 B CN 116644796B CN 202310927992 A CN202310927992 A CN 202310927992A CN 116644796 B CN116644796 B CN 116644796B
Authority
CN
China
Prior art keywords
operation layer
network model
target operation
scaling factor
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310927992.1A
Other languages
Chinese (zh)
Other versions
CN116644796A (en
Inventor
唐剑
赵东宇
丁维浩
夏立超
张法朝
牟小峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Midea Robozone Technology Co Ltd
Original Assignee
Midea Robozone Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Midea Robozone Technology Co Ltd filed Critical Midea Robozone Technology Co Ltd
Priority to CN202310927992.1A priority Critical patent/CN116644796B/en
Publication of CN116644796A publication Critical patent/CN116644796A/en
Application granted granted Critical
Publication of CN116644796B publication Critical patent/CN116644796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The application relates to the technical field of data processing, and provides a quantization method of a network model, a processing method, a device and a chip of voice data, wherein the quantization method of the network model comprises the following steps: determining input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; the first target operational layer is determined from all the operational layers according to the input scaling factor or the output scaling factor. The network model generated by the method reduces the precision loss of the network model and ensures the calculation precision of the network model on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantitative calculation.

Description

Network model quantization method, voice data processing method, device and chip
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a quantization method of a network model, and a processing method, apparatus and chip of voice data.
Background
In the related technology, the data calculation amount of the network model can be reduced by quantifying the trained network model, so that the purposes of reducing the size of the model, reducing the memory consumption of the model and accelerating the model reasoning speed are achieved.
However, in the quantized network model, errors occur in the process of data quantization calculation, so that the accuracy of the network model is reduced, and the requirement of data processing cannot be met.
Therefore, how to ensure the operation precision of the model on the basis of accelerating the operation speed of the model through quantization becomes a technical problem to be solved urgently.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art.
To this end, a first aspect of the application provides a method for quantifying a network model.
A second aspect of the present application provides a method of processing voice data.
A third aspect of the present application provides a quantization apparatus for a network model.
A fourth aspect of the present application provides a quantization apparatus of a network model.
A fifth aspect of the present application provides a readable storage medium.
A sixth aspect of the application proposes an electronic device.
A seventh aspect of the application proposes a computer program product.
An eighth aspect of the present application proposes a chip.
The first aspect of the present application provides a quantization method of a network model, including: determining input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
The quantization method of the network model provided by the application can be used for quantizing the deep neural network model, namely, the trained weight, activation value and the like of the deep neural network are converted into low-precision operation processes from high precision, for example, the output data of the network model before quantization is 32-bit floating point data, and after quantization, the output data of the network model can be converted into 8-bit integer numbers. Thereby achieving the aims of reducing the size of the model, reducing the memory consumption of the model, accelerating the calculation speed of the model, and the like.
The quantization method of the network model provided by the application comprises the steps of firstly determining input data and output data of each operation layer of the network model, then determining an input scaling factor and an output scaling factor of each operation layer according to the input data and the output data, and further determining a first target operation layer from all operation layers of the network model according to the input scaling factor or the output scaling factor of each operation layer. The first target operation layer is an operation layer unsuitable for quantization.
Because the structure, data distribution and processing logic of each model are different, in the process of determining the first target operation layer according to the input scaling factor and the output scaling factor of each operation layer, the judgment cannot be directly performed according to the values of the input scaling factor and the output scaling factor, but the input scaling factor and the output scaling factor of each operation layer are required to be compared with the input scaling factor and the output scaling factor of other operation layers, and the operation layer with larger difference between the input scaling factor and the output scaling factor of the other operation layer is determined as the first target operation layer.
Specifically, a ratio of the input scaling factor of the first target operation layer to a maximum value of the input scaling factors of the second target operation layers other than the first target operation layer is greater than a first preset value. That is, the maximum value of the input scaling factors of all the operation layers is divided by the maximum value of the input scaling factors of the rest operation layers, and if the ratio result is greater than the first preset value, the input scaling factor of the first target operation layer is larger than the input scaling factors of the other operation layers, so that the first target operation layer loses more precision in the quantization operation process, and the operation layer is not suitable for quantization.
Accordingly, the ratio of the maximum values of the output scaling factors of the second target operation layers except the first target operation layer is larger than the second preset value. That is, the output scaling factor of the first target operation layer is different from the output scaling factor of the second target operation layer greatly, and the accuracy is lost in the quantization operation process, so that the operation layer is not suitable for quantization. The first preset value and the second preset value may be equal, or may be unequal according to parameters such as a structure, data distribution, processing logic, and the like of the network model.
After the first target operation layer unsuitable for quantization is determined, in the process of carrying out data processing on the network model, the first target operation layer can be not subjected to quantization calculation, and only the second target operation layer except for the first target operation layer can be subjected to quantization calculation, so that the precision loss of the network model is reduced and the calculation precision of the network model is ensured on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantization calculation.
According to the quantization method of the network model, the test data are input into the network model to generate the input data and the output data of each operation layer of the network model, so that the input scaling factor and the output scaling factor of each operation layer are determined according to the input data and the output data of each operation layer, whether each operation layer is suitable for quantization calculation or not is determined according to the input scaling factor and the output scaling factor, namely, a first target operation layer unsuitable for quantization is determined from all operation layers of the network model according to the input scaling factor or the output scaling factor. In this way, in the process of network model operation, only the second target operation layer except the first target operation layer can be subjected to quantization calculation, and on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantization calculation, the precision loss of the network model is reduced, and the calculation precision of the network model is ensured.
According to a second aspect of the present application, a method for processing voice data is provided, including obtaining a frequency domain feature and a network model of the voice data; and processing the frequency domain characteristics through a network model to obtain the wake-up result probability of the voice data, wherein the network model is a quantized network model generated through the quantization method of the network model in any one of the technical schemes.
According to the voice data processing method, the quantized network model is used for processing the frequency domain features of the voice data, so that the calculation speed of the network model can be increased, the processing efficiency of the frequency domain features is improved, meanwhile, the calculation accuracy of the network model in the processing process of the frequency domain features can be guaranteed, and the accuracy of the wake-up result probability of the generated voice data is further guaranteed.
According to a third aspect of the present application, there is provided a quantization apparatus of a network model, comprising: a determining unit for determining input data and output data of each operation layer of the network model; a determining unit for determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and the quantization unit is used for quantizing the second target operation layer and generating a quantized network model.
According to a fourth aspect of the present application, there is provided a quantization apparatus of a network model, comprising: a processor and a memory storing programs or instructions executable on the processor, which when executed by the processor implement the steps of: obtaining test data; inputting the test data into a network model, and generating input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
According to a fifth aspect of the present application, there is provided a readable storage medium having stored thereon a program or instructions which when executed by a processor performs the steps of: obtaining test data; inputting the test data into a network model, and generating input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
In a sixth aspect, the present application provides an electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions when executed by the processor performing the steps of: obtaining test data; inputting the test data into a network model, and generating input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
In a seventh aspect, the present application provides a computer program product comprising a computer program or instructions which, when executed by a processor, performs the steps of: obtaining test data; inputting the test data into a network model, and generating input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
An eighth aspect of the present application proposes a chip comprising a program or instructions for implementing the following steps when the chip is running: obtaining test data; inputting the test data into a network model, and generating input data and output data of each operation layer of the network model; determining an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and quantizing the second target operation layer to generate a quantized network model.
Additional aspects and advantages of the application will be set forth in part in the description which follows, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a flow diagram of a method for quantifying a network model according to one embodiment of the application;
fig. 2 is a flow chart illustrating a quantization method of a network model according to another embodiment of the present application;
FIG. 3 is a flow chart illustrating a method for quantifying a network model according to yet another embodiment of the present application;
FIG. 4 is a flow chart illustrating a method for quantifying a network model according to yet another embodiment of the present application;
FIG. 5 is a flow chart illustrating a method for quantifying a network model according to yet another embodiment of the present application;
FIG. 6 is a block diagram of an operation layer after quantization in a network model according to an embodiment of the present application;
FIG. 7 is a flow chart illustrating a method for processing voice data according to one embodiment of the present application;
FIG. 8 shows a distribution diagram of weight values of each operation layer of a network model according to an embodiment of the present application;
FIG. 9 shows a distribution diagram of input data and output data for each operation layer of a network model provided by an embodiment of the present application;
FIG. 10 shows a distribution diagram of another form of input data and output data for each operation layer of the network model provided by the embodiment of the present application;
FIG. 11 shows another view of the distribution graph of input data and output data of FIG. 10;
FIG. 12 is a schematic bar graph showing the input scaling factor, output scaling factor, and weight scaling factor for each operational layer of the present application;
FIG. 13 is a schematic bar graph showing the order factor of each operational layer of the present application;
fig. 14 is a block diagram showing a structure of a quantization apparatus of a network model provided according to an embodiment of the present application;
fig. 15 is a flowchart illustrating a quantization method of a network model according to still another embodiment of the present application.
The correspondence between the reference numerals and the component names in fig. 14 is:
600 network model quantization means, 606 determination unit, 608 quantization unit.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced otherwise than as described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.
A quantization method, a quantization apparatus, and a readable storage medium of a network model according to some embodiments of the present application are described below with reference to fig. 1 to 15.
As shown in fig. 1, according to one embodiment of the present application, a quantization method of a network model is provided, including:
s102, determining input data and output data of each operation layer of a network model;
s104, determining an input scaling factor of each operation layer according to the input data;
s106, determining an output scaling factor of each operation layer according to the output data;
s108, determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor;
the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value;
s110, quantizing the second target operation layer to generate a quantized network model.
The quantization method of the network model provided by the application can be used for quantizing the deep neural network model, namely, the trained weight, activation value and the like of the deep neural network are converted into low-precision operation processes from high precision, for example, the output data of the network model before quantization is 32-bit floating point data, and after quantization, the output data of the network model can be converted into 8-bit integer numbers. Thereby achieving the aims of reducing the size of the model, reducing the memory consumption of the model, accelerating the calculation speed of the model, and the like.
The network model quantization method may include Static offline quantization (Post Training Quantization Static, PTQ Static), among others. Specifically, taking a convolution layer as an example, firstly quantizing input data through a quantization function, at the moment, converting high-precision input data into low-precision input data, then entering the convolution layer for calculation, and obtaining final output data of the convolution layer after the data calculated by the convolution layer passes through an inverse quantization function, wherein the output data and the input data are in the same data form. In this process, the relation between the input data and the output data is as follows:
Output f =DeQuant(Quant(Input f )×Quant(weight f ));
wherein Input is f To input data, output f To output data, weight f Is a weight value.
Further, the quantization function Quant (r) is:
where r is the Input of the quantization function (i.e., input f Or weight f ) Scale is a scaling factor and Z is an integer zero. In this formula, the round function maps a real value to an integer value by a rounding operation.
Further, the inverse quantization function dequanti (r) =scale× (Quant (r) +z), by which the output data is restored to the same data form as the input data.
By the quantization function and the inverse quantization function, the relation between the input data and the output data can be changed as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,to output the scaling factor +.>For inputting the zoom factor->For the weight scaling factor, +.>For the integer zero of the input data, < > for>Is an integer of output dataZero point (I)>Is an integer zero of the weight value.
In the process of processing data by a quantized convolution layer, the sources of errors comprise three parts, namely 1, the errors of input data in the process of calculating a quantized function; 2. error of the weight value in the process of calculating the quantized function; 3. errors in the calculation of the output data by the inverse quantization function are mainly due to rounding operations in the calculation of the round function.
The larger the scaling factor scale is in the process of processing the quantized convolution layer data, the larger the error brought in the data processing process is. Wherein the scaling factors include an input scaling factor for the input data, an output scaling factor for the output data, and a weight scaling factor for the weight value for each operational layer of the network model.
The quantization method of the network model provided by the application comprises the steps of firstly determining input data and output data of each operation layer of the network model, then determining an input scaling factor and an output scaling factor of each operation layer according to the input data and the output data, and further determining a first target operation layer from all operation layers of the network model according to the input scaling factor or the output scaling factor of each operation layer. The first target operation layer is an operation layer unsuitable for quantization.
In some embodiments, considering that the structure, data distribution and processing logic of each model are different, in the process of determining the first target operation layer according to the input scaling factor and the output scaling factor of each operation layer, the judgment may not be directly performed according to the values of the input scaling factor and the output scaling factor, the input scaling factor and the output scaling factor of each operation layer may be compared with the input scaling factor and the output scaling factor of other operation layers, and an operation layer with a larger difference from the input scaling factor and the output scaling factor of other operation layers is determined as the first target operation layer.
Specifically, a ratio of the input scaling factor of the first target operation layer to a maximum value of the input scaling factors of the second target operation layers other than the first target operation layer is greater than a first preset value. In some embodiments, the maximum value of the input scaling factors of all the operation layers is divided by the maximum value of the input scaling factors of the rest of the operation layers, and if the ratio result is greater than the first preset value, the input scaling factor of the first target operation layer is indicated to have a larger phase difference than the input scaling factors of the other operation layers, so that the first target operation layer loses more precision in the quantization operation process, and therefore, the operation layer is not suitable for quantization.
Accordingly, the ratio of the maximum values of the output scaling factors of the second target operation layers except the first target operation layer is larger than the second preset value. In some embodiments, the output scaling factor of the first target operation layer is different from the output scaling factor of the second target operation layer, so that the accuracy loss in the quantization operation process is high, and the operation layer is not suitable for quantization. The first preset value and the second preset value may be equal, or may be unequal according to parameters such as a structure, data distribution, processing logic, and the like of the network model.
After the first target operation layer unsuitable for quantization is determined, in the process of carrying out data processing on the network model, quantization calculation can be carried out on the first target operation layer and only the second target operation layer except for the first target operation layer, so that the precision loss of the network model is reduced and the calculation precision of the network model is ensured on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantization calculation.
According to the quantization method of the network model, the test data are input into the network model to generate the input data and the output data of each operation layer of the network model, so that the input scaling factor and the output scaling factor of each operation layer are determined according to the input data and the output data of each operation layer, whether each operation layer is suitable for quantization calculation or not is determined according to the input scaling factor and the output scaling factor, namely, a first target operation layer unsuitable for quantization is determined from all operation layers of the network model according to the input scaling factor or the output scaling factor. In this way, in the process of network model operation, only the second target operation layer except the first target operation layer can be subjected to quantization calculation, and on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantization calculation, the precision loss of the network model is reduced, and the calculation precision of the network model is ensured.
According to an embodiment of the present application, as shown in fig. 2, a quantization method of a network model is provided, including:
s202, acquiring test data;
s204, inputting test data into the network model, and generating input data and output data of each operation layer of the network model;
S206, obtaining the maximum value of the input data and the minimum value of the input data;
s208, determining an input scaling factor according to the maximum value and the minimum value of the input data and the data capacity required by the quantized operation layer;
s210, obtaining the maximum value and the minimum value of output data;
s212, determining an output scaling factor according to the maximum value and the minimum value of the output data and the data capacity required by the quantized operation layer;
s214, determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor;
the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value;
s216, quantizing the second target operation layer to generate a quantized network model.
In this embodiment, first, test data is acquired, then the test data is input into the network model, after the test data is input into the network model, the input data of each operation layer of the network model can be generated through the operation of each operation layer of the network model, and further, according to the input data, the input scaling factor of each operation layer can be calculated.
In particular, the input scaling factor of the operational layer may be determined from the data range of the input data. First, a maximum value of input data and a minimum value of input data are acquired. Then, the input scaling factor is determined based on the maximum and minimum values of the input data and the data capacity required by the operation layer after quantization. The data capacity required by the quantized operation layer is the capacity of the output data quantized by the operation layer, taking 8-bit integer (int 8) as an example, after quantization calculation, the input data is converted into 8-bit integer, and the data capacity of the data is-127 to 127, namely 254.
Specifically, the scale may be calculated according to a preset formula in =(x max -x min ) Determining an input scale factor, wherein scale in To input a scaling factor, x max For maximum value of input data, x min Q is the data capacity, to be the minimum value of the input data, specifically q=q max -Q min Wherein Q is max Maximum value of fixed point number of output data quantized by operation layer, Q min The minimum value of the fixed point number of the output data quantized by the operation layer.
Further, after the test data is input into the network model, the output data of each operation layer of the network model can be generated through the operation of each operation layer of the network model, and further, the output scaling factor of each operation layer can be calculated according to the output data.
Specifically, the output scaling factor of the operation layer may be determined according to the data range of the output data. First, a maximum value of output data and a minimum value of output data are acquired. Then, the output scaling factor is determined according to the maximum value and the minimum value of the output data and the data capacity required by the quantized operation layer. The data capacity required by the quantized operation layer is the capacity of the output data quantized by the operation layer, taking 8-bit integer (int 8) as an example, after quantization calculation, the output data is converted into 8-bit integer, and the data capacity of the data is-127 to 127, namely 254.
Specifically, the scale may be calculated according to a preset formula out =(y max -y min ) Determining an output scaling factor, wherein scale out To output the scaling factor, y max To output the maximum value of the data, y min For outputting the minimum value of the data, Q is the data capacity, specifically, q=q max -Q min Wherein Q is max Maximum value of fixed point number of output data quantized by operation layer, Q min The minimum value of the fixed point number of the output data quantized by the operation layer.
According to an embodiment of the present application, as shown in fig. 3, a quantization method of a network model is provided, including:
s302, test data are obtained;
S304, inputting test data into a network model, and generating input data and output data of each operation layer of the network model;
s306, determining an input scaling factor of each operation layer according to the input data;
s308, determining an output scaling factor of each operation layer according to the output data;
s310, determining a first target operation layer from all operation layers according to an input scaling factor or an output scaling factor;
the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value;
s312, quantifying each second target operation layer in turn according to a preset sequence;
s314, after quantification of any one of the second target operation layers is completed, acquiring the precision of the network model;
s316, restoring the second target operation layer after the last quantization into the second target operation layer before the quantization under the condition that the precision of the network model is smaller than the preset precision;
and S318, determining the current network model as the quantized network model.
In this embodiment, after the first target operation layer unsuitable for quantization is determined, in the process of quantizing the second target operation layer other than the first target operation layer, the operation layer unsuitable for quantization may also be determined from the second target operation layer according to the input scaling factor and the weight scaling factor of the second target operation layer.
Specifically, first, quantization calculation may be sequentially performed on each second target operation layer according to a preset sequence, and after each quantization, the operation precision of the network model is obtained. That is, after quantizing a second target operation layer, inputting test data for testing the accuracy of the model to the network model, generating test results, and determining the accuracy of the network model according to the test results. If the precision of the network model is greater than or equal to the preset precision, the precision loss of the network model after quantization of the second target operation layer is smaller, that is, the second target operation layer is suitable for quantization, and the quantized second target operation layer can be reserved.
Accordingly, after the quantization of a second target operation layer is completed, determining the precision of the network model, if the precision of the network model is smaller than the preset precision, it indicates that the network model loses larger precision after the quantization of the second target operation layer, that is, the second target operation layer is not suitable for quantization, so that the quantized second target operation layer is restored to an unquantized second target operation layer.
In addition, during the process of sequentially quantizing the second target operation layers, the second target operation layers which are located further forward in the preset sequence are subjected to smaller loss precision of the quantized network model, so that when the precision of the quantized network model of the second target operation layers is smaller than the preset precision, the unquantized second target operation layers are indicated to be smaller and smaller, and therefore, the unquantized other second target operation layers do not need to be quantized, namely, the quantization of the network model is completed.
According to an embodiment of the present application, as shown in fig. 4, a quantization method of a network model is provided, including:
s402, acquiring test data;
s404, inputting test data into the network model, and generating input data and output data of each operation layer of the network model;
s406, determining an input scaling factor of each operation layer according to the input data;
s408, determining an output scaling factor of each operation layer according to the output data;
s410, determining a first target operation layer from all operation layers according to an input scaling factor or an output scaling factor;
the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value;
S412, traversing each second target operation layer, and determining a weight value of each second target operation layer;
s414, determining the sequence factor of each second target operation layer according to the weight value and the input scaling factor of the second target operation layer;
s416, determining a preset sequence for quantizing the second target operation layer according to the sequence factors;
the preset sequence is from small to large according to the sequence factor;
s418, quantifying each second target operation layer in turn according to a preset sequence;
s420, after quantification of any one second target operation layer is completed, obtaining the precision of the network model;
s422, restoring the second target operation layer after the last quantization into the second target operation layer before the quantization under the condition that the precision of the network model is smaller than the preset precision;
s424, determining the current network model as the quantized network model.
In this embodiment, before each second target operation layer is quantized in turn according to a preset order, the order in which the second target operation layers are quantized needs to be determined first, that is, the preset order is determined, so that after the second target operation layers are quantized in turn, the accuracy of the network model is lower and lower, and therefore, after the accuracy of the network model is smaller than the preset accuracy, it is determined that all other second target operation layers that are not quantized are not suitable for quantization, and further efficiency of determining the second target operation layers that are not suitable for quantization from all the second target operation layers is improved, that is, efficiency of quantizing the network model is improved.
Specifically, as can be seen from the above relational expression between the input data and the output data, in the quantization calculation process of the operation layers of the network model, the input scaling factor of each operation layer has a product relationship with the weight scaling factor of each operation layer, so that the possible precision loss in the quantization process of each second target operation layer can be determined according to the input scaling factor and the weight value. In some embodiments, each second target operation layer is traversed first, a weight value of each second target operation layer is determined, and then a sequence factor of each second target operation layer is determined according to the weight value of each second target operation layer and an input scaling factor of each second target operation layer. And finally, determining a preset sequence, namely determining the sequence for quantizing the second target operation layers according to the sequence factor of each second target operation layer.
Specifically, the input scaling factor of each operation layer has a product relationship with the weight scaling factor of each operation layer, so that the magnitude of the scaling factor determined according to the input scaling factor and the weight value can indicate the magnitude of the precision loss of the network model after the quantization of the second target operation layer, that is, the smaller the sequence factor, the smaller the precision loss of the network model, that is, the greater the precision of the network model. Thus, the preset order may be determined in order of the order factor from small to large. That is, the second target operation layers with smaller order factors are quantized preferentially, so that the accuracy of the network model is lower and lower after the second target operation layers are quantized in sequence, and the efficiency of determining the second target operation layers unsuitable for quantization from all the second target operation layers is guaranteed.
Further, determining the order factor of each second target operation layer according to the weight value and the input scaling factor of the second target operation layer comprises: obtaining the maximum value and the minimum value of the weight value of each second target operation layer; determining a weight scaling factor according to the maximum value of the weight value, the minimum value of the weight value and the data capacity required by the operation layer after quantization; the order factor is determined based on the weight scaling factor and the input scaling factor.
Specifically, firstly, the maximum value and the minimum value of the weight value of each second target operation layer are obtained, and then the weight scaling factor of the second target operation layer is determined according to the maximum value and the minimum value of the weight value and the data capacity required by the operation layer after quantization. Specifically, the weight scaling factor may be scaled according to the formula scale we =(W max -W min ) Q, wherein scale is determined we To scale the weight by a factor, W max For maximum value of weight value, W min Q is the data capacity, which is the minimum value of the weight values, in particular q=q max -Q min Wherein Q is max Maximum value of fixed point number of output data quantized by operation layer, Q min The minimum value of the fixed point number of the output data quantized by the operation layer.
Further, after the weight scaling factor is determined, the order factor of the second target operational layer may be determined according to the weight scaling factor and the input scaling factor. And determining a preset sequence of the quantization process of the second target operation layers according to the sequence factors of all the second target operation layers.
Specifically, determining the order factor from the weight scaling factor and the input scaling factor includes: according to a preset formula: scale = scale in ×scale we Determining a sequence factor; wherein scale is a sequential factor, scale in To input a scale factor we Is a weight scaling factor.
In a specific embodiment, quantization is performed for a network model comprising six operational layers, fc1, fc2, fc3, fc4, fc5, and fc6, respectively. The quantization calculation process of the data of each operation layer is shown in fig. 6. Specifically, the 32-bit floating point number is converted into the 8-bit integer number (int 8) through the quantization function Quant, then the 8-bit integer number is restored to the 32-bit number through the inverse quantization function DeQuant after the convolution operation is performed through the Conv2D function in the convolution layer, and the quantization calculation process of the operation layer is completed.
In the quantization process, first, six operation layers are traversed to obtain a weight value weight of each operation layer, and further, a weight scale factor weight of each operation layer is determined according to the weight value of each operation layer. And then, inputting the test data into a network model, generating input data input and output data output of six operation layers, and determining an input scaling factor input scale and an output scaling factor output scale of each operation layer according to the input data and the output data. As shown in table 1 and fig. 8, 9, 10, 11 and 12. Fig. 8 is a distribution diagram of weight values of each operation layer, wherein the abscissa is the weight value of each operation layer, and the ordinate is the number of occurrences of each weight value. Fig. 9 is a graph of the input data and output data distribution for each operation layer, with the abscissa representing the data values of the input data and output data and the ordinate representing the number of times the data values occur. Fig. 10 is a diagram showing another form of distribution of input data and output data for each operation layer, with the abscissa representing data values of the input data and the output data and the ordinate representing the number of times of occurrence of the data values. Fig. 11 is another view of the distribution diagram of the input data and the output data in fig. 10. FIG. 12 is a bar graph of input scaling factors, output scaling factors, and weight scaling factors for each operational layer.
TABLE 1
As can be seen from table 1 and fig. 12, for fc1, the input scaling factor of the input data differs greatly from that of other operation layers, and therefore, for fc1, a great loss of precision is caused in the quantization calculation process, that is, fc1 is not suitable for quantization. Accordingly, for fc6, the output scaling factor of the output data differs greatly from that of other operation layers, and therefore, for fc6, a great loss of precision is caused in the quantization calculation process, that is, fc6 is not suitable for quantization.
Further, for fc2 to fc5, it is preferable to determine the order factor of each operation layer according to the input scaling factor and the weight scaling factor, that is, multiply the input scaling factor of each operation layer with the weight scaling factor, that is, input scale×weight scale, to obtain the order factor, as shown in fig. 13. And further determines the quantization order of fc2 to fc5 according to the order factor.
As can be seen from fig. 13, for fc2 to fc5, the order factor is fc4< fc5< fc3< fc2 from small to large, that is, during quantization, quantization is performed according to the order of fc4, fc5, fc3, fc2, and at the same time, after each quantization of one operation layer, the accuracy of the network model is obtained, and in the case where the accuracy of the network model is less than the preset accuracy, it means that the operation layer quantized last and other operation layers not quantized are not suitable for quantization, and the network model quantization is completed. For example, after fc3 is quantized, the accuracy of the network model is smaller than the preset accuracy, which indicates that fc3 and fc2 are not suitable for quantization, so that fc4 and fc5 are quantized, and the accuracy of the model is basically unchanged after fc4 and fc5 are quantized, but the technical effect of greatly reducing the parameter is achieved, and the size of the system firmware is also reduced (for example, the size of the system firmware can be reduced from 4.7MB to 3.7 MB) by a partial quantization mode, so that the memory occupation and the upgrading time of the system are greatly reduced.
According to an embodiment of the present application, as shown in fig. 5, a quantization method of a network model is provided, including:
s502, test data are obtained;
s504, inputting test data into a network model, and generating input data and output data of each operation layer of the network model;
s506, determining an input scaling factor of each operation layer according to the input data;
s508, determining an output scaling factor of each operation layer according to the output data;
s510, determining a first target operation layer from all operation layers according to an input scaling factor or an output scaling factor;
s512, quantizing the second target operation layer to generate a quantized network model;
s514, adding a normalization layer before each first target operation layer of the network model;
s516, quantifying each first target operation layer;
s518, after quantifying any one of the first target operation layers, obtaining the precision of the network model;
s520, under the condition that the accuracy of the network model is not reduced, reserving a normalization layer before the first target operation layer, and determining the first target operation layer as a second target operation layer;
s522, in the case where the accuracy of the network model is reduced, the quantized first target operation layer is restored to the first target operation layer before quantization, and the normalization layer before the first target operation layer is deleted.
In this embodiment, after the first target operation layer unsuitable for quantization is determined, the first target operation layer unsuitable for quantization may be further adjusted by adding a normalization layer, so that after the first target operation layer is quantized, the operation precision of the network model may be ensured, so that the targets of reducing the size of the model, reducing the memory consumption of the model, accelerating the calculation speed of the model, and the like may be achieved by quantizing the first target operation layer, and the operation precision of the network model may be ensured.
Specifically, after the first target operation layers unsuitable for quantization are determined, a normalization layer may be added before each first target operation layer, and then the first target operation layers may be quantized. That is, the input data is normalized by the normalization layer and then quantized, so that the range of the normalized input data is narrowed, and the accuracy loss after passing through the first target operation layer is reduced, so that the accuracy of the output data of the first target operation layer can be ensured after the first target operation layer is quantized.
It should be noted that, after adding the normalization layer before the first target operation layer, it is helpful to reduce the precision loss after quantizing the first target operation layer, but it cannot be guaranteed that the precision loss is reduced to a predetermined precision loss, so when quantizing the first target operation layer after adding the normalization layer before each first target operation layer, it is also necessary to determine the precision of the network model, that is, after quantizing the first target operation layer each time, the precision of the network model is obtained, and if the precision of the network model is not reduced, the first target operation layer may be determined as a quantifiable second target operation layer, and meanwhile, the normalization layer before the first target operation layer is reserved. If the accuracy of the network model is reduced, it means that adding a normalization layer before the first target operation layer cannot meet the accuracy after the quantization of the first target operation layer, at this time, deleting the normalization layer before the first target operation layer, where the first target operation layer is still an operation layer unsuitable for quantization, and not quantizing the first target operation layer in the network model quantization process.
It can be understood that in the process of quantizing the second target operation layer according to the preset sequence, when the precision of the model is smaller than the preset precision, the unquantized second target operation layer is determined to be an operation layer unsuitable for quantization, at this time, the precision of the quantized model can be improved by adding a normalization layer before the unquantized second target operation layer, and meanwhile, the precision of the model is obtained in the quantization process, so that after the normalization layer is added, the network model can ensure the precision in the process of quantizing the unquantized second target operation layer.
As shown in fig. 7, according to an embodiment of the present application, there is provided a method for processing voice data, including:
s702, acquiring frequency domain characteristics and a network model of voice data;
s704, processing the frequency domain features through the network model to obtain the wake-up result probability of the voice data.
In this embodiment, a method for processing voice data is provided for determining a wake-up result probability of the voice data. Firstly, obtaining the frequency domain characteristics of voice data, wherein the frequency domain characteristics of the voice data are obtained by carrying out pre-emphasis, framing, windowing, short-time Fourier transform and Mel filtering calculation on an original voice signal. And then processing the frequency domain characteristics through a network model to obtain the probability of a wake-up result of the voice data. Wherein the network model is a quantized network model generated by the quantization method of the network model of any one of the above embodiments.
Specifically, when determining the probability of the wake-up result of the voice data, a time convolution network model is adopted, and the historical information of the model can be traced back in the running process of the time convolution network model, so that the accumulation of quantization errors is easy to cause in the process of quantizing the time convolution network model, namely, the quantization errors generated in one calculation process can influence the subsequent calculation, and the accumulation of quantization errors is caused. Therefore, the quantization difficulty of the time convolution network model is high, and the situation that the accuracy is reduced after quantization is easy to occur. The quantization method of the network model provided by the application quantizes the time convolution network model, and determines the layers unsuitable for quantization in the time convolution network model, so that only other layers are quantized, and on the basis of achieving the aims of reducing the size of the time convolution network model, reducing the memory consumption of the time convolution network model, accelerating the calculation speed of the time convolution network model and the like through quantization calculation, the precision loss of the time convolution network model is reduced, and the calculation precision of the time convolution network model is ensured. Therefore, in the process of processing the voice data through the quantized time convolution network model, the processing efficiency of the frequency domain features can be improved, meanwhile, the calculation accuracy of the processing process of the frequency domain features can be guaranteed, and the accuracy of the wake-up result probability of the generated voice data is further guaranteed.
In one embodiment of the present application, a processing apparatus for voice data is provided, including: a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method for processing speech data as provided in the above embodiments.
In one embodiment of the present application, as shown in fig. 14, a quantization apparatus 600 of a network model is provided, including: a determining unit 606 for determining input data and output data of each operation layer of the network model; the determining unit 606 is further configured to determine an input scaling factor of each operation layer according to the input data; determining an output scaling factor of each operation layer according to the output data; determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor; the ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value; and a quantization unit 608, configured to quantize the second target operational layer, and generate a quantized network model.
The quantization device 600 of the network model provided by the application generates the input data and the output data of each operation layer of the network model by inputting the test data into the network model, further determines the input scaling factor and the output scaling factor of each operation layer according to the input data and the output data of each operation layer, and further determines whether each operation layer is suitable for quantization calculation according to the input scaling factor and the output scaling factor, namely, determines the first target operation layer unsuitable for quantization from all operation layers of the network model according to the input scaling factor or the output scaling factor. In this way, in the process of network model operation, only the second target operation layer except the first target operation layer can be subjected to quantization calculation, and on the basis of achieving the aims of reducing the size of the network model, reducing the memory consumption of the network model, accelerating the calculation speed of the network model and the like through quantization calculation, the precision loss of the network model is reduced, and the calculation precision of the network model is ensured.
Further, the quantization apparatus 600 of the network model further includes an obtaining unit 602, where the obtaining unit 602 is further configured to obtain a maximum value of the input data and a minimum value of the input data; the determining unit 606 is specifically configured to determine the input scaling factor according to the maximum value and the minimum value of the input data and the data capacity required by the quantized operation layer.
Further, the obtaining unit 602 is further configured to obtain a maximum value of the output data and a minimum value of the output data; the determining unit 606 is specifically configured to determine the output scaling factor according to the maximum value and the minimum value of the output data and the data capacity required by the quantized operation layer.
Further, the quantization unit 608 is specifically configured to sequentially quantize each of the second target operation layers according to a preset sequence; the obtaining unit 602 is further configured to obtain accuracy of the network model after quantization of any one of the second target operation layers is completed; the quantization unit 608 is further configured to restore the second target operation layer after the last quantization to the second target operation layer before quantization if the precision of the network model is less than the preset precision; and determining the current network model as the quantized network model.
Further, the determining unit 606 is further configured to traverse each second target operation layer, and determine a weight value of each second target operation layer; determining the sequence factor of each second target operation layer according to the weight value and the input scaling factor of the second target operation layer; determining a preset sequence for quantizing the second target operation layer according to the sequence factors; the preset sequence is the sequence from small to large according to the sequence factor.
Further, the obtaining unit 602 is further configured to obtain a maximum value of the weight value and a minimum value of the weight value of each second target operation layer; the determining unit 606 is specifically further configured to determine a weight scaling factor according to a maximum value of the weight value, a minimum value of the weight value, and a data capacity required by the operation layer after quantization; the order factor is determined based on the weight scaling factor and the input scaling factor.
Further, the determining unit 606 is specifically configured to: scale = scale in ×scale we Determining a sequence factor; wherein scale is a sequential factor, scale in To input a scale factor we Is a weight scaling factor.
Further, the quantization device further comprises an adding unit for adding a normalization layer before each first target operation layer of the network model; the quantization unit 608 is further configured to quantize each of the first target operational layers; the obtaining unit 602 is further configured to obtain accuracy of the network model after quantizing any one of the first target operation layers; the determining unit 606 is further configured to, in a case where the accuracy of the network model is not reduced, reserve a normalization layer before the first target operation layer, and determine the first target operation layer as the second target operation layer; and under the condition that the accuracy of the network model is reduced, the quantized first target operation layer is restored to the first target operation layer before quantization, and the normalization layer before the first target operation layer is deleted.
In one embodiment of the present application, a quantization apparatus for a network model is provided, including: a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the quantization method of a network model as provided in any one of the above embodiments.
The network model quantization device provided by the application comprises a memory and a processor, and further comprises a program or instructions stored in the memory, wherein the program or instructions can realize the steps of the network model quantization method according to any one of the above embodiments when being executed by the processor, so that the network model quantization device has all the beneficial effects of the network model quantization method, and is not repeated herein.
In one embodiment of the present application, a readable storage medium is provided, on which a program or instructions are stored, which when executed by a processor, implement the steps of the quantization method of a network model or the processing method of speech data as in any of the above embodiments.
The readable storage medium provided by the present application has a program or an instruction stored thereon, and when the program or the instruction is executed by a processor, the method for quantizing a network model or the method for processing voice data according to any one of the embodiments can be implemented, so that the readable storage medium has all the advantages of the method for quantizing a network model or the method for processing voice data, which are not described herein.
In one embodiment of the present application, an electronic device is provided that includes a processor and a memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the quantization method of the network model or the processing method of the speech data as in any of the above embodiments when executed by the processor.
The electronic device of the model provided by the application comprises a memory and a processor, and further comprises a program or instructions stored on the memory, wherein the program or instructions can realize the steps of the quantization method of the network model or the processing method of the voice data in any one of the above embodiments when being executed by the processor, so that the electronic device has all the beneficial effects of the quantization method of the network model or the processing method of the voice data, and are not repeated herein.
In an embodiment of the application, a computer program product is proposed, comprising a computer program or instructions which, when executed by a processor, implement the steps of the quantization method of the network model or the processing method of the speech data of any of the above embodiments. Therefore, the computer program product has all the advantages of the quantization method of the network model or the processing method of the voice data, which are not described herein.
In an embodiment of the application, a chip is provided, which includes a program or instructions for implementing the steps of the quantization method of the network model or the processing method of the voice data of any one of the above embodiments when the chip is running. Therefore, the chip has all the beneficial effects of the quantization method of the network model or the processing method of the voice data, and will not be described herein.
Model quantization is an operation process of converting weight, activation value and the like of a trained deep neural network from high precision to low precision. For example, the 32-bit floating point number is converted into 8-bit integer number, and the accuracy of the converted model is similar to that before conversion. Thereby achieving the aims of reducing the size of the model, reducing the memory consumption of the model, accelerating the reasoning speed of the model, and the like.
For the model, some layers are suitable for quantization, and many layers lose much precision after quantization, and if all layers of the model are quantized directly without analysis, the model loss precision is very serious. The application analyzes the input data distribution range, the output data distribution range and the weight parameter distribution range of each layer to carry out quantization evaluation on the values of the quantization scale of the input data, the quantization scale of the output data and the input scale, thereby solving the problem of selecting a layer suitable for quantization when carrying out partial quantization and realizing the aim of reducing the model scale as much as possible on the premise of ensuring the precision. In some embodiments, the present application obtains a weight data range by reading model data, inputs a small amount of calibration data into the network, obtains an input/output data range for each layer, and evaluates the quantization capability of each layer by analyzing the input/output data range and the weight data range for each layer. In the application, the quantization capacities of all layers in one model are screened and ordered, and the model is started from the layer which is easiest to realize quantization, so that the model size and the model precision are both considered. The technical scheme provided by the application can be applied to command word awakening and offline command word tasks of air conditioner voice projects.
In one embodiment, as shown in fig. 15, the quantization method of the network model is as follows:
s802, reading a network model, and acquiring a weight value of each operation layer;
s804, inputting calibration data, and determining an input scaling factor and an output scaling factor of each operation layer;
s806, judging whether the input scaling factor of each operation layer is too large, if so, entering S818, otherwise, entering S808;
s808, judging whether the output scaling factor of each operation layer is too large, if so, entering S818, and if not, entering S810;
s810, sorting the rest operation layers except for quantization according to the weight value and the input scaling factor to obtain the quantization capacity of each operation layer;
s812, sequentially quantizing each operation layer according to the sequence;
s814, judging whether the precision loss of the operation layer exceeds a preset value; if yes, go to S816, if not, continue to quantize the operation layer;
s816, only the operation layer with the quantization precision loss smaller than a preset value;
s818, determining that the operation layer does not quantize.
According to the quantization method of the network model, the test data are input into the network model to generate the input data and the output data of each operation layer of the network model, so that the input scaling factor and the output scaling factor of each operation layer are determined according to the input data and the output data of each operation layer, and whether each operation layer is suitable for quantization calculation or not is determined according to the input scaling factor and the output scaling factor, namely, the operation layer unsuitable for quantization is determined from all operation layers of the network model according to the input scaling factor or the output scaling factor.
Further, before the remaining operation layers are quantized in sequence according to the preset sequence, the sequence in which the remaining operation layers are quantized needs to be determined first, that is, the preset sequence is determined, so that after the remaining operation layers are quantized in sequence, the precision of the network model is lower and lower, and therefore after the precision of the network model is smaller than the preset precision, the remaining unquantized remaining operation layers are determined to be unsuitable for quantization, and further efficiency of determining the operation layers unsuitable for quantization from all the remaining operation layers is improved, that is, efficiency of quantizing the network model is improved.
The technical scheme provided by the application can be applied to different side systems such as linux/rtos/android/ios and the like, and provides instruction level acceleration for different side platforms such as armv7/v8, dsp and the like. The technical scheme of the application has the characteristics of light-weight deployment, strong universality, strong usability, high-performance reasoning and the like, comprehensively solves the low-resource bottleneck of the intelligent equipment, greatly shortens the AI model deployment period, and achieves the industry leading level in the side AI deployment field. The technical scheme provided by the application can be applied to a self-grinding chip, for example, the first three-in-one chip FL119 supporting voice, connection and display in the industry. The related achievements have comprehensively energized intelligent household electric quantity production land of voice refrigerators, air conditioners, robots and the like, and are intelligent and synergistic.
In the description of the present application, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present application; the terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.
In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above is only a preferred embodiment of the present application, and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (13)

1. A method for processing voice data, comprising:
acquiring frequency domain characteristics and a network model of voice data;
processing the frequency domain features through the network model to obtain the wake-up result probability of the voice data, wherein the network model is a quantized network model generated through a quantization method of the network model;
the quantization method of the network model comprises the following steps:
determining input data and output data of each operation layer of the network model;
determining an input scaling factor of each operation layer according to the input data;
determining an output scaling factor of each operation layer according to the output data;
determining a first target operation layer from all operation layers according to the input scaling factor or the output scaling factor;
The ratio of the input scaling factor of the first target operation layer to the maximum value of the input scaling factor of the second target operation layer except the first target operation layer is larger than a first preset value, or the ratio of the output scaling factor of the first target operation layer to the maximum value of the output scaling factor of the second target operation layer is larger than a second preset value;
quantizing the second target operation layer to generate a quantized network model;
adding a normalization layer before each of the first target operational layers of the network model;
quantifying each of the first target operation layers;
after quantizing any one of the first target operation layers, acquiring the precision of the network model;
under the condition that the accuracy of the network model is not reduced, reserving the normalization layer before the first target operation layer, and determining the first target operation layer as the second target operation layer;
and under the condition that the precision of the network model is reduced, restoring the quantized first target operation layer into the first target operation layer before quantization, and deleting the normalization layer before the first target operation layer.
2. The method according to claim 1, wherein said determining an input scaling factor for each of said operation layers based on said input data, comprises:
acquiring the maximum value of the input data and the minimum value of the input data;
and determining the input scaling factor according to the maximum value and the minimum value of the input data and the data capacity required by the operation layer after quantization.
3. The method according to claim 1, wherein determining an output scaling factor for each of the operation layers based on the output data, comprises:
obtaining the maximum value of the output data and the minimum value of the output data;
and determining the output scaling factor according to the maximum value and the minimum value of the output data and the data capacity required by the quantized operation layer.
4. A method of processing speech data according to any one of claims 1 to 3, wherein said quantizing said second target operation layer, generating a quantized network model, comprises:
quantifying each second target operation layer in turn according to a preset sequence;
After the quantification of any one of the second target operation layers is completed, the precision of the network model is obtained;
restoring the second target operation layer after the last quantization to the second target operation layer before the quantization under the condition that the precision of the network model is smaller than the preset precision;
and determining the current network model as the quantized network model.
5. The method according to claim 4, wherein before the step of quantizing each of the second target operation layers in turn in a predetermined order, the method further comprises:
traversing each second target operation layer, and determining a weight value of each second target operation layer;
determining a sequence factor of each second target operation layer according to the weight value and the input scaling factor of the second target operation layer;
determining a preset sequence for quantizing the second target operation layer according to the sequence factor;
the preset sequence is from small to large according to the sequence factor.
6. The method according to claim 5, wherein determining the order factor of each of the second target operation layers based on the weight value and the input scaling factor of the second target operation layer comprises:
Obtaining the maximum value and the minimum value of the weight value of each second target operation layer;
determining a weight scaling factor according to the maximum value of the weight value, the minimum value of the weight value and the data capacity required by the operation layer after quantization;
and determining the sequence factor according to the weight scaling factor and the input scaling factor.
7. The method of claim 6, wherein said determining said order factor based on said weight scaling factor and said input scaling factor comprises:
according to a preset formula: scale = scale in ×scale we Determining the order factor;
wherein scale is the order factor, scale in Scale for the input scale factor we Scaling the factors for the weights.
8. The method according to any one of claims 1 to 4, characterized in that the determining input data and output data of each operation layer of the network model includes:
obtaining test data;
the test data is input to the network model to generate input data and output data for each operational layer of the network model.
9. A processing apparatus for voice data, comprising: a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of processing speech data according to any one of claims 1 to 8.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the method of processing speech data according to any of claims 1 to 8.
11. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of processing speech data according to any one of claims 1 to 8.
12. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the method of processing speech data according to any one of claims 1 to 8.
13. A chip comprising a program or instructions for implementing the steps of the method of processing speech data according to any one of claims 1 to 8 when said chip is running.
CN202310927992.1A 2023-07-27 2023-07-27 Network model quantization method, voice data processing method, device and chip Active CN116644796B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310927992.1A CN116644796B (en) 2023-07-27 2023-07-27 Network model quantization method, voice data processing method, device and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310927992.1A CN116644796B (en) 2023-07-27 2023-07-27 Network model quantization method, voice data processing method, device and chip

Publications (2)

Publication Number Publication Date
CN116644796A CN116644796A (en) 2023-08-25
CN116644796B true CN116644796B (en) 2023-10-03

Family

ID=87643830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310927992.1A Active CN116644796B (en) 2023-07-27 2023-07-27 Network model quantization method, voice data processing method, device and chip

Country Status (1)

Country Link
CN (1) CN116644796B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183726A (en) * 2020-09-28 2021-01-05 云知声智能科技股份有限公司 Neural network full-quantization method and system
CN113887709A (en) * 2021-11-02 2022-01-04 上海壁仞智能科技有限公司 Neural network adaptive quantization method, apparatus, device, medium, and product
CN114708855A (en) * 2022-06-07 2022-07-05 中科南京智能技术研究院 Voice awakening method and system based on binary residual error neural network
WO2023011002A1 (en) * 2021-08-05 2023-02-09 鹏城实验室 Overflow-aware quantization model training method and apparatus, medium and terminal device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183726A (en) * 2020-09-28 2021-01-05 云知声智能科技股份有限公司 Neural network full-quantization method and system
WO2023011002A1 (en) * 2021-08-05 2023-02-09 鹏城实验室 Overflow-aware quantization model training method and apparatus, medium and terminal device
CN113887709A (en) * 2021-11-02 2022-01-04 上海壁仞智能科技有限公司 Neural network adaptive quantization method, apparatus, device, medium, and product
CN114708855A (en) * 2022-06-07 2022-07-05 中科南京智能技术研究院 Voice awakening method and system based on binary residual error neural network

Also Published As

Publication number Publication date
CN116644796A (en) 2023-08-25

Similar Documents

Publication Publication Date Title
US11354238B2 (en) Method and device for determining memory size
CN112580793B (en) Neural network accelerator based on time domain memory computing and acceleration method
CN111985495A (en) Model deployment method, device, system and storage medium
CN111242358A (en) Enterprise information loss prediction method with double-layer structure
CN111753878A (en) Network model deployment method, equipment and medium
CN116644796B (en) Network model quantization method, voice data processing method, device and chip
CN115169809A (en) Smart city evaluation method and device
Shi et al. Compression of acoustic event detection models with quantized distillation
CN114049530A (en) Hybrid precision neural network quantization method, device and equipment
CN115774854B (en) Text classification method and device, electronic equipment and storage medium
Kwak et al. Quantization aware training with order strategy for CNN
CN114237182B (en) Robot scheduling method and system
CN115526320A (en) Neural network model inference acceleration method, apparatus, electronic device and medium
CN115410042A (en) Workpiece classification method and device, computer readable medium and electronic equipment
CN114528966A (en) Local learning method, equipment and medium
CN109635326B (en) Mechanical structure and aviation hydraulic pipeline vibration failure sensitivity analysis method
US6192336B1 (en) Method and system for searching for an optimal codevector
US20190236354A1 (en) Information processing method and information processing system
CN115294108B (en) Target detection method, target detection model quantification device, and medium
CN117763318A (en) Model parameter adjustment method
CN113688929B (en) Prediction model determining method, apparatus, electronic device and computer storage medium
CN116227332A (en) Method and system for quantizing mixed bits of transformers
Wang et al. Software defect prediction based on combined sampling and feature selection
CN117391145A (en) Convolutional neural network quantitative reasoning optimization method and system
CN116186558A (en) Deep neural network layered data comparison method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant