CN117787358A

CN117787358A - Model quantization method, device and equipment based on resistive random access memory

Info

Publication number: CN117787358A
Application number: CN202410064436.0A
Authority: CN
Inventors: 高丽丽; 时拓; 张徽; 顾子熙
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-03-29

Abstract

In the model quantization method, device and equipment based on the resistive random access memory, a compression model is constructed according to a model to be quantized, the preset input bit width and the preset weight bit width of each network layer of the compression model are determined, then training samples are input into the compression model, an output result is determined, and loss is determined according to the output result and the difference between corresponding labels of the training samples; and adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, determining a plurality of groups of input bit widths and weight bit widths which meet the optimization conditions, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model. On the premise of ensuring that the precision of the quantized model meets the requirement, the quantized model can accurately represent the weight of the model through the resistance random access memory, and the resource utilization rate of the resistance random access memory is increased.

Description

Model quantization method, device and equipment based on resistive random access memory

Technical Field

The present disclosure relates to the field of machine learning, and in particular, to a method, an apparatus, and a device for model quantization based on a resistive random access memory.

Background

With the rapid development of machine learning, neural network technology is widely used in various fields. In order to solve the problem that the traditional architecture with separate chip memory and computation results in higher requirements on interface resources in a high concurrency scene. At present, a neural network model is deployed on the memory computing integrated equipment, so that the demand on interface resources is reduced. The resistive random access memory has the advantages of high working speed, controllable resistance, simple structure and the like, and is used for solving the problems caused by the traditional chip architecture.

In the prior art, in order to facilitate the deployment of the neural network model, the neural network model needs to be compressed. In the model compression process, the wider the bit width of the quantized bit (bit) of the model parameters is, the higher the precision of the model is.

However, when the neural network model is deployed in the resistive random access memory, the resistive random access memory represents different data according to different electrical conductivity states of the resistive random access memory, so that data storage is realized. Therefore, the bit width of each resistive random access memory which can be stored is limited, if the model parameters are directly mapped into the resistive random access memory, the parameters which are partially different are expressed as the same conductivity state on the resistive random access memory, so that the precision of the model is reduced, and therefore, the problem to be solved is solved urgently, under the condition of ensuring that the model precision loss is smaller, the utilization rate of the resources of the resistive random access memory is maximized. The specification provides a model quantization method, device and equipment based on a resistance random access memory.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a medium, and an electronic device for processing genetic data, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

determining a storage bit width of a resistive random access memory of a model to be quantized, a training sample and a model to be deployed;

constructing a compression model according to the model to be quantized, and determining preset input bit widths and preset weight bit widths of each network layer of the compression model;

inputting the training sample into the compression model, determining an output result, and determining loss according to the difference between the output result and the corresponding label of the training sample;

adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, and determining a plurality of groups of input bit widths and weight bit widths which meet an optimization condition, wherein the optimization condition comprises that the adjusted weight bit width is smaller than the storage bit width or the weight bit width is an integer multiple of the storage bit width;

calculating model precision of the model to be quantized by taking the weight bit width and the input bit width of each group as parameters according to the determined weight bit width and the input bit width;

And determining the input bit width and the weight bit width corresponding to each network layer of the model to be quantized according to each model precision, determining the quantized model according to the determined input bit width and the weight bit width of each network layer, and deploying the quantized model on the resistive random access memory.

Optionally, constructing a compression model according to the model to be quantized specifically includes:

and determining each network layer of the model to be quantized, adding a quantization layer and an inverse quantization layer after each network layer, and adding the model to be quantized of the quantization layer and the inverse quantization layer as the compression model.

Optionally, adjusting the input bit width and the weight bit width of each network layer of the compression model, and determining a plurality of groups of input bit widths and weight bit widths meeting optimization conditions, including:

adjusting the input bit width and the weight bit width of each network layer of the compression model, and judging whether the input bit width and the weight bit width meet the optimization condition when the loss is smaller than or equal to a preset value;

if yes, storing the input bit width and the weight bit width;

if not, the input bit width and the weight bit width are not stored;

And continuously adjusting the input bit width and the weight bit width of each network layer of the compression model, and determining a plurality of groups of stored input bit widths and weight bit widths which meet the optimization conditions when training is finished.

Optionally, determining a plurality of groups of input bit widths and weight bit widths meeting the optimization condition specifically includes:

for each network layer, determining the preset model precision of the compression model and a plurality of weight bit widths meeting optimization conditions;

aiming at each weight bit width meeting the optimization condition, adjusting the input bit width corresponding to the weight bit width, and determining each input bit width;

according to the determined input bit widths, adjusting the compression model, and calculating model accuracy of the compression model corresponding to the input bit widths;

and screening the input bit width from the input bit widths with the corresponding model precision not smaller than the preset model precision, and taking the screened input bit width and the weight bit width as a group of input bit width and weight bit width which meet the optimization condition.

Optionally, determining, according to each model precision, an input bit width and a weight bit width corresponding to each network layer of the model to be quantized specifically includes:

determining the preset model precision of the compression model;

Determining weight bit widths and input bit widths of all groups of weight bit widths which are smaller than or equal to the storage bit widths from all groups of weight bit widths and input bit widths, wherein each group of weight bit widths and input bit widths serve as a target parameter;

aiming at each target parameter, adjusting the compression model according to the target parameter, and calculating the model precision of the compression model corresponding to the target parameter;

when the corresponding target parameters with the model precision not smaller than the preset model precision exist, determining the target parameter with the minimum input bit width from the corresponding target parameters with the model precision not smaller than the preset model precision of the compression model as the corresponding input bit width and the weight bit width of each network layer;

and when the corresponding target parameters of which the model precision is smaller than the preset model precision do not exist, calculating the model precision of the compression model corresponding to each group of weight bit widths and input bit widths, wherein the weight bit widths are integer multiples of the storage bit widths, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model corresponding to each group of weight bit widths and input bit widths and the preset model precision.

Optionally, determining the quantization model according to the input bit width and the weight bit width specifically includes:

when the corresponding target parameters of the compression model with the model precision smaller than the preset model precision do not exist, adjusting the input bit width and the weight bit width of each network layer of the model to be quantized according to the determined input bit width and the weight bit width corresponding to each network layer, and determining an intermediate model;

inputting the training sample into the intermediate model, determining an output result of the intermediate model, and determining the loss of the intermediate model according to the output result of the intermediate model and the difference between corresponding labels of the training sample;

and taking the minimum loss of the intermediate model as an optimization target, adjusting model parameters of the intermediate model according to the input bit width and the weight bit width of each network layer of the intermediate model, and determining the quantized model.

The specification provides a model quantization device based on a resistive random access memory, which comprises:

the determining module is used for determining a to-be-quantized model, a training sample and a storage bit width of a resistance random access memory of the to-be-deployed model;

the compression module is used for constructing a compression model according to the model to be quantized and determining preset input bit width and preset weight bit width of each network layer of the compression model;

The optimization module is used for inputting the training sample into the compression model, determining an output result, and determining loss according to the difference between the output result and the corresponding label of the training sample; adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, and determining a plurality of groups of input bit widths and weight bit widths which meet an optimization condition, wherein the optimization condition comprises that the adjusted weight bit width is smaller than the storage bit width or the weight bit width is an integer multiple of the storage bit width;

the selection module is used for calculating the model precision of the model to be quantized by taking the weight bit width and the input bit width of each group of determined weight bit width and the input bit width as parameters; and determining the input bit width and the weight bit width corresponding to each network layer of the model to be quantized according to each model precision, determining the quantized model according to the determined input bit width and the weight bit width of each network layer, and deploying the quantized model on the resistive random access memory.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described resistive random access memory-based model quantization method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described resistive memory based model quantization method when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the model quantization method based on the resistive random access memory, a compression model is constructed according to a model to be quantized, the preset input bit width and the preset weight bit width of each network layer of the compression model are determined, then training samples are input into the compression model, an output result is determined, and loss is determined according to the output result and the difference between corresponding labels of the training samples; and adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, determining a plurality of groups of input bit widths and weight bit widths which meet the optimization conditions, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model.

According to the method, the input bit width and the weight bit width after the adjustment are determined through the optimization condition pair, so that the quantized model can be an integral multiple of the storage bit width of the resistance random access memory or smaller than the storage bit width of the resistance random access memory on the premise that the model precision meets the requirement, and the weight of the model can be represented through the resistance random access memory when the model is deployed on the resistance random access memory, and the resource utilization rate of the resistance random access memory is increased.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic diagram of a resistive random access memory operating circuit according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a model quantization method flow based on a resistive random access memory according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a compression model constructed according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a model quantization apparatus based on a resistive random access memory according to an embodiment of the present disclosure;

fig. 5 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present invention based on the embodiments herein.

With the rapid development of machine learning, the problem of high requirements on interface resources in a high concurrency scene caused by a traditional architecture with separate chip memory and computation is solved. At present, a neural network model is deployed on the memory computing integrated equipment, so that the demand on interface resources is reduced. The resistive random access memory has the advantages of high working speed, controllable resistance, simple structure and the like, and is used for solving the problems caused by the traditional chip architecture.

At present, when the model is deployed, the model is often required to be compressed first, and then the compressed model is deployed. Methods of model compression generally include weight clipping, quantization, knowledge distillation, etc., and in one or more embodiments of the present disclosure, the compression method employed is model quantization. It should be noted that, during the quantization process of the model, the bit width of the parameter of the model is quantized, that is, the bit width of the original parameter is reduced, and the bit width of the parameter often determines the precision of the quantized model, and the smaller the bit width of the parameter, the lower the precision of the model.

As shown in fig. 2, fig. 2 is a working circuit of a resistive random access memory provided in the present specification, wherein straight lines are wires in the circuit, V0 to V2 are voltage values of the corresponding wires, I0 to I2 are current values of the corresponding wires, the directions indicated by arrows are current directions and voltage directions, and a small rectangle is a resistor. The working principle of the working circuit of the resistive random access memory is that the conduction state of the resistive random access memory is changed by changing the condition information such as the voltage at two ends of the resistive random access memory, so that the data storage is realized. The resistive random access memory is a nonvolatile memory realized by a phase-change material, and can be switched between a crystal state and an amorphous state when the resistive random access memory is in different current, voltage or temperature conditions, so that two conductivity states exist, namely, the resistive random access memory has higher resistance when in the crystal state and has a low resistance when in the amorphous state. Therefore, when the resistive random access memory stores data, the storable bit width is limited, if the model parameters are directly mapped into the resistive random access memory, the parameters with different parts are expressed as the same conductivity state on the resistive random access memory, so that the precision of the model is reduced, and therefore, the problem to be solved is solved urgently, on the premise of ensuring that the model precision loss is smaller, the resource utilization rate of the resistive random access memory is maximized. The specification provides a model quantization method, device and equipment based on a resistance random access memory.

S100: and determining the storage bit width of the resistive random access memory of the model to be quantized, the training sample and the model to be deployed.

In one or more embodiments of the present specification, there is no limitation on what specific apparatus performs the resistive memory-based model quantization method, for example, a personal computer, a mobile terminal, a server, and the like. However, since the subsequent steps involve operations such as model quantization, and the tasks that consume large computing resources are generally performed by a server, the present description will also be described later by taking the server as an example to perform the model quantization method based on the resistive random access memory. The server may be a single device or may be composed of multiple devices, for example, a distributed server, which is not limited in this specification.

After the server determines that the model to be deployed on the resistive random access memory is quantized, the model to be quantized, the training sample and the storage bit width of the resistive random access memory of the model to be deployed can be determined first, so that the bit width of parameters of the model to be quantized can be adjusted according to the storage bit width of the resistive random access memory of the model to be deployed later.

It should be noted that, in one or more embodiments of the present disclosure, the manner in which the server determines the model to be deployed on the resistive random access memory is not limited, and the subsequent operations may be started in response to a user instruction or when a preset time is reached. Of course, the manner by which the server determines the model to be quantized, the training samples, and the memory bit width of the resistive random access memory of the model to be deployed is not limited. The information may be obtained from a fixed storage or may be determined from an address contained in the user instruction.

Of course, in one or more embodiments of the present disclosure, the specific scenario of the model to be quantized is not limited, and the model may be a model for classifying pictures, a large language model, or a model for classifying social networks. For convenience of description, the model to be quantified is taken as a model for social network classification, and the corresponding training sample can be the content of different types of social posts. The bit width of the resistive random access memory may be based on the bit width of the resistive random access memory actually selected.

It should be noted that, the model to be quantized may be a model after training or a model after pre-training, and it is determined that the model to be quantized needs to be deployed on a resistive random access memory, which is not limited in this specification.

S102: and constructing a compression model according to the model to be quantized, and determining preset input bit widths and preset weight bit widths of each network layer of the compression model.

After the model to be quantized is determined, a corresponding compression model can be constructed according to the actual model structure of the model to be quantized. Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of a method for constructing a compression model according to the present disclosure, that is, a quantization layer and an inverse quantization layer are respectively added at the back of an original network, so as to implement a quantized perceptual model constructed according to a model to be quantized. Of course, quantization of the model may also be implemented by other methods, such as channel quantization, group-wise quantization, layer-wise quantization, or hybrid quantization, and in one or more embodiments of the present disclosure, a specific quantization method of the model to be quantized is not limited.

Then, for each network layer of the model to be quantized, the input bit width and the weight bit width of each network layer are changed into a preset input bit width and a preset weight bit width respectively. Of course, the input bit width and the weight bit width of each network layer of the model to be quantized can be initialized, so that the stability and the accuracy of the model are improved. Of course, whether the preset value is determined or the input bit width and the weight bit width of each network layer of the model to be quantized are initialized, the input bit width and the weight bit width are determined according to the actual model.

For convenience of description, the network layers described later refer to the network layers of the model to be quantized, the quantization layer added by the compression model, and the inverse quantization layer are not described in the embodiments of the present specification.

S104: and inputting the training sample into the compression model, determining an output result, and determining loss according to the difference between the output result and the corresponding label of the training sample.

S106: and adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, and determining a plurality of groups of input bit widths and weight bit widths which meet an optimization condition, wherein the optimization condition comprises that the adjusted weight bit width is smaller than the storage bit width or the weight bit width is an integer multiple of the storage bit width.

After the compression model is determined, model training is also required to be performed on the compression model in order to ensure that the accuracy of the compressed model is greater than the preset model accuracy. Specifically, the training sample is input into the compression model, an output result is determined, and loss is determined according to the output result and the difference between corresponding labels of the training sample.

And (3) taking the minimum loss as an optimization target, adjusting parameters of the compression model, and the input bit width and the weight bit width of each network layer, so that the model to be quantized does not lose much model precision even if the model to be quantized is compressed.

Of course, in order that the parameters of the quantized model can be accurately represented by the resistive random access memory, when the input bit width and the weight bit width of each network layer are adjusted, the determined weight bit width should also satisfy that the adjusted weight bit width is smaller than the storage bit width or the weight bit width is an integer multiple of the storage bit width.

Specifically, parameters of the compression model, input bit width and weight bit width of each network layer are adjusted, and when the loss is smaller than a preset value, whether the weight bit width at the moment meets an optimization condition is judged, namely, the adjusted weight bit width is smaller than the storage bit width or the weight bit width is integral multiple of the storage bit width. If yes, the weight bit width at the moment is reserved or stored, and the input bit width is reserved or stored. If not, the weight bit width at the moment is not reserved. And continuously adjusting parameters of the compression model, input bit widths and weight bit widths of all network layers, and continuously storing the input bit widths and the weight bit widths which meet the optimization conditions under the condition that the loss is not more than a preset value. And determining a plurality of groups of input bit widths and weight bit widths meeting the optimization conditions.

In addition, when determining the input bit width, for each network layer, the preset model precision of the compression model and a plurality of weight bit widths meeting the optimization conditions are determined. Then, aiming at the weight bit width meeting the optimization condition, the weight bit width is kept unchanged, and the input bit width is adjusted. And determining the input bit width when the model precision is highest, and taking the input bit width and the weight bit width at the moment as a group.

S108: and calculating the model precision of the model to be quantized by taking the weight bit width and the input bit width of each group as parameters aiming at the determined weight bit width and the input bit width of each group.

S110: and determining the input bit width and the weight bit width corresponding to each network layer of the model to be quantized according to each model precision, determining the quantized model according to the determined input bit width and the weight bit width of each network layer, and deploying the quantized model on the resistive random access memory.

After determining multiple groups of input bit widths and weight bit widths, in order to select the most suitable input bit widths and weight bit widths from the input bit widths and weight bit widths of all groups, calculating the precision of the compression model at the moment, and then determining the input bit widths and weight bit widths corresponding to the model to be quantized according to the determined input bit widths and model precision corresponding to the weight bit of all groups.

Specifically, after determining the preset model precision of the compression model, determining the weight bit width and the input bit width of each group from the weight bit width and the input bit width of each group, wherein the weight bit width and the input bit width of each group are smaller than or equal to the weight bit width of the storage bit width, and each group of weight bit width and the input bit width is used as a target parameter.

And aiming at each target parameter, adjusting the compression model according to the target parameter, and calculating the model precision of the compression model corresponding to the target parameter. When the corresponding target parameters with the model precision not smaller than the preset model precision exist, determining the target parameter with the minimum input bit width from the corresponding target parameters with the model precision not smaller than the preset model precision as the corresponding input bit width and weight bit width of each network layer;

and when the model precision of the corresponding compression model is not smaller than the target parameter of the preset model precision, calculating the model precision of the compression model corresponding to each group of weight bit widths and input bit widths, wherein the weight bit widths are integral multiples of the storage bit widths, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model and the preset model precision corresponding to each group of weight bit widths and input bit widths.

And then determining a quantized model according to the determined input bit width and the weight bit width corresponding to each network layer, and disposing the quantized model on a resistive random access memory.

Based on the model quantization method based on the resistive random access memory shown in fig. 2, a compression model is constructed according to a model to be quantized, the preset input bit width and the preset weight bit width of each network layer of the compression model are determined, then training samples are input into the compression model, an output result is determined, and loss is determined according to the output result and the difference between corresponding labels of the training samples; and adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, determining a plurality of groups of input bit widths and weight bit widths which meet the optimization conditions, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model.

When determining the input bit width in step S106, the input bit width corresponding to each weight bit width meeting the optimization condition may be adjusted, each input bit width is determined, the compression model is adjusted according to each determined input bit width, and the model accuracy of the compression model corresponding to each input bit width is calculated. And screening the input bit width from the input bit widths with the corresponding model precision not smaller than the preset model precision, and taking the screened input bit width and the weight bit width as a group of input bit width and weight bit width which meet the optimization condition.

In step S108, if the determined weight bit widths and input bit widths are not present, and the weight bit width and input bit width of each set of the weight bit width is less than or equal to the stored bit width, when the model precision is greater than the weight bit width and input bit width of the preset model precision, calculating the model precision of the compression model corresponding to each set of the weight bit width and input bit width, wherein the weight bit width is an integer multiple of the stored bit width, and screening the weight bit width and input bit width, of which the model precision is not less than the preset model precision, from each set of the weight bit width and input bit width, wherein the weight bit width is an integer multiple of the stored bit width. And determining a group of weight bit width and input bit width with the minimum input bit width from the filtered weight bit width and input bit width, and taking the group of weight bit width and input bit width as the input bit width and the weight bit width corresponding to each network layer.

It should be noted that, when there is no corresponding target parameter that the model precision of the compressed model is smaller than the preset model precision, the determined weight bit width is an integer multiple of the stored bit width, and at this time, to further ensure the precision of the compressed model, the input bit width and the weight bit width of each network layer of the model to be quantized are adjusted according to the determined input bit width and the weight bit width of each network layer, and an intermediate model is determined.

And inputting a training sample into the intermediate model, determining an output result of the intermediate model, and determining the loss of the intermediate model according to the output result of the intermediate model and the difference between corresponding marks of the training sample. And fixing the input bit width and the weight bit width of the intermediate model as the input bit width and the weight bit width of each network layer of the intermediate model by taking the minimum loss of the intermediate model as an optimization target, and then adjusting the values of model parameters of the intermediate model to determine the quantized model.

Inputting the training sample into the compressed model, determining the compressed output, determining the compressed loss according to the difference between the compressed output and the corresponding label of the training sample, adjusting the model parameters of the compressed model except the input bit width and the weight bit width to determine the quantized model, wherein the compressed loss is minimum.

In addition, based on the thought, after the compression model is constructed, before the input bit width and the weight bit width are determined, an optimization function can be determined according to the loss and the optimization condition, and the weight bit width and the input bit width which meet the condition are determined through the optimization function. For example Wherein bit _input Representing the input bit width, bit _weight Representing the weight bit width, L representing the number of layers of the network layer, bit _rram Representing the memory bit width, cross EntropyLoss being the cross entropy between the model input and training samples, cross EntropyLoss representing the determination of the minimum input bit width,/">Indicating that the determined weight bit width is less than or equal to the memory bit width or is an integer multiple of the memory tail. With the minimum Cost as an optimization target, determining a plurality of groups of bits _weight Sum bit _input . Furthermore, it is also possible to determine the optimization function as +.> Alpha, beta and gamma are three super-parameter weights, and the most suitable weight can be determined by a grid search mode. Of course, the above formula is only one embodiment, and this specification is not limited thereto.

The above method for quantifying a model based on a resistive random access memory according to one or more embodiments of the present disclosure is based on the same concept, and the present disclosure further provides a corresponding device for quantifying a model based on a resistive random access memory, as shown in fig. 4.

Fig. 4 is a schematic diagram of a model quantization apparatus based on a resistive random access memory provided in the present specification, where the apparatus is used to perform a model quantization method based on a resistive random access memory provided in fig. 1.

The determining module 400 is configured to determine a storage bit width of the resistive random access memory of the model to be quantized, the training sample, and the model to be deployed.

The compression module 401 is configured to construct a compression model according to the model to be quantized, and determine a preset input bit width and a preset weight bit width of each network layer of the compression model.

The optimizing module 402 is configured to input the training sample into the compression model, determine an output result, and determine a loss according to a difference between the output result and a label corresponding to the training sample; and adjusting the input bit width and the weight bit width of each network layer of the compression model by taking the minimum loss as an optimization target, and determining a plurality of groups of input bit widths and weight bit widths which meet an optimization condition, wherein the optimization condition comprises that the adjusted weight bit width is smaller than the storage bit width or the weight bit width is an integer multiple of the storage bit width.

A selection module 403, configured to calculate, for each determined set of weight bit width and input bit width, a model precision of the model to be quantized when the set of weight bit width and input bit width is used as parameters; and determining the input bit width and the weight bit width corresponding to each network layer of the model to be quantized according to each model precision, determining the quantized model according to the determined input bit width and the weight bit width of each network layer, and deploying the quantized model on the resistive random access memory.

Optionally, the compression module 401 is specifically configured to determine each network layer of the model to be quantized, add a quantization layer and an inverse quantization layer after each network layer, and add the model to be quantized of the quantization layer and the inverse quantization layer as the compression model.

Optionally, the optimizing module 402 is specifically configured to adjust the input bit width and the weight bit width of each network layer of the compression model, and determine whether the input bit width and the weight bit width meet the optimizing condition when the loss is less than or equal to a preset value; if yes, storing the input bit width and the weight bit width; if not, the input bit width and the weight bit width are not stored; and continuously adjusting the input bit width and the weight bit width of each network layer of the compression model, and determining a plurality of groups of stored input bit widths and weight bit widths which meet the optimization conditions when training is finished.

Optionally, an optimization module 402, specifically configured to determine, for each network layer, a preset model precision of the compression model and a number of weight bit widths that meet an optimization condition; aiming at each weight bit width meeting the optimization condition, adjusting the input bit width corresponding to the weight bit width, and determining each input bit width; according to the determined input bit widths, adjusting the compression model, and calculating model accuracy of the compression model corresponding to the input bit widths; and screening the input bit width from the input bit widths with the corresponding model precision not smaller than the preset model precision, and taking the screened input bit width and the weight bit width as a group of input bit width and weight bit width which meet the optimization condition.

Optionally, the selection module 403 is specifically configured to determine a preset model precision of the compression model; determining weight bit widths and input bit widths of all groups of weight bit widths which are smaller than or equal to the storage bit widths from all groups of weight bit widths and input bit widths, wherein each group of weight bit widths and input bit widths serve as a target parameter; aiming at each target parameter, adjusting the compression model according to the target parameter, and calculating the model precision of the compression model corresponding to the target parameter; when the corresponding target parameters with the model precision not smaller than the preset model precision exist, determining the target parameter with the minimum input bit width from the corresponding target parameters with the model precision not smaller than the preset model precision of the compression model as the corresponding input bit width and the weight bit width of each network layer; and when the corresponding target parameters of which the model precision is smaller than the preset model precision do not exist, calculating the model precision of the compression model corresponding to each group of weight bit widths and input bit widths, wherein the weight bit widths are integer multiples of the storage bit widths, and determining the input bit widths and the weight bit widths corresponding to each network layer according to the model precision of the compression model corresponding to each group of weight bit widths and input bit widths and the preset model precision.

Optionally, the selection module 403 is specifically configured to calculate model precision of the compression model corresponding to each set of weight bit widths and input bit widths with weight bit widths that are integer multiples of the storage bit width; screening the weight bit width and the input bit width of which the model precision is not less than the preset model precision from the weight bit width and the input bit width of each group of which the weight bit width is integral multiple of the storage bit width; and determining a group of weight bit width and input bit width with the minimum input bit width from the filtered weight bit width and input bit width, and taking the group of weight bit width and input bit width as the input bit width and the weight bit width corresponding to each network layer.

Optionally, the apparatus further includes a quantization module 404, configured to, when there is no target parameter with a model precision of the compression model smaller than the preset model precision, adjust an input bit width and a weight bit width of each network layer of the compression model according to the determined input bit width and weight bit width of each network layer, and determine an intermediate model; inputting the training sample into the intermediate model, determining an output result of the intermediate model, and determining the loss of the intermediate model according to the output result of the intermediate model and the difference between corresponding labels of the training sample; and taking the minimum loss of the intermediate model as an optimization target, adjusting model parameters of the intermediate model according to the input bit width and the weight bit width of each network layer of the intermediate model, and determining the quantized model.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 5, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the model quantization method based on the resistive random access memory shown in the above figure 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present invention.

Claims

1. A resistive random access memory-based model quantization method, comprising:

2. The method of claim 1, wherein constructing a compression model from the model to be quantized specifically comprises:

3. The method of claim 1, wherein adjusting the input bit width and the weight bit width of each network layer of the compression model and determining sets of input bit widths and weight bit widths that meet optimization conditions comprises:

adjusting the input bit width and the weight bit width of each network layer of the compression model, and judging whether the input bit width and the weight bit width meet the optimization condition when the loss is not more than a preset value;

if yes, storing the input bit width and the weight bit width;

if not, the input bit width and the weight bit width are not stored;

4. The method of claim 1, wherein determining a plurality of sets of input bit widths and weight bit widths that meet an optimization condition, comprises:

5. The method of claim 1, wherein determining the input bit width and the weight bit width corresponding to each network layer of the model to be quantized according to each model precision, specifically comprises:

determining the preset model precision of the compression model;

6. The method of claim 5, wherein determining the input bit width and the weight bit width corresponding to each network layer according to the model precision and the preset model precision of the compression model corresponding to each set of weight bit width and input bit width, specifically comprises:

calculating model precision of the compression model corresponding to each group of weight bit widths and input bit widths, wherein the weight bit widths are integer multiples of the storage bit width;

screening the weight bit width and the input bit width of which the model precision is not less than the preset model precision from the weight bit width and the input bit width of each group of which the weight bit width is integral multiple of the storage bit width;

and determining a group of weight bit width and input bit width with the minimum input bit width from the filtered weight bit width and input bit width, and taking the group of weight bit width and input bit width as the input bit width and the weight bit width corresponding to each network layer.

7. The method of claim 5, wherein determining the quantization model based on the input bit width and the weight bit width, comprises:

when the corresponding target parameters of which the model precision is smaller than the preset model precision do not exist, adjusting the input bit width and the weight bit width of each network layer of the compression model according to the determined input bit width and the weight bit width corresponding to each network layer, and determining an intermediate model;

8. A resistance change memory-based model quantization apparatus, comprising:

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An electronic driving device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.