CN111832709A

CN111832709A - Mixed quantization method of operation data and related product

Info

Publication number: CN111832709A
Application number: CN201910306477.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2020-10-27

Abstract

The application provides a mixed quantization method of operational data and a related product, the method comprises the steps of executing mixed quantization processing on the data operated by a neural network and operating the processed data.

Description

Mixed quantization method of operation data and related product

Technical Field

The present application relates to the field of neural networks, and in particular, to a hybrid quantization method for operation data and a related product.

Background

Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by connecting a large number of nodes (or neurons). The operation of the existing neural network is based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to realize the inference (i.e. forward operation) of the neural network, and the inference has a large amount of computation and high power consumption.

Disclosure of Invention

The embodiment of the application provides a mixed quantification method of operation data and a related product, which can reduce the calculated amount of reasoning and training, improve the processing speed of a computing chip and improve the efficiency.

In a first aspect, a method for quantizing operation data is provided, and the method is applied to an artificial intelligence processor, and the method comprises the following steps:

a hybrid quantization method of operational data, the method being applied to an artificial intelligence processor, the method comprising the steps of:

determining operation data; acquiring a quantization command, wherein the quantization command comprises a data type of quantization precision, and extracting a quantization function corresponding to the data type from a calculation library according to the data type; dividing the operation data into g groups of data, and performing mixed quantization operation on the g groups of data according to the data type of the quantization precision to obtain quantized data, so that the artificial intelligence processor executes operation according to the quantized data, wherein g is an integer greater than or equal to 2.

Optionally, the operation data includes: input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

One or any combination thereof;

the data types of the quantization precision specifically include: discrete quantization precision or continuous quantization precision.

Optionally, the performing, according to the data type of the quantization precision, a mixed quantization operation on the g groups of data specifically includes:

and performing quantization operation on the g groups of data by adopting at least two data types of quantization precision according to the data type of the quantization precision, wherein the data types of the quantization precision of the single group of data in the g groups of data are the same.

Optionally, the dividing the operation data into g groups of data specifically includes:

dividing the operation data into g groups of data according to a network layer of the neural network;

or dividing the operation data into g groups of data according to the number of the cores of the artificial intelligence processor.

Optionally, the performing, by using at least two data types with quantization accuracies for the g groups of data according to the data type with quantization accuracy specifically includes:

a part of the g group data is quantized with discrete quantization precision, and another part of the g group data is quantized with continuous quantization precision.

Optionally, the extracting, according to the data type, a quantization function corresponding to the data type from a calculation library specifically includes:

if the data type of the quantization precision is an index s of discrete quantization precision, the quantized data is obtained by calculation according to the discrete quantization precision and the element value of the operation data;

if the data type of the quantization precision is continuous quantization precision, the quantized data is calculated according to the continuous quantization precision and the element value of the operation data.

Optionally, before dividing the operation data into g groups of data, the method further includes determining quantization precision of the operation data according to the operation data and the data type, where determining quantization precision of the operation data according to the operation data and the data type specifically includes:

determining s or f according to the maximum value of the absolute value of the operation data;

or determining s or f according to the minimum value of the absolute value of the operation data;

or determining s or f according to the quantization precision of different data;

or s or f may be determined according to an empirical constant.

Optionally, the determining s and f according to the maximum absolute value of the operation data specifically includes:

determining s by equation 1-1;

s＝[log₂(a_max)—bitnum+1]formula (1-1)

Determining f by equation 2-1

Wherein c is a constant, bitnum is the bit number of the quantized data, and amax is the maximum value of the absolute value of the operational data.

Optionally, the determining s or f according to the minimum absolute value of the operation data specifically includes:

determining s by equation 1-2;

alternatively, the first and second electrodes may be,

determining the precision f through a formula 2-2;

f＝a_mind formula 2-2

Wherein d is a constant and amin is the minimum absolute value of the operation data.

Optionally, the determining manner of the maximum absolute value or the minimum absolute value of the operation data specifically includes:

searching the maximum value or the minimum value of the absolute value by adopting all the layer classifications;

or searching the maximum value or the minimum value of the absolute value by adopting the hierarchical classification;

or the hierarchical classification is adopted to group and search the maximum value or the minimum value of the absolute value.

Optionally, the determining s and f according to the quantization precision of different data specifically includes:

determining discrete quantization precision by equations 1-3

S_a ^(l)＝∑_b≠aα_bS_b ^(l)+β_bFormulas 1 to 3

Wherein the content of the first and second substances,

is data a^(l)Data of the same layer b^(l)Of the discrete quantization accuracy of said

The method comprises the following steps of (1) knowing;

determining continuous quantization precision by equations 2-3

f_a ^(l)＝∑_b≠aα_bf_b ^(l)+β_bEquations 2 to 3

Is data a^(l)Data of the same layer b^(l)Of said continuous quantization accuracy, said

As is known, the superscript l denotes the l-th layer.

Optionally, the determining s and f according to the empirical value constant specifically includes:

setting up

Wherein C is an integer constant;

setting up

Wherein C is a rational number constant;

wherein the content of the first and second substances,

is data a of the l-th layer^(l)Is determined by the index of the discrete quantization accuracy of (c),

is data a of the l-th layer^(l)Continuous quantization accuracy of.

Optionally, the method further includes:

and dynamically adjusting s or f.

Optionally, the dynamically adjusting s or f specifically includes:

adjusting s or f upwards according to the maximum value of the absolute value of the data to be quantized;

or gradually adjusting s or f upwards according to the maximum value of the absolute value of the data to be quantized;

or adjusting s or f in a single step upwards according to the distribution of the data to be quantized;

or gradually adjusting s or f upwards according to the distribution of the data to be quantized;

or downwards adjusting s or f according to the maximum value of the absolute value of the data to be quantized.

Optionally, the method further includes:

and (5) dynamically adjusting the trigger frequency of s and f.

Optionally, the dynamically adjusting the trigger frequency of s and f specifically includes:

adjusting s and f once every training period, and fixing the s and the f;

or adjusting s and f once every other training period epoch, and keeping the s and the f constant;

or adjusting s and f once every step of iteration or epoch, wherein step is alpha, and alpha is more than 1;

or adjusting s and f once every training iteration or epoch, and gradually decreasing with the increase of the training times.

In a third aspect, a neural network computing device is provided, which includes one or more artificial intelligence processors provided in the second aspect.

In a fourth aspect, there is provided a combined processing apparatus comprising: the neural network arithmetic device, the universal interconnection interface and the universal processing device are provided by the third aspect;

the neural network operation device is connected with the general processing device through the general interconnection interface.

In a fifth aspect, an electronic device is provided, which includes the artificial intelligence processor provided in the second aspect or the neural network operation apparatus provided in the third aspect.

In a sixth aspect, a computer-readable storage medium is provided, which is characterized by storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method provided in the first aspect.

In a seventh aspect, a computer program product is provided, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform the method provided by the first aspect

Drawings

Fig. 1 is a schematic diagram of a training method of a neural network.

Fig. 2 is a flow chart of a hybrid quantization method of operation data.

Fig. 3a is a data representation of discrete quantization accuracy.

Fig. 3b is a data representation of the continuous quantization accuracy.

Fig. 4a is a schematic diagram of a chip.

Fig. 4b is another schematic of a chip.

Fig. 5a is a schematic structural diagram of a combined processing device according to the present disclosure.

Fig. 5b is a schematic view of another structure of a combined processing device disclosed in the present application.

Fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the present application.

Fig. 5d is a schematic structural diagram of a neural network chip package structure according to the embodiment of the present disclosure.

Fig. 5e is a schematic structural diagram of a neural network chip according to the embodiment of the present application.

Fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the present disclosure.

Fig. 6a is a schematic diagram of another neural network chip package structure according to the embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The method for quantizing the operation data provided by the application can be operated in a processor, the processor can be a general processor, such as a Central Processing Unit (CPU), or a special processor, such as a Graphics Processing Unit (GPU), and can also be realized in an artificial intelligence processor. The processor may be embodied in any form, including but not limited to a computer readable medium.

Referring to fig. 1, fig. 1 is a schematic diagram of training an operation layer of a neural network provided in the present application, as shown in fig. 1, the operation layer may be a fully-connected layer or a convolutional layer, and if the operation layer is a fully-connected layer, it corresponds to a fully-connected operation, such as the matrix multiplication operation shown in fig. 1, and if the operation layer is a convolutional layer, it corresponds to a convolution operation. As shown in fig. 1, the training includes forward reasoning (reasoning for short) and reverse training, and as shown in fig. 1, the solid line shows the process of forward reasoning, and the dotted line shows the process of reverse training. As shown in fig. 1, in the forward inference, the input data of the operation layer and the weight value perform operations to obtain output data of the operation layer, which may be input data of a next layer of the operation layer. In the reverse training process shown in fig. 1, the output data gradient of the operation layer and the weight are operated to obtain an input data gradient, the output data gradient and the input data are operated to obtain a weight gradient, the weight gradient is used to update the weight of the operation layer, and the input data gradient is used as the output data gradient of the layer above the operation layer.

Referring to fig. 2, fig. 2 provides a hybrid quantization method for operational data, which may include, forward reasoning and/or reverse training; the method is performed by a computing chip, which may be a general-purpose processor, such as a central processing unit, a graphics processor, or a dedicated artificial intelligence processor (e.g., a dedicated neural network processor). Of course, the above method may also be executed by an apparatus including a computing chip, and as shown in fig. 2, the method includes the following steps:

step S201, a computing chip determines operation data;

optionally, the operation data in step S201 includes but is not limited to: input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

One or any combination thereof.

Step S202, a calculation chip acquires a quantization command, the quantization command comprises a data type of quantization precision, and a quantization function corresponding to the data type is extracted from a calculation library according to the data type;

the quantization command in step S202 may be obtained from an external device, for example, in an alternative embodiment, the computing chip includes a CPU and a dedicated neural network chip, and the quantization command may be sent from the CPU to the dedicated neural network chip, or may be a quantization command input to the computing chip by a user through the external device. In practical applications, the quantization command may also be automatically generated by the computing chip, for example, the quantization command may be generated according to the number of elements of the operation data, specifically, if the number of elements is greater than a number threshold, the quantization command including discrete quantization precision may be generated, and conversely, if the number of elements is less than the number threshold, the quantization command including continuous quantization precision may be generated. Of course, in another alternative embodiment, the above strategy for generating the quantization command may also be to determine the span of the element values of the operation data, i.e. to calculate the difference between the maximum element value and the minimum element value in the operation data, if the difference is greater than the difference threshold, the quantization command including discrete quantization precision may be generated, and conversely, if the difference is less than the difference threshold, the quantization command including continuous quantization precision may be generated.

Step S203, dividing the operation data into g groups of data, and performing a mixed quantization operation on the g groups of data according to the data type of the quantization precision to obtain quantized data, so that the artificial intelligence processor performs an operation according to the quantized data, where g is an integer greater than or equal to 2.

The dividing of the operation data into g groups of data may specifically include:

in an optional embodiment, the operation data is divided into g groups of data according to a network layer of the neural network; for example, each layer of data of the neural network is divided into a group, and taking 10 layers of neural networks as an example, each layer of 10 layers of neural networks can be divided into a group of data to form 10 groups of data.

In another alternative embodiment, the operational data is divided into g groups of data based on the number of cores of the artificial intelligence processor. For example, if the number of cores of the artificial intelligence processor is 4, the operation data is divided into 4 groups of data. Of course, the g can be configured by the user.

The data types of the quantization precision of each group of data need to be the same, because if the quantization precision of the same group of data is different, a situation that the operation cannot be performed may occur when the operation is performed, for example, a case that a calculation error or the operation cannot be performed may occur when a discrete quantized data is calculated with a continuous quantized data, and thus the data types of the quantization precision of a single group of data in g groups of data need to be the same.

According to the technical scheme, after the neural network operation data are determined, the data type of the quantization precision is determined according to the quantization command, then the quantization function and the quantization precision are determined according to the data type, then the operation data are divided into g groups of data, and then the quantization operation is executed in a mixed quantization mode.

Optionally, the data type of the quantization precision may specifically include: discrete quantization precision or continuous quantization precision.

For example, g groups here take 4 groups of data as an example, a part of group data, for example,

group

1 and 2 groups of data, in 4 groups of data may perform quantization operation with discrete quantization precision, and another part of group data, for example,

group

3 and 4 groups of data, in 4 groups of data may perform quantization operation with continuous quantization precision.

Optionally, the extracting, from the computation library, the quantization function corresponding to the data type according to the data type specifically includes:

if the data type of the quantization precision is continuous quantization precision f, the quantized data is calculated according to the continuous quantization precision and the element value of the operation data.

The quantization may be performed in various ways, for example, in an alternative embodiment, the quantized data X is calculated according to a quantization function for formula (1).

Wherein, X is the quantized discrete fixed point data, X is an element value in the operation data, s is an index of the discrete quantization precision, and σ is a rounding function, and the rounding function includes but is not limited to: an rounding-up function, a rounding-down function, a rounding function, a random rounding function, or other rounding functions.

In another alternative embodiment, the quantized data Y is calculated for equation (2) according to a quantization function

Where Y is the quantized continuous fixed-point data, x is an element value in the operation data, f is the continuous quantization precision, and σ is a rounding function, which includes but is not limited to: rounding up, rounding down, random rounding, or other rounding approaches.

Optionally, the quantization precision may specifically include any one of the following manners:

determining an index s of discrete quantization precision or continuous quantization precision f according to the maximum value of the absolute value of the operation data in a mode A; the maximum absolute value of the operation data described in the present application may specifically be the maximum absolute value of all elements in the operation data, and may also be the maximum absolute value of various types of operation data, such as the maximum absolute value of an element of the input neuron a, the maximum absolute value of an element of the output neuron B, the maximum absolute value of an element of the weight W, and the derivative of the input neuron

Maximum of absolute value of element(s), derivative of output neuron

Maximum of absolute value, derivative of weight of element(s)

Maximum of absolute value of the element(s).

Specifically, the computing chip determines the maximum absolute value amax of the operational data, and determines s or f through the following formula;

s can be determined by equation 1-1;

s＝[log₂(a_max)—bitnum+1]formula (1-1)

F can be determined by equation 2-1

Wherein c is a constant and can take any rational number, and preferably, the c can take any rational number between [1,1.2 ]; of course, the above c may not be a rational number in the above range, and may be a rational number. The bitnum is the bit number required by the quantized data; referring to fig. 3a and fig. 3b, fig. 3a is a schematic diagram of data representation of discrete quantization precision, and fig. 3b is a schematic diagram of data representation of continuous quantization precision, for which the bitnum can be 8, 16, 24 or 32.

Optionally, the amax selection method may adopt various methods. In particular, a_maxCan be searched by data category; but also hierarchical, categorical or group finding.

A1, the computing chip can search the maximum value of the absolute value by adopting all layer classifications; the computing chip determines each element of the data to be operated as

Wherein

May be input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

The value of the element (1). And traversing all layers of the neural network, and searching the maximum value of the absolute value of each type of data of all the layers.

A2, the computing chip can adopt the hierarchical classification to search the maximum value of the absolute value; the computing chip determines each element of the data to be operated as

Wherein

May be input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

The element value of the l-th layer. Of course, in practical applications, data of layers other than the i layer may be extracted, for example, the maximum absolute value of each category data of the λ layer may be extracted, where λ is an integer greater than or equal to 2.

A3, the computing chip can adopt the hierarchical classification to group and further search the maximum value of the absolute value; the computing chip determines each element of the data to be operated as

Wherein

May be input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

The element value of the l-th layer. Dividing each type of data of each layer into E groups (the E can be an empirical value or a user set value), traversing all layers of the neural network, and searching the maximum value of the absolute value of each group in g groups of data of each type of each layer. The input neuron A, the output neuron B, the weight W and the derivative of the input neuron

Output neuron derivative

Weight derivative

May represent a type of data.

Determining s and f according to the minimum value of the absolute value of the operational data in a mode B;

specifically, the computing chip determines the minimum value a of the absolute value of the operation data_minAccording to a_minAnd determining s or f.

The accuracy s can be determined by equation 1-2;

the accuracy f can be determined by equation 2-2;

f_a＝a_mind formula 2-2

Wherein d is a constant and can take any rational number.

A above_minSearching according to data categories; but also hierarchical, categorical or group finding. The specific searching mode can adopt the mode of A1, A2 or A3, and only needs to use a in A1, A2 or A3_maxBy substitution of a_minAnd (4) finishing.

And the mode C is that s and f are determined according to the quantization precision of different data:

there is a correlation between s of different data of the same layer. For example: discrete quantization accuracy, value of discrete quantization accuracy of first layer data a

Another kind of data b can be generated by the l layer^(l)Of discrete quantization precision

And (4) obtaining the calculation, wherein the specific calculation is shown in formulas 1-3.

S_a ^(l)＝∑_b≠aα_bS_b ^(l)+β_b Formulas 1 to 3

There is a correlation between f of different data of the same layer. For example: layer a data^(l)Continuous quantization accuracy of

Another kind of data b can be generated by the l layer^(l)Continuous quantization accuracy of

Determined according to equations 2-3.

f_a ^(l)＝∑_b≠aα_bf_b ^(l)+β_b Equations 2 to 3

Wherein alpha is_b、β_bIs an integer constant. Specifically, for equations 1-3, the α_b、β_bIs an integer constant, for equations 2-3, the α_b、β_bIs a rational number constant.

The above data a^(l)May be input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

The element value of the l-th layer; the above data b^(l)May be input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

The value of the additional element of the l-th layer.

And D, determining s and f by the computing chip according to the empirical value constant:

in particular, layer I data type a^(l)Is/are as follows

Can be set artificially

Wherein C is an integer constant.

In particular, layer I data type a^(l)Is/are as follows

Can be set artificially

Wherein C is a rational number constant.

Optionally, the above method may further comprise "

When the neural network operation is executed, s and f are dynamically adjusted.

The methods for adjusting s and f include, but are not limited to: adjust s, f upward (s, f becomes larger) or adjust s, f downward (s, f becomes smaller).

For the way of adjusting s, f upwards, s or f can be adjusted upwards in a single step, and of course, s or f can also be adjusted upwards in a step-by-step manner.

Similarly, for the way of adjusting s and f downward, s or f may be adjusted downward in a single step, or s or f may be adjusted downward step by step.

The dynamically adjusting s and f according to the data to be quantized may specifically include:

a) the computing chip adjusts s or f in a single step upwards according to the maximum value of the absolute value of the data to be quantized:

if the adjustment s is used, the computing chip determines the index of the discrete quantization precision before the adjustment as s _ old, and the fixed point can indicate the range as [ neg, pos []Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1)*2^s_old. Extracting the maximum value a of the absolute value of the data to be quantized_maxnewWhen the maximum value a of the absolute value of the data to be quantized_maxnewAt pos or more, s _ new is log₂(a_maxnew) -bitnum +1, otherwise not adjusted, i.e. keeping s _ new ═ s _ old.

If the f is adjusted, the computing chip determines that the fixed point precision of the fixed point data before adjustment is f _ old, and the fixed point representable range is [ neg, pos [)]Wherein pos is (2)^bitnum-1-1)*f_old，neg＝-(2^bitnum-1) F old. When the maximum value a of the absolute value of the data to be quantized_maxnewWhen the value is more than or equal to pos, f is adjusted according to the formula 2-4, otherwise, f _ new is kept equal to f _ old

b) The computing chip gradually adjusts s or f upwards according to the maximum value of the absolute value of the data to be quantized:

if s is adjusted, the index of discrete quantization precision before chip adjustment is calculated as s _ old, and the fixed point can indicate the range is [ neg, pos]. When the maximum value a of the absolute value of the data to be quantized_maxnewWhen the value is greater than or equal to pos, s _ new is s _ old + η, otherwise, s is not adjusted, namely, s _ new is maintained to be s _ old; where η is a step size of a single adjustment in the stepwise upward adjustment, and may be, for example, 1, or may be another value.

If f is adjusted, the continuous quantization precision before the adjustment is f _ old, and the fixed point can indicate the range is [ neg, pos]. When the maximum value a of the absolute value of the data to be quantized_maxnewWhen the value is greater than or equal to pos, f _ new is f _ old + η, otherwise, f is not adjusted, namely, f _ new is maintained to be f _ old; where η is a step size of a single adjustment in the stepwise upward adjustment, and may be, for example, 1, or may be another value.

c) The computing chip adjusts s or f upwards in a single step according to the distribution of the data to be quantized:

if the adjustment s is determined, the calculation chip determines the discrete quantization precision before the adjustment as s _ old, and the fixed point representable range is [ neg, pos []. Calculating statistic of absolute value of data to be quantized, wherein the statistic can be mean value a of absolute value_meanAnd the standard deviation a of n absolute values_stdAnd (4) summing. In particular, the statistic z_max＝a_mean+na_stdN may be a positive integer, for example, n ═ 3. When z is_maxWhen the pressure is more than or equal to pos,

otherwise, s is not adjusted.

If f is adjusted, the continuous quantization precision is f _ old before the adjustment of the computing chip, and the fixed point can represent the range [ neg, pos []. The statistic may be the mean a of absolute values_meanAnd the standard deviation a of n absolute values_stdAnd (4) summing. In particular, the statistic z_max＝a_mean+na_stdN may be a positive integer, for example, n ═ 3. When z is_maxWhen the pos is larger than or equal to pos, f is obtained by calculation according to the formula 2-5_new(ii) a Otherwise f is not adjusted.

d) The computing chip gradually adjusts s or f upwards according to the distribution of the data to be quantized:

if the adjustment s is used, the computing chip determines the index of the discrete quantization precision before the adjustment as s _ old, and the fixed point can indicate the range as [ neg, pos []. Statistic z_max＝a_mean+ nastd, n may take a positive integer, e.g., n ═ 3. When z is_maxWhen the value is more than or equal to pos, s _ new is s _ old + eta, otherwise, the value is not greater than or equal to posS is not adjusted.

If the f is adjusted, the computing chip determines the index of the continuous quantization precision before the adjustment as f _ old, and the fixed point can indicate the range as [ neg, pos []. Statistic z_max＝a_mean+na_stdN may be a positive integer, for example, n ═ 3. When z is_maxAnd f _ new is equal to f _ old + eta when the value is more than or equal to pos, otherwise, f is not adjusted.

e) The computing chip downwards adjusts s or f according to the maximum value of the absolute value of the data to be quantized:

if s is adjusted, the index of discrete quantization precision before adjustment is s _ old, and the fixed point can indicate the range is [ neg, pos]. When the maximum value a of the absolute value of the data to be quantized_max<2^{s_old+(bitnum-n}) And s _ old is greater than or equal to s_minWhen s is_new＝s_old- η, where n is an integer constant, s_minS can take the minimum value of the range (typically a negative integer or negative infinity). Preferably n-3, s_min＝-40。

If f is adjusted, the continuous quantization precision before the adjustment is f _ old, and the fixed point can indicate the range is [ neg, pos]. When the maximum value a of the absolute value of the data to be quantized_max<fo_ld+2^(bitnum-n)And f _ old is greater than or equal to fmin, f_new＝f_old- η, where n is an integer constant and fmin is the minimum value of the range over which f can be taken.

Optionally, the method may further include: and the computing chip dynamically adjusts the trigger frequency of s and f.

Dynamically adjusting the trigger frequency of s may include, but is not limited to, the following methods.

a) No adjustment is triggered ever, i.e. s, f are fixed.

b) The adjustment is performed every training period (iteration) and is fixed, and the adjustment mode can be referred to as the adjustment mode of s and f. The training period may specifically be one training period, where one training sample completes one iterative operation, that is, one training sample completes one forward operation and one reverse training, that is, one training period.

Preferably, the above values may be set differently depending on the kind of data, for example, may be set to 100 for the data types such as input and output neurons and weight data. For the neuron derivative data type, 20 may be set.

c) The adjustment is carried out once every training epoch, and is fixed. The training period may specifically include: all samples in the training set are trained once in one training period, that is, one training period can perform one forward operation and one reverse training for all samples in the training set, that is, one training period.

d) Adjusting once every training iteration or epoch, and adjusting step alpha every iteration or epoch, wherein alpha is larger than 1.

e) The training times are adjusted once every other training iteration or epoch, and gradually decrease with the increase of the training times, for example, when the training times is 100, 180, 90, 260, 80.

The present application further provides an artificial intelligence processor, wherein the artificial intelligence processor comprises:

the processing unit is used for acquiring a quantization command, wherein the quantization command comprises a data type of quantization precision, and a quantization function corresponding to the data type is extracted from a calculation base according to the data type; dividing the operation data into g groups of data;

and the quantization unit is used for performing mixed quantization operation on the g groups of data according to the data type of the quantization precision to obtain quantized data so that the artificial intelligence processor executes operation according to the quantized data, wherein g is an integer greater than or equal to 2.

The refinement schemes of the processing unit and the quantization unit can be referred to the description of the method embodiment, and are not described herein again.

The above quantization unit, configured to quantize the operation data according to the quantization precision and the quantization function to obtain a refinement scheme of quantized data, may refer to the description of the method embodiment, and is not described herein again.

The present application also discloses a neural network computing device, which includes one or more chips as shown in fig. 4a or fig. 4b, and may also include one or more artificial intelligence processors. The device is used for acquiring data to be operated and control information from other processing devices, executing specified neural network operation, and transmitting an execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip shown in fig. 4a or fig. 4b is included, the chips shown in fig. 4a or fig. 4b can be linked and transmit data through a specific structure, for example, a PCIE bus interconnects and transmits data, so as to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The neural network arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the neural network operation device, the universal interconnection interface and other processing devices (namely, a universal processing device). The neural network arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. FIG. 5a is a schematic view of a combined processing apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the neural network arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the neural network arithmetic device; other processing devices can cooperate with the neural network arithmetic device to complete the arithmetic task.

And the universal interconnection interface is used for transmitting data and control instructions between the neural network arithmetic device and other processing devices. The neural network arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the neural network arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network arithmetic device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.

As shown in fig. 5b, the structure may further include a storage device for storing data required by the present arithmetic unit/arithmetic device or other arithmetic units, and is particularly suitable for data that is required to be calculated and cannot be stored in the internal storage of the present neural network arithmetic device or other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

Referring to fig. 5c, fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure. As shown in fig. 5c, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.

The present application does not limit the specific structure of the neural network chip package structure 11, and optionally, as shown in fig. 5d, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.

The specific form of the neural network chip 111 related to the present application is not limited, and the neural network chip 111 includes, but is not limited to, a neural network wafer integrating a neural network processor, and the wafer may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.

The specific structure of the neural network chip 111 is not limited in the present application, and please refer to the apparatus shown in fig. 4a or fig. 4b as an alternative.

The type of the first substrate 13 and the second substrate 113 is not limited in this application, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.

The second substrate 113 according to the present invention is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111 and facilitating further packaging of the neural network chip package structure 11 and the first substrate 13.

The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).

The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.

Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.

Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.

The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.

The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The type and number of the pins are not limited in the present application, and different pin forms can be selected according to different packaging technologies and arranged according to certain rules.

Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.

Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.

For example, as shown in fig. 6a, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.

Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.

The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

The specific form of the first electrical and non-electrical device 12 is not limited in the present application, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, or a connection wire connection or a plug connection may be adopted to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.

The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.

The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.

In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.

Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The above-mentioned embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not to be construed as limiting the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A hybrid quantization method for operation data, the method being applied to an artificial intelligence processor, the method comprising the steps of:

determining operation data;

acquiring a quantization command, wherein the quantization command comprises a data type of quantization precision, and extracting a quantization function corresponding to the data type from a calculation library according to the data type;

dividing the operation data into g groups of data, and performing mixed quantization operation on the g groups of data according to the data type of the quantization precision to obtain quantized data, so that the artificial intelligence processor executes operation according to the quantized data, wherein g is an integer greater than or equal to 2.

2. The method of claim 1,

the operational data includes: input neuron A, output neuron B, weight W, input neuron derivative

Output neuron derivative

Weight derivative

One or any combination thereof;

3. The method according to claim 2, wherein said applying a hybrid quantization operation to the g groups of data according to the data type of the quantization precision specifically comprises:

4. The method according to any one of claims 1 to 3, wherein the dividing the operation data into g groups of data specifically comprises:

5. The method according to claim 3, wherein the quantizing the g groups of data according to the data type of quantization precision with at least two data types of quantization precision comprises:

6. An artificial intelligence processor, comprising:

7. A neural network computing device, comprising one or more artificial intelligence processors as claimed in claim 6.

8. A combined processing apparatus, characterized in that the combined processing apparatus comprises: the neural network computing device, the universal interconnect interface, and the universal processing device of claim 7;

9. An electronic device comprising the artificial intelligence processor of claim 6 or the neural network operation apparatus of claim 7.

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-5.