WO2022097954A1

WO2022097954A1 - Neural network computation method and neural network weight generation method

Info

Publication number: WO2022097954A1
Application number: PCT/KR2021/014367
Authority: WO
Inventors: 정태영
Original assignee: 오픈엣지테크놀로지 주식회사
Priority date: 2020-11-03
Filing date: 2021-10-15
Publication date: 2022-05-12
Also published as: KR102384588B1

Abstract

Disclosed is a neural network driving method comprising a shifting step of obtaining one set of first values by multiplying one set of weights by corresponding values of one set of input activations, respectively, obtaining a second value by adding the one set of first values to one another, and obtaining an output activation of a first node by right-shifting the second value by n bits. Each of the one set of input activations and the one set of weights corresponds to n-bit integer-type data.

Description

Neural network calculation method and neural network weight generation method

The present invention relates to computing technology, and more particularly, to an operation structure inside a neural network and a method of generating parameters provided therefor.

In Korean Patent Laid-Open Publication No. 10-2019-0014900, "Method and apparatus for quantizing parameters of a neural network," as a method for quantizing parameters of a neural network, statistical distribution for each channel of floating-point type parameter values from data of a pre-trained neural network is obtained. A technique for analyzing, determining a fixed-point representation of a parameter per channel, determining fraction lengths of bias and per-channel weight, and generating a fixed-point type quantized neural network with bias and per-channel weight of the determined fraction lengths this is disclosed. According to the above publication, the same contents as in FIGS. 1, 13A, and 13B are presented.

1 is a diagram for explaining an operation performed in a neural network according to an embodiment. Referring to FIG. 1 , the neural network 2 has a structure including an input layer, hidden layers, and an output layer, performs an operation based on received input data I1 and I2 , and outputs data based on the result of the execution. (O1 and O2) can be formed.

Each of the layers included in the neural network 2 may include a plurality of channels. A channel may correspond to a plurality of artificial nodes known as a neuron, a processing element (PE), a unit, or similar terms.

Channels included in each of the layers of the neural network 2 may be connected to each other to process data. For example, one channel may perform an operation by receiving data from other channels, and may output an operation result to other channels.

The input and output, respectively, of each of the layers may be referred to as input activation and output activation. That is, activation may be an output of one channel and a parameter corresponding to an input of channels included in the next layer. Meanwhile, each of the channels may determine its own activation based on activations and weights received from channels included in the previous layer. The weight is a parameter used to calculate output activation in each layer, and may be a value assigned to a connection relationship between layers.

In general, activation of one input data can have c certain values for each coordinate (two-dimensional in x and y in the case of an image), and the axis for these c values can be expressed as a channel axis. In other words, one input may be composed of (coordinates*channel) data.

In order to increase the efficiency of parallel processing in GPUs, etc., several input data are bundled and processed at once. In general, activation for image processing can be composed of four dimensions such as (batch, x, y, channel). . Here, a batch may mean a dimension representing input 0, input 1, input 2, and the like.

Each of the channels may be processed by a computational unit or processing element that receives an input and outputs an output activation, and an input-output of each of the channels may be mapped. For example, σ is an activation function, w ⁱ _jk is the weight from the k-th channel included in the (i-1)-th layer to the j-th channel included in the i-th layer, b ⁱ _j is If it is the bias of the j-th channel included in the i-th layer, and a ⁱ _j is the activation of the j-th channel in the i-th layer, activation a ⁱ _j is a ⁱ _j = σ ( SUM _k (w ⁱ _jk ) * a ^i-1 _k ) + b ⁱ _j ).

1 , activation of the first channel CH 1 of the second layer Layer 2 may be expressed as a ² ₁ . Also, according to Equation 1, a ² ₁ may have a value of a ² ₁ = σ (w ² _1,1 * a ¹ ₁ + w ² _1,2 * a ¹ ₂ + b ² ₁ ). However, Equation 1 described above is only an example for explaining the activation and weight used to process data in the neural network 2 , and is not limited thereto. Activation may be a value obtained by passing a value obtained by applying an activation function to the sum of activations received from the previous layer through a Rectified Linear Unit (ReLU).

As described above, in the neural network 2, numerous data sets are exchanged between a plurality of interconnected channels and undergo numerous computational processes while passing through layers. Accordingly, there is a need for a technique capable of minimizing the loss of accuracy while reducing the amount of computation required to process complex input data.

The neural network quantization apparatus proposed in the above-mentioned prior art generates a neural network, trains (or learns) a neural network, quantizes a floating-point type neural network into a fixed-point type neural network, or regenerates a neural network. It performs various processing functions, such as functions to train (retrain). Specifically, a trained neural network can be generated by repeatedly training (learning) a given initial neural network. In this case, the initial neural network may have floating-point type parameters in order to secure processing accuracy of the neural network. In this case, the floating point requires a relatively large amount of computation and a large memory access frequency compared to the fixed point. Accordingly, processing of a neural network having floating-point type parameters may not be smooth in a mobile device, an embedded device, or the like, which has relatively low processing performance. After all, in order to drive the neural network within an acceptable loss of accuracy while reducing the amount of computation in such devices, the floating-point type parameters are preferably quantized. Here, parameter quantization means converting a floating-point type parameter into a fixed-point type parameter. Accordingly, the neural network quantization apparatus performs quantization by converting parameters of a trained neural network into a fixed-point type of predetermined bits, and transmits the quantized neural network to a device to be employed.

The prior art described above is advantageous in that it uses fixed-point type calculations instead of floating-point types. However, there is still no approach to improving the computational structure inside each computational node of the neural network.

By improving the arithmetic structure inside each arithmetic structure, the performance or efficiency of the hardware accelerator may be improved.

An object of the present invention is to provide a technique capable of reducing hardware complexity of a computing device processing a neural network, increasing processing speed, or improving resource utilization.

Specifically, in the case of implementing a neural network in hardware, it is intended to provide a technique for reducing the complexity of the structure of each computation node in the neural network.

And, in the case of implementing the neural network as software, it is intended to provide a technique for simplifying the computation algorithm in each computation node in the neural network.

In addition, as an accompanying technique, it is intended to provide a technique for defining and calculating a weight parameter required for calculation.

In general, in the process of inference of a neural network, floating point-based operations as shown in FIG. 2 are predominant. In FIG. 2 , I _A denotes an input activation, O _A denotes an output activation, and w denotes a weight parameter.

However, since the floating-point operator is more complicated than the integer operator, there is a problem in that the hardware size increases when a neural network accelerator is made through actual floating-point operation.

To alleviate this problem, a hardware accelerator may be implemented to convert an input activation, an output activation, and a weight into integers in a fixed point form and process the operation through an integer operator.

When using a fixed-point operation, I _A , O _A , w may be defined as shown in the following equation and FIG. 3A .

[Equation]

q is n-bit integer

Thereafter, the scale value is determined so that the q value has sufficient resolution in the form of an n-bit integer, and the floating-point value can be converted into the fixed-point format of the integer form through the /scale and quantize process.

Next, as shown in FIG. 3B , an operation is performed using the fixed-point value q.

In general, S _I , S _w , and S _O values are commonly used values within one layer or a specific channel of one activation _. , but in the case of q _S , it is commonly used within a layer or within a specific channel of one activation.

In the present invention, a fixed-point-based quantized arithmetic method of the form shown in FIG. 4 is provided as a method for making simpler hardware while exhibiting an operation effect similar to that shown in FIGS. 2, 3A, and 3B. In this case, since one multiplier can be reduced for each multiplication, there is an advantage in that hardware of a smaller size can be made.

The present invention includes a quantization algorithm for generating a weight parameter q' _w for the above quantization operation. This algorithm is an algorithm that premultiplies the weight with S _I /S _O and then converts it to a fixed-point integer.

According to one aspect of the present invention, there is provided a computing device comprising a data operation unit that executes an operation of a specific layer of an integer type neural network including an input layer, an intermediate layer, and an output layer, and an internal memory that provides data for operation to the data operation unit can do. In this case, when the specific layer is the first layer belonging to the intermediate layer part or the output layer, the first node included in the first layer includes a set of input activations input to the first node and a set of activations. a set of multipliers each multiplying a set of weights corresponding to s; an adder for adding the outputs of the set of multipliers to each other; and a shift unit converting the output of the adder to generate output activation of the first node. In addition, each of the set of input activations and the set of weights may be n-bit integer data, and the shift unit may be configured to right-shift the output of the adder by n bits.

In this case, when the specific layer is the input layer, the second node included in the input layer converts activation having a real value input to the second node into activation having an integer value using a predetermined input scaling factor. may be adapted to be converted.

In this case, the computing device may further include a scaling unit that converts and outputs the activation output by the output layer using a predetermined output scaling factor.

In this case, the input scaling factor, the output scaling factor, and the one set of weights may be provided by the computing device from another computing device. and the other computing device is configured to use information about a neural network having a structure corresponding to the integer neural network, and to generate the input scaling factor assigned to an input layer of the original neural network, - generate the output scaling factor assigned to an output layer of a neural network, and generate scaling factors assigned to each of the intermediate layers defined between the input layer and the output layer in the original-neural network, dividing the Lth scaling factor assigned to the Lth layer directly upstream of the L+1th layer including the node corresponding to the first node by the L+1th scaling factor assigned to the L+1th layer in A first value is calculated, and the weight (w.ab _{Layer_L} ) assigned to the link (link _ab ^Layer_L ) connected from the node having the index b of the Lth layer to the node having the index a of the L+1th ^layer is added to the multiply the first value to calculate a second value, multiply the second value by 2 ⁿ to obtain a third value, and approximate the third value to an integer value to generate a fourth value, wherein the fourth value is The value may be a weight (q' _w.ab ^layer_L ) of the integer neural network corresponding to the weight ( ^{w.ab Layer_L} ₎ .

According to an aspect of the present invention, a terminal including a processing unit and a storage unit drives a neural network to generate an output activation of the first node included in the intermediate layer or the output layer of an integer neural network including an input layer, an intermediate layer, and an output layer method can be provided. In this case, the neural network driving method may include: obtaining, by the processing unit, a set of input activations from the storage unit, and acquiring a set of weights corresponding to the set of input activations from the storage unit; a multiplication step in which the processing unit multiplies the one set of weights and the values corresponding to each other of the one set of input activations to calculate a set of first values; an adding step in which the processing unit adds the set of first values to each other to calculate a second value; and a shifting step in which the processing unit calculates the output activation of the first node by right-shifting the second value by n bits. In this case, each of the set of input activations and the set of weights may be n-bit integer data.

In this case, the method of driving the neural network includes: converting, by the processor, an activation having a real value input to a second node belonging to the input layer into an activation having an integer value using a predetermined input scaling factor; and converting, by the processing unit, the activation output from the output layer using a predetermined output scaling factor.

In this case, the input scaling factor, the output scaling factor, and the set of weights may be provided by the terminal from another computing device. The other computing device is configured to use information about a one-neural network having a structure corresponding to the integer neural network, and to generate the input scaling factor assigned to an input layer of the original-neural network, generate the output scaling factor assigned to an output layer of a neural network, and generate scaling factors assigned to each of the intermediate layers defined between the input layer and the output layer in the original neural network, wherein in the original neural network The L-th scaling factor allocated to the L-th layer directly upstream of the L+1-th layer including the node corresponding to the first node is divided by the L+1-th scaling factor allocated to the L+1-th layer to obtain a A value of 1 is calculated, and the weight (w.ab _{Layer_L} ) assigned to the link (link _ab ^Layer_L ) connected from the node having the index b of the Lth layer to the node having the index a of the L+1th layer is added to the weight value ^{w.ab Layer_L} . A second value is calculated by multiplying the value by 1, a third value is calculated by multiplying the second value by 2 ⁿ , and a fourth value is generated by approximating the third value to an integer value. The fourth value may be a weight (q' _w.ab ^layer_L ) of the integer neural network corresponding to the weight ( ^{w.ab Layer_L} ₎ .

It is possible to provide an integer neural network information processing method according to an aspect of the present invention. In the integer neural network information processing method, a server generates an input scaling factor assigned to the input layer of a neural network having an input layer, an intermediate layer part, and an output layer, and generates an output scaling factor assigned to the output layer, and the intermediate layer generating scaling factors assigned to each layer of negative; and the server assigns the Lth scaling factor assigned to the Lth layer directly upstream of the L+1th layer including the first node included in the intermediate layer unit or the output layer to the L+1th layer. A weight assigned to a link (link _ab ^Layer_L ) connected from a node having an index b of the L-th layer to a node having an index a of the L+1 layer to calculate a first value by dividing by the L+1 scaling factor ( w _.ab ^Layer_L ) is multiplied by the first value to calculate a second value, the second value is multiplied by 2 ⁿ to calculate a third value, and the third value is approximated to an integer value to generate a fourth value may include; In this case, the fourth value may be a weight (q' _w.ab ^layer_L ) of the integer neural network corresponding to the weight (w _.ab ^Layer_L ).

In this case, the integer neural network information processing method may further include the step of providing, by the server, the input scaling factor, the output scaling factor, and a weight of the integer neural network to a computing device executing the operation of the integer neural network. can

According to the present invention, it is possible to provide a technology capable of reducing hardware complexity of a computing device processing a neural network, increasing processing speed, or improving resource utilization.

Specifically, when the neural network is implemented as hardware, it is possible to provide a technique for reducing the complexity of the structure of each computation node in the neural network.

In addition, when the neural network is implemented as software, it is possible to provide a technique for simplifying the computation algorithm in each computation node in the neural network.

In addition, it is possible to provide a technique for defining and calculating a weight parameter required for calculation as accompanying such a technique.

1 is a diagram for explaining an operation performed in a neural network according to an embodiment.

2 is a conceptual diagram presented to explain the creative process of the present invention.

3A and 3B are conceptual diagrams presented to explain the creative process of the present invention.

4 is a conceptual diagram presented to explain the creative process of the present invention.

5 illustrates an example of the structure of a neural network executed in a terminal provided according to an embodiment of the present invention.

FIG. 6 shows a method of generating an output activation at each node of the input layer of the neural network shown in FIG. 5 .

FIG. 7 shows a method of generating an output activation at each node of an intermediate layer (hidden layer) of the neural network shown in FIG. 5 .

FIG. 8 shows an output generation method at each node of the intermediate layer (hidden layer) of the neural network shown in FIG. 5 .

FIG. 9 shows an output generation method at each node of the output layer of the neural network shown in FIG. 5 .

10 is a diagram illustrating a method of restoring a target value from an output integer value output from each node of the output layer of the neural network shown in FIG. 5 .

11 shows information provided by the server to the terminal.

12 shows the structure of a neural network that has been trained by the server. The neural network shown in FIG. 12 has a structure corresponding to the neural network shown in FIG. 5 .

13A is a block diagram illustrating a hardware configuration of a server according to an embodiment of the present invention.

13B is a diagram for explaining the quantization of a pre-trained neural network according to an embodiment of the present invention and employing it in a terminal (hardware accelerator).

14 is a diagram illustrating a neural network calculation method in a terminal provided according to a comparative embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein and may be implemented in various other forms. The terminology used herein is for the purpose of helping the understanding of the embodiments, and is not intended to limit the scope of the present invention. Also, singular forms used hereinafter include plural forms unless the phrases clearly indicate the opposite.

The neural network 420 may be implemented as hardware or software within the terminal 200 .

The neural network 420 may include a plurality of layers (layer 1, layer 2, layer 3, layer 4). Each layer may include one or a plurality of operation nodes. In the example shown in FIG. 5 , the input layer includes two nodes and the output layer includes one node. In FIG. 5, each node is expressed as a rectangle including the letter N.

In an embodiment, a value input to each node of the input layer (layer 1) may be a real value in the form of a floating point or a fixed point. In this case, each node of the input layer (layer 1) may convert the real value into an integer value and output it.

In another embodiment, a value input to each node of the input layer (layer 1) may be an integer type.

However, the value output by each node of each layer is an integer value and can be expressed in integer or fixed-point format.

The weight assigned to the link connecting each node is also an integer value, and may be expressed in an integer or fixed-point format.

The integer value output from the output layer (layer 4) is not the target value itself that the neural network 420 should calculate from the real value provided to the input layer, but is proportional to the target value.

Accordingly, the scale unit 210 may restore the target value from the output integer value by applying an output scaling factor to the output integer value output from the output layer. The target value may be an integer or a real value. The target value may be expressed in a fixed-point form or in a floating-point form.

Although the structure of the neural network 420 is simply presented in FIG. 5 for convenience of explanation, it may be implemented more complexly than this according to an embodiment. However, from the structure of the neural network 420 shown in FIG. 5 , the more complex structures of the neural network 420 according to another embodiment can be fully understood.

Each of the nodes shown in FIG. 5 is configured to output an integer. In addition, in order to generate an output from the input of each node, a multiplier existing inside each node may be configured to perform only integer multiplication. Since the multiplier does not need to perform multiplication between real numbers, its complexity is relatively low.

The neural network 420 may be a DNN or n-layer neural network including two or more hidden layers. For example, the neural network 420 may be a DNN including an input layer (Layer 1), two hidden layers (Layer 2 and Layer 3), and an output layer (Layer 4). Meanwhile, although the neural network 420 is illustrated as including four layers, this is only an example and the neural network 420 may include fewer or more layers, or fewer or more channels. That is, the neural network 420 may include layers of various structures.

Each of the layers included in the neural network 420 may include a plurality of channels. A channel may correspond to a plurality of artificial nodes/computation nodes/nodes known as neurons, processing elements (PEs), units, or similar terms. For example, as shown in FIG. 5 , Layer 1 may include two channels (nodes), and Layer 2 and Layer 3 may each include three channels. However, this is only an example, and each of the layers included in the neural network 420 may include a variable number of channels (nodes).

Hereinafter, the operation structure of the neural network shown in FIG. 5 will be described with reference to FIGS. 6 to 10 .

Activation having a real value may be provided to each node (eg, N11 ) of the input layer (layer 1). The real value may have a floating-point or fixed-point format.

Each node of the input layer may convert the input real value into an output activation having an integer value.

When the real value is I _A and the integer value is q _I , the real value and the integer value have a relationship of [Equation 1].

[Formula 1]

q _I =quantize( I _A / S _I )

However, S _I is the input scaling factor given in advance, I _A is the input activation (real number), q _I is the output activation (integer)

According to Equation 1, first, the output activation q _I is generated by approximating to the integer value by removing the decimal part from the value obtained by dividing the input activation I _A by the input scaling factor S _I given in advance. In Equation 1, the operator quantize ( ) is an operator that approximates a real number to an integer close to it.

The previously given input scaling factor S _I (=S ¹ ) may be commonly applied to all nodes of the input layer.

Accordingly, in FIG. 6 , q _I1 ¹ =quantize(I1 _A ¹ / S _I ) and q _I2 ¹ =quantize(I2 _A ¹ / S _I ) may be obtained. where S ¹ =S _I . And the superscript attached to each term indicates the layer number '1' of the first layer, which is the input layer.

6 , the output of each node of the input layer is an integer.

As will be described later, the input scaling factor S _I may be provided from a device different from the terminal 200 , for example, a server.

In particular, FIG. 7 shows a method of generating output activation at each node of the second layer (layer 2) of the neural network shown in FIG. 5 .

For convenience of explanation, only the 21st node is shown in FIG. 7 among the nodes of the second layer shown in FIG. 5 .

As shown in FIG. 6 , the nodes N11 and N12 of the input layer output output activations q _I1 ¹ and q _I2 ¹ .

Output activations output from the nodes of the input layer may be input to each node of the second layer (layer 2) that is a downstream layer of the input layer (layer 1) as input activations for the second layer. In this case, pre-prepared weights q' _w.ab ¹ corresponding to each of the respective input activations may be input to each node of the second layer.

The output activation q _Ia ² of an arbitrary node (a) in the second layer may be expressed as in [Equation 2].

[Equation 2]

q _Ia ² = Right_Shift { n2 , ( q _I1 ¹ *q' _w.a1 ¹ + q _I2 ¹ *q' _w.a2 ¹ + b ₂₁ ) }

However, the superscript indicates the layer number, and Right_Shift { n2,x } indicates an operation that shifts x to the right by n2 bits.

A multiplication operation is performed in Equation 2, and since the multiplication is multiplication between integers, an operation for multiplying two integers is sufficient. This has the advantage of consuming less hardware resources than an operator performing multiplication between real numbers.

Also, referring to Equation 2 and FIG. 7 , the multiplication operator only needs to be provided upstream of the summation operator, and does not need to be provided downstream of the summation operator. Only the Right_Shift operator needs to be provided downstream of the summation operator.

In q' _w.ab ¹ of Equation 2, a denotes an index identifying a node of the second layer, b denotes an index identifying a node of a first layer upstream of the second layer, and a superscript is a layer number indicates That is, the weight assigned to each link of the neural network 420 shown in FIG. 5 may be a value independently provided according to the number of the layer to which the weight is provided, and a source node and a target node of each link.

As will be described later, each weight q' _w.ab ^Layer_L may be a value provided to the terminal 200 by a device different from the terminal 200 , for example, a server.

It can be seen from FIG. 7 that both the input and output of each node of the second layer are integers.

b ₂₁ shown in FIG. 7 is a bias applied to the node N21 and may have an integer value.

In particular, FIG. 8 shows a method of generating output activation at each node of the third layer (layer 3) of the neural network 420 shown in FIG. 5 .

For convenience of explanation, only the 31st node N31 is shown in FIG. 8 among the nodes of the third layer shown in FIG. 5 .

Although not shown in FIG. 7 , the 22nd node N22 of the second layer outputs the output activation q _I2 ² , and the 23rd node N23 of the second layer outputs the output activation q _I3 ² .

Output activations output from the nodes of the second layer may be input to each node of the third layer, which is a downstream layer of the second layer, as input activations for the third layer. In this case, pre-prepared weights q' _w.ab ² corresponding to each of the respective input activations may be input to each node of the third layer.

As shown in FIG. 8 , the 31st node N31 of the third layer outputs an output activation q _I1 ³ .

The output activation q _Ia ³ of an arbitrary node (a) of the third layer may be expressed as in [Equation 3].

[Equation 3]

q _Ia ³ = Right_Shift { n3 , ( q _I1 ² *q' _w.a1 ² + q _I2 ² *q' _w.a2 ² + q _I3 ² *q' _w.a3 ² + b ₃₁ ) }

However, the superscript indicates the layer number, and Right_Shift { n,x } indicates an operation that shifts x to the right by n3 bits.

A multiplication operation is performed in Equation 3, and since the multiplication is multiplication between integers, an operation for multiplying two integers is sufficient.

In q' _w.ab ² of Equation 3, a represents an index for identifying a node of the third layer, b represents an index for identifying a node of a second layer upstream of the third layer, and the superscript is a layer number indicates

b ₃₁ shown in FIG. 8 is a bias applied to the node N31 and may have an integer value.

In particular, FIG. 9 shows a method of generating output activation at each node of the fourth layer (output layer) of the neural network 420 shown in FIG. 5 .

For convenience of explanation, a 41 th node N41 among the nodes of the output layer shown in FIG. 5 is shown in FIG. 9 .

Although not shown in FIG. 8 , the 32nd node N32 of the third layer outputs the output activation q _I2 ³ , and the 33rd node N33 of the third layer outputs the output activation q _I3 ³ .

Output activations output from the nodes of the third layer may be input to each node of the fourth layer (=output layer) that is a downstream layer of the third layer as input activations for the fourth layer. In this case, pre-prepared weights q' _w.ab ³ corresponding to each of the respective input activations may be input to each node of the fourth layer.

As shown in FIG. 9 , the 41st node N41 of the fourth layer outputs an output activation q _I1 ⁴ (=q _O1 ).

The output activation q _Ia ⁴ of any node of the fourth layer may be expressed as [Equation 4].

[Equation 4]

q _Ia ⁴ = Right_Shift { n4 , ( q _I1 ³ *q' _w.a1 ³ + q _I2 ³ *q' _w.a2 ³ + q _I3 ³ *q' _w.a3 ³ + b ₄₁ ) }

However, the superscript indicates the layer number, and Right_Shift { n,x } indicates an operation that shifts x to the right by n4 bits.

A multiplication operation is performed in Equation 4, and since the multiplication is multiplication between integers, an operation for multiplying two integers is sufficient.

In q' _w.ab ² of Equation 4, a represents an index for identifying a node of the fourth layer, b represents an index for identifying a node of a third layer upstream of the fourth layer, and the superscript is a layer number indicates

b ₄₁ shown in FIG. 9 is a bias applied to the node N41 and may have an integer value.

Any two distinct numbers selected from n2, n3, and n4 shown in Equation 2, Equation 3, and Equation 4 may be different from or the same as each other.

From the above description using FIGS. 7 to 9, the output activation (q _Ia ^Layer_L+1 ) output by each node (a) of the layer (L+1) that is the hidden layer and the output layer of the neural network 420 of FIG. 5 is [ It can be understood that it can be presented as Equation 5].

[Equation 5]

q _Ia ^Layer_L+1 = Right_Shift { n , sum ( q _Ib ^Layer_N * q' _w.ab ^Layer_N , a=1, 2, 3, ..., b=1, 2, 3, .... ) }

However, a is an index of a specific node of layer L+1, b is an index of a specific node of layer L, and q' _w.ab ^Layer_N is a link from node b of layer L to node a of layer L+1 is the weight assigned to

The restored value O _A restored by the scale unit 210 from the output integer value q _O output by the node belonging to the output layer has the same relationship as [Equation 6].

[Equation 6]

O _A =q _O ㆍS _O

However, S _O is a pre-given output layer scale factor.

According to Equation 6, O1 _A =q _O1 ㆍS _O and O2 _A =q _O2 ㆍS _O are established in FIG. 10 .

Although it has an output integer value (q _O ) output by a node belonging to the output layer, the restored value output by the scale unit may be expressed in a real number of a fixed-point format or a floating-point format.

As will be described later, _SO may be a value provided to the terminal 200 by a device other than the terminal, for example, a server.

The Right_Shift operation performed at each node of the hidden layers and the output layer of the neural network 420 shown in FIG. 5 may all be an n-bit shift.

The output values of each node of the neural network 420 shown in FIG. 5 are all n-bit integers, and the values of each weight are also all n-bit integers.

11 shows information provided by the server to the terminal.

The server 100 may provide the above-described input layer scaling factor (S _I ) and _output layer scaling factor (SO ) to the terminal 200 .

Also, the server 100 may provide all the above-described weights to the terminal 200 .

In addition, the server 100 may provide the terminal 200 with an n value used for the n-bit Right-Shift operation.

The n value used for the right-shift operation may be independently set for each layer. Accordingly, the server 100 may provide the terminal 200 with n values to be assigned to each layer of the integer neural network provided according to an embodiment of the present invention.

To this end, the server 100 must be able to generate the above-described data. This method will be described with reference to FIG. 12 .

It can be assumed that the terminal 200 already has structural information related to the neural network, such as the number of layers of the neural network, the number of nodes, and the number of links. At this time, the server 100 may provide the terminal 200 with only the input layer scaling factor ( _SI ), the _output layer scaling factor (SO ), weights, and n values used for the n-bit Right-Shift operation. there is.

If the terminal 200 does not have the structure information of the neural network, the server 100 may have to provide not only the above-mentioned information but also the structure information of the neural network to the terminal 200 .

12 shows the structure of a neural network that has been trained by the server. The neural network 410 shown in FIG. 12 has a structure corresponding to the neural network 420 shown in FIG. 5 .

The weight w.ab assigned to each link of the learned neural network 410 may be a real value. The real value may be expressed in a fixed-point method or a floating-point method.

The neural network 410 illustrated in FIG. 12 includes a total of four layers including an input layer and an output layer. Each layer is assigned a corresponding scaling factor. The value of the scaling factor may be determined as an arbitrary value, but in a preferred embodiment may be determined according to a well-designed method. For example, Patent Publication No. 10-2019-0014900 provides an example of generating a scaling factor assigned to each layer.

Among the determined scaling factors, the input layer scaling factor (S _I =S ¹ ) allocated to the input layer and the output layer scaling factor (S _O =S ⁴ ) allocated to the output layer may be provided to the terminal 200 .

The scaling factors S ² , S ³ allocated to the hidden layer do not need to be provided to the terminal 200 , but are used in the process of calculating the respective weights q _a ^Layer_L to be provided to the terminal 200 . can be

The weight to be provided by the server 100 to the terminal 200 may be calculated by Equation 7.

[Equation 7]

q' _w.ab ^layer_L = quantize ((S ^Layer_L /S ^Layer_L+1 ) * w.ab ^Layer_L * 2 ⁿ )

However, layer_L represents the L-th layer,

layer_L+1 represents the L+1-th layer,

S ^Layer_L represents the scaling factor given to the L-th layer,

S ^Layer_L+1 represents the scaling factor given to the L+1-th layer,

w.ab ^Layer_L represents the weight given to the link from node b of the Lth layer to the node a of the L+1th layer,

q' _w.ab ^layer_L is an n-bit integer, which is a quantized weight obtained from w.ab ^Layer_L for the server to provide to the terminal.

13A is a block diagram illustrating a hardware configuration of a server according to an embodiment of the present invention. The server may be referred to as a neural network quantization device.

Referring to FIG. 13A , the server 100 includes a processor 110 and a memory 120 . In the server 100 shown in FIG. 13A, only the components related to the present embodiments are shown. Accordingly, it is apparent to those skilled in the art that other general-purpose components may be further included in the server 100 in addition to the components illustrated in FIG. 13A .

The server 100 generates a neural network 410, trains (or learns) the neural network 410, or quantizes a floating-point type neural network 410 into a fixed-point type neural network 420. Or, it corresponds to a computing device having various processing functions, such as functions to retrain the neural network 410 . For example, the server 100 may be implemented with various types of devices such as a personal computer (PC), a server device, and a mobile device.

The processor 110 serves to perform an overall function for controlling the server 100 . For example, the processor 110 generally controls the server 100 by executing programs stored in the memory 120 in the server 100 . The processor 110 may be implemented as a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), etc. provided in the server 100 , but is not limited thereto.

The memory 120 is hardware for storing various types of data processed in the server 100 , and for example, the memory 120 may store data processed by the server 100 and data to be processed. In addition, the memory 120 may store applications to be driven by the server 100 , drivers, and the like. The memory 120 may be a DRAM, but is not limited thereto. The memory 120 may include at least one of a volatile memory and a nonvolatile memory. Non-volatile memory includes ROM (Read Only Memory), PROM (Programmable ROM), EPROM (Electrically Programmable ROM), EEPROM (Electrically Erasable and Programmable ROM), Flash memory, PRAM (Phase-change RAM), MRAM (Magnetic RAM), RRAM (Resistive RAM), FRAM (Ferroelectric RAM), and the like. Volatile memory includes DRAM (Dynamic RAM), SRAM (Static RAM), SDRAM (Synchronous DRAM), PRAM (Phase-change RAM), MRAM (Magnetic RAM), RRAM (Resistive RAM), FeRAM (Ferroelectric RAM), etc. . In an embodiment, the memory 1940 is a hard disk drive (HDD), solid state drive (SSD), compact flash (CF), secure digital (SD), micro secure digital (Micro-SD), mini-SD (mini). secure digital), xD (extreme digital), or Memory Stick.

The processor 110 may generate the trained neural network 410 by repeatedly training (learning) a given initial neural network. In this case, the initial neural network may have floating-point type parameters, for example, parameters of 32-bit floating point precision in order to secure processing accuracy of the neural network. Here, the parameters may include various types of data input/output to the neural network, such as input/output activations, weights, and biases of the neural network. As the neural network iteratively trains, the floating-point parameters of the neural network can be tuned to compute a more accurate output for a given input.

Floating point requires a relatively large amount of computation and memory access frequency compared to fixed point. In particular, it is known that most of the amount of computation required for processing a neural network is a convolution operation that performs computation of various parameters. Accordingly, in mobile devices such as smartphones, tablets, wearable devices, etc., which have relatively low processing performance, embedded devices, etc., the processing of the neural network 410 having floating-point type parameters may not be smooth. After all, in order for the server 100 according to an embodiment of the present invention to drive the neural network within an allowable loss of accuracy while sufficiently reducing the amount of computation in such devices, the floating-point type parameters processed in the neural network 410 are can be quantized. Here, parameter quantization means converting a floating-point type parameter into a fixed-point type parameter having an integer value.

The server 100 converts the parameters of the trained neural network 410 into a fixed-point type of predetermined bits in consideration of the processing performance of the device (eg, mobile device, embedded device, etc.) to which the neural network is to be deployed (eg, mobile device, embedded device, etc.) After performing quantization, the server 100 transmits the quantized neural network 420 to the device to be employed. The device to which the neural network 420 is to be employed may be the aforementioned terminal 200 . Specific examples include, but are not limited to, autonomous vehicles, robotics, smart phones, tablet devices, augmented reality (AR) devices, and Internet of Things (IoT) devices that perform voice recognition and image recognition using neural networks.

The processor 110 obtains data of the neural network 410 that is pre-trained using floating points, stored in the memory 120 . The pretrained neural network 410 may be data repeatedly trained with floating-point type parameters. Training of the neural network may be first iteratively trained with training-set data as input, and then iteratively trained again with test-set data, but is not necessarily limited thereto. The training-set data is input data for training a neural network, and the test set data is input data that does not overlap with the training-set data, and is data for training while measuring the performance of a neural network trained with the training-set data.

In an embodiment, the processor 110 may analyze the statistical distribution for each channel of the floating-point type parameter values used in each layer included in each of the feature maps and the kernels from the pre-trained neural network data. The processor 110 may determine the scaling factors corresponding to each of the above-described layers based on the analyzed statistical distribution for each layer. For example, the scaling factors may be determined so that the above-described q' _w.ab ^Layer_L value has sufficient resolution in the form of an n bit integer.

On the other hand, the memory 120 is, for example, untrained initial neural network data, neural network data generated in the training process, neural network data for which all training has been completed, quantized neural network data, etc. to be processed or processed by the processor 110 . A related data set may be stored, and various programs related to a training algorithm of a neural network, a quantization algorithm, etc. to be executed by the processor 110 may be stored.

Referring to FIG. 13B, as described above, in a server (100 in FIG. 13A), such as a PC, a server, etc., the processor (110 in FIG. 13A) is a floating-point type (eg, 32-bit floating-point type) neural network train Since the pretrained neural network 410 itself may not be efficiently processed in a low-power or low-performance hardware accelerator due to floating-point type parameters, the processor 110 of the server 100 operates the floating-point type neural network 410 ) is quantized into the neural network 420 of a fixed-point type (eg, a fixed-point type of 16 bits or less). The terminal (hardware accelerator) is dedicated hardware for driving the neural network 420 , and since it is implemented with relatively low power or low performance, it may be implemented more suitable for a fixed-point operation rather than a floating-point operation. The hardware accelerator may correspond to, for example, a neural processing unit (NPU), a tensor processing unit (TPU), a neural engine, etc. which are dedicated modules for driving a neural network, but is not limited thereto.

The hardware accelerator for driving the quantized neural network 420 may be implemented in an independent device separate from the server 100 . However, the present invention is not limited thereto, and the hardware accelerator may be implemented in the same device as the server 100 .

In another embodiment, the terminal 200 may not implement the quantized neural network 420 as a hardware accelerator, but may be implemented by a CPU and software.

Comparing FIGS. 14 and 7 with each other, in the comparative embodiment, each node further includes one arithmetic unit for multiplying integers. Accordingly, the embodiment of FIG. 7 has lower complexity than the embodiment of FIG. 14 .

In addition, in FIG. 14 , an additional parameter q _S ² provided for the one added operator should be further provided. That is, the number of parameters that the server should provide to the terminal 200 is greater in the embodiment of FIG. 14 than in the embodiment of FIG. 7 .

As the number of nodes included in the neural network increases, the complexity of the neural network according to FIG. 14 increases according to the above-described phenomenon, and the amount of necessary parameters also increases.

In this aspect, the operation structure in each node according to FIG. 7 has a great advantage compared to the operation structure in each node according to FIG. 14 . As a result, the neural network of FIG. 5 having each node structure according to FIG. 7 can also enjoy improved technical effects.

In FIG. 14 , the display of the bias b ₃₁ applied to the node N31 is omitted for convenience of description.

The multiplier and the adder used in the embodiment of the present invention may be integer type operators.

By using the above-described embodiments of the present invention, those skilled in the art will be able to easily implement various changes and modifications within the scope without departing from the essential characteristics of the present invention. The content of each claim in the claims may be combined with other claims without reference within the scope that can be understood through this specification.

<Sasa-Acknowledgment>

The present invention is a complex of next-generation intelligent semiconductor technology development (design)-artificial intelligence processor business, a research project supported by the Ministry of Science and ICT and the Information and Communication Planning and Evaluation Institute affiliated with the National Research Foundation of Open Edge Technology Co., Ltd. (the task execution organization). It was developed in the course of carrying out the research task of developing a sensory-based situational prediction type mobile artificial intelligence processor (task unique number 2020001310, task number 2020-0-01310, research period 2020.04.01 ~ 2024.12.31).

Claims

A computing device comprising: a data operation unit that executes an operation of a specific layer of an integer type neural network including an input layer, an intermediate layer, and an output layer; and an internal memory that provides data for calculation to the data operation unit,

When the specific layer is the first layer belonging to the intermediate layer part or the output layer, the first node included in the first layer,

a set of multipliers for multiplying a set of input activations input to the first node and a set of weights corresponding to the set of activations, respectively;

an adder for adding the outputs of the set of multipliers to each other; and

a shift unit converting an output of the adder to generate an output activation of the first node;

includes,

wherein each of the set of input activations and the set of weights is n bits of integer data, and the shift unit is configured to right-shift the output of the adder by n bits;

computing device.
The method of claim 1, wherein when the specific layer is the input layer, the second node included in the input layer performs activation with a real value input to the second node as an integer value using a predetermined input scaling factor. A computing device adapted to convert to activation with
The computing device of claim 1 , further comprising a scaling unit that converts and outputs the activation output from the output layer using a predetermined output scaling factor.
4. The method of claim 3,

the input scaling factor, the output scaling factor, and the set of weights are provided by the computing device from another computing device,

The other computing device,

It is configured to use information about a circular neural network having a structure corresponding to the integer type neural network,

generate the input scaling factor assigned to the input layer of the neural network,

generate the output scaling factor assigned to an output layer of the neural network,

generate scaling factors assigned to each of the intermediate layers defined between the input layer and the output layer in the one-neural network;

In the one-neural network, the Lth scaling factor assigned to the Lth layer directly upstream of the L+1th layer including the node corresponding to the first node is the L+1th scaling factor assigned to the L+1th layer. A first value is calculated by dividing by a scaling factor, and a weight ( w.ab ) assigned to a link (link ab Layer_L ) connected from a node having an index b of the Lth layer to a node having an index a of the L+1th layer Layer_L ) is multiplied by the first value to calculate a second value, the second value is multiplied by 2 n to calculate a third value, and the third value is approximated to an integer value to generate a fourth value, ,

The fourth value, characterized in that the weight (q' w.ab layer_L ) of the integer type neural network corresponding to the weight ( w.ab Layer_L ),

computing device.
A method for driving a neural network in which a terminal including a processing unit and a storage unit generates output activation of a first node included in the intermediate layer or the output layer of an integer neural network including an input layer, an intermediate layer, and an output layer,

obtaining, by the processing unit, a set of input activations from the storage unit, and acquiring a set of weights corresponding to the set input activations from the storage unit;

a multiplication step in which the processing unit multiplies the one set of weights and the values corresponding to each other of the one set of input activations to calculate a set of first values;

an adding step in which the processing unit adds the set of first values to each other to calculate a second value; and

a shifting step of calculating, by the processing unit, the output activation of the first node by right-shifting the second value by n bits;

includes,

Each of the set of input activations and the set of weights is n-bit integer data,

How to run a neural network.
6. The method of claim 5,

converting, by the processor, an activation having a real value input to a second node belonging to the input layer into an activation having an integer value using a predetermined input scaling factor; and

converting, by the processing unit, the activation output from the output layer using a predetermined output scaling factor;

further comprising,

How to run a neural network.
6. The method of claim 5,

The input scaling factor, the output scaling factor, and the set of weights are provided by the terminal from another computing device,

The other computing device,

It is configured to use information about a circular neural network having a structure corresponding to the integer type neural network,

generate the input scaling factor assigned to the input layer of the neural network,

generate the output scaling factor assigned to an output layer of the neural network,

generate scaling factors assigned to each of the intermediate layers defined between the input layer and the output layer in the one-neural network;

In the one-neural network, the Lth scaling factor assigned to the Lth layer directly upstream of the L+1th layer including the node corresponding to the first node is the L+1th scaling factor assigned to the L+1th layer. A first value is calculated by dividing by a scaling factor, and a weight ( w.ab ) assigned to a link (link ab Layer_L ) connected from a node having an index b of the Lth layer to a node having an index a of the L+1th layer Layer_L ) is multiplied by the first value to calculate a second value, the second value is multiplied by 2 n to calculate a third value, and the third value is approximated to an integer value to generate a fourth value, ,

The fourth value, characterized in that the weight (q' w.ab layer_L ) of the integer type neural network corresponding to the weight ( w.ab Layer_L ),

How to run a neural network.
A server generates an input scaling factor assigned to the input layer of a neural network having an input layer, an intermediate layer part, and an output layer, generates an output scaling factor assigned to the output layer, and a scaling factor assigned to each layer of the middle layer part generating them; and

The server assigns the Lth scaling factor assigned to the Lth layer directly upstream of the L+1th layer including the first node included in the intermediate layer unit or the output layer to the Lth scaling factor assigned to the L+1th layer. A first value is calculated by dividing by a +1 scaling factor, and a weight (w) assigned to a link (link ab Layer_L ) connected from a node having an index b of the L-th layer to a node having an index a of the L+1-th layer ab Layer_L ) is multiplied by the first value to calculate a second value, the second value is multiplied by 2 n to calculate a third value, and the third value is approximated to an integer value to generate a fourth value step;

includes,

The fourth value, characterized in that the weight (q' w.ab layer_L ) of the integer type neural network corresponding to the weight ( w.ab Layer_L ),

Integer type neural network information processing method.
The integer neural network information according to claim 8, further comprising the step of providing, by the server, the input scaling factor, the output scaling factor, and the weight of the integer neural network to a computing device executing the operation of the integer neural network processing method.