WO2022244216A1

WO2022244216A1 - Learning device, inference device, learning method, inference method, and program

Info

Publication number: WO2022244216A1
Application number: PCT/JP2021/019268
Authority: WO
Inventors: 宗一郎加来; 京介西田; 仙吉田
Original assignee: 日本電信電話株式会社
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2022-11-24
Also published as: JPWO2022244216A1

Abstract

This learning device, according to one embodiment, includes: an inference unit that infers a prescribed task by a neural network model; a ternarization unit that uses a plurality of ternary vectors to ternarize activation representing input to each layer constituting the neural network model; and a learning unit that learns a model parameter of the neural network model and a ternarization parameter for expressing the activation with three values.

Description

Learning device, reasoning device, learning method, reasoning method, and program

The present invention relates to a learning device, an inference device, a learning method, an inference method, and a program.

In recent years, improvements in the performance of neural network models have attracted attention, but at the same time, the number of parameters and computational complexity of neural network models have been increasing. For this reason, research that aims to reduce the weight and speed of neural network models while suppressing deterioration in accuracy is attracting attention in both the academic and commercial worlds. A neural network model includes a large number of linear transformations, and in particular, matrix operations during these linear transformations affect the computation time.

A known method of quantizing the weights and activations of neural network models is one of the studies on reducing the weight and speed of neural network models. Here, quantization means approximating and expressing a float value, which is normally expressed by 32 bits, by a smaller number of bits (for example, 2 bits, 8 bits, etc.). Quantization is also called bit-lowering. Activation is a vector input to each layer of the neural network model.

For example, the methods described in Non-Patent Document 1 and Non-Patent Document 2 are known as conventional methods related to quantization of weights and activations of neural network models.

Non-Patent Documents

1 and 2 both use BERT (Bidirectional Encoder Representations from Transformers)
The weight of the language model called is 3-valued, and the activation is 8-bit. Note that ternarization means approximating a real value by a product (or a sum of a plurality of products) of a scalar value and one of three integer values, as will be described later.

It is thought that if both the weight and activation of the neural network model are ternary (2-bit), it will be possible to make it lighter and faster. However, in quantization, there is a trade-off relationship between maintaining accuracy and reducing weight and speed. In particular, it is difficult to maintain accuracy when the activation is represented by low bits such as binary or ternary.

An embodiment of the present invention has been made in view of the above points, and aims to ternary the activation of a neural network model with high accuracy.

In order to achieve the above object, a learning device according to one embodiment includes an inference unit that infers a predetermined task using a neural network model, and a plurality of ternary vectors to each layer that constitutes the neural network model. a ternarization unit that ternarizes an activation representing an input; and a learning unit that learns a model parameter of the neural network model and a ternarization parameter for expressing the activation with ternary values. .

　The activation of the neural network model can be ternarized with high accuracy.

It is a figure which shows an example of the hardware constitutions of the inference apparatus which concerns on this embodiment. FIG. 10 is a diagram showing an example of a functional configuration of an inference device at the time of inference; FIG. 10 is a diagram showing an example of a functional configuration of an inference device during learning; 5 is a flowchart showing an example of inference processing in Example 1; 4 is a flow chart showing an example of learning processing in Example 1. FIG. 10 is a flowchart illustrating an example of inference processing in Example 2; It is a figure which shows the comparative example with the ternarization vector of a conventional method. 10 is a flowchart illustrating an example of learning processing in Example 2; It is a figure which shows the comparative example with the pseudo gradient of a conventional method.

An embodiment of the present invention will be described below. In the present embodiment, an inference device 10 that ternarizes the activation of a neural network model with high accuracy and executes inference of a predetermined task using the neural network model will be described. Here, when quantizing the activations of a neural network model, it is common to quantize the weights of the neural network model as well. , binarization, ternarization, etc.). Quantization of weights has the effect of reducing weight, and when both weights and activations are quantized, it is possible to increase the speed by using a dedicated implementation for matrix operations when performing linear transformations. it is conceivable that. In particular, in binarization and ternarization, implementation of significantly high-speed matrix operations using logic value operations can be expected.

Note that binarization refers to approximating a real value with n sums of products of a scalar value and either of two integer values (for example, {-1, 1}), and ternarization refers to approximation of a real value. It refers to approximating a numerical value by n sums of products of a scalar value and any of three integer values (eg {−1, 0, 1}). However, the scalar value is a real number greater than 0, and n is a natural number.

<Hardware configuration>
First, the hardware configuration of the inference device 10 according to this embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of the hardware configuration of an inference device 10 according to this embodiment.

As shown in FIG. 1, the inference device 10 according to the present embodiment is realized by the hardware configuration of a general computer or computer system, and includes an input device 101, a display device 102, an external I/F 103, and a communication I/F. F 104 , processor 105 and memory device 106 . Each of these pieces of hardware is communicably connected via a bus 107 .

The input device 101 is, for example, a keyboard, mouse, touch panel, or the like. The display device 102 is, for example, a display. Note that the inference device 10 does not have to have at least one of the input device 101 and the display device 102 .

The external I/F 103 is an interface with an external device such as the recording medium 103a. The inference device 10 can perform reading, writing, etc. of the recording medium 103 a via the external I/F 103 . Examples of the recording medium 103a include CD (Compact Disc), DVD (Digital Versatile Disk), SD memory card (Secure Digital memory card), USB (Universal Serial Bus) memory card, and the like.

The communication I/F 104 is an interface for connecting the inference device 10 to a communication network. The processor 105 is, for example, various arithmetic units such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). The memory device 106 is, for example, various storage devices such as HDD (Hard Disk Drive), SSD (Solid State Drive), RAM (Random Access Memory), ROM (Read Only Memory), and flash memory.

The inference device 10 according to the present embodiment has the hardware configuration shown in FIG. 1, so that inference processing and learning processing, which will be described later, can be realized. Note that the hardware configuration shown in FIG. 1 is merely an example, and the inference device 10 may have other hardware configurations. For example, the reasoning device 10 may have multiple processors 105 and may have multiple memory devices 106 .

<Functional configuration>
The inference device 10 according to the present embodiment has two phases: learning and inference. During learning, neural network model parameters (hereinafter also referred to as “model parameters”) and parameters for ternarizing activation (hereinafter also referred to as “ternarization parameters”) are learned. At the time of inference, the learned model parameters and the learned ternarized parameters are used to ternarize the activations, and the neural network model performs inference.

Note that the inference device 10 during learning may be called a "learning device" or the like. Also, the inference device 10 during learning and the inference device 10 during inference may be realized by different devices or systems.

≪During Inference≫
A functional configuration of the inference device 10 at the time of inference will be described with reference to FIG. FIG. 2 is a diagram showing an example of the functional configuration of the inference device 10 during inference.

As shown in FIG. 2, the inference device 10 at the time of inference has a ternarization unit 201 and an inference unit 202 . Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of inference cause the processor 105 to execute.

The ternarization unit 201 uses the learned ternarization parameter to ternarize the activation of each layer of the neural network model. That is, the ternarization unit 201 uses the learned ternarization parameter to create a ternarization vector by ternarizing each element of the real-valued vector input to each layer of the neural network model. Output the activation vectors to the neural network model as activations. Note that the learned ternary parameter exists for each layer of the neural network model, and is stored in the memory device 106 or the like, for example.

The inference unit 202 is realized by a neural network model, and uses learned model parameters to infer a given task. That is, the inference unit 202 receives the ternarized vector created by the ternarization unit 201 in each layer of the neural network model, uses the learned model parameters, and outputs a real-valued vector as the output vector of the layer. do. At this time, the output of the final layer (output layer) of the neural network model (or the result of performing predetermined processing on it) becomes the inference result. The learned model parameters exist for each layer of the neural network model, and are stored in the memory device 106 or the like, for example.

Here, a neural network model is generally composed of multiple layers, and each layer has neurons (also called units, nodes, etc.). In addition, a weight representing the strength of connection between neurons exists between each layer. When the layer of the neural network model is called a linear layer (or a fully connected layer), the following operations 1 to 3 are performed in neurons existing in that layer.

Operation 1: Calculate the weighted sum of the input vectors (activations) input to the layer Operation 2: Add the bias term Operation 3: Calculate the activation function (for example, ReLu, etc.) At this time, calculate the neural network model Time is dominated by the time required to compute the weighted sum in operation 1 above. The weighted sum of operation ₁ above is obtained by w ₌ (w ₁ , _w ₂ , . , v _dim ), the inner product of v and w, that is, w ₁ v ₁ + . . . +w _dim v _dim . Therefore, if the time required for a large number of multiplication operations w _k v _k (k=1, . . . , dim) can be reduced, the computation time of the neural network model can be reduced. Note that dim is the number of dimensions of the vector.

For example, in Reference 1, compared to the product operation w _k v _k between float values (32 bits), the product operation w _k v _k between binary and ternary values (for example, w _k ε{−1, 1}, v It has been shown that _k ∈ {−1,0,1}) can be calculated theoretically 40 times faster. It should be noted that not only this example, but when both the weight and the activation are reduced in bits, it can be expected that the product calculation of the operation 1 will be speeded up.

As described above, in the present embodiment, the weights of the neural network model are quantized by a known quantization technique (in particular, quantization expressed with low bits such as binarization and ternarization), and activation is ternarized, it is possible to reduce the computation time of the neural network model.

≪When learning≫
A functional configuration of the inference device 10 during learning will be described with reference to FIG. FIG. 3 is a diagram showing an example of the functional configuration of the inference device 10 during learning.

As shown in FIG. 3 , the inference device 10 during learning has a ternarization unit 201 , an inference unit 202 , and a learning unit 203 . Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of learning cause the processor 105 to execute.

The ternarization unit 201 and the inference unit 202 are the same as in inference. However, it differs from inference in that untrained ternarization parameters and untrained model parameters are used. Note that the ternarization parameter exists for each layer of the neural network model and is stored in the memory device 106 or the like, for example. Similarly, model parameters exist for each layer of the neural network model and are stored, for example, in the memory device 106 or the like.

The learning unit 203 learns the ternarization parameters used when the ternarization unit 201 ternarizes the activation, and learns the model parameters of the neural network model that implements the inference unit 202 .

[Example 1]
Example 1 of the present embodiment will be described below. In this embodiment, a case of approximating activation by n ternary vectors will be described. That is, for example, the activation (real-value vector) of a certain layer of the neural network model is X=(x ₁ , x ₂ , . . . , x _dim ), and the ternarization parameter of the layer is A=( _a _, _a ₂ _, _. _{_} _{_} _{_} Here, a _i (i=1, . . . , n) is a scalar value that satisfies a _i >0, and B _i (i=1, . It is a vector that takes either.

It should be noted that the larger the value of n, the more accurately the activation can be approximated, but the larger the amount of memory and the amount of calculation required.

<Inference processing in the first embodiment>
The inference processing executed by the inference device 10 during inference will be described with reference to FIG. FIG. 4 is a flowchart illustrating an example of inference processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.

Here, steps S101 to S106 in FIG. 4 are repeatedly executed for each layer of the neural network model. Steps S101 to S106 for a certain layer of the neural network model will be described below.

Step S101: The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer. Here, the inference unit 202 inputs the real-valued vector given to the neural network model if the layer is the first layer (input layer), otherwise the output vector of the previous layer. Input a real-valued vector. Note that the real-valued vector given to the neural network model is the inference target data of the task.

Step S102: Using the learned ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S101 to create a ternarization vector. At this time, the ternarization unit 201 creates a ternarization vector according to procedures 1-1 to 1-3 below. Note that the real _- valued vector input in step _S101 above is X=(x ₁ , _x ₂ , . , a _n ). It is also assumed that a _i >0 for each i=1, . . . , n.

Procedure 1-1) First, the ternarization unit 201 sets X=X ₀ and sets the function used for quantization to

, Q _a (X)=(Q _a (x ₁ ), Q _a (x ₂ ), . . . , Q _a (x _dim )). However, a is a scalar value.

Then, the ternarization unit 201

, the first ternary vector is a ₁ B ₁ . Also, the ternarization unit 201 sets X ₁ =X ₀ -a ₁ B ₁ . At this time, X1 can be regarded as an error that could not be _approximated by the _first ternary vector _a1B1 .

Procedure 1-2) Next, the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a _i B _i (i≧2). _That is, when _ternary _vectors _up to a ₁ _B ₁ , . _{After setting -1} B _i-1 ,

and Then, ternarization section 201 sets the i-th ternary vector to a _i B _i . As a result, _n ternary vectors a ₁ B ₁ , . . . , an B _n are obtained.

Procedure 1-3) Finally, the ternarization unit 201 sets a ₁ B ₁ + . . . +a _n B _n as a ternarization vector. As a result, _a _ternarized _vector _{a 1 B 1} ₊ _. _Bn is obtained.

Step S103: The inference unit 202 uses the trained model parameters to calculate the weighted sum of the ternarized vectors created in step S102 for each neuron included in the layer. That is, for _example , a _ternary _vector _a ₁ B ₁ ₊ . _Let ( _w ₁ , _w ₂ _, _. to calculate Note that the weights are included in the trained model parameters.

Step S104: The inference unit 202 adds a bias term to the weighted sum calculated in step S103 above using the learned model parameters for each neuron included in the layer. That is, for example, if the bias term of the neuron is b, the inference unit 202 calculates w ₁ v ₁ + . . . +w _dim v _dim +b. Note that the bias term is included in the learned model parameters.

Step S105: The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S104. That is, for example, if the activation function of the neuron is σ(·), the inference unit 202 calculates σ(w ₁ v ₁ + . . . +w _dim v _dim +b).

Step S106: The inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer. That is, for example, if the activation function value of the _j -th neuron is x'j, the inference unit 202 generates a real-valued vector X'=(x' ₁ , x' ₂ , . . . , x'_dim' ) to the next layer. Note that dim' is the number of dimensions of the real-valued vector output to the next layer, and is the number of neurons included in the layer.

The above steps S101 to S106 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) is the inference result.

<Learning processing in the first embodiment>
A learning process executed by the inference device 10 during learning will be described with reference to FIG. FIG. 5 is a flowchart illustrating an example of learning processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.

Here, steps S201 to S207 in FIG. 5 are repeatedly executed for each layer of the neural network model. Steps S201 to S207 for a certain layer of the neural network model will be described below.

Step S201: The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.

Step S202: Using the ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S201 to create a ternarized vector. At this time, the ternarization unit 201 creates a ternarization vector according to procedures 2-1 to 2-3 below. Let X=(x ₁ , x ₂ , . . . , x _dim ) be the real-value vector input in step S201. Also, let A=(a' ₁ , a' ₂ , . . . , a' _n ) be the ternarization parameter of the layer.

Procedure 2-1) First, the ternarization unit 201 sets X=X ₀ and obtains a scalar value a and a ternary vector B that minimize |X ₀ ^-aB | Here, B is a vector whose elements are either -1, 0, or 1. |·| is the L2 norm. However, |·| may be a distance other than the L2 norm.

At this time, a and B interact with each other, and it is difficult to obtain an exact solution (for example, see Reference 2, etc.). Therefore, in this embodiment, an approximate solution is obtained using Newton's method with reference to algorithm 2 described in Reference 3. This takes advantage of the fact that when either a or B is fixed, the other value that minimizes |X ₀ −aB| ² can be calculated. However, an approximate solution may be obtained by a technique other than Newton's method.

Then, the ternarization unit 201 sets the scalar value a obtained in this manner to a1, sets the ternary vector B to _B1 , and _sets the _first ternary vector to _a1B1 . Also, the ternarization unit 201 sets X ₁ =X ₀ -a ₁ B ₁ . At this time, X1 can be regarded as an error that could not be _approximated by the _first ternary vector _a1B1 .

Procedure 2-2) Next, the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a _i B _i (i≧2). _That is, when _ternary _vectors _up to a ₁ _B ₁ , . ₋₁ B _i−1 , the scalar value a and the ternary vector B that minimize |X _i−1 −aB| ² are obtained. This can be done by finding an approximate solution by the same method as in step 2-1 above.

Then, the ternarization unit 201 sets the scalar value a obtained in this manner to a _i and the ternary vector B to B _i . As a result, _n ternary vectors a ₁ B ₁ , . . . , an B _n are obtained.

Procedure 2-3) Finally, the ternarization unit 201 sets a ₁ B ₁ + . . . +a _n B _n as a ternarization vector. As a result, _a _ternarized _vector _{a 1 B 1} ₊ _. _Bn is obtained.

Step S203: The learning unit ₂₀₃ uses the scalar values a ₁ , . . . an obtained when the _ternary _vector a ₁ B ₁ + . , updates the ternarization parameter A=(a′ ₁ , a′ ₂ , . . . , a′ _n ) of the layer.

Specifically, the learning unit 203 updates a' _i by a' _i =(1−γ)a' _i +γa _i for each i=1, . . . , n. Here, γ is a parameter that satisfies 0<γ<1. That is, the learning unit 203 updates a' _i by moving average. However, this is only an example, and the ternarization parameter of each layer may be updated by other methods.

Step S204: Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternarized vectors created in step S202 above. do.

Step S205: Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S204 using model parameters for each neuron included in the layer. do.

Step S206: The inference unit 202 calculates an activation function using the calculation result of step S205 above, as in step S105 of FIG.

Step S207: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.

Step S208: When the above steps S201 to S207 are executed up to the final layer, the learning unit 203 updates the model parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters.

Here, let Q be a function (hereinafter also referred to as a “quantization function”) for ternarizing (quantizing) the activation X in step S202. That is, Q(X)=a ₁ B ₁ + . . . +a _n B _n . The differential value of the quantization function Q is necessary to calculate the differential of the loss function by the error backpropagation method. I can't let it go. Therefore, in the present embodiment, the differential value of the quantization function Q is pseudo-given by the STE (straight-through estimator) technique described in reference 4. FIG. STE is one of the basic techniques used in quantization learning.

Specifically, as a pseudo differential value of the quantization function Q, for example,

give.

Note that model parameters may be updated for each batch, which is a set of learning data.

By executing the above steps S201 to S208, the ternarization parameters and model parameters are learned.

<Application example and evaluation>
An application example and its evaluation when this embodiment is applied to a language model will be described below. In this application example, this embodiment is applied to TernaryBERT described in Non-Patent Document 1. FIG.

With TernaryBERT, the weights of the BERT model are ternary and the activation is 8-bit. By applying this embodiment and ternary activation, speedup can be expected. In addition, as will be described later, it was also confirmed that the accuracy was improved compared to the conventional method.

Configuration of Transformer Language Model Represented by BERT, etc. A Transformer language model represented by BERT, etc. is composed of an embedding layer and L (L=12 in this application example) Transformer encoder blocks. BERT parameters are embedding matrices W _e , W _s , and W _p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W _lh ^Q , W _lh ^K , W _lh ^V of query, key, and value in , linear transformation matrix W _l ^O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W _l ¹ and W _l ² .

One embedding layer with W _e as the weight matrix, excluding the embedding matrix for the weight segment and position ternarized by TernaryBERT, {W _lh ^Q }, {W _lh ^K }, {W _lh ^V }, {W _l ^O }, {W _l ¹ }, and {W _l ² } are ternarized weights with six linear layers having respective weight matrices.

-Additional implementation of TernaryBERT in this application example TernaryBERT model code is not open to the public. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5). However, in _TernaryBERT , the embedding matrix We is subjected to special _{ternarization} in which quantization coefficients are prepared for each dimension, but in this application example, We are not quantized for the sake of simplicity. That is, in this application example, the weights of the six linear layers {W _lh ^Q }, {W _lh ^K }, {W _lh ^V }, {W _l ^O }, {W _l ¹ }, and {W _l ² } are The ternarization is performed according to the first embodiment.

- About pre-learning Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.). The models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additional TernaryBERT model (8-bit activation), the TernaryBERT model with ternary activation, and the additional model. A model in which this embodiment is applied to the implemented TernaryBERT with n=1 and 2, respectively. Hereinafter, each of these models will be referred to as "RoBERTa", "TernaryBERT (8-bit activation)", "TernaryBERT (three-value activation)", "this embodiment (n = 1)", "this embodiment (n = 2 )”. Note that n is the number of ternary vectors when approximating activation with a ternary vector.

As learning conditions, in the pre-learning, the batch size was 64, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 ⁻⁵ and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.

In TernaryBERT, distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, no pre-learning was performed in TernaryBERT, and distillation learning was performed using the same distillation loss as in Non-Patent Document 1 as a teacher model that was pre-learned with the Wikitext 103 data set of the RoBERTa model of huginface.

Table 1 below shows the results of evaluation experiments for each model under the above conditions.

Here, 2 ^* indicates that activation is represented by the sum of 2 bits. Also, ppl is word perplexity. The ppl is a general evaluation index that qualitatively means the reciprocal of the "certainty of word prediction" of the language model, and the lower the better. In pre-learning, the task of hiding some words in a sentence and predicting hidden words from surrounding words is solved. Become.

When the weight and activation are three-valued, ppl is 1904 for TernaryBERT (three-valued activation) and ppl is 26.93 for this example (n=1). It can be seen that it is superior to the activation low bit method.

In addition, in this example (n=2) in which the number of ternary vectors approximating the activation is increased to two, the ppl is 9.42, and the decrease in accuracy can be greatly suppressed by increasing n. Although this is not as good as TernaryBERT (8-bit activation) with 8-bit activation, it can be said that it has succeeded in maintaining the accuracy of the language model, which has been difficult until now, while keeping the activation at less than 8-bit. Furthermore, in this embodiment (n=2), both weights and activations are ternarized, so higher speed than TernaryBERT can be expected.

• Evaluation on downstream tasks Each model trained in pretraining was fine-tuned on the QQP task and SST-2 task on the GLUE dataset (see, for example, reference 8).

The evaluation targets were RoBERTa, TernaryBERT (8-bit activation), TernaryBERT (three-level activation), and this example (n=2).

As learning conditions, the initial value of each model was learned by prior learning, the batch size was 16, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 ⁻⁵ and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.

In TernaryBERT, distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, fine-tuning was not performed in TernaryBERT, and distillation learning was performed using distilled loss similar to Non-Patent Document 1, using the RoBERTa model of hugginface fine-tuned in each task of the GLUE dataset as a teacher model.

Table 2 below shows the results of evaluation experiments for each model under the above conditions.

Here, as an evaluation index, F1 was adopted for the QQP task, and accuracy was adopted for the SST-2 task.

In TernaryBERT (ternary activation), the language model was not acquired by pre-learning, and learning was not possible in each task. is suppressed to

In addition, in this example (n = 2), the accuracy is suppressed to about 5 to 7% lower than the real number RoBERTa. It can be said that the accuracy of the language model has been maintained successfully. Furthermore, although this example (n = 2) is inferior in accuracy to TernaryBERT (8-bit activation), both weights and activations are ternary, so it is more accurate than TernaryBERT (8-bit activation). Higher speed can be expected.

[Example 2]
Example 2 of the present embodiment will be described below. In this embodiment, the case of approximating activation by the sum of two ternary vectors will be described. That is, for example, the activation (real-valued vector) of _a certain layer of the _neural network model is X=(x ₁ , x ₂ , . , s ₂ ), approximating this real-valued vector with a ternary vector s ₁ B ₁ +s ₂ B ₂ will be described. Here, s ₁ and s ₂ are scalar values that satisfy s ₁ >2s ₂ >0, and B _i (i=1, 2) is a vector whose elements are either −1, 0, or 1.

It should be noted that, in this embodiment, activation can be approximated with higher accuracy than in the case of n=2 in the first embodiment.

<Inference processing in the second embodiment>
The inference processing executed by the inference device 10 during inference will be described with reference to FIG. FIG. 6 is a flowchart illustrating an example of inference processing according to the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.

Here, steps S301 to S306 in FIG. 6 are repeatedly executed for each layer of the neural network model. Steps S301 to S306 for a certain layer of the neural network model will be described below.

Step S301: The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is the inference target data of the task.

Step S302: Using the learned ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S301 to create a ternarization vector. At this time, the ternarization unit 201 creates a ternarization vector according to the following procedures 3-1 and 3-2. Note that the real _- valued vector input in step _S301 above is X=(x ₁ , _x ₂ , . and Also, s ₁ and s ₂ are scalar values that satisfy s ₁ >2s ₂ >0.

Procedure 3-1) First, the ternarization unit 201 selects the function used for quantization as

As

and Next, the ternarization unit 201

and the first ternary vector is s ₁ B ₁ . Further, the ternarization unit 201

, and the second ternary vector is s ₂ B ₂ .

Procedure 3-2) Then, the ternarization unit 201 sets s ₁ B ₁ +s ₂ B ₂ as a ternarization vector. As a result, a ternarized vector s ₁ B ₁ +s ₂ B ₂ that approximates the activation X=(x ₁ , x ₂ , . . . , x _dim ) of the layer by the sum of two ternary vectors is obtained. be done.

Step S303: Similar to step S103 in FIG. 4, the inference unit 202 uses the trained model parameters for each neuron included in the layer to obtain a weighted sum of the ternary vectors created in step S302 above. to calculate

Step S304: Similar to step S104 in FIG. 4, the inference unit 202 uses the learned model parameters for each neuron included in the layer to apply the bias term to the weighted sum calculated in step S303. Add

Step S305: The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S304 above, as in step S105 of FIG.

Step S306: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.

The above steps S301 to S306 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) becomes the inference result.

Here, a conventional method called LSQ quantization (see, for example, Reference 9) will be compared with this embodiment. In LSQ quantization, one ternary vector represents activation. That is, in contrast to this embodiment, in LSQ quantization, activation X is represented by s ₁ B ₁ .

For example, a ternary vector representing activation X is

Then, the i-th element of the ternary vector of this embodiment and the i-th element of the LSQ quantized ternary vector are as shown in FIG. In FIG. 7, the vertical axis is the i-th element of the ternary vector, and the horizontal axis is the i-th element of the activation X. As shown in FIG.

As shown in FIG. 7, in this embodiment, each of the _three values given by s1 is fine _- tuned by s2, thereby reducing accuracy loss more than LSQ quantization. As will be described later, in the learning process, a pseudo-gradient is provided that can learn a ternarization parameter S=(s ₁ , s ₂ ) that finely adjusts each of the ternary values given by s ₁ with s ₂ . .

<Learning processing in the second embodiment>
The learning process executed by the inference device 10 during learning will be described with reference to FIG. FIG. 8 is a flowchart illustrating an example of learning processing in the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.

Here, steps S402 to S407 in FIG. 8 are repeatedly executed for each layer of the neural network model. Steps S402 to S407 relating to a certain layer of the neural network model will be described below. Note that the initialization in step S401 is executed only once. For example, when model parameters and ternarization parameters are repeatedly updated for each batch, the initialization in step S401 is performed only for the first batch, and not performed for the second and subsequent batches.

Step S401: The learning unit 203 initializes the ternarization parameter using the learning data first given to the neural network model. Specifically, the learning unit 203 initializes the ternarization parameter according to procedures 4-1 to 4-3 below. Let X be the real-valued vector represented by the learning data first given to the neural network model.

Procedure 4-1) First, the learning unit 203 obtains a scalar value s and a ternary vector B that minimize |X−sB| ² , and sets s ₁ =s and B ₁ =B. Here, B is a vector whose elements are either -1, 0, or 1.

Procedure 4-2) Next, the learning unit 203 obtains the scalar value s and the ternary vector B that minimize |X ⁻ s ₁ _B ₁ _−sB | and

Procedure 4-3) Then, the learning unit 203 sets (s ₁ , s ₂ ) obtained in the procedures 4-1 to 4-2 as the initial values of the ternarization parameter S. As a result, it is possible to obtain an initial value that reduces the L2 norm of X−(s ₁ B ₁ +s ₂ B ₂ ), so that learning can be started with an initial value with a small approximation error.

Step S402: The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.

Step S403: The ternarization unit 201 uses the ternarization parameter S=(s ₁ , s ₂ ) to ternarize the real-valued vector (that is, the activation) input in the above step S402. Create a valued vector. At this time, the ternarization unit 201 creates a ternarization vector according to the following procedures 5-1 and 5-2. Let X=(x ₁ , x ₂ , . . . , x _dim ) be the real-value vector input in step S402. again,
Procedure 5-1) First, the ternarization unit 201 selects a function to be used for quantization as

As

and Next, the ternarization unit 201

, and the second ternary vector is s ₂ B ₂ .

Procedure 5-2) Then, the ternarization unit 201 sets s ₁ B ₁ +s ₂ B ₂ as a ternarization vector. As a result, a ternarized vector s ₁ B ₁ +s ₂ B ₂ that approximates the activation X=(x ₁ , x ₂ , . . . , x _dim ) of the layer by the sum of two ternary vectors is obtained. be done.

Step S404: Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternary vectors created in step S403 above. do.

Step S405: Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S404 using model parameters for each neuron included in the layer. do.

Step S406: The inference unit 202 calculates an activation function using the calculation result of step S405 above, as in step S105 of FIG.

Step S407: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.

Step S408: When the above steps S402 to S407 are executed up to the final layer, the learning unit 203 updates the model parameters and the ternarization parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters and the ternarization parameters.

Here, let Q be a quantization function that ternarizes (quantizes) the activation X in step S403. That is, Q(X)=s ₁ B ₁ +s ₂ B ₂ . As described above, the differential value of the quantization function Q is necessary in order to calculate the differential of the loss function by the error backpropagation method. , the error cannot be backpropagated. Therefore, in this embodiment, the differential value of the quantization function Q is given in a pseudo manner.

First, as a pseudo-differential value of the quantization function Q necessary for learning (updating) the model parameters, for example,

give. This is given by the STE approach described in reference 4, as in Example 1.

Next, a pseudo differential value of the quantization function Q necessary for learning (updating) the ternarization parameter S=(s ₁ , s ₂ )

give. In this embodiment, the activation is expressed by the sum of two ternary vectors, referring to the pseudo gradient of LSQ quantization (see, for example, Reference 9), which expresses the activation by one ternary vector. Expand in case.

For the sake of description,

and At this time, as a pseudo-differential value of the quantization function Q necessary for learning the ternarization parameter S=(s ₁ , s ₂ ),

give. here,

is. Note that the value of the kth dimension of the activation after quantization is determined by the value of the kth dimension of the activation. Also, round(x) is a function that rounds off the decimal point of x to the nearest integer value, and Sign(x) is a sign function that returns 1 if 0≤x and -1 if x<0.

Here, the pseudo-gradient of LSQ quantization and the pseudo-gradient (pseudo-differential value of the quantization function Q necessary for learning the ternarization parameter S=(s ₁ , s ₂ )) given in this embodiment are compare. The LSQ quantization pseudo-gradient and the pseudo-gradient provided in this embodiment are shown in FIG. In FIG. 9, the vertical axis is the pseudo differential with respect to the ternary parameter of the kth element of the ternarized vector, and the horizontal axis is the kth element of the ternarized vector.

As shown in FIG. 9, in this embodiment, each of the _three values given by s1 is fine _- tuned by s2. That is, near {s ₁ , 0, -s ₁ }

to 0, instead of

to fine _- tune s2.

Note that the model parameters and the ternarization parameters may be updated for each batch, which is a set of learning data.

By executing the above steps S401 to S408, the ternarization parameters and model parameters are learned.

<Application example and evaluation>
An application example and its evaluation when this embodiment is applied to a language model will be described below. In this application example, this embodiment is applied to KDLSQ-BERT described in Non-Patent Document 2.

KDLSQ-BERT uses ternary weights and 8-bit activation for the BERT model, and by applying this embodiment to ternary activation, speedup can be expected. In addition, in KDLSQ-BERT, LSQ quantization is used for activation bit reduction, and as will be described later, it has been confirmed that accuracy is improved compared to LSQ quantization.

Configuration of Transformer language model represented by BERT etc. As described in the first embodiment, the Transformer language model represented by BERT etc. includes an embedding layer and L (L=12 in this application example) Transformer encoders. block. BERT parameters are embedding matrices W _e , W _s , and W _p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W _lh ^Q , W _lh ^K , W _lh ^V of query, key, and value in , linear transformation matrix W _l ^O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W _l ¹ and W _l ² .

・ One embedding layer with W _e as the weight matrix except the embedding matrix for the weight segment and position ternarized by KDLSQ-BERT, {W _lh ^Q }, {W _lh ^K }, {W _lh ^V }, Weights with six linear layers having {W _l ^O }, {W _l ¹ }, and {W _l ² } as weight matrices are ternarized.

・Additional implementation of KDLSQ-BERT in this application example KDLSQ-BERT has not released the model code. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5). However, in _KDLSQ -BERT, special _{ternarization} is applied to the embedding matrix We by preparing quantization coefficients for each dimension, but in this application example, We are not quantized for simplicity. That is, in this application example, the weights of the six linear layers {W _lh ^Q }, {W _lh ^K }, {W _lh ^V }, {W _l ^O }, {W _l ¹ }, and {W _l ² } are The ternarization is performed according to the second embodiment.

- About pre-learning Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.). The models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additionally implemented KDLSQ-BERT model (8-bit activation), and the KDLSQ-BERT method with ternary activation. and a model in which this embodiment is applied to the additionally mounted TernaryBERT. Hereinafter, each of these models is also referred to as "RoBERTa", "KDLSQ-BERT (8-bit activation)", "KDLSQ-BERT (three-level activation)", and "this example".

As learning conditions, in the pre-learning, the batch size was 64, the number of epochs was 12, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 ⁻⁵ and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.

KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT did not perform pre-learning, and distillation learning was performed using distilled loss similar to Non-Patent Document 2 as a teacher model that was pre-learned with the Wikitext103 data set of hugginface's RoBERTa model.

Table 3 below shows the results of evaluation experiments for each model under the above conditions.

When the weight and activation are three values, the ppl is 17.24 in KDLSQ-BERT (ternary activation), and the ppl is 5.27 in this example. It can be seen that it is superior to ternarization by .

In addition, it can be seen that by increasing the number of ternary vectors that approximate activation to two, the decrease in accuracy can be greatly suppressed. Although this is not as good as KDLSQ-BERT (8-bit activation) with 8-bit activation, it can be said that it succeeds in maintaining the accuracy of the language model, which has been difficult until now, while making the activation less than 8-bit. Furthermore, in this embodiment, both the weights and activations are ternarized, so higher speed than KDLSQ-BERT can be expected.

・ Evaluation in downstream tasks Each model learned in pre-training is applied to various tasks (MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC , RTE).

The evaluation targets were RoBERTa, KDLSQ-BERT (8-bit activation), KDLSQ-BERT (3-value activation), and this example.

KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT does not perform fine tuning, and the training model is the RoBERTa model of hugginface fine-tuned for each task of the GLUE dataset, and distillation learning is performed using the same distillation loss as in Non-Patent Document 2. .

Table 4 below shows the results of evaluation experiments for each model under the above conditions.

Here, as the evaluation index, MC (Matthews Correlation) for CoLA task, F1 for MRPC task and QQP task, SC (Spearman Correlation) for STS-B task, and accuracy for other tasks were adopted according to KDLSQ-BERT. .

Compared to KDLSQ-BERT (ternary activation), this embodiment greatly suppresses the decrease in accuracy in each task.

In addition, in tasks other than CoLA and RTE, the accuracy is suppressed to about 2 to 5% compared to RoBERTa, and while the activation is less than 8 bits, it is possible to maintain the accuracy of the language model, which has been difficult until now. It can be said that it is successful. Furthermore, although this embodiment is inferior to KDLSQ-BERT (8-bit activation) in terms of accuracy, it is faster than KDLSQ-BERT (8-bit activation) because both weights and activations are ternary. change can be expected.

<Summary>
As described above, the inference device 10 according to the present embodiment can highly accurately ternarize the activation of a neural network model and perform inference of a predetermined task at high speed using the neural network model. In addition, experiments were conducted with various natural language tasks using a Transformer language model represented by BERT as a neural network model, and it was confirmed that accuracy deterioration can be greatly suppressed compared to conventional methods.

The present invention is not limited to the specifically disclosed embodiments described above, and various modifications, alterations, combinations with known techniques, etc. are possible without departing from the scope of the claims. .

Regarding the above embodiments, the following additional remarks are disclosed.

(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
inference of a given task by a neural network model,
Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model;
learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
learning device.

(Appendix 2)
The processor
For each layer, n (where n is a predetermined recursively and repeatedly calculating the scalar value and the ternary vector that minimize the distance from the sum of natural numbers);
The learning device according to appendix 1, wherein the ternarization parameter is learned by moving average with the scalar value.

(Appendix 3)
The processor
for each layer, ternary activations representing inputs to the layer by a quantization function having the ternarization parameter;
The learning device according to appendix 1, wherein the ternarization parameter is learned by error backpropagation using a pseudo gradient of the quantization function for the ternarization parameter.

(Appendix 4)
The ternarization parameter is composed of a first ternarization parameter and a second ternarization parameter,
The processor
The product of the first scalar value and the first ternary vector having -1, 0, or +1 in each element, and the second scalar value and -1, 0, or +1 in each element The first scalar value, the second scalar value, the first ternary vector, and the second ternary vector that minimize the sum of the products with the second ternary vector having using the first scalar value and the second scalar value as initial values of the first ternarization parameter and the second ternarization parameter, respectively;
For each layer, the product of the first ternarization parameter and the first ternary vector, the second ternarization parameter and the second ternarizing the activations representing the inputs to the layer by the sum of their products with the ternary vector of
Learning the ternarization parameter using the pseudo gradient of the quantization function for the first ternarization parameter and the pseudo gradient of the quantization function for the second ternarization parameter The learning device according to appendix 3.

(Appendix 5)
memory;
at least one processor connected to the memory;
including
The processor
inference of a given task by a neural network model,
Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values in ternary values, ternarizing activations representing inputs to each layer constituting the neural network model,
reasoning device.

(Appendix 6)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
inference of a given task by a neural network model,
Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model;
learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
Non-transitory storage media.

(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform inference processing,
inference of a given task by a neural network model,
Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values, ternarizing activations representing inputs to each layer constituting the neural network model,
Non-transitory storage media.

[References]
Reference 1: Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, Heng Tao Shen. TBN: Convolutional Neural Network with Ternary Inputs and Binary Weights (ECCV2018)
Reference 2: Fengfu Li, Bo Zhang, Bin Liu. Ternary Weight Networks (2016)
Reference 3: Lu Hou, James T. Kwok. LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS (2018)
Reference 4: Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Reference 5: huggingface, Internet <URL: https://github.com/huggingface/transformers>
Reference 6: Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. Pointer Sentinel Mixture Models (2016)
Reference 7: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-anov, V.: RoBERTa: A Robustly Optimized BERT Pre-training Approach, CoRR, Vol. abs/1907.11692, (2019)
Reference 8: Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy & Samuel R. Bowman GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING (ICLR2019)
Reference 9: Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha. LEARNED STEP SIZE QUANTIZATION (ICLR2020)

10 inference device 101 input device 102 display device 103 external I/F
103a recording medium 104 communication I/F
105 processor 106 memory device 107 bus 201 ternarization unit 202 inference unit 203 learning unit

Claims

an inference unit that infers a given task using a neural network model;
a ternarization unit that ternarizes activations representing inputs to each layer constituting the neural network model using a plurality of ternary vectors;
a learning unit that learns model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
A learning device having
The ternarization unit
For each layer, n (where n is a predetermined recursively and repeatedly calculating the scalar value and the ternary vector that minimize the distance from the sum of natural numbers);
The learning unit
2. The learning device according to claim 1, wherein said ternary parameter is learned by moving average with said scalar value.
The ternarization unit
for each layer, ternary activations representing inputs to the layer by a quantization function having the ternarization parameter;
The learning unit
2. The learning device according to claim 1, wherein said ternary parameter is learned by error backpropagation using a pseudo gradient of said quantization function relating to said ternary parameter.
The ternarization parameter is composed of a first ternarization parameter and a second ternarization parameter,
The product of the first scalar value and the first ternary vector having -1, 0, or +1 in each element, and the second scalar value and -1, 0, or +1 in each element The first scalar value, the second scalar value, the first ternary vector, and the second ternary vector that minimize the sum of the products with the second ternary vector having and an initialization unit that sets the first scalar value and the second scalar value as initial values of the first ternarization parameter and the second ternarization parameter, respectively;
The ternarization unit
For each layer, the product of the first ternarization parameter and the first ternary vector, the second ternarization parameter and the second ternarizing the activations representing the inputs to the layer by the sum of their products with the ternary vector of
The learning unit
Learning the ternarization parameter using the pseudo gradient of the quantization function for the first ternarization parameter and the pseudo gradient of the quantization function for the second ternarization parameter 4. The learning device according to claim 3, wherein
an inference unit that infers a given task using a neural network model;
Ternary ternary activation representing input to each layer constituting the neural network model using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values chemical department and
A reasoning device with
an inference procedure for inferring a given task using a neural network model;
a ternarization procedure for ternarizing activations representing inputs to each layer constituting the neural network model using a plurality of ternary vectors;
a learning procedure for learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
a computer-implemented learning method.
an inference procedure for inferring a given task using a neural network model;
Ternary ternary activation representing input to each layer constituting the neural network model using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values conversion procedure;
is a computer-implemented inference method.
A program that causes a computer to function as the learning device according to any one of claims 1 to 4 or the inference device according to claim 5.