WO2022244216A1 - Learning device, inference device, learning method, inference method, and program - Google Patents

Learning device, inference device, learning method, inference method, and program Download PDF

Info

Publication number
WO2022244216A1
WO2022244216A1 PCT/JP2021/019268 JP2021019268W WO2022244216A1 WO 2022244216 A1 WO2022244216 A1 WO 2022244216A1 JP 2021019268 W JP2021019268 W JP 2021019268W WO 2022244216 A1 WO2022244216 A1 WO 2022244216A1
Authority
WO
WIPO (PCT)
Prior art keywords
ternarization
ternary
learning
parameter
neural network
Prior art date
Application number
PCT/JP2021/019268
Other languages
French (fr)
Japanese (ja)
Inventor
宗一郎 加来
京介 西田
仙 吉田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/019268 priority Critical patent/WO2022244216A1/en
Priority to JP2023522144A priority patent/JPWO2022244216A1/ja
Publication of WO2022244216A1 publication Critical patent/WO2022244216A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to a learning device, an inference device, a learning method, an inference method, and a program.
  • a neural network model includes a large number of linear transformations, and in particular, matrix operations during these linear transformations affect the computation time.
  • quantization means approximating and expressing a float value, which is normally expressed by 32 bits, by a smaller number of bits (for example, 2 bits, 8 bits, etc.). Quantization is also called bit-lowering.
  • Activation is a vector input to each layer of the neural network model.
  • Non-Patent Document 1 and Non-Patent Document 2 are known as conventional methods related to quantization of weights and activations of neural network models.
  • Non-Patent Documents 1 and 2 both use BERT (Bidirectional Encoder Representations from Transformers)
  • the weight of the language model called is 3-valued, and the activation is 8-bit.
  • ternarization means approximating a real value by a product (or a sum of a plurality of products) of a scalar value and one of three integer values, as will be described later.
  • An embodiment of the present invention has been made in view of the above points, and aims to ternary the activation of a neural network model with high accuracy.
  • a learning device includes an inference unit that infers a predetermined task using a neural network model, and a plurality of ternary vectors to each layer that constitutes the neural network model. a ternarization unit that ternarizes an activation representing an input; and a learning unit that learns a model parameter of the neural network model and a ternarization parameter for expressing the activation with ternary values. .
  • the activation of the neural network model can be ternarized with high accuracy.
  • FIG. 10 is a diagram showing an example of a functional configuration of an inference device at the time of inference;
  • FIG. 10 is a diagram showing an example of a functional configuration of an inference device during learning;
  • 5 is a flowchart showing an example of inference processing in Example 1;
  • 4 is a flow chart showing an example of learning processing in Example 1.
  • FIG. 10 is a flowchart illustrating an example of inference processing in Example 2; It is a figure which shows the comparative example with the ternarization vector of a conventional method.
  • 10 is a flowchart illustrating an example of learning processing in Example 2; It is a figure which shows the comparative example with the pseudo gradient of a conventional method.
  • an inference device 10 that ternarizes the activation of a neural network model with high accuracy and executes inference of a predetermined task using the neural network model will be described.
  • quantizing the activations of a neural network model it is common to quantize the weights of the neural network model as well. , binarization, ternarization, etc.).
  • Quantization of weights has the effect of reducing weight, and when both weights and activations are quantized, it is possible to increase the speed by using a dedicated implementation for matrix operations when performing linear transformations. it is conceivable that.
  • binarization and ternarization implementation of significantly high-speed matrix operations using logic value operations can be expected.
  • binarization refers to approximating a real value with n sums of products of a scalar value and either of two integer values (for example, ⁇ -1, 1 ⁇ ), and ternarization refers to approximation of a real value. It refers to approximating a numerical value by n sums of products of a scalar value and any of three integer values (eg ⁇ 1, 0, 1 ⁇ ).
  • the scalar value is a real number greater than 0, and n is a natural number.
  • FIG. 1 is a diagram showing an example of the hardware configuration of an inference device 10 according to this embodiment.
  • the inference device 10 is realized by the hardware configuration of a general computer or computer system, and includes an input device 101, a display device 102, an external I/F 103, and a communication I/F. F 104 , processor 105 and memory device 106 . Each of these pieces of hardware is communicably connected via a bus 107 .
  • the input device 101 is, for example, a keyboard, mouse, touch panel, or the like.
  • the display device 102 is, for example, a display. Note that the inference device 10 does not have to have at least one of the input device 101 and the display device 102 .
  • the external I/F 103 is an interface with an external device such as the recording medium 103a.
  • the inference device 10 can perform reading, writing, etc. of the recording medium 103 a via the external I/F 103 .
  • Examples of the recording medium 103a include CD (Compact Disc), DVD (Digital Versatile Disk), SD memory card (Secure Digital memory card), USB (Universal Serial Bus) memory card, and the like.
  • the communication I/F 104 is an interface for connecting the inference device 10 to a communication network.
  • the processor 105 is, for example, various arithmetic units such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit).
  • the memory device 106 is, for example, various storage devices such as HDD (Hard Disk Drive), SSD (Solid State Drive), RAM (Random Access Memory), ROM (Read Only Memory), and flash memory.
  • the inference device 10 has the hardware configuration shown in FIG. 1, so that inference processing and learning processing, which will be described later, can be realized.
  • the hardware configuration shown in FIG. 1 is merely an example, and the inference device 10 may have other hardware configurations.
  • the reasoning device 10 may have multiple processors 105 and may have multiple memory devices 106 .
  • the inference device 10 has two phases: learning and inference.
  • learning neural network model parameters (hereinafter also referred to as “model parameters”) and parameters for ternarizing activation (hereinafter also referred to as “ternarization parameters”) are learned.
  • ternarization parameters parameters for ternarizing activation
  • the learned model parameters and the learned ternarized parameters are used to ternarize the activations, and the neural network model performs inference.
  • the inference device 10 during learning may be called a "learning device” or the like. Also, the inference device 10 during learning and the inference device 10 during inference may be realized by different devices or systems.
  • FIG. 2 is a diagram showing an example of the functional configuration of the inference device 10 during inference.
  • the inference device 10 at the time of inference has a ternarization unit 201 and an inference unit 202 .
  • Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of inference cause the processor 105 to execute.
  • the ternarization unit 201 uses the learned ternarization parameter to ternarize the activation of each layer of the neural network model. That is, the ternarization unit 201 uses the learned ternarization parameter to create a ternarization vector by ternarizing each element of the real-valued vector input to each layer of the neural network model. Output the activation vectors to the neural network model as activations.
  • the learned ternary parameter exists for each layer of the neural network model, and is stored in the memory device 106 or the like, for example.
  • the inference unit 202 is realized by a neural network model, and uses learned model parameters to infer a given task. That is, the inference unit 202 receives the ternarized vector created by the ternarization unit 201 in each layer of the neural network model, uses the learned model parameters, and outputs a real-valued vector as the output vector of the layer. do. At this time, the output of the final layer (output layer) of the neural network model (or the result of performing predetermined processing on it) becomes the inference result.
  • the learned model parameters exist for each layer of the neural network model, and are stored in the memory device 106 or the like, for example.
  • a neural network model is generally composed of multiple layers, and each layer has neurons (also called units, nodes, etc.).
  • neurons also called units, nodes, etc.
  • a weight representing the strength of connection between neurons exists between each layer.
  • the layer of the neural network model is called a linear layer (or a fully connected layer)
  • the following operations 1 to 3 are performed in neurons existing in that layer.
  • Operation 1 Calculate the weighted sum of the input vectors (activations) input to the layer Operation 2: Add the bias term Operation 3: Calculate the activation function (for example, ReLu, etc.)
  • the neural network model Time is dominated by the time required to compute the weighted sum in operation 1 above.
  • the weights of the neural network model are quantized by a known quantization technique (in particular, quantization expressed with low bits such as binarization and ternarization), and activation is ternarized, it is possible to reduce the computation time of the neural network model.
  • FIG. 3 is a diagram showing an example of the functional configuration of the inference device 10 during learning.
  • the inference device 10 during learning has a ternarization unit 201 , an inference unit 202 , and a learning unit 203 .
  • Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of learning cause the processor 105 to execute.
  • the ternarization unit 201 and the inference unit 202 are the same as in inference. However, it differs from inference in that untrained ternarization parameters and untrained model parameters are used. Note that the ternarization parameter exists for each layer of the neural network model and is stored in the memory device 106 or the like, for example. Similarly, model parameters exist for each layer of the neural network model and are stored, for example, in the memory device 106 or the like.
  • the learning unit 203 learns the ternarization parameters used when the ternarization unit 201 ternarizes the activation, and learns the model parameters of the neural network model that implements the inference unit 202 .
  • Example 1 of the present embodiment will be described below.
  • B i (i 1, . It is a vector that takes either.
  • FIG. 4 is a flowchart illustrating an example of inference processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.
  • steps S101 to S106 in FIG. 4 are repeatedly executed for each layer of the neural network model. Steps S101 to S106 for a certain layer of the neural network model will be described below.
  • Step S101 The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer.
  • the inference unit 202 inputs the real-valued vector given to the neural network model if the layer is the first layer (input layer), otherwise the output vector of the previous layer. Input a real-valued vector. Note that the real-valued vector given to the neural network model is the inference target data of the task.
  • Step S102 Using the learned ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S101 to create a ternarization vector. At this time, the ternarization unit 201 creates a ternarization vector according to procedures 1-1 to 1-3 below.
  • Q a (X) (Q a (x 1 ), Q a (x 2 ), . . . , Q a (x dim )).
  • a is a scalar value.
  • the first ternary vector is a 1 B 1 .
  • X1 can be regarded as an error that could not be approximated by the first ternary vector a1B1 .
  • the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a i B i (i ⁇ 2). That is, when ternary vectors up to a 1 B 1 , .
  • a i B i 2 th ternary vector
  • ternarization section 201 sets the i-th ternary vector to a i B i .
  • n ternary vectors a 1 B 1 , . . . , an B n are obtained.
  • the ternarization unit 201 sets a 1 B 1 + . . . +a n B n as a ternarization vector. As a result, a ternarized vector a 1 B 1 + . Bn is obtained.
  • Step S103 The inference unit 202 uses the trained model parameters to calculate the weighted sum of the ternarized vectors created in step S102 for each neuron included in the layer. That is, for example , a ternary vector a 1 B 1 + . Let ( w 1 , w 2 , . to calculate Note that the weights are included in the trained model parameters.
  • Step S104 The inference unit 202 adds a bias term to the weighted sum calculated in step S103 above using the learned model parameters for each neuron included in the layer. That is, for example, if the bias term of the neuron is b, the inference unit 202 calculates w 1 v 1 + . . . +w dim v dim +b. Note that the bias term is included in the learned model parameters.
  • Step S105 The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S104. That is, for example, if the activation function of the neuron is ⁇ ( ⁇ ), the inference unit 202 calculates ⁇ (w 1 v 1 + . . . +w dim v dim +b).
  • the above steps S101 to S106 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) is the inference result.
  • FIG. 5 is a flowchart illustrating an example of learning processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.
  • steps S201 to S207 in FIG. 5 are repeatedly executed for each layer of the neural network model. Steps S201 to S207 for a certain layer of the neural network model will be described below.
  • Step S201 The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.
  • B is a vector whose elements are either -1, 0, or 1.
  • is the L2 norm. However,
  • an approximate solution is obtained using Newton's method with reference to algorithm 2 described in Reference 3. This takes advantage of the fact that when either a or B is fixed, the other value that minimizes
  • an approximate solution may be obtained by a technique other than Newton's method.
  • the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a i B i (i ⁇ 2). That is, when ternary vectors up to a 1 B 1 , . ⁇ 1 B i ⁇ 1 , the scalar value a and the ternary vector B that minimize
  • the ternarization unit 201 sets the scalar value a obtained in this manner to a i and the ternary vector B to B i .
  • n ternary vectors a 1 B 1 , . . . , an B n are obtained.
  • the ternarization unit 201 sets a 1 B 1 + . . . +a n B n as a ternarization vector.
  • a ternarized vector a 1 B 1 + . Bn is obtained.
  • is a parameter that satisfies 0 ⁇ 1. That is, the learning unit 203 updates a' i by moving average.
  • Step S204 Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternarized vectors created in step S202 above. do.
  • Step S205 Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S204 using model parameters for each neuron included in the layer. do.
  • Step S206 The inference unit 202 calculates an activation function using the calculation result of step S205 above, as in step S105 of FIG.
  • Step S207 As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
  • Step S208 When the above steps S201 to S207 are executed up to the final layer, the learning unit 203 updates the model parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters.
  • the differential value of the quantization function Q is necessary to calculate the differential of the loss function by the error backpropagation method. I can't let it go. Therefore, in the present embodiment, the differential value of the quantization function Q is pseudo-given by the STE (straight-through estimator) technique described in reference 4.
  • FIG. STE is one of the basic techniques used in quantization learning.
  • model parameters may be updated for each batch, which is a set of learning data.
  • the weights of the BERT model are ternary and the activation is 8-bit.
  • speedup can be expected.
  • accuracy was improved compared to the conventional method.
  • BERT parameters are embedding matrices W e , W s , and W p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W lh Q , W lh K , W lh V of query, key, and value in , linear transformation matrix W l O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W l 1 and W l 2 .
  • MHA Multi-Head Attention
  • One embedding layer with W e as the weight matrix excluding the embedding matrix for the weight segment and position ternarized by TernaryBERT, ⁇ W lh Q ⁇ , ⁇ W lh K ⁇ , ⁇ W lh V ⁇ , ⁇ W l O ⁇ , ⁇ W l 1 ⁇ , and ⁇ W l 2 ⁇ are ternarized weights with six linear layers having respective weight matrices.
  • TernaryBERT model code is not open to the public. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5). However, in TernaryBERT , the embedding matrix We is subjected to special ternarization in which quantization coefficients are prepared for each dimension, but in this application example, We are not quantized for the sake of simplicity.
  • the weights of the six linear layers ⁇ W lh Q ⁇ , ⁇ W lh K ⁇ , ⁇ W lh V ⁇ , ⁇ W l O ⁇ , ⁇ W l 1 ⁇ , and ⁇ W l 2 ⁇ are The ternarization is performed according to the first embodiment.
  • Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.).
  • the models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additional TernaryBERT model (8-bit activation), the TernaryBERT model with ternary activation, and the additional model.
  • a model in which this embodiment is applied to the implemented TernaryBERT with n 1 and 2, respectively.
  • n is the number of ternary vectors when approximating activation with a ternary vector.
  • the batch size was 64, the number of epochs was 3, and one commercially available general GPU was used.
  • the learning rate was set to 2 ⁇ 10 ⁇ 5 and was linearly attenuated so that the learning rate became 0 at the final step of learning.
  • adam was used with a Dropout rate of 0.1.
  • distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, no pre-learning was performed in TernaryBERT, and distillation learning was performed using the same distillation loss as in Non-Patent Document 1 as a teacher model that was pre-learned with the Wikitext 103 data set of the RoBERTa model of huginface.
  • Table 1 below shows the results of evaluation experiments for each model under the above conditions.
  • ppl is word perplexity.
  • the ppl is a general evaluation index that qualitatively means the reciprocal of the "certainty of word prediction" of the language model, and the lower the better. In pre-learning, the task of hiding some words in a sentence and predicting hidden words from surrounding words is solved. Become.
  • the initial value of each model was learned by prior learning, the batch size was 16, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2 ⁇ 10 ⁇ 5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
  • distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, fine-tuning was not performed in TernaryBERT, and distillation learning was performed using distilled loss similar to Non-Patent Document 1, using the RoBERTa model of hugginface fine-tuned in each task of the GLUE dataset as a teacher model.
  • Table 2 below shows the results of evaluation experiments for each model under the above conditions.
  • F1 was adopted for the QQP task, and accuracy was adopted for the SST-2 task.
  • Example 2 of the present embodiment will be described below.
  • s 1 and s 2 are scalar values that satisfy s 1 >2s 2 >0
  • FIG. 6 is a flowchart illustrating an example of inference processing according to the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.
  • steps S301 to S306 in FIG. 6 are repeatedly executed for each layer of the neural network model. Steps S301 to S306 for a certain layer of the neural network model will be described below.
  • Step S301 The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is the inference target data of the task.
  • the ternarization unit 201 is a ternarization unit 201 .
  • the second ternary vector is s 2 B 2 .
  • the ternarization unit 201 sets s 1 B 1 +s 2 B 2 as a ternarization vector.
  • Step S303 Similar to step S103 in FIG. 4, the inference unit 202 uses the trained model parameters for each neuron included in the layer to obtain a weighted sum of the ternary vectors created in step S302 above. to calculate
  • Step S304 Similar to step S104 in FIG. 4, the inference unit 202 uses the learned model parameters for each neuron included in the layer to apply the bias term to the weighted sum calculated in step S303. Add
  • Step S305 The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S304 above, as in step S105 of FIG.
  • Step S306 As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
  • the above steps S301 to S306 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) becomes the inference result.
  • LSQ quantization a conventional method called LSQ quantization (see, for example, Reference 9) will be compared with this embodiment.
  • LSQ quantization one ternary vector represents activation. That is, in contrast to this embodiment, in LSQ quantization, activation X is represented by s 1 B 1 .
  • a ternary vector representing activation X is
  • the i-th element of the ternary vector of this embodiment and the i-th element of the LSQ quantized ternary vector are as shown in FIG.
  • the vertical axis is the i-th element of the ternary vector
  • the horizontal axis is the i-th element of the activation X.
  • each of the three values given by s1 is fine - tuned by s2, thereby reducing accuracy loss more than LSQ quantization.
  • FIG. 8 is a flowchart illustrating an example of learning processing in the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.
  • steps S402 to S407 in FIG. 8 are repeatedly executed for each layer of the neural network model. Steps S402 to S407 relating to a certain layer of the neural network model will be described below. Note that the initialization in step S401 is executed only once. For example, when model parameters and ternarization parameters are repeatedly updated for each batch, the initialization in step S401 is performed only for the first batch, and not performed for the second and subsequent batches.
  • Step S401 The learning unit 203 initializes the ternarization parameter using the learning data first given to the neural network model. Specifically, the learning unit 203 initializes the ternarization parameter according to procedures 4-1 to 4-3 below. Let X be the real-valued vector represented by the learning data first given to the neural network model.
  • the learning unit 203 obtains a scalar value s and a ternary vector B that minimize
  • B is a vector whose elements are either -1, 0, or 1.
  • the learning unit 203 obtains the scalar value s and the ternary vector B that minimize
  • the learning unit 203 sets (s 1 , s 2 ) obtained in the procedures 4-1 to 4-2 as the initial values of the ternarization parameter S.
  • the learning unit 203 sets (s 1 , s 2 ) obtained in the procedures 4-1 to 4-2 as the initial values of the ternarization parameter S.
  • Step S402 The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.
  • the ternarization unit 201 is a ternarization unit 201 .
  • the second ternary vector is s 2 B 2 .
  • the ternarization unit 201 sets s 1 B 1 +s 2 B 2 as a ternarization vector.
  • Step S404 Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternary vectors created in step S403 above. do.
  • Step S405 Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S404 using model parameters for each neuron included in the layer. do.
  • Step S406 The inference unit 202 calculates an activation function using the calculation result of step S405 above, as in step S105 of FIG.
  • Step S407 As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
  • Step S408 When the above steps S402 to S407 are executed up to the final layer, the learning unit 203 updates the model parameters and the ternarization parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters and the ternarization parameters.
  • the differential value of the quantization function Q is necessary in order to calculate the differential of the loss function by the error backpropagation method. , the error cannot be backpropagated. Therefore, in this embodiment, the differential value of the quantization function Q is given in a pseudo manner.
  • the activation is expressed by the sum of two ternary vectors, referring to the pseudo gradient of LSQ quantization (see, for example, Reference 9), which expresses the activation by one ternary vector. Expand in case.
  • the value of the kth dimension of the activation after quantization is determined by the value of the kth dimension of the activation.
  • round(x) is a function that rounds off the decimal point of x to the nearest integer value
  • Sign(x) is a sign function that returns 1 if 0 ⁇ x and -1 if x ⁇ 0.
  • the LSQ quantization pseudo-gradient and the pseudo-gradient provided in this embodiment are shown in FIG.
  • the vertical axis is the pseudo differential with respect to the ternary parameter of the kth element of the ternarized vector
  • the horizontal axis is the kth element of the ternarized vector.
  • each of the three values given by s1 is fine - tuned by s2. That is, near ⁇ s 1 , 0, -s 1 ⁇
  • model parameters and the ternarization parameters may be updated for each batch, which is a set of learning data.
  • KDLSQ-BERT uses ternary weights and 8-bit activation for the BERT model, and by applying this embodiment to ternary activation, speedup can be expected.
  • LSQ quantization is used for activation bit reduction, and as will be described later, it has been confirmed that accuracy is improved compared to LSQ quantization.
  • BERT parameters are embedding matrices W e , W s , and W p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W lh Q , W lh K , W lh V of query, key, and value in , linear transformation matrix W l O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W l 1 and W l 2 .
  • MHA Multi-Head Attention
  • KDLSQ-BERT has not released the model code. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5).
  • KDLSQ -BERT special ternarization is applied to the embedding matrix We by preparing quantization coefficients for each dimension, but in this application example, We are not quantized for simplicity.
  • the weights of the six linear layers ⁇ W lh Q ⁇ , ⁇ W lh K ⁇ , ⁇ W lh V ⁇ , ⁇ W l O ⁇ , ⁇ W l 1 ⁇ , and ⁇ W l 2 ⁇ are The ternarization is performed according to the second embodiment.
  • Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.).
  • the models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additionally implemented KDLSQ-BERT model (8-bit activation), and the KDLSQ-BERT method with ternary activation. and a model in which this embodiment is applied to the additionally mounted TernaryBERT.
  • RoBERTa “KDLSQ-BERT (8-bit activation)”
  • KDLSQ-BERT (three-level activation) three-level activation
  • the batch size was 64, the number of epochs was 12, and one commercially available general GPU was used.
  • the learning rate was set to 2 ⁇ 10 ⁇ 5 and was linearly attenuated so that the learning rate became 0 at the final step of learning.
  • adam was used with a Dropout rate of 0.1.
  • KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT did not perform pre-learning, and distillation learning was performed using distilled loss similar to Non-Patent Document 2 as a teacher model that was pre-learned with the Wikitext103 data set of hugginface's RoBERTa model.
  • Table 3 below shows the results of evaluation experiments for each model under the above conditions.
  • the ppl is 17.24 in KDLSQ-BERT (ternary activation), and the ppl is 5.27 in this example. It can be seen that it is superior to ternarization by .
  • the evaluation targets were RoBERTa, KDLSQ-BERT (8-bit activation), KDLSQ-BERT (3-value activation), and this example.
  • the initial value of each model was learned by prior learning, the batch size was 16, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2 ⁇ 10 ⁇ 5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
  • KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT does not perform fine tuning, and the training model is the RoBERTa model of hugginface fine-tuned for each task of the GLUE dataset, and distillation learning is performed using the same distillation loss as in Non-Patent Document 2. .
  • Table 4 below shows the results of evaluation experiments for each model under the above conditions.
  • this embodiment greatly suppresses the decrease in accuracy in each task.
  • the accuracy is suppressed to about 2 to 5% compared to RoBERTa, and while the activation is less than 8 bits, it is possible to maintain the accuracy of the language model, which has been difficult until now. It can be said that it is successful.
  • this embodiment is inferior to KDLSQ-BERT (8-bit activation) in terms of accuracy, it is faster than KDLSQ-BERT (8-bit activation) because both weights and activations are ternary. change can be expected.
  • the inference device 10 can highly accurately ternarize the activation of a neural network model and perform inference of a predetermined task at high speed using the neural network model.
  • experiments were conducted with various natural language tasks using a Transformer language model represented by BERT as a neural network model, and it was confirmed that accuracy deterioration can be greatly suppressed compared to conventional methods.
  • (Appendix 1) memory at least one processor connected to the memory; including The processor inference of a given task by a neural network model, Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model; learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values; learning device.
  • Appendix 2 The processor For each layer, n (where n is a predetermined recursively and repeatedly calculating the scalar value and the ternary vector that minimize the distance from the sum of natural numbers); The learning device according to appendix 1, wherein the ternarization parameter is learned by moving average with the scalar value.
  • Appendix 3 The processor for each layer, ternary activations representing inputs to the layer by a quantization function having the ternarization parameter; The learning device according to appendix 1, wherein the ternarization parameter is learned by error backpropagation using a pseudo gradient of the quantization function for the ternarization parameter.
  • the ternarization parameter is composed of a first ternarization parameter and a second ternarization parameter
  • the processor The product of the first scalar value and the first ternary vector having -1, 0, or +1 in each element, and the second scalar value and -1, 0, or +1 in each element The first scalar value, the second scalar value, the first ternary vector, and the second ternary vector that minimize the sum of the products with the second ternary vector having using the first scalar value and the second scalar value as initial values of the first ternarization parameter and the second ternarization parameter, respectively; For each layer, the product of the first ternarization parameter and the first ternary vector, the second ternarization parameter and the second ternarizing the activations representing the inputs to the layer by the sum of their products with the ternary vector of Learning the ternarization parameter using the pseudo gradient of the quantization function for the first ternarization parameter and the pseudo gradient of the
  • (Appendix 5) memory at least one processor connected to the memory; including The processor inference of a given task by a neural network model, Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values in ternary values, ternarizing activations representing inputs to each layer constituting the neural network model, reasoning device.
  • a non-transitory storage medium storing a program executable by a computer to perform a learning process,
  • the learning process includes inference of a given task by a neural network model, Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model; learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
  • Non-transitory storage media Non-transitory storage media.
  • a non-transitory storage medium storing a program executable by a computer to perform inference processing, inference of a given task by a neural network model, Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values, ternarizing activations representing inputs to each layer constituting the neural network model, Non-transitory storage media.
  • Reference 1 Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, Heng Tao Shen.
  • TBN Convolutional Neural Network with Ternary Inputs and Binary Weights (ECCV2018)
  • Reference 2 Fengfu Li, Bo Zhang, Bin Liu. Ternary Weight Networks (2016)
  • Reference 3 Lu Hou, James T. Kwok. LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS (2018)
  • Reference 4 Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Reference 5 huggingface, Internet ⁇ URL: https://github.com/huggingface/transformers>
  • Reference 6 Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. Pointer Sentinel Mixture Models (2016)
  • Reference 7 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-anov, V.: RoBERTa: A Robustly Optimized BERT Pre-training Approach, CoRR, Vol.
  • inference device 101 input device 102 display device 103 external I/F 103a recording medium 104 communication I/F 105 processor 106 memory device 107 bus 201 ternarization unit 202 inference unit 203 learning unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

This learning device, according to one embodiment, includes: an inference unit that infers a prescribed task by a neural network model; a ternarization unit that uses a plurality of ternary vectors to ternarize activation representing input to each layer constituting the neural network model; and a learning unit that learns a model parameter of the neural network model and a ternarization parameter for expressing the activation with three values.

Description

学習装置、推論装置、学習方法、推論方法、及びプログラムLearning device, reasoning device, learning method, reasoning method, and program
 本発明は、学習装置、推論装置、学習方法、推論方法、及びプログラムに関する。 The present invention relates to a learning device, an inference device, a learning method, an inference method, and a program.
 近年、ニューラルネットワークモデルの性能向上が注目を集めているが、同時にニューラルネットワークモデルのパラメータ数と計算量が増加傾向にある。このため、ニューラルネットワークモデルの精度低下を抑えながら、軽量化・高速化を目指す研究が学術界と商業界の双方で注目を集めている。なお、ニューラルネットワークモデルには多数の線形変換が含まれ、特に、それらの線形変換を行う際の行列演算が計算時間に影響している。 In recent years, improvements in the performance of neural network models have attracted attention, but at the same time, the number of parameters and computational complexity of neural network models have been increasing. For this reason, research that aims to reduce the weight and speed of neural network models while suppressing deterioration in accuracy is attracting attention in both the academic and commercial worlds. A neural network model includes a large number of linear transformations, and in particular, matrix operations during these linear transformations affect the computation time.
 ニューラルネットワークモデルの軽量化・高速化に関する研究の1つとして、ニューラルネットワークモデルの重みと活性化を量子化する手法が知られている。ここで、量子化とは、通常32bitで表現されるfloat値をより少ないbit数(例えば、2ビットや8ビット等)で近似して表現することである。量子化は低ビット化とも呼ばれる。なお、活性化とは、ニューラルネットワークモデルの各層へ入力されるベクトルのことである。 A known method of quantizing the weights and activations of neural network models is one of the studies on reducing the weight and speed of neural network models. Here, quantization means approximating and expressing a float value, which is normally expressed by 32 bits, by a smaller number of bits (for example, 2 bits, 8 bits, etc.). Quantization is also called bit-lowering. Activation is a vector input to each layer of the neural network model.
 ニューラルネットワークモデルの重みと活性化の量子化に関する従来手法としては、例えば、非特許文献1や非特許文献2に記載されている手法が知られている。非特許文献1及び2ではいずれもBERT(Bidirectional Encoder Representations from Transformers)
と呼ばれる言語モデルの重みを3値化、活性化を8bit化している。なお、3値化とは、後述するように、実数値をスカラー値と3つの整数値のいずれかとの積(又は複数個の積の和)で近似することをいう。
For example, the methods described in Non-Patent Document 1 and Non-Patent Document 2 are known as conventional methods related to quantization of weights and activations of neural network models. Non-Patent Documents 1 and 2 both use BERT (Bidirectional Encoder Representations from Transformers)
The weight of the language model called is 3-valued, and the activation is 8-bit. Note that ternarization means approximating a real value by a product (or a sum of a plurality of products) of a scalar value and one of three integer values, as will be described later.
 ニューラルネットワークモデルの重みと活性化の両方を3値化(2bit化)すれば、より軽量化・高速化を図ることができると考えられる。しかしながら、量子化では、精度維持と軽量化・高速化とがトレードオフの関係にある。特に、活性化を2値、3値等の低ビットで表現する場合は精度維持が困難となる。 It is thought that if both the weight and activation of the neural network model are ternary (2-bit), it will be possible to make it lighter and faster. However, in quantization, there is a trade-off relationship between maintaining accuracy and reducing weight and speed. In particular, it is difficult to maintain accuracy when the activation is represented by low bits such as binary or ternary.
 本発明の一実施形態は、上記の点に鑑みてなされたもので、ニューラルネットワークモデルの活性化を高精度に3値化することを目的とする。 An embodiment of the present invention has been made in view of the above points, and aims to ternary the activation of a neural network model with high accuracy.
 上記目的を達成するため、一実施形態に係る学習装置は、ニューラルネットワークモデルにより所定のタスクの推論を行う推論部と、複数の3値ベクトルを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する3値化部と、前記ニューラルネットワークモデルのモデルパラメータと、前記活性化を3値で表現するための3値化パラメータとを学習する学習部と、を有する。 In order to achieve the above object, a learning device according to one embodiment includes an inference unit that infers a predetermined task using a neural network model, and a plurality of ternary vectors to each layer that constitutes the neural network model. a ternarization unit that ternarizes an activation representing an input; and a learning unit that learns a model parameter of the neural network model and a ternarization parameter for expressing the activation with ternary values. .
 ニューラルネットワークモデルの活性化を高精度に3値化することができる。  The activation of the neural network model can be ternarized with high accuracy.
本実施形態に係る推論装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the inference apparatus which concerns on this embodiment. 推論時における推論装置の機能構成の一例を示す図である。FIG. 10 is a diagram showing an example of a functional configuration of an inference device at the time of inference; 学習時における推論装置の機能構成の一例を示す図である。FIG. 10 is a diagram showing an example of a functional configuration of an inference device during learning; 実施例1における推論処理の一例を示すフローチャートである。5 is a flowchart showing an example of inference processing in Example 1; 実施例1における学習処理の一例を示すフローチャートである。4 is a flow chart showing an example of learning processing in Example 1. FIG. 実施例2における推論処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of inference processing in Example 2; 従来手法の3値化ベクトルとの比較例を示す図である。It is a figure which shows the comparative example with the ternarization vector of a conventional method. 実施例2における学習処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of learning processing in Example 2; 従来手法の疑似勾配との比較例を示す図である。It is a figure which shows the comparative example with the pseudo gradient of a conventional method.
 以下、本発明の一実施形態について説明する。本実施形態では、ニューラルネットワークモデルの活性化を高精度に3値化し、そのニューラルネットワークモデルにより所定のタスクの推論を実行する推論装置10について説明する。ここで、ニューラルネットワークモデルの活性化を量子化する際には、ニューラルネットワークモデルの重みも量子化されることが一般的であるため、以下では、重みは既知の量子化技術により量子化(例えば、2値化、3値化等)されているものとする。重みの量子化には軽量化の効果があり、更に重みと活性化の両方を量子化した場合には線形変換を行う際の行列演算を専用の実装で行うことで高速化を図ることができると考えられる。特に、2値化、3値化では論理値演算を利用した大幅に高速な行列演算の実装が期待できる。 An embodiment of the present invention will be described below. In the present embodiment, an inference device 10 that ternarizes the activation of a neural network model with high accuracy and executes inference of a predetermined task using the neural network model will be described. Here, when quantizing the activations of a neural network model, it is common to quantize the weights of the neural network model as well. , binarization, ternarization, etc.). Quantization of weights has the effect of reducing weight, and when both weights and activations are quantized, it is possible to increase the speed by using a dedicated implementation for matrix operations when performing linear transformations. it is conceivable that. In particular, in binarization and ternarization, implementation of significantly high-speed matrix operations using logic value operations can be expected.
 なお、2値化とは実数値をスカラー値と2つの整数値(例えば、{-1,1})のいずれかとの積のn個の和で近似することをいい、3値化とは実数値をスカラー値と3つの整数値(例えば、{-1,0,1})のいずれかとの積のn個の和で近似することをいう。ただし、スカラー値は0より大きい実数値、nは自然数である。 Note that binarization refers to approximating a real value with n sums of products of a scalar value and either of two integer values (for example, {-1, 1}), and ternarization refers to approximation of a real value. It refers to approximating a numerical value by n sums of products of a scalar value and any of three integer values (eg {−1, 0, 1}). However, the scalar value is a real number greater than 0, and n is a natural number.
 <ハードウェア構成>
 まず、本実施形態に係る推論装置10のハードウェア構成について、図1を参照しながら説明する。図1は、本実施形態に係る推論装置10のハードウェア構成の一例を示す図である。
<Hardware configuration>
First, the hardware configuration of the inference device 10 according to this embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of the hardware configuration of an inference device 10 according to this embodiment.
 図1に示すように、本実施形態に係る推論装置10は一般的なコンピュータ又はコンピュータシステムのハードウェア構成で実現され、入力装置101と、表示装置102と、外部I/F103と、通信I/F104と、プロセッサ105と、メモリ装置106とを有する。これらの各ハードウェアは、それぞれがバス107により通信可能に接続される。 As shown in FIG. 1, the inference device 10 according to the present embodiment is realized by the hardware configuration of a general computer or computer system, and includes an input device 101, a display device 102, an external I/F 103, and a communication I/F. F 104 , processor 105 and memory device 106 . Each of these pieces of hardware is communicably connected via a bus 107 .
 入力装置101は、例えば、キーボードやマウス、タッチパネル等である。表示装置102は、例えば、ディスプレイ等である。なお、推論装置10は、入力装置101及び表示装置102のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, mouse, touch panel, or the like. The display device 102 is, for example, a display. Note that the inference device 10 does not have to have at least one of the input device 101 and the display device 102 .
 外部I/F103は、記録媒体103a等の外部装置とのインタフェースである。推論装置10は、外部I/F103を介して、記録媒体103aの読み取りや書き込み等を行うことができる。なお、記録媒体103aとしては、例えば、CD(Compact Disc)、DVD(Digital Versatile Disk)、SDメモリカード(Secure Digital memory card)、USB(Universal Serial Bus)メモリカード等が挙げられる。 The external I/F 103 is an interface with an external device such as the recording medium 103a. The inference device 10 can perform reading, writing, etc. of the recording medium 103 a via the external I/F 103 . Examples of the recording medium 103a include CD (Compact Disc), DVD (Digital Versatile Disk), SD memory card (Secure Digital memory card), USB (Universal Serial Bus) memory card, and the like.
 通信I/F104は、推論装置10を通信ネットワークに接続するためのインタフェースである。プロセッサ105は、例えば、CPU(Central Processing Unit)やGPU(Graphics Processing Unit)等の各種演算装置である。メモリ装置106は、例えば、HDD(Hard Disk Drive)やSSD(Solid State Drive)、RAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ等の各種記憶装置である。 The communication I/F 104 is an interface for connecting the inference device 10 to a communication network. The processor 105 is, for example, various arithmetic units such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). The memory device 106 is, for example, various storage devices such as HDD (Hard Disk Drive), SSD (Solid State Drive), RAM (Random Access Memory), ROM (Read Only Memory), and flash memory.
 本実施形態に係る推論装置10は、図1に示すハードウェア構成を有することにより、後述する推論処理や学習処理を実現することができる。なお、図1に示すハードウェア構成は一例であって、推論装置10は、他のハードウェア構成を有していてもよい。例えば、推論装置10は、複数のプロセッサ105を有していてもよいし、複数のメモリ装置106を有していてもよい。 The inference device 10 according to the present embodiment has the hardware configuration shown in FIG. 1, so that inference processing and learning processing, which will be described later, can be realized. Note that the hardware configuration shown in FIG. 1 is merely an example, and the inference device 10 may have other hardware configurations. For example, the reasoning device 10 may have multiple processors 105 and may have multiple memory devices 106 .
 <機能構成>
 本実施形態に係る推論装置10には、学習時と推論時の2つのフェーズが存在する。学習時では、ニューラルネットワークモデルのパラメータ(以下、「モデルパラメータ」ともいう。)と、活性化を3値化するためのパラメータ(以下、「3値化パラメータ)ともいう。)とを学習する。推論時では、学習済みモデルパラメータと学習済み3値化パラメータとを用いて、活性化を3値化すると共に、ニューラルネットワークモデルにより推論を行う。
<Functional configuration>
The inference device 10 according to the present embodiment has two phases: learning and inference. During learning, neural network model parameters (hereinafter also referred to as “model parameters”) and parameters for ternarizing activation (hereinafter also referred to as “ternarization parameters”) are learned. At the time of inference, the learned model parameters and the learned ternarized parameters are used to ternarize the activations, and the neural network model performs inference.
 なお、学習時における推論装置10は、「学習装置」等と呼ばれてもよい。また、学習時における推論装置10と推論時における推論装置10は異なる装置又はシステムで実現されていてもよい。 Note that the inference device 10 during learning may be called a "learning device" or the like. Also, the inference device 10 during learning and the inference device 10 during inference may be realized by different devices or systems.
  ≪推論時≫
 推論時における推論装置10の機能構成について、図2を参照しながら説明する。図2は、推論時における推論装置10の機能構成の一例を示す図である。
≪During Inference≫
A functional configuration of the inference device 10 at the time of inference will be described with reference to FIG. FIG. 2 is a diagram showing an example of the functional configuration of the inference device 10 during inference.
 図2に示すように、推論時における推論装置10は、3値化部201と、推論部202とを有する。これら各部は、推論時における推論装置10にインストールされた1以上のプログラムが、プロセッサ105に実行させる処理により実現される。 As shown in FIG. 2, the inference device 10 at the time of inference has a ternarization unit 201 and an inference unit 202 . Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of inference cause the processor 105 to execute.
 3値化部201は、学習済み3値化パラメータを用いて、ニューラルネットワークモデルの各層の活性化を3値化する。すなわち、3値化部201は、学習済み3値化パラメータを用いて、ニューラルネットワークモデルの各層に入力される実数値ベクトルの各要素を3値化した3値化ベクトルを作成し、この3値化ベクトルを活性化としてニューラルネットワークモデルに出力する。なお、学習済み3値化パラメータはニューラルネットワークモデルの層毎に存在し、例えば、メモリ装置106等に格納されている。 The ternarization unit 201 uses the learned ternarization parameter to ternarize the activation of each layer of the neural network model. That is, the ternarization unit 201 uses the learned ternarization parameter to create a ternarization vector by ternarizing each element of the real-valued vector input to each layer of the neural network model. Output the activation vectors to the neural network model as activations. Note that the learned ternary parameter exists for each layer of the neural network model, and is stored in the memory device 106 or the like, for example.
 推論部202は、ニューラルネットワークモデルで実現され、学習済みモデルパラメータを用いて、所定のタスクの推論を行う。すなわち、推論部202は、ニューラルネットワークモデルの各層において、3値化部201で作成された3値化ベクトルを入力として、学習済みモデルパラメータを用いて、当該層の出力ベクトルとして実数値ベクトルを出力する。このとき、ニューラルネットワークモデルの最終層(出力層)の出力(又は、それに対して所定の処理を行った結果)が推論結果となる。なお、学習済みモデルパラメータはニューラルネットワークモデルの層毎に存在し、例えば、メモリ装置106等に格納されている。 The inference unit 202 is realized by a neural network model, and uses learned model parameters to infer a given task. That is, the inference unit 202 receives the ternarized vector created by the ternarization unit 201 in each layer of the neural network model, uses the learned model parameters, and outputs a real-valued vector as the output vector of the layer. do. At this time, the output of the final layer (output layer) of the neural network model (or the result of performing predetermined processing on it) becomes the inference result. The learned model parameters exist for each layer of the neural network model, and are stored in the memory device 106 or the like, for example.
 ここで、ニューラルネットワークモデルは、一般に、複数の層で構成されており、各層にはニューロン(又は、ユニット、ノード等とも呼ばれる。)が存在する。また、各層の間にはニューロン同士の結合の強さを表す重みが存在する。ニューラルネットワークモデルの層が線形層(又は、全結合層)と呼ばれるものである場合、その層に存在するニューロンでは、以下の操作1~操作3が行われる。 Here, a neural network model is generally composed of multiple layers, and each layer has neurons (also called units, nodes, etc.). In addition, a weight representing the strength of connection between neurons exists between each layer. When the layer of the neural network model is called a linear layer (or a fully connected layer), the following operations 1 to 3 are performed in neurons existing in that layer.
 操作1:当該層に入力された入力ベクトル(活性化)の重み付き和を計算
 操作2:バイアス項を加算
 操作3:活性化関数(例えば、ReLu等)を計算
 このとき、ニューラルネットワークモデルの計算時間は、上記の操作1の重み付き和の計算に要する時間が支配的である。上記の操作1の重み付き和は、重みを表すベクトル(重みベクトル)をw=(w,w,・・・,wdim)、入力ベクトルをv=(v,v,・・・,vdim)とすると、vとwの内積、つまりw+・・・+wdimdimとなる。このため、大量の積演算w(k=1,・・・,dim)に要する時間を削減できれば、ニューラルネットワークモデルの計算時間を削減することが可能となる。なお、dimはベクトルの次元数である。
Operation 1: Calculate the weighted sum of the input vectors (activations) input to the layer Operation 2: Add the bias term Operation 3: Calculate the activation function (for example, ReLu, etc.) At this time, calculate the neural network model Time is dominated by the time required to compute the weighted sum in operation 1 above. The weighted sum of operation 1 above is obtained by w = (w 1 , w 2 , . , v dim ), the inner product of v and w, that is, w 1 v 1 + . . . +w dim v dim . Therefore, if the time required for a large number of multiplication operations w k v k (k=1, . . . , dim) can be reduced, the computation time of the neural network model can be reduced. Note that dim is the number of dimensions of the vector.
 例えば、参考文献1では、float値(32bit)同士の積演算wに比べて、2値と3値の積演算w(例えば、w∈{-1,1},v∈{-1,0,1})の方が理論的に40倍速く計算できることが示されている。なお、この例に限らず、重みと活性化の両方を低bit化した場合は操作1の積演算が高速化されることが期待できる。 For example, in Reference 1, compared to the product operation w k v k between float values (32 bits), the product operation w k v k between binary and ternary values (for example, w k ε{−1, 1}, v It has been shown that k ∈ {−1,0,1}) can be calculated theoretically 40 times faster. It should be noted that not only this example, but when both the weight and the activation are reduced in bits, it can be expected that the product calculation of the operation 1 will be speeded up.
 上述したように、本実施形態では、ニューラルネットワークモデルの重みを既知の量子化技術より量子化(特に、2値化や3値化等といった低bitで表現する量子化)しており、活性化を3値化するため、ニューラルネットワークモデルの計算時間を削減することが可能となる。 As described above, in the present embodiment, the weights of the neural network model are quantized by a known quantization technique (in particular, quantization expressed with low bits such as binarization and ternarization), and activation is ternarized, it is possible to reduce the computation time of the neural network model.
  ≪学習時≫
 学習時における推論装置10の機能構成について、図3を参照しながら説明する。図3は、学習時における推論装置10の機能構成の一例を示す図である。
≪When learning≫
A functional configuration of the inference device 10 during learning will be described with reference to FIG. FIG. 3 is a diagram showing an example of the functional configuration of the inference device 10 during learning.
 図3に示すように、学習時における推論装置10は、3値化部201と、推論部202と、学習部203とを有する。これら各部は、学習時における推論装置10にインストールされた1以上のプログラムが、プロセッサ105に実行させる処理により実現される。 As shown in FIG. 3 , the inference device 10 during learning has a ternarization unit 201 , an inference unit 202 , and a learning unit 203 . Each of these units is realized by processing that one or more programs installed in the inference apparatus 10 at the time of learning cause the processor 105 to execute.
 3値化部201及び推論部202は、推論時と同様である。ただし、学習済みでない3値化パラメータ及び学習済みでないモデルパラメータをそれぞれ用いる点が推論時と異なる。なお、3値化パラメータはニューラルネットワークモデルの層毎に存在し、例えば、メモリ装置106等に格納されている。同様に、モデルパラメータはニューラルネットワークモデルの層毎に存在し、例えば、メモリ装置106等に格納されている。 The ternarization unit 201 and the inference unit 202 are the same as in inference. However, it differs from inference in that untrained ternarization parameters and untrained model parameters are used. Note that the ternarization parameter exists for each layer of the neural network model and is stored in the memory device 106 or the like, for example. Similarly, model parameters exist for each layer of the neural network model and are stored, for example, in the memory device 106 or the like.
 学習部203は、3値化部201が活性化を3値化する際に用いる3値化パラメータの学習と、推論部202を実現するニューラルネットワークモデルのモデルパラメータの学習とを行う。 The learning unit 203 learns the ternarization parameters used when the ternarization unit 201 ternarizes the activation, and learns the model parameters of the neural network model that implements the inference unit 202 .
 [実施例1]
 以下、本実施形態の実施例1について説明する。本実施例では、活性化をn個の3値ベクトルで近似する場合について説明する。すなわち、例えば、ニューラルネットワークモデルの或る層の活性化(実数値ベクトル)をX=(x,x,・・・,xdim)、当該層の3値化パラメータをA=(a,a,・・・,a)とすれば、この実数値ベクトルを3値化ベクトルa+a+・・・+aで近似する場合について説明する。ここで、a(i=1,・・・,n)はa>0を満たすスカラー値、B(i=1,・・・,n)は各要素が-1,0,1のいずれかを取るベクトルである。
[Example 1]
Example 1 of the present embodiment will be described below. In this embodiment, a case of approximating activation by n ternary vectors will be described. That is, for example, the activation (real-value vector) of a certain layer of the neural network model is X=(x 1 , x 2 , . . . , x dim ), and the ternarization parameter of the layer is A=( a , a 2 , . _ _ _ Here, a i (i=1, . . . , n) is a scalar value that satisfies a i >0, and B i (i=1, . It is a vector that takes either.
 なお、nの値が大きい方が活性化をより高精度に近似することが可能であるが、より多くのメモリ量と計算量が必要となる。 It should be noted that the larger the value of n, the more accurately the activation can be approximated, but the larger the amount of memory and the amount of calculation required.
 <実施例1における推論処理>
 推論時における推論装置10が実行する推論処理について、図4を参照しながら説明する。図4は、実施例1における推論処理の一例を示すフローチャートである。なお、以下では、3値化パラメータとモデルパラメータは学習済みであるものとする。
<Inference processing in the first embodiment>
The inference processing executed by the inference device 10 during inference will be described with reference to FIG. FIG. 4 is a flowchart illustrating an example of inference processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.
 ここで、図4のステップS101~ステップS106はニューラルネットワークモデルの各層毎に繰り返し実行される。以下では、ニューラルネットワークモデルの或る層に関するステップS101~ステップS106について説明する。 Here, steps S101 to S106 in FIG. 4 are repeatedly executed for each layer of the neural network model. Steps S101 to S106 for a certain layer of the neural network model will be described below.
 ステップS101:推論部202は、ニューラルネットワークモデルに与えられた実数値ベクトル又は1つ前の層の出力ベクトルである実数値ベクトルを入力する。ここで、推論部202は、当該層が最初の層(入力層)である場合はニューラルネットワークモデルに与えられた実数値ベクトルを入力し、それ以外の場合は1つ前の層の出力ベクトルである実数値ベクトルを入力する。なお、ニューラルネットワークモデルに与えられた実数値ベクトルとは、タスクの推論対象データのことである。 Step S101: The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer. Here, the inference unit 202 inputs the real-valued vector given to the neural network model if the layer is the first layer (input layer), otherwise the output vector of the previous layer. Input a real-valued vector. Note that the real-valued vector given to the neural network model is the inference target data of the task.
 ステップS102:3値化部201は、学習済み3値化パラメータを用いて、上記のステップS101で入力された実数値ベクトル(つまり、活性化)を3値化し、3値化ベクトルを作成する。このとき、3値化部201は、以下の手順1-1~手順1-3により3値化ベクトルを作成する。なお、上記のステップS101で入力された実数値ベクトルをX=(x,x,・・・,xdim)、当該層の学習済み3値化パラメータをA=(a,a,・・・,a)とする。また、各i=1,・・・,nに対してa>0であるものとする。 Step S102: Using the learned ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S101 to create a ternarization vector. At this time, the ternarization unit 201 creates a ternarization vector according to procedures 1-1 to 1-3 below. Note that the real - valued vector input in step S101 above is X=(x 1 , x 2 , . , a n ). It is also assumed that a i >0 for each i=1, . . . , n.
 手順1-1)まず、3値化部201は、X=Xとすると共に、量子化に用いる関数を Procedure 1-1) First, the ternarization unit 201 sets X=X 0 and sets the function used for quantization to
Figure JPOXMLDOC01-appb-M000001
として、Q(X)=(Q(x),Q(x),・・・,Q(xdim))とする。ただし、aはスカラー値である。
Figure JPOXMLDOC01-appb-M000001
, Q a (X)=(Q a (x 1 ), Q a (x 2 ), . . . , Q a (x dim )). However, a is a scalar value.
 そして、3値化部201は、 Then, the ternarization unit 201
Figure JPOXMLDOC01-appb-M000002
として、1つ目の3値ベクトルをaとする。また、3値化部201は、X=X-aとする。このとき、Xは、1つ目の3値ベクトルaで近似できなかった誤差と見做せる。
Figure JPOXMLDOC01-appb-M000002
, the first ternary vector is a 1 B 1 . Also, the ternarization unit 201 sets X 1 =X 0 -a 1 B 1 . At this time, X1 can be regarded as an error that could not be approximated by the first ternary vector a1B1 .
 手順1-2)次に、3値化部201は、i個目の3値ベクトルa(i≧2)を帰納的に繰り返し計算する。すなわち、a,・・・,ai-1i-1までの3値ベクトルが得られているとき、3値化部201は、Xi-1=Xi-2-ai-1i-1とした上で、 Procedure 1-2) Next, the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a i B i (i≧2). That is, when ternary vectors up to a 1 B 1 , . After setting -1 B i-1 ,
Figure JPOXMLDOC01-appb-M000003
とする。そして、3値化部201は、i個目の3値ベクトルをaとする。これにより、n個の3値ベクトルa,・・・,aが得られる。
Figure JPOXMLDOC01-appb-M000003
and Then, ternarization section 201 sets the i-th ternary vector to a i B i . As a result, n ternary vectors a 1 B 1 , . . . , an B n are obtained.
 手順1-3)最後に、3値化部201は、a+・・・+aを3値化ベクトルとする。これにより、当該層の活性化X=(x,x,・・・,xdim)をn個の3値ベクトルの和で近似した3値化ベクトルa+・・・+aが得られる。 Procedure 1-3) Finally, the ternarization unit 201 sets a 1 B 1 + . . . +a n B n as a ternarization vector. As a result, a ternarized vector a 1 B 1 + . Bn is obtained.
 ステップS103:推論部202は、当該層に含まれるニューロン毎に、学習済みモデルパラメータを用いて、上記のステップS102で作成された3値化ベクトルの重み付き和を計算する。すなわち、例えば、3値化ベクトルa+・・・+aを(v,v,・・・,vdim)、当該層に含まれる或るニューロンと1つ前の層に含まれる各ニューロンとの間の重みを表す重みベクトルを(w,w,・・・,wdim)とすれば、推論部202は、w+・・・+wdimdimを計算する。なお、重みは学習済みモデルパラメータに含まれる。 Step S103: The inference unit 202 uses the trained model parameters to calculate the weighted sum of the ternarized vectors created in step S102 for each neuron included in the layer. That is, for example , a ternary vector a 1 B 1 + . Let ( w 1 , w 2 , . to calculate Note that the weights are included in the trained model parameters.
 ステップS104:推論部202は、当該層に含まれるニューロン毎に、学習済みモデルパラメータを用いて、上記のステップS103で計算された重み付き和に対してバイアス項を加算する。すなわち、例えば、当該ニューロンのバイアス項をbとすれば、推論部202は、w+・・・+wdimdim+bを計算する。なお、バイアス項は学習済みモデルパラメータに含まれる。 Step S104: The inference unit 202 adds a bias term to the weighted sum calculated in step S103 above using the learned model parameters for each neuron included in the layer. That is, for example, if the bias term of the neuron is b, the inference unit 202 calculates w 1 v 1 + . . . +w dim v dim +b. Note that the bias term is included in the learned model parameters.
 ステップS105:推論部202は、当該層に含まれるニューロン毎に、上記のステップS104の計算結果を用いて活性化関数を計算する。すなわち、例えば、当該ニューロンの活性化関数をσ(・)とすれば、推論部202は、σ(w+・・・+wdimdim+b)を計算する。 Step S105: The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S104. That is, for example, if the activation function of the neuron is σ(·), the inference unit 202 calculates σ(w 1 v 1 + . . . +w dim v dim +b).
 ステップS106:推論部202は、当該層に含まれる各ニューロンの活性化関数値を要素とする出力ベクトル(実数値ベクトル)を次の層に出力する。すなわち、例えば、j番目のニューロンの活性化関数値をx'とすれば、推論部202は、実数値ベクトルX'=(x',x',・・・,x'dim')を次の層に出力する。なお、dim'は次の層に出力される実数値ベクトルの次元数であり、当該層に含まれるニューロン数である。 Step S106: The inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer. That is, for example, if the activation function value of the j -th neuron is x'j, the inference unit 202 generates a real-valued vector X'=(x' 1 , x' 2 , . . . , x'dim' ) to the next layer. Note that dim' is the number of dimensions of the real-valued vector output to the next layer, and is the number of neurons included in the layer.
 以上のステップS101~ステップS106が各層毎に繰り返し実行され、最終層の出力ベクトル(又は、当該出力ベクトルに対して、タスクに応じた所定の処理を行った結果)が推論結果となる。 The above steps S101 to S106 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) is the inference result.
 <実施例1における学習処理>
 学習時における推論装置10が実行する学習処理について、図5を参照しながら説明する。図5は、実施例1における学習処理の一例を示すフローチャートである。なお、以下では、3値化パラメータとモデルパラメータは学習済みでないものとする。
<Learning processing in the first embodiment>
A learning process executed by the inference device 10 during learning will be described with reference to FIG. FIG. 5 is a flowchart illustrating an example of learning processing according to the first embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.
 ここで、図5のステップS201~ステップS207はニューラルネットワークモデルの各層毎に繰り返し実行される。以下では、ニューラルネットワークモデルの或る層に関するステップS201~ステップS207について説明する。 Here, steps S201 to S207 in FIG. 5 are repeatedly executed for each layer of the neural network model. Steps S201 to S207 for a certain layer of the neural network model will be described below.
 ステップS201:推論部202は、図4のステップS101と同様に、ニューラルネットワークモデルに与えられた実数値ベクトル又は1つ前の層の出力ベクトルである実数値ベクトルを入力する。なお、ニューラルネットワークモデルに与えられた実数値ベクトルとは、タスクの学習用データのことである。 Step S201: The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.
 ステップS202:3値化部201は、3値化パラメータを用いて、上記のステップS201で入力された実数値ベクトル(つまり、活性化)を3値化し、3値化ベクトルを作成する。このとき、3値化部201は、以下の手順2-1~手順2-3により3値化ベクトルを作成する。なお、上記のステップS201で入力された実数値ベクトルをX=(x,x,・・・,xdim)とする。また、当該層の3値化パラメータをA=(a',a',・・・,a')とする。 Step S202: Using the ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S201 to create a ternarized vector. At this time, the ternarization unit 201 creates a ternarization vector according to procedures 2-1 to 2-3 below. Let X=(x 1 , x 2 , . . . , x dim ) be the real-value vector input in step S201. Also, let A=(a' 1 , a' 2 , . . . , a' n ) be the ternarization parameter of the layer.
 手順2-1)まず、3値化部201は、X=Xとすると共に、|X-aB|を最小化するようなスカラー値aと3値ベクトルBを求める。ここで、Bは各要素が-1,0,1のいずれかを取るベクトルである。また、|・|はL2ノルムである。ただし、|・|はL2ノルム以外の他の距離であってもよい。 Procedure 2-1) First, the ternarization unit 201 sets X=X 0 and obtains a scalar value a and a ternary vector B that minimize |X 0 -aB | Here, B is a vector whose elements are either -1, 0, or 1. |·| is the L2 norm. However, |·| may be a distance other than the L2 norm.
 このとき、aとBは相互作用しており、厳密解を求めるのことは困難である(例えば、参考文献2等を参照)。そこで、本実施例では、参考文献3に記載されているalgorithm2を参考にニュートン法を用いて近似解を求める。これは、a又はBのいずれか一方が固定されたとき、|X-aB|を最小化する他方の値が計算可能であることを利用している。ただし、ニュートン法以外の他の手法により近似解を求めてもよい。 At this time, a and B interact with each other, and it is difficult to obtain an exact solution (for example, see Reference 2, etc.). Therefore, in this embodiment, an approximate solution is obtained using Newton's method with reference to algorithm 2 described in Reference 3. This takes advantage of the fact that when either a or B is fixed, the other value that minimizes |X 0 −aB| 2 can be calculated. However, an approximate solution may be obtained by a technique other than Newton's method.
 そして、3値化部201は、このようして求めたスカラー値aをaとすると共に3値ベクトルBをBとして、1つ目の3値ベクトルをaとする。また、3値化部201は、X=X-aとする。このとき、Xは、1つ目の3値ベクトルaで近似できなかった誤差と見做せる。 Then, the ternarization unit 201 sets the scalar value a obtained in this manner to a1, sets the ternary vector B to B1 , and sets the first ternary vector to a1B1 . Also, the ternarization unit 201 sets X 1 =X 0 -a 1 B 1 . At this time, X1 can be regarded as an error that could not be approximated by the first ternary vector a1B1 .
 手順2-2)次に、3値化部201は、i個目の3値ベクトルa(i≧2)を帰納的に繰り返し計算する。すなわち、a,・・・,ai-1i-1までの3値ベクトルが得られているとき、3値化部201は、Xi-1=Xi-2-ai-1i-1とした上で、|Xi-1-aB|を最小化するようなスカラー値aと3値ベクトルBを求める。これは、上記の手順2-1と同様の手法により近似解を求めればよい。 Procedure 2-2) Next, the ternarization unit 201 recursively and repeatedly calculates the i-th ternary vector a i B i (i≧2). That is, when ternary vectors up to a 1 B 1 , . −1 B i−1 , the scalar value a and the ternary vector B that minimize |X i−1 −aB| 2 are obtained. This can be done by finding an approximate solution by the same method as in step 2-1 above.
 そして、3値化部201は、このようにして求めたスカラー値aをaとすると共に3値ベクトルBをBとする。これにより、n個の3値ベクトルa,・・・,aが得られる。 Then, the ternarization unit 201 sets the scalar value a obtained in this manner to a i and the ternary vector B to B i . As a result, n ternary vectors a 1 B 1 , . . . , an B n are obtained.
 手順2-3)最後に、3値化部201は、a+・・・+aを3値化ベクトルとする。これにより、当該層の活性化X=(x,x,・・・,xdim)をn個の3値ベクトルの和で近似した3値化ベクトルa+・・・+aが得られる。 Procedure 2-3) Finally, the ternarization unit 201 sets a 1 B 1 + . . . +a n B n as a ternarization vector. As a result, a ternarized vector a 1 B 1 + . Bn is obtained.
 ステップS203:学習部203は、上記のステップS202で3値化ベクトルa+・・・+aを作成した際に得られたスカラー値a,・・・aを用いて、当該層の3値化パラメータA=(a',a',・・・,a')を更新する。 Step S203: The learning unit 203 uses the scalar values a 1 , . . . an obtained when the ternary vector a 1 B 1 + . , updates the ternarization parameter A=(a′ 1 , a′ 2 , . . . , a′ n ) of the layer.
 具体的には、学習部203は、各i=1,・・・,nに対して、a'=(1-γ)a'+γaによりa'を更新する。ここで、γは0<γ<1を満たすパラメータである。すなわち、学習部203は、移動平均によりa'を更新する。ただし、これは一例であって、他の方法により各層の3値化パラメータを更新してもよい。 Specifically, the learning unit 203 updates a' i by a' i =(1−γ)a' i +γa i for each i=1, . . . , n. Here, γ is a parameter that satisfies 0<γ<1. That is, the learning unit 203 updates a' i by moving average. However, this is only an example, and the ternarization parameter of each layer may be updated by other methods.
 ステップS204:推論部202は、図4のステップS103と同様に、当該層に含まれるニューロン毎に、モデルパラメータを用いて、上記のステップS202で作成された3値化ベクトルの重み付き和を計算する。 Step S204: Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternarized vectors created in step S202 above. do.
 ステップS205:推論部202は、図4のステップS104と同様に、当該層に含まれるニューロン毎に、モデルパラメータを用いて、上記のステップS204で計算された重み付き和に対してバイアス項を加算する。 Step S205: Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S204 using model parameters for each neuron included in the layer. do.
 ステップS206:推論部202は、図4のステップS105と同様に、上記のステップS205の計算結果を用いて活性化関数を計算する。 Step S206: The inference unit 202 calculates an activation function using the calculation result of step S205 above, as in step S105 of FIG.
 ステップS207:推論部202は、図4のステップS106と同様に、当該層に含まれる各ニューロンの活性化関数値を要素とする出力ベクトル(実数値ベクトル)を次の層に出力する。 Step S207: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
 ステップS208:上記のステップS201~ステップS207が最終層まで実行された場合、学習部203は、モデルパラメータを更新する。すなわち、学習部203は、既知の誤差逆伝播法によりloss関数の微分を計算し、その微分値を用いてモデルパラメータを更新する。 Step S208: When the above steps S201 to S207 are executed up to the final layer, the learning unit 203 updates the model parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters.
 ここで、上記のステップS202で活性化Xを3値化(量子化)する関数(以下、「量子化関数」ともいう。)をQとする。すなわち、Q(X)=a+・・・+aとする。誤差逆伝播法によりloss関数の微分を計算するためには量子化関数Qの微分値が必要であるが、量子化関数Qの微分値はその性質上必ず0になってしまい、誤差を逆伝播させることができなくなってしまう。そこで、本実施例では、参考文献4に記載されているSTE(straight-through estimator)の手法によって量子化関数Qの微分値を疑似的に与える。STEは量子化学習で用いられる基本的な手法の1つである。 Here, let Q be a function (hereinafter also referred to as a “quantization function”) for ternarizing (quantizing) the activation X in step S202. That is, Q(X)=a 1 B 1 + . . . +a n B n . The differential value of the quantization function Q is necessary to calculate the differential of the loss function by the error backpropagation method. I can't let it go. Therefore, in the present embodiment, the differential value of the quantization function Q is pseudo-given by the STE (straight-through estimator) technique described in reference 4. FIG. STE is one of the basic techniques used in quantization learning.
 具体的には、量子化関数Qの疑似的な微分値として、例えば、 Specifically, as a pseudo differential value of the quantization function Q, for example,
Figure JPOXMLDOC01-appb-M000004
を与える。
Figure JPOXMLDOC01-appb-M000004
give.
 なお、モデルパラメータの更新は、学習用データの集合であるバッチ毎に行われてもよい。 Note that model parameters may be updated for each batch, which is a set of learning data.
 以上のステップS201~ステップS208が実行されることで、3値化パラメータとモデルパラメータが学習される。 By executing the above steps S201 to S208, the ternarization parameters and model parameters are learned.
 <適用例及び評価>
 以下では、本実施例を言語モデルに適用する場合の適用例とその評価について説明する。本適用例では、非特許文献1に記載されているTernaryBERTに対して本実施例を適用する。
<Application example and evaluation>
An application example and its evaluation when this embodiment is applied to a language model will be described below. In this application example, this embodiment is applied to TernaryBERT described in Non-Patent Document 1. FIG.
 TernaryBERTはBERTモデルの重みを3値化、活性化を8bit化しており、本実施例を適用して活性化も3値化することで高速化が期待できる。また、後述するように、従来手法と比較して精度が向上することも確認できた。 With TernaryBERT, the weights of the BERT model are ternary and the activation is 8-bit. By applying this embodiment and ternary activation, speedup can be expected. In addition, as will be described later, it was also confirmed that the accuracy was improved compared to the conventional method.
 ・BERT等に代表されるTransformer言語モデルの構成
 BERT等に代表されるTransformer言語モデルは、embedding層と、L個(本適用例ではL=12)のTransformer encoderブロックとで構成される。BERTのパラメータは、token、segment、positionに対するembedding行列W,W,Wと、l(lは小文字のL)層目のブロックに含まれるMulti-Head Attention(MHA)のh番目のheaにおけるquery,key,valueの線形変換行列Wlh ,Wlh ,Wlh 、MHAの直後に施されるoutputの線形変換行列W 、及び2層のFeed-Forward Network(FFN)の線形変換行列W ,W である。
Configuration of Transformer Language Model Represented by BERT, etc. A Transformer language model represented by BERT, etc. is composed of an embedding layer and L (L=12 in this application example) Transformer encoder blocks. BERT parameters are embedding matrices W e , W s , and W p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W lh Q , W lh K , W lh V of query, key, and value in , linear transformation matrix W l O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W l 1 and W l 2 .
 ・TernaryBERTで3値化している重み
 segmentとpositionに対するembedding行列を除くWを重み行列とする1つのembedding層と、{Wlh },{Wlh },{Wlh },{W },{W },{W }をそれぞれ重み行列とする6つの線形層との重みを3値化している。
One embedding layer with W e as the weight matrix, excluding the embedding matrix for the weight segment and position ternarized by TernaryBERT, {W lh Q }, {W lh K }, {W lh V }, {W l O }, {W l 1 }, and {W l 2 } are ternarized weights with six linear layers having respective weight matrices.
 ・本適用例におけるTernaryBERTの追実装
 TernaryBERTはモデルのコードが公開されていない。そこで、本適用例では、huggingface社が提供しているRoBERTaモデルのコード(例えば、参考文献5等を参照)をベースとして追実装を行った。ただし、TernaryBERTではembedding行列Wに対して次元毎に量子化係数を用意する特殊な3値化を施しているが、本適用例では簡単のためWは量子化しないものとした。すなわち、本適用例では、{Wlh },{Wlh },{Wlh },{W },{W },{W }の6つの線形層の重みを実施例1により3値化するものとした。
-Additional implementation of TernaryBERT in this application example TernaryBERT model code is not open to the public. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5). However, in TernaryBERT , the embedding matrix We is subjected to special ternarization in which quantization coefficients are prepared for each dimension, but in this application example, We are not quantized for the sake of simplicity. That is, in this application example, the weights of the six linear layers {W lh Q }, {W lh K }, {W lh V }, {W l O }, {W l 1 }, and {W l 2 } are The ternarization is performed according to the first embodiment.
 ・事前学習について
 Wikitext103データセット(例えば、参考文献6等を参照)を用いて各モデルの事前学習を行った。学習対象のモデルは、量子化されていないRoBERTaモデル(例えば、参考文献7等を参照)、追実装したTernaryBERTモデル(8bit活性化)、TernaryBERTの手法のまま活性化を3値化したモデル、追実装したTernaryBERTに対して本実施例をそれぞれn=1,2として適用したモデルとした。以下、これら各モデルのそれぞれを「RoBERTa」、「TernaryBERT(8bit活性化)」、「TernaryBERT(3値活性化)」、「本実施例(n=1)」、「本実施例(n=2)」ともいう。なお、nは、活性化を3値化ベクトルで近似する際における3値ベクトルの個数である。
- About pre-learning Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.). The models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additional TernaryBERT model (8-bit activation), the TernaryBERT model with ternary activation, and the additional model. A model in which this embodiment is applied to the implemented TernaryBERT with n=1 and 2, respectively. Hereinafter, each of these models will be referred to as "RoBERTa", "TernaryBERT (8-bit activation)", "TernaryBERT (three-value activation)", "this embodiment (n = 1)", "this embodiment (n = 2 )”. Note that n is the number of ternary vectors when approximating activation with a ternary vector.
 学習条件として、事前学習では、バッチサイズを64、エポック数を3とし、市販の一般的なGPUを1枚用いた。また、学習率は2×10-5とし、学習の最終ステップでは学習率が0となるように線形に減衰させた。最適化にはadamを用いて、Dropout率は0.1とした。 As learning conditions, in the pre-learning, the batch size was 64, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 −5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
 TernaryBERTでは実数値モデルを教師モデルとした蒸留学習を行う(例えば、非特許文献1等を参照)。そこで、TernaryBERTでは事前学習を行わず、hugginfaceのRoBERTaモデルをWikitext103データセットで事前学習したもの教師モデルとして、非特許文献1と同様の蒸留lossを用いて蒸留学習を行った。 In TernaryBERT, distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, no pre-learning was performed in TernaryBERT, and distillation learning was performed using the same distillation loss as in Non-Patent Document 1 as a teacher model that was pre-learned with the Wikitext 103 data set of the RoBERTa model of huginface.
 以上の条件の下で、各モデルの評価実験を行った結果を以下の表1に示す。 Table 1 below shows the results of evaluation experiments for each model under the above conditions.
Figure JPOXMLDOC01-appb-T000005
 ここで、2は活性化が2bitの和で表現されていることを表す。また、pplはword perplexityである。pplは定性的には言語モデルの「単語予測の確信度」の逆数を意味する一般的な評価指標であり低いほど良い。事前学習では文章の一部の単語を隠して周囲の単語から隠された単語を予測するタスクを解くが、予測した単語の確信度が高ければpplは低くなり、確信度が低ければpplは高くなる。
Figure JPOXMLDOC01-appb-T000005
Here, 2 * indicates that activation is represented by the sum of 2 bits. Also, ppl is word perplexity. The ppl is a general evaluation index that qualitatively means the reciprocal of the "certainty of word prediction" of the language model, and the lower the better. In pre-learning, the task of hiding some words in a sentence and predicting hidden words from surrounding words is solved. Become.
 重みと活性化が3値の場合で比較すると、TernaryBERT(3値活性化)ではpplが1904、本実施例(n=1)ではpplが26.93であり、本実施例の手法がTernaryBERTの活性化低bit化手法よりも優れていることがわかる。 When the weight and activation are three-valued, ppl is 1904 for TernaryBERT (three-valued activation) and ppl is 26.93 for this example (n=1). It can be seen that it is superior to the activation low bit method.
 また、活性化を近似する3値ベクトルを2つに増やした本実施例(n=2)ではpplが9.42であり、nを増やすことで精度低下を大幅に抑えることができている。これは、活性化を8bit化したTernaryBERT(8bit活性化)には及ばないものの、活性化を8bit未満にしつつ、今まで困難であった言語モデルの精度維持に成功しているといえる。更に、本実施例(n=2)では、重みと活性化の両方が3値化されているため、TernaryBERTよりも高速化が期待できる。 In addition, in this example (n=2) in which the number of ternary vectors approximating the activation is increased to two, the ppl is 9.42, and the decrease in accuracy can be greatly suppressed by increasing n. Although this is not as good as TernaryBERT (8-bit activation) with 8-bit activation, it can be said that it has succeeded in maintaining the accuracy of the language model, which has been difficult until now, while keeping the activation at less than 8-bit. Furthermore, in this embodiment (n=2), both weights and activations are ternarized, so higher speed than TernaryBERT can be expected.
 ・ダウンストリームタスクでの評価
 事前学習で学習した各モデルをGLUEデータセット(例えば、参考文献8等を参照)のQQPタスク及びSST-2タスクでファインチューニングした。
• Evaluation on downstream tasks Each model trained in pretraining was fine-tuned on the QQP task and SST-2 task on the GLUE dataset (see, for example, reference 8).
 評価対象はRoBERTa、TernaryBERT(8bit活性化)、TernaryBERT(3値活性化)、本実施例(n=2)とした。 The evaluation targets were RoBERTa, TernaryBERT (8-bit activation), TernaryBERT (three-level activation), and this example (n=2).
 学習条件として、各モデルの初期値は事前学習で学習したものとし、バッチサイズを16、エポック数を3とし、市販の一般的なGPUを1枚用いた。また、学習率は2×10-5とし、学習の最終ステップでは学習率が0となるように線形に減衰させた。最適化にはadamを用いて、Dropout率は0.1とした。 As learning conditions, the initial value of each model was learned by prior learning, the batch size was 16, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 −5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
 TernaryBERTでは実数値モデルを教師モデルとした蒸留学習を行う(例えば、非特許文献1等を参照)。そこで、TernaryBERTではファインチューニングを行わず、hugginfaceのRoBERTaモデルをGLUEデータセットの各タスクでファインチューニングしたものを教師モデルとして、非特許文献1と同様の蒸留lossを用いて蒸留学習を行った。 In TernaryBERT, distillation learning is performed using a real-valued model as a teacher model (for example, see Non-Patent Document 1, etc.). Therefore, fine-tuning was not performed in TernaryBERT, and distillation learning was performed using distilled loss similar to Non-Patent Document 1, using the RoBERTa model of hugginface fine-tuned in each task of the GLUE dataset as a teacher model.
 以上の条件の下で、各モデルの評価実験を行った結果を以下の表2に示す。 Table 2 below shows the results of evaluation experiments for each model under the above conditions.
Figure JPOXMLDOC01-appb-T000006
 ここで、評価指標としては、QQPタスクではF1、SST-2タスクではaccuracyを採用した。
Figure JPOXMLDOC01-appb-T000006
Here, as an evaluation index, F1 was adopted for the QQP task, and accuracy was adopted for the SST-2 task.
 TernaryBERT(3値活性化)では事前学習で言語モデルを獲得できておらず、各タスクにおいても学習ができていないのに対して、本実施例(n=2)では各タスクで精度低下が大幅に抑えられている。 In TernaryBERT (ternary activation), the language model was not acquired by pre-learning, and learning was not possible in each task. is suppressed to
 また、本実施例(n=2)は、実数値のRoBERTaと比較しても5~7%程度の精度低下に抑えられており、活性化を8bit未満にしつつも、今まで困難であった言語モデルの精度維持に成功しているといえる。更に、本実施例(n=2)は、TernaryBERT(8bit活性化)と比べると精度面では劣るものの、重みと活性化の両方が3値化されているため、TernaryBERT(8bit活性化)よりも高速化が期待できる。 In addition, in this example (n = 2), the accuracy is suppressed to about 5 to 7% lower than the real number RoBERTa. It can be said that the accuracy of the language model has been maintained successfully. Furthermore, although this example (n = 2) is inferior in accuracy to TernaryBERT (8-bit activation), both weights and activations are ternary, so it is more accurate than TernaryBERT (8-bit activation). Higher speed can be expected.
 [実施例2]
 以下、本実施形態の実施例2について説明する。本実施例では、活性化を2個の3値ベクトルの和で近似する場合について説明する。すなわち、例えば、ニューラルネットワークモデルの或る層の活性化(実数値ベクトル)をX=(x,x,・・・,xdim)、当該層の3値化パラメータをS=(s,s)とすれば、この実数値ベクトルを3値化ベクトルs+sで近似する場合について説明する。ここで、s,sは、s>2s>0を満たすスカラー値、B(i=1,2)は各要素が-1,0,1のいずれかを取るベクトルである。
[Example 2]
Example 2 of the present embodiment will be described below. In this embodiment, the case of approximating activation by the sum of two ternary vectors will be described. That is, for example, the activation (real-valued vector) of a certain layer of the neural network model is X=(x 1 , x 2 , . , s 2 ), approximating this real-valued vector with a ternary vector s 1 B 1 +s 2 B 2 will be described. Here, s 1 and s 2 are scalar values that satisfy s 1 >2s 2 >0, and B i (i=1, 2) is a vector whose elements are either −1, 0, or 1.
 なお、本実施例では、実施例1でn=2とした場合と比較して、より高精度に活性化を近似することが可能となる。 It should be noted that, in this embodiment, activation can be approximated with higher accuracy than in the case of n=2 in the first embodiment.
 <実施例2における推論処理>
 推論時における推論装置10が実行する推論処理について、図6を参照しながら説明する。図6は、実施例2における推論処理の一例を示すフローチャートである。なお、以下では、3値化パラメータとモデルパラメータは学習済みであるものとする。
<Inference processing in the second embodiment>
The inference processing executed by the inference device 10 during inference will be described with reference to FIG. FIG. 6 is a flowchart illustrating an example of inference processing according to the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have already been learned.
 ここで、図6のステップS301~ステップS306はニューラルネットワークモデルの各層毎に繰り返し実行される。以下では、ニューラルネットワークモデルの或る層に関するステップS301~ステップS306について説明する。 Here, steps S301 to S306 in FIG. 6 are repeatedly executed for each layer of the neural network model. Steps S301 to S306 for a certain layer of the neural network model will be described below.
 ステップS301:推論部202は、図4のステップS101と同様に、ニューラルネットワークモデルに与えられた実数値ベクトル又は1つ前の層の出力ベクトルである実数値ベクトルを入力する。なお、ニューラルネットワークモデルに与えられた実数値ベクトルとは、タスクの推論対象データのことである。 Step S301: The inference unit 202 inputs a real-valued vector given to the neural network model or a real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is the inference target data of the task.
 ステップS302:3値化部201は、学習済み3値化パラメータを用いて、上記のステップS301で入力された実数値ベクトル(つまり、活性化)を3値化し、3値化ベクトルを作成する。このとき、3値化部201は、以下の手順3-1~手順3-2により3値化ベクトルを作成する。なお、上記のステップS301で入力された実数値ベクトルをX=(x,x,・・・,xdim)、当該層の学習済み3値化パラメータをS=(s,s)とする。また、s,sは、s>2s>0を満たすスカラー値であるものとする。 Step S302: Using the learned ternarization parameter, the ternarization unit 201 ternarizes the real-valued vector (that is, activation) input in step S301 to create a ternarization vector. At this time, the ternarization unit 201 creates a ternarization vector according to the following procedures 3-1 and 3-2. Note that the real - valued vector input in step S301 above is X=(x 1 , x 2 , . and Also, s 1 and s 2 are scalar values that satisfy s 1 >2s 2 >0.
 手順3-1)まず、3値化部201は、量子化に用いる関数を Procedure 3-1) First, the ternarization unit 201 selects the function used for quantization as
Figure JPOXMLDOC01-appb-M000007
として、
Figure JPOXMLDOC01-appb-M000007
As
Figure JPOXMLDOC01-appb-M000008
とする。次に、3値化部201は、
Figure JPOXMLDOC01-appb-M000008
and Next, the ternarization unit 201
Figure JPOXMLDOC01-appb-M000009
とした上で、1つ目の3値ベクトルをsとする。また、3値化部201は、
Figure JPOXMLDOC01-appb-M000009
and the first ternary vector is s 1 B 1 . Further, the ternarization unit 201
Figure JPOXMLDOC01-appb-M000010
とした上で、2つ目の3値ベクトルをsとする。
Figure JPOXMLDOC01-appb-M000010
, and the second ternary vector is s 2 B 2 .
 手順3-2)そして、3値化部201は、s+sを3値化ベクトルとする。これにより、当該層の活性化X=(x,x,・・・,xdim)を2個の3値ベクトルの和で近似した3値化ベクトルs+sが得られる。 Procedure 3-2) Then, the ternarization unit 201 sets s 1 B 1 +s 2 B 2 as a ternarization vector. As a result, a ternarized vector s 1 B 1 +s 2 B 2 that approximates the activation X=(x 1 , x 2 , . . . , x dim ) of the layer by the sum of two ternary vectors is obtained. be done.
 ステップS303:推論部202は、図4のステップS103と同様に、当該層に含まれるニューロン毎に、学習済みモデルパラメータを用いて、上記のステップS302で作成された3値化ベクトルの重み付き和を計算する。 Step S303: Similar to step S103 in FIG. 4, the inference unit 202 uses the trained model parameters for each neuron included in the layer to obtain a weighted sum of the ternary vectors created in step S302 above. to calculate
 ステップS304:推論部202は、図4のステップS104と同様に、当該層に含まれるニューロン毎に、学習済みモデルパラメータを用いて、上記のステップS303で計算された重み付き和に対してバイアス項を加算する。 Step S304: Similar to step S104 in FIG. 4, the inference unit 202 uses the learned model parameters for each neuron included in the layer to apply the bias term to the weighted sum calculated in step S303. Add
 ステップS305:推論部202は、図4のステップS105と同様に、当該層に含まれるニューロン毎に、上記のステップS304の計算結果を用いて活性化関数を計算する。 Step S305: The inference unit 202 calculates an activation function for each neuron included in the layer using the calculation result of step S304 above, as in step S105 of FIG.
 ステップS306:推論部202は、図4のステップS106と同様に、当該層に含まれる各ニューロンの活性化関数値を要素とする出力ベクトル(実数値ベクトル)を次の層に出力する。 Step S306: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
 以上のステップS301~ステップS306が各層毎に繰り返し実行され、最終層の出力ベクトル(又は、当該出力ベクトルに対して、タスクに応じた所定の処理を行った結果)が推論結果となる。 The above steps S301 to S306 are repeatedly executed for each layer, and the output vector of the final layer (or the result of performing predetermined processing on the output vector according to the task) becomes the inference result.
 ここで、LSQ量子化と呼ばれる従来手法(例えば、参考文献9等を参照)と、本実施例とを比較する。LSQ量子化では、1つの3値ベクトルで活性化を表現している。すなわち、本実施例と対比する形で記載すると、LSQ量子化では、活性化Xをsで表現している。 Here, a conventional method called LSQ quantization (see, for example, Reference 9) will be compared with this embodiment. In LSQ quantization, one ternary vector represents activation. That is, in contrast to this embodiment, in LSQ quantization, activation X is represented by s 1 B 1 .
 例えば、活性化Xを表現する3値化ベクトルを For example, a ternary vector representing activation X is
Figure JPOXMLDOC01-appb-M000011
とすれば、本実施例の3値化ベクトルのi番目の要素とLSQ量子化の3値化ベクトルのi番目の要素は、図7に示すようになる。なお、図7では、縦軸が3値化ベクトルのi番目の要素、横軸が活性化Xのi番目の要素である。
Figure JPOXMLDOC01-appb-M000011
Then, the i-th element of the ternary vector of this embodiment and the i-th element of the LSQ quantized ternary vector are as shown in FIG. In FIG. 7, the vertical axis is the i-th element of the ternary vector, and the horizontal axis is the i-th element of the activation X. As shown in FIG.
 図7に示すように、本実施例では、sにより与えられる3値のそれぞれをsで微調整することにより、LSQ量子化よりも精度低下を抑えている。なお、後述するように、学習処理では、sにより与えられる3値のそれぞれをsで微調整するような3値化パラメータS=(s,s)を学習できる疑似勾配が与えられる。 As shown in FIG. 7, in this embodiment, each of the three values given by s1 is fine - tuned by s2, thereby reducing accuracy loss more than LSQ quantization. As will be described later, in the learning process, a pseudo-gradient is provided that can learn a ternarization parameter S=(s 1 , s 2 ) that finely adjusts each of the ternary values given by s 1 with s 2 . .
 <実施例2における学習処理>
 学習時における推論装置10が実行する学習処理について、図8を参照しながら説明する。図8は、実施例2における学習処理の一例を示すフローチャートである。なお、以下では、3値化パラメータとモデルパラメータは学習済みでないものとする。
<Learning processing in the second embodiment>
The learning process executed by the inference device 10 during learning will be described with reference to FIG. FIG. 8 is a flowchart illustrating an example of learning processing in the second embodiment. In the following, it is assumed that the ternarization parameters and model parameters have not been learned.
 ここで、図8のステップS402~ステップS407はニューラルネットワークモデルの各層毎に繰り返し実行される。以下では、ニューラルネットワークモデルの或る層に関するステップS402~ステップS407について説明する。なお、ステップS401の初期化は1回のみ実行される。例えば、モデルパラメータと3値化パラメータの更新がバッチ毎に繰り返される場合、ステップS401の初期化は1バッチ目のみ実行され、2バッチ目以降では実行されない。 Here, steps S402 to S407 in FIG. 8 are repeatedly executed for each layer of the neural network model. Steps S402 to S407 relating to a certain layer of the neural network model will be described below. Note that the initialization in step S401 is executed only once. For example, when model parameters and ternarization parameters are repeatedly updated for each batch, the initialization in step S401 is performed only for the first batch, and not performed for the second and subsequent batches.
 ステップS401:学習部203は、ニューラルネットワークモデルに最初に与えられた学習用データを用いて、3値化パラメータを初期化する。具体的には、学習部203は、以下の手順4-1~手順4-3により3値化パラメータを初期化する。なお、ニューラルネットワークモデルに最初に与えられた学習用データが表す実数値ベクトルをXとする。 Step S401: The learning unit 203 initializes the ternarization parameter using the learning data first given to the neural network model. Specifically, the learning unit 203 initializes the ternarization parameter according to procedures 4-1 to 4-3 below. Let X be the real-valued vector represented by the learning data first given to the neural network model.
 手順4-1)まず、学習部203は、|X-sB|を最小化するようなスカラー値sと3値ベクトルBを求め、s=s,B=Bとする。ここで、Bは各要素が-1,0,1のいずれかを取るベクトルである。 Procedure 4-1) First, the learning unit 203 obtains a scalar value s and a ternary vector B that minimize |X−sB| 2 , and sets s 1 =s and B 1 =B. Here, B is a vector whose elements are either -1, 0, or 1.
 手順4-2)次に、学習部203は、|X-s-sB|を最小化するようなスカラー値sと3値ベクトルBを求め、s=s,B=Bとする。 Procedure 4-2) Next, the learning unit 203 obtains the scalar value s and the ternary vector B that minimize |X s 1 B 1 −sB | and
 手順4-3)そして、学習部203は、上記の手順4-1~手順4-2で求めた(s,s)を3値化パラメータSの初期値とする。これにより、X-(s+s)のL2ノルムを小さくような初期値を得ることができるため、近似誤差の小さい初期値で学習を開始させることができるようになる。 Procedure 4-3) Then, the learning unit 203 sets (s 1 , s 2 ) obtained in the procedures 4-1 to 4-2 as the initial values of the ternarization parameter S. As a result, it is possible to obtain an initial value that reduces the L2 norm of X−(s 1 B 1 +s 2 B 2 ), so that learning can be started with an initial value with a small approximation error.
 ステップS402:推論部202は、図4のステップS101と同様に、ニューラルネットワークモデルに与えられた実数値ベクトル又は1つ前の層の出力ベクトルである実数値ベクトルを入力する。なお、ニューラルネットワークモデルに与えられた実数値ベクトルとは、タスクの学習用データのことである。 Step S402: The inference unit 202 inputs the real-valued vector given to the neural network model or the real-valued vector that is the output vector of the previous layer, as in step S101 of FIG. Note that the real-valued vector given to the neural network model is task learning data.
 ステップS403:3値化部201は、3値化パラメータS=(s,s)を用いて、上記のステップS402で入力された実数値ベクトル(つまり、活性化)を3値化し、3値化ベクトルを作成する。このとき、3値化部201は、以下の手順5-1~手順5-2により3値化ベクトルを作成する。なお、上記のステップS402で入力された実数値ベクトルをX=(x,x,・・・,xdim)とする。また、
 手順5-1)まず、3値化部201は、量子化に用いる関数を
Step S403: The ternarization unit 201 uses the ternarization parameter S=(s 1 , s 2 ) to ternarize the real-valued vector (that is, the activation) input in the above step S402. Create a valued vector. At this time, the ternarization unit 201 creates a ternarization vector according to the following procedures 5-1 and 5-2. Let X=(x 1 , x 2 , . . . , x dim ) be the real-value vector input in step S402. again,
Procedure 5-1) First, the ternarization unit 201 selects a function to be used for quantization as
Figure JPOXMLDOC01-appb-M000012
として、
Figure JPOXMLDOC01-appb-M000012
As
Figure JPOXMLDOC01-appb-M000013
とする。次に、3値化部201は、
Figure JPOXMLDOC01-appb-M000013
and Next, the ternarization unit 201
Figure JPOXMLDOC01-appb-M000014
とした上で、1つ目の3値ベクトルをsとする。また、3値化部201は、
Figure JPOXMLDOC01-appb-M000014
and the first ternary vector is s 1 B 1 . Further, the ternarization unit 201
Figure JPOXMLDOC01-appb-M000015
とした上で、2つ目の3値ベクトルをsとする。
Figure JPOXMLDOC01-appb-M000015
, and the second ternary vector is s 2 B 2 .
 手順5-2)そして、3値化部201は、s+sを3値化ベクトルとする。これにより、当該層の活性化X=(x,x,・・・,xdim)を2個の3値ベクトルの和で近似した3値化ベクトルs+sが得られる。 Procedure 5-2) Then, the ternarization unit 201 sets s 1 B 1 +s 2 B 2 as a ternarization vector. As a result, a ternarized vector s 1 B 1 +s 2 B 2 that approximates the activation X=(x 1 , x 2 , . . . , x dim ) of the layer by the sum of two ternary vectors is obtained. be done.
 ステップS404:推論部202は、図4のステップS103と同様に、当該層に含まれるニューロン毎に、モデルパラメータを用いて、上記のステップS403で作成された3値化ベクトルの重み付き和を計算する。 Step S404: Similar to step S103 in FIG. 4, the inference unit 202 uses the model parameters for each neuron included in the layer to calculate the weighted sum of the ternary vectors created in step S403 above. do.
 ステップS405:推論部202は、図4のステップS104と同様に、当該層に含まれるニューロン毎に、モデルパラメータを用いて、上記のステップS404で計算された重み付き和に対してバイアス項を加算する。 Step S405: Similar to step S104 in FIG. 4, the inference unit 202 adds a bias term to the weighted sum calculated in step S404 using model parameters for each neuron included in the layer. do.
 ステップS406:推論部202は、図4のステップS105と同様に、上記のステップS405の計算結果を用いて活性化関数を計算する。 Step S406: The inference unit 202 calculates an activation function using the calculation result of step S405 above, as in step S105 of FIG.
 ステップS407:推論部202は、図4のステップS106と同様に、当該層に含まれる各ニューロンの活性化関数値を要素とする出力ベクトル(実数値ベクトル)を次の層に出力する。 Step S407: As in step S106 of FIG. 4, the inference unit 202 outputs to the next layer an output vector (real-valued vector) whose elements are the activation function values of the neurons included in the layer.
 ステップS408:上記のステップS402~ステップS407が最終層まで実行された場合、学習部203は、モデルパラメータと3値化パラメータを更新する。すなわち、学習部203は、既知の誤差逆伝播法によりloss関数の微分を計算し、その微分値を用いてモデルパラメータと3値化パラメータを更新する。 Step S408: When the above steps S402 to S407 are executed up to the final layer, the learning unit 203 updates the model parameters and the ternarization parameters. That is, the learning unit 203 calculates the differential of the loss function by a known error backpropagation method, and uses the differential value to update the model parameters and the ternarization parameters.
 ここで、上記のステップS403で活性化Xを3値化(量子化)する量子化関数をQとする。すなわち、Q(X)=s+sとする。上述したように、誤差逆伝播法によりloss関数の微分を計算するためには量子化関数Qの微分値が必要であるが、量子化関数Qの微分値はその性質上必ず0になってしまい、誤差を逆伝播させることができなくなってしまう。そこで、本実施例では、量子化関数Qの微分値を疑似的に与える。 Here, let Q be a quantization function that ternarizes (quantizes) the activation X in step S403. That is, Q(X)=s 1 B 1 +s 2 B 2 . As described above, the differential value of the quantization function Q is necessary in order to calculate the differential of the loss function by the error backpropagation method. , the error cannot be backpropagated. Therefore, in this embodiment, the differential value of the quantization function Q is given in a pseudo manner.
 まず、モデルパラメータの学習(更新)に必要な量子化関数Qの疑似的な微分値として、例えば、 First, as a pseudo-differential value of the quantization function Q necessary for learning (updating) the model parameters, for example,
Figure JPOXMLDOC01-appb-M000016
を与える。これは、実施例1と同様に、参考文献4に記載されているSTEの手法によって与えられるものである。
Figure JPOXMLDOC01-appb-M000016
give. This is given by the STE approach described in reference 4, as in Example 1.
 次に、3値化パラメータS=(s,s)の学習(更新)に必要な量子化関数Qの疑似的な微分値 Next, a pseudo differential value of the quantization function Q necessary for learning (updating) the ternarization parameter S=(s 1 , s 2 )
Figure JPOXMLDOC01-appb-M000017
を与える。本実施例では、活性化を1つの3値ベクトルで表現するLSQ量子化(例えば、参考文献9等を参照)の疑似勾配を参考にして、活性化を2つの3値ベクトルの和で表現する場合に拡張する。
Figure JPOXMLDOC01-appb-M000017
give. In this embodiment, the activation is expressed by the sum of two ternary vectors, referring to the pseudo gradient of LSQ quantization (see, for example, Reference 9), which expresses the activation by one ternary vector. Expand in case.
 記述のため、 For the sake of description,
Figure JPOXMLDOC01-appb-M000018
とする。このとき、3値化パラメータS=(s,s)の学習に必要な量子化関数Qの疑似的な微分値として、
Figure JPOXMLDOC01-appb-M000018
and At this time, as a pseudo-differential value of the quantization function Q necessary for learning the ternarization parameter S=(s 1 , s 2 ),
Figure JPOXMLDOC01-appb-M000019
を与える。ここで、
Figure JPOXMLDOC01-appb-M000019
give. here,
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000021
である。なお、量子化後の活性化のk番目の次元の値は、活性化のk番目の次元の値で決まることに留意されたい。また、round(x)はxの小数点を四捨五入して最も近い整数値にする関数、Sign(x)は0≦xならば1、x<0なら-1を返す符号関数である。
Figure JPOXMLDOC01-appb-M000021
is. Note that the value of the kth dimension of the activation after quantization is determined by the value of the kth dimension of the activation. Also, round(x) is a function that rounds off the decimal point of x to the nearest integer value, and Sign(x) is a sign function that returns 1 if 0≤x and -1 if x<0.
 ここで、LSQ量子化の疑似勾配と、本実施例で与える疑似勾配(3値化パラメータS=(s,s)の学習に必要な量子化関数Qの疑似的な微分値)とを比較する。LSQ量子化の疑似勾配と、本実施例で与える疑似勾配とを図9に示す。なお、図9では、縦軸が3値化ベクトルのk番目の要素の3値化パラメータに関する疑似的な微分、横軸が3値化ベクトルのk番目の要素である。 Here, the pseudo-gradient of LSQ quantization and the pseudo-gradient (pseudo-differential value of the quantization function Q necessary for learning the ternarization parameter S=(s 1 , s 2 )) given in this embodiment are compare. The LSQ quantization pseudo-gradient and the pseudo-gradient provided in this embodiment are shown in FIG. In FIG. 9, the vertical axis is the pseudo differential with respect to the ternary parameter of the kth element of the ternarized vector, and the horizontal axis is the kth element of the ternarized vector.
 図9に示すように、本実施例では、sにより与えられる3値のそれぞれをsで微調整している。すなわち、{s,0,-s}付近で As shown in FIG. 9, in this embodiment, each of the three values given by s1 is fine - tuned by s2. That is, near {s 1 , 0, -s 1 }
Figure JPOXMLDOC01-appb-M000022
を0にし、代わりに
Figure JPOXMLDOC01-appb-M000022
to 0, instead of
Figure JPOXMLDOC01-appb-M000023
でsを微調整している。
Figure JPOXMLDOC01-appb-M000023
to fine - tune s2.
 なお、モデルパラメータと3値化パラメータの更新は、学習用データの集合であるバッチ毎に行われてもよい。 Note that the model parameters and the ternarization parameters may be updated for each batch, which is a set of learning data.
 以上のステップS401~ステップS408が実行されることで、3値化パラメータとモデルパラメータが学習される。 By executing the above steps S401 to S408, the ternarization parameters and model parameters are learned.
 <適用例及び評価>
 以下では、本実施例を言語モデルに適用する場合の適用例とその評価について説明する。本適用例では、非特許文献2に記載されているKDLSQ-BERTに対して本実施例を適用する。
<Application example and evaluation>
An application example and its evaluation when this embodiment is applied to a language model will be described below. In this application example, this embodiment is applied to KDLSQ-BERT described in Non-Patent Document 2.
 KDLSQ-BERTはBERTモデルの重みを3値化、活性化を8bit化しており、本実施例を適用して活性化も3値化することで高速化が期待できる。また、KDLSQ-BERTでは活性化の低bit化にLSQ量子化が用いられており、後述するように、LSQ量子化と比較して精度が向上することも確認できた。 KDLSQ-BERT uses ternary weights and 8-bit activation for the BERT model, and by applying this embodiment to ternary activation, speedup can be expected. In addition, in KDLSQ-BERT, LSQ quantization is used for activation bit reduction, and as will be described later, it has been confirmed that accuracy is improved compared to LSQ quantization.
 ・BERT等に代表されるTransformer言語モデルの構成
 実施例1でも説明したように、BERT等に代表されるTransformer言語モデルは、embedding層と、L個(本適用例ではL=12)のTransformer encoderブロックとで構成される。BERTのパラメータは、token、segment、positionに対するembedding行列W,W,Wと、l(lは小文字のL)層目のブロックに含まれるMulti-Head Attention(MHA)のh番目のheaにおけるquery,key,valueの線形変換行列Wlh ,Wlh ,Wlh 、MHAの直後に施されるoutputの線形変換行列W 、及び2層のFeed-Forward Network(FFN)の線形変換行列W ,W である。
Configuration of Transformer language model represented by BERT etc. As described in the first embodiment, the Transformer language model represented by BERT etc. includes an embedding layer and L (L=12 in this application example) Transformer encoders. block. BERT parameters are embedding matrices W e , W s , and W p for token, segment, and position, and the h-th hea of Multi-Head Attention (MHA) included in the l-th layer block linear transformation matrix W lh Q , W lh K , W lh V of query, key, and value in , linear transformation matrix W l O of output applied immediately after MHA, and two-layer Feed-Forward Network (FFN) of These are linear transformation matrices W l 1 and W l 2 .
 ・KDLSQ-BERTで3値化している重み
 segmentとpositionに対するembedding行列を除くWを重み行列とする1つのembedding層と、{Wlh },{Wlh },{Wlh },{W },{W },{W }をそれぞれ重み行列とする6つの線形層との重みを3値化している。
・ One embedding layer with W e as the weight matrix except the embedding matrix for the weight segment and position ternarized by KDLSQ-BERT, {W lh Q }, {W lh K }, {W lh V }, Weights with six linear layers having {W l O }, {W l 1 }, and {W l 2 } as weight matrices are ternarized.
 ・本適用例におけるKDLSQ-BERTの追実装
 KDLSQ-BERTはモデルのコードが公開されていない。そこで、本適用例では、huggingface社が提供しているRoBERTaモデルのコード(例えば、参考文献5等を参照)をベースとして追実装を行った。ただし、KDLSQ-BERTではembedding行列Wに対して次元毎に量子化係数を用意する特殊な3値化を施しているが、本適用例では簡単のためWは量子化しないものとした。すなわち、本適用例では、{Wlh },{Wlh },{Wlh },{W },{W },{W }の6つの線形層の重みを実施例2により3値化するものとした。
・Additional implementation of KDLSQ-BERT in this application example KDLSQ-BERT has not released the model code. Therefore, in this application example, additional implementation was performed based on the code of the RoBERTa model provided by Huggingface (for example, see Reference 5). However, in KDLSQ -BERT, special ternarization is applied to the embedding matrix We by preparing quantization coefficients for each dimension, but in this application example, We are not quantized for simplicity. That is, in this application example, the weights of the six linear layers {W lh Q }, {W lh K }, {W lh V }, {W l O }, {W l 1 }, and {W l 2 } are The ternarization is performed according to the second embodiment.
 ・事前学習について
 Wikitext103データセット(例えば、参考文献6等を参照)を用いて各モデルの事前学習を行った。学習対象のモデルは、量子化されていないRoBERTaモデル(例えば、参考文献7等を参照)、追実装したKDLSQ-BERTモデル(8bit活性化)、KDLSQ-BERTの手法のまま活性化を3値化したモデル、追実装したTernaryBERTに対して本実施例を適用したモデルとした。以下、これら各モデルのそれぞれを「RoBERTa」、「KDLSQ-BERT(8bit活性化)」、「KDLSQ-BERT(3値活性化)」、「本実施例」ともいう。
- About pre-learning Pre-learning of each model was performed using Wikitext103 data set (for example, refer to reference 6 etc.). The models to be learned are the non-quantized RoBERTa model (see, for example, reference 7), the additionally implemented KDLSQ-BERT model (8-bit activation), and the KDLSQ-BERT method with ternary activation. and a model in which this embodiment is applied to the additionally mounted TernaryBERT. Hereinafter, each of these models is also referred to as "RoBERTa", "KDLSQ-BERT (8-bit activation)", "KDLSQ-BERT (three-level activation)", and "this example".
 学習条件として、事前学習では、バッチサイズを64、エポック数を12とし、市販の一般的なGPUを1枚用いた。また、学習率は2×10-5とし、学習の最終ステップでは学習率が0となるように線形に減衰させた。最適化にはadamを用いて、Dropout率は0.1とした。 As learning conditions, in the pre-learning, the batch size was 64, the number of epochs was 12, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 −5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
 KDLSQ-BERTでは実数値モデルを教師モデルとした蒸留学習を行う(例えば、非特許文献2等を参照)。そこで、KDLSQ-BERTでは事前学習を行わず、hugginfaceのRoBERTaモデルをWikitext103データセットで事前学習したもの教師モデルとして、非特許文献2と同様の蒸留lossを用いて蒸留学習を行った。 KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT did not perform pre-learning, and distillation learning was performed using distilled loss similar to Non-Patent Document 2 as a teacher model that was pre-learned with the Wikitext103 data set of hugginface's RoBERTa model.
 以上の条件の下で、各モデルの評価実験を行った結果を以下の表3に示す。 Table 3 below shows the results of evaluation experiments for each model under the above conditions.
Figure JPOXMLDOC01-appb-T000024
 重みと活性化が3値の場合で比較すると、KDLSQ-BERT(3値活性化)ではpplが17.24、本実施例ではpplが5.27であり、本実施例の手法がLSQ量子化による3値化よりも優れていることがわかる。
Figure JPOXMLDOC01-appb-T000024
When the weight and activation are three values, the ppl is 17.24 in KDLSQ-BERT (ternary activation), and the ppl is 5.27 in this example. It can be seen that it is superior to ternarization by .
 また、活性化を近似する3値ベクトルを2つに増やしたことにより精度低下を大幅に抑えることができていることがわかる。これは、活性化を8bit化したKDLSQ-BERT(8bit活性化)には及ばないものの、活性化を8bit未満にしつつ、今まで困難であった言語モデルの精度維持に成功しているといえる。更に、本実施例では、重みと活性化の両方が3値化されているため、KDLSQ-BERTよりも高速化が期待できる。 In addition, it can be seen that by increasing the number of ternary vectors that approximate activation to two, the decrease in accuracy can be greatly suppressed. Although this is not as good as KDLSQ-BERT (8-bit activation) with 8-bit activation, it can be said that it succeeds in maintaining the accuracy of the language model, which has been difficult until now, while making the activation less than 8-bit. Furthermore, in this embodiment, both the weights and activations are ternarized, so higher speed than KDLSQ-BERT can be expected.
 ・ダウンストリームタスクでの評価
 事前学習で学習した各モデルをGLUEデータセット(例えば、参考文献8等を参照)の様々なタスク(MNLI、QQP、QNLI、SST-2、CoLA、STS-B、MRPC、RTE)でファインチューニングした。
・ Evaluation in downstream tasks Each model learned in pre-training is applied to various tasks (MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC , RTE).
 評価対象はRoBERTa、KDLSQ-BERT(8bit活性化)、KDLSQ-BERT(3値活性化)、本実施例とした。 The evaluation targets were RoBERTa, KDLSQ-BERT (8-bit activation), KDLSQ-BERT (3-value activation), and this example.
 学習条件として、各モデルの初期値は事前学習で学習したものとし、バッチサイズを16、エポック数を3とし、市販の一般的なGPUを1枚用いた。また、学習率は2×10-5とし、学習の最終ステップでは学習率が0となるように線形に減衰させた。最適化にはadamを用いて、Dropout率は0.1とした。 As learning conditions, the initial value of each model was learned by prior learning, the batch size was 16, the number of epochs was 3, and one commercially available general GPU was used. Also, the learning rate was set to 2×10 −5 and was linearly attenuated so that the learning rate became 0 at the final step of learning. For optimization, adam was used with a Dropout rate of 0.1.
 KDLSQ-BERTでは実数値モデルを教師モデルとした蒸留学習を行う(例えば、非特許文献2等を参照)。そこで、KDLSQ-BERTではファインチューニングを行わず、hugginfaceのRoBERTaモデルをGLUEデータセットの各タスクでファインチューニングしたものを教師モデルとして、非特許文献2と同様の蒸留lossを用いて蒸留学習を行った。 KDLSQ-BERT performs distillation learning using a real-valued model as a teacher model (for example, see Non-Patent Document 2, etc.). Therefore, KDLSQ-BERT does not perform fine tuning, and the training model is the RoBERTa model of hugginface fine-tuned for each task of the GLUE dataset, and distillation learning is performed using the same distillation loss as in Non-Patent Document 2. .
 以上の条件の下で、各モデルの評価実験を行った結果を以下の表4に示す。 Table 4 below shows the results of evaluation experiments for each model under the above conditions.
Figure JPOXMLDOC01-appb-T000025
 ここで、評価指標としては、KDLSQ-BERTに合わせてCoLAタスクではMC(Matthews Correlation)、MRPCタスクとQQPタスクではF1、STS-BタスクではSC(Spearman Correlation)、その他のタスクではaccuracyを採用した。
Figure JPOXMLDOC01-appb-T000025
Here, as the evaluation index, MC (Matthews Correlation) for CoLA task, F1 for MRPC task and QQP task, SC (Spearman Correlation) for STS-B task, and accuracy for other tasks were adopted according to KDLSQ-BERT. .
 KDLSQ-BERT(3値活性化)と比較して、本実施例では各タスクでの精度低下が大幅に抑えられている。 Compared to KDLSQ-BERT (ternary activation), this embodiment greatly suppresses the decrease in accuracy in each task.
 また、CoLA及びRTE以外のタスクではRoBERTaと比較しても2~5%程度の精度低下に抑えられており、活性化を8bit未満にしつつも、今まで困難であった言語モデルの精度維持に成功しているといえる。更に、本実施例は、KDLSQ-BERT(8bit活性化)と比べると精度面では劣るものの、重みと活性化の両方が3値化されているため、KDLSQ-BERT(8bit活性化)よりも高速化が期待できる。 In addition, in tasks other than CoLA and RTE, the accuracy is suppressed to about 2 to 5% compared to RoBERTa, and while the activation is less than 8 bits, it is possible to maintain the accuracy of the language model, which has been difficult until now. It can be said that it is successful. Furthermore, although this embodiment is inferior to KDLSQ-BERT (8-bit activation) in terms of accuracy, it is faster than KDLSQ-BERT (8-bit activation) because both weights and activations are ternary. change can be expected.
 <まとめ>
 以上のように、本実施形態に係る推論装置10は、ニューラルネットワークモデルの活性化を高精度に3値化し、そのニューラルネットワークモデルにより所定のタスクの推論を高速に実行することができる。また、ニューラルネットワークモデルとしてBERT等に代表されるTransformer言語モデルを用いて様々な自然言語タスクで実験を行い、従来手法と比べて精度低下を大幅に抑えられることを確認した。
<Summary>
As described above, the inference device 10 according to the present embodiment can highly accurately ternarize the activation of a neural network model and perform inference of a predetermined task at high speed using the neural network model. In addition, experiments were conducted with various natural language tasks using a Transformer language model represented by BERT as a neural network model, and it was confirmed that accuracy deterioration can be greatly suppressed compared to conventional methods.
 本発明は、具体的に開示された上記の実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments described above, and various modifications, alterations, combinations with known techniques, etc. are possible without departing from the scope of the claims. .
 以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following additional remarks are disclosed.
 (付記1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 ニューラルネットワークモデルにより所定のタスクの推論を行い、
 複数の3値ベクトルを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化し、
 前記ニューラルネットワークモデルのモデルパラメータと、前記活性化を3値で表現するための3値化パラメータとを学習する、
 学習装置。
(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
inference of a given task by a neural network model,
Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model;
learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
learning device.
 (付記2)
 前記プロセッサは、
 前記層毎に、前記層への入力を表す活性化と、スカラー値と-1、0、+1のいずれかを各要素に持つ3値ベクトルとの積のn(ただし、nは予め決められた自然数)個の和との距離を最小化するような前記スカラー値と前記3値ベクトルとを帰納的に繰り返し計算し、
 前記スカラー値との移動平均により前記3値化パラメータを学習する、付記1に記載の学習装置。
(Appendix 2)
The processor
For each layer, n (where n is a predetermined recursively and repeatedly calculating the scalar value and the ternary vector that minimize the distance from the sum of natural numbers);
The learning device according to appendix 1, wherein the ternarization parameter is learned by moving average with the scalar value.
 (付記3)
 前記プロセッサは、
 前記層毎に、前記3値化パラメータを持つ量子化関数により、前記層への入力を表す活性化を3値化し、
 前記3値化パラメータに関する前記量子化関数の疑似的な勾配を用いて、誤差逆伝播法により前記3値化パラメータを学習する、付記1に記載の学習装置。
(Appendix 3)
The processor
for each layer, ternary activations representing inputs to the layer by a quantization function having the ternarization parameter;
The learning device according to appendix 1, wherein the ternarization parameter is learned by error backpropagation using a pseudo gradient of the quantization function for the ternarization parameter.
 (付記4)
 前記3値化パラメータは、第1の3値化パラメータと、第2の3値化パラメータとで構成され、
 前記プロセッサは、
 第1のスカラー値と-1、0、+1のいずれかを各要素に持つ第1の3値ベクトルとの積と、第2のスカラー値と-1、0、+1のいずれかを各要素に持つ第2の3値ベクトルとの積との和を最小化するような前記第1のスカラー値と前記第2のスカラー値と前記第1の3値ベクトルと前記第2の3値ベクトルとを計算し、前記第1のスカラー値と前記第2のスカラー値とをそれぞれ前記第1の3値化パラメータと前記第2の3値化パラメータの初期値とし、
 前記層毎に、前記3値化パラメータを持つ量子化関数により、前記第1の3値化パラメータと前記第1の3値ベクトルとの積と、前記第2の3値化パラメータと前記第2の3値ベクトルとの積との和で前記層への入力を表す活性化を3値化し、
 前記第1の3値化パラメータに関する前記量子化関数の疑似的な勾配と、前記第2の3値化パラメータに関する前記量子化関数の疑似的な勾配とを用いて、前記3値化パラメータを学習する、付記3に記載の学習装置。
(Appendix 4)
The ternarization parameter is composed of a first ternarization parameter and a second ternarization parameter,
The processor
The product of the first scalar value and the first ternary vector having -1, 0, or +1 in each element, and the second scalar value and -1, 0, or +1 in each element The first scalar value, the second scalar value, the first ternary vector, and the second ternary vector that minimize the sum of the products with the second ternary vector having using the first scalar value and the second scalar value as initial values of the first ternarization parameter and the second ternarization parameter, respectively;
For each layer, the product of the first ternarization parameter and the first ternary vector, the second ternarization parameter and the second ternarizing the activations representing the inputs to the layer by the sum of their products with the ternary vector of
Learning the ternarization parameter using the pseudo gradient of the quantization function for the first ternarization parameter and the pseudo gradient of the quantization function for the second ternarization parameter The learning device according to appendix 3.
 (付記5)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 ニューラルネットワークモデルにより所定のタスクの推論を行い、
 複数の3値ベクトルと、実数値を3値で表現するための学習済み3値化パラメータとを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する、
 推論装置。
(Appendix 5)
memory;
at least one processor connected to the memory;
including
The processor
inference of a given task by a neural network model,
Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values in ternary values, ternarizing activations representing inputs to each layer constituting the neural network model,
reasoning device.
 (付記6)
 学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 前記学習処理は、
 ニューラルネットワークモデルにより所定のタスクの推論を行い、
 複数の3値ベクトルを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化し、
 前記ニューラルネットワークモデルのモデルパラメータと、前記活性化を3値で表現するための3値化パラメータとを学習する、
 非一時的記憶媒体。
(Appendix 6)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
inference of a given task by a neural network model,
Using a plurality of ternary vectors, ternarize the activation representing the input to each layer constituting the neural network model;
learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
Non-transitory storage media.
 (付記7)
 推論処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 ニューラルネットワークモデルにより所定のタスクの推論を行い、
 複数の3値ベクトルと、実数値を3値で表現するための学習済み3値化パラメータとを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する、
 非一時的記憶媒体。
(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform inference processing,
inference of a given task by a neural network model,
Using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values, ternarizing activations representing inputs to each layer constituting the neural network model,
Non-transitory storage media.
 [参考文献]
 参考文献1:Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, Heng Tao Shen. TBN: Convolutional Neural Network with Ternary Inputs and Binary Weights (ECCV2018)
 参考文献2:Fengfu Li, Bo Zhang, Bin Liu. Ternary Weight Networks (2016)
 参考文献3:Lu Hou, James T. Kwok. LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS (2018)
 参考文献4:Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 参考文献5:huggingface, インターネット<URL:https://github.com/huggingface/transformers>
 参考文献6:Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. Pointer Sentinel Mixture Models (2016)
 参考文献7:Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M.,Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-anov, V.: RoBERTa: A Robustly Optimized BERT Pre-training Approach,CoRR, Vol. abs/1907.11692, (2019)
 参考文献8:Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy & Samuel R. Bowman GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING (ICLR2019)
 参考文献9:Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha. LEARNED STEP SIZE QUANTIZATION (ICLR2020)
[References]
Reference 1: Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, Heng Tao Shen. TBN: Convolutional Neural Network with Ternary Inputs and Binary Weights (ECCV2018)
Reference 2: Fengfu Li, Bo Zhang, Bin Liu. Ternary Weight Networks (2016)
Reference 3: Lu Hou, James T. Kwok. LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS (2018)
Reference 4: Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Reference 5: huggingface, Internet <URL: https://github.com/huggingface/transformers>
Reference 6: Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. Pointer Sentinel Mixture Models (2016)
Reference 7: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-anov, V.: RoBERTa: A Robustly Optimized BERT Pre-training Approach, CoRR, Vol. abs/1907.11692, (2019)
Reference 8: Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy & Samuel R. Bowman GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING (ICLR2019)
Reference 9: Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha. LEARNED STEP SIZE QUANTIZATION (ICLR2020)
 10    推論装置
 101   入力装置
 102   表示装置
 103   外部I/F
 103a  記録媒体
 104   通信I/F
 105   プロセッサ
 106   メモリ装置
 107   バス
 201   3値化部
 202   推論部
 203   学習部
10 inference device 101 input device 102 display device 103 external I/F
103a recording medium 104 communication I/F
105 processor 106 memory device 107 bus 201 ternarization unit 202 inference unit 203 learning unit

Claims (8)

  1.  ニューラルネットワークモデルにより所定のタスクの推論を行う推論部と、
     複数の3値ベクトルを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する3値化部と、
     前記ニューラルネットワークモデルのモデルパラメータと、前記活性化を3値で表現するための3値化パラメータとを学習する学習部と、
     を有する学習装置。
    an inference unit that infers a given task using a neural network model;
    a ternarization unit that ternarizes activations representing inputs to each layer constituting the neural network model using a plurality of ternary vectors;
    a learning unit that learns model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
    A learning device having
  2.  前記3値化部は、
     前記層毎に、前記層への入力を表す活性化と、スカラー値と-1、0、+1のいずれかを各要素に持つ3値ベクトルとの積のn(ただし、nは予め決められた自然数)個の和との距離を最小化するような前記スカラー値と前記3値ベクトルとを帰納的に繰り返し計算し、
     前記学習部は、
     前記スカラー値との移動平均により前記3値化パラメータを学習する、請求項1に記載の学習装置。
    The ternarization unit
    For each layer, n (where n is a predetermined recursively and repeatedly calculating the scalar value and the ternary vector that minimize the distance from the sum of natural numbers);
    The learning unit
    2. The learning device according to claim 1, wherein said ternary parameter is learned by moving average with said scalar value.
  3.  前記3値化部は、
     前記層毎に、前記3値化パラメータを持つ量子化関数により、前記層への入力を表す活性化を3値化し、
     前記学習部は、
     前記3値化パラメータに関する前記量子化関数の疑似的な勾配を用いて、誤差逆伝播法により前記3値化パラメータを学習する、請求項1に記載の学習装置。
    The ternarization unit
    for each layer, ternary activations representing inputs to the layer by a quantization function having the ternarization parameter;
    The learning unit
    2. The learning device according to claim 1, wherein said ternary parameter is learned by error backpropagation using a pseudo gradient of said quantization function relating to said ternary parameter.
  4.  前記3値化パラメータは、第1の3値化パラメータと、第2の3値化パラメータとで構成され、
     第1のスカラー値と-1、0、+1のいずれかを各要素に持つ第1の3値ベクトルとの積と、第2のスカラー値と-1、0、+1のいずれかを各要素に持つ第2の3値ベクトルとの積との和を最小化するような前記第1のスカラー値と前記第2のスカラー値と前記第1の3値ベクトルと前記第2の3値ベクトルとを計算し、前記第1のスカラー値と前記第2のスカラー値とをそれぞれ前記第1の3値化パラメータと前記第2の3値化パラメータの初期値とする初期化部を有し、
     前記3値化部は、
     前記層毎に、前記3値化パラメータを持つ量子化関数により、前記第1の3値化パラメータと前記第1の3値ベクトルとの積と、前記第2の3値化パラメータと前記第2の3値ベクトルとの積との和で前記層への入力を表す活性化を3値化し、
     前記学習部は、
     前記第1の3値化パラメータに関する前記量子化関数の疑似的な勾配と、前記第2の3値化パラメータに関する前記量子化関数の疑似的な勾配とを用いて、前記3値化パラメータを学習する、請求項3に記載の学習装置。
    The ternarization parameter is composed of a first ternarization parameter and a second ternarization parameter,
    The product of the first scalar value and the first ternary vector having -1, 0, or +1 in each element, and the second scalar value and -1, 0, or +1 in each element The first scalar value, the second scalar value, the first ternary vector, and the second ternary vector that minimize the sum of the products with the second ternary vector having and an initialization unit that sets the first scalar value and the second scalar value as initial values of the first ternarization parameter and the second ternarization parameter, respectively;
    The ternarization unit
    For each layer, the product of the first ternarization parameter and the first ternary vector, the second ternarization parameter and the second ternarizing the activations representing the inputs to the layer by the sum of their products with the ternary vector of
    The learning unit
    Learning the ternarization parameter using the pseudo gradient of the quantization function for the first ternarization parameter and the pseudo gradient of the quantization function for the second ternarization parameter 4. The learning device according to claim 3, wherein
  5.  ニューラルネットワークモデルにより所定のタスクの推論を行う推論部と、
     複数の3値ベクトルと、実数値を3値で表現するための学習済み3値化パラメータとを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する3値化部と、
     を有する推論装置。
    an inference unit that infers a given task using a neural network model;
    Ternary ternary activation representing input to each layer constituting the neural network model using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values chemical department and
    A reasoning device with
  6.  ニューラルネットワークモデルにより所定のタスクの推論を行う推論手順と、
     複数の3値ベクトルを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する3値化手順と、
     前記ニューラルネットワークモデルのモデルパラメータと、前記活性化を3値で表現するための3値化パラメータとを学習する学習手順と、
     をコンピュータが実行する学習方法。
    an inference procedure for inferring a given task using a neural network model;
    a ternarization procedure for ternarizing activations representing inputs to each layer constituting the neural network model using a plurality of ternary vectors;
    a learning procedure for learning model parameters of the neural network model and ternarization parameters for expressing the activation in three values;
    a computer-implemented learning method.
  7.  ニューラルネットワークモデルにより所定のタスクの推論を行う推論手順と、
     複数の3値ベクトルと、実数値を3値で表現するための学習済み3値化パラメータとを用いて、前記ニューラルネットワークモデルを構成する各層への入力を表す活性化を3値化する3値化手順と、
     をコンピュータが実行する推論方法。
    an inference procedure for inferring a given task using a neural network model;
    Ternary ternary activation representing input to each layer constituting the neural network model using a plurality of ternary vectors and a learned ternarization parameter for expressing real values with ternary values conversion procedure;
    is a computer-implemented inference method.
  8.  コンピュータを、請求項1乃至4の何れか一項に記載の学習装置、又は、請求項5に記載の推論装置、として機能させるプログラム。 A program that causes a computer to function as the learning device according to any one of claims 1 to 4 or the inference device according to claim 5.
PCT/JP2021/019268 2021-05-20 2021-05-20 Learning device, inference device, learning method, inference method, and program WO2022244216A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/019268 WO2022244216A1 (en) 2021-05-20 2021-05-20 Learning device, inference device, learning method, inference method, and program
JP2023522144A JPWO2022244216A1 (en) 2021-05-20 2021-05-20

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/019268 WO2022244216A1 (en) 2021-05-20 2021-05-20 Learning device, inference device, learning method, inference method, and program

Publications (1)

Publication Number Publication Date
WO2022244216A1 true WO2022244216A1 (en) 2022-11-24

Family

ID=84141178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/019268 WO2022244216A1 (en) 2021-05-20 2021-05-20 Learning device, inference device, learning method, inference method, and program

Country Status (2)

Country Link
JP (1) JPWO2022244216A1 (en)
WO (1) WO2022244216A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018016608A1 (en) * 2016-07-21 2018-01-25 株式会社デンソーアイティーラボラトリ Neural network apparatus, vehicle control system, decomposition device, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018016608A1 (en) * 2016-07-21 2018-01-25 株式会社デンソーアイティーラボラトリ Neural network apparatus, vehicle control system, decomposition device, and program

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AMBAI, MITSURU ET AL.: "SPADE: Scalar Product Accelerator by Integer Decomposition for Object Detection", ROCEEDINGS OF EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV) 2014, vol. 8693, 2014, pages 267 - 281, XP047529520, Retrieved from the Internet <URL:https://link.springer.com/chapter/10.1007/978-3-319-10602-1_18> [retrieved on 20210629], DOI: 10.1007/978-3-319-10602-1_18 *
KUROMERU: "Frequently used properties of inverse matrix.", BUTSURINO KAGISHIPPO = PHYSICS KEY PROJECT, JP, XP009541578, Retrieved from the Internet <URL:https://hooktail.sub.jp/mathInPhys/inverseMatrix/> [retrieved on 20210630] *
RYOU ET AL.: "Matrix About row and column vectors. Oshiete! Goo.", OSHIETE, 26 May 2014 (2014-05-26), pages 1 - 10, XP093011104, Retrieved from the Internet <URL:https://oshiete.goo.ne.jp/qa/8610804.html> [retrieved on 20210630] *

Also Published As

Publication number Publication date
JPWO2022244216A1 (en) 2022-11-24

Similar Documents

Publication Publication Date Title
US11928600B2 (en) Sequence-to-sequence prediction using a neural network model
Xu et al. Alternating multi-bit quantization for recurrent neural networks
US11593655B2 (en) Predicting deep learning scaling
CN107679618B (en) Static strategy fixed-point training method and device
CN109785826B (en) System and method for trace norm regularization and faster reasoning for embedded models
Hwang et al. Fixed-point feedforward deep neural network design using weights+ 1, 0, and− 1
US11593611B2 (en) Neural network cooperation
CN111414749B (en) Social text dependency syntactic analysis system based on deep neural network
US20210224447A1 (en) Grouping of pauli strings using entangled measurements
CN110728350A (en) Quantification for machine learning models
WO2020204904A1 (en) Learning compressible features
US11823054B2 (en) Learned step size quantization
CN115699029A (en) Knowledge distillation using back-propagation knowledge in neural networks
JP2017016384A (en) Mixed coefficient parameter learning device, mixed occurrence probability calculation device, and programs thereof
US20240061889A1 (en) Systems and Methods for Weighted Quantization
Rokh et al. A comprehensive survey on model quantization for deep neural networks
Zhu et al. Structurally sparsified backward propagation for faster long short-term memory training
KR20210099795A (en) Autoencoder-based graph construction for semi-supervised learning
JP2023545820A (en) Generative neural network model for processing audio samples in the filter bank domain
CN114267366A (en) Speech noise reduction through discrete representation learning
Huai et al. Latency-constrained DNN architecture learning for edge systems using zerorized batch normalization
CN116171445A (en) On-line training of neural networks
WO2022244216A1 (en) Learning device, inference device, learning method, inference method, and program
Mishra CNN and RNN Using PyTorch
WO2023009740A1 (en) Contrastive learning and masked modeling for end-to-end self-supervised pre-training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21940830

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023522144

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21940830

Country of ref document: EP

Kind code of ref document: A1