KR102645267B1

KR102645267B1 - Quantization method for transformer encoder layer based on the sensitivity of the parameter and apparatus thereof

Info

Publication number: KR102645267B1
Application number: KR1020200183411A
Authority: KR
Inventors: 강유; 박태임; 조익현
Original assignee: 서울대학교산학협력단
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2024-03-07
Also published as: KR20220092043A

Abstract

복수의 트랜스포머 인코더 레이어를 포함하는 뉴럴 네트워크에서 양자화를 수행하는 방법에 있어서, 상기 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 상기 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할하는 단계 및 분할된 그룹별로 결정된 양자화 방법을 적용하는 단계를 포함한다.A method of performing quantization in a neural network including a plurality of transformer encoder layers, dividing the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on sensitivity of a parameter and dividing. It includes the step of applying the quantization method determined for each group.

Description

Method and apparatus for quantizing a plurality of transformer encoder layers based on the sensitivity of parameters {QUANTIZATION METHOD FOR TRANSFORMER ENCODER LAYER BASED ON THE SENSITIVITY OF THE PARAMETER AND APPARATUS THEREOF}

본 명세서에서 개시되는 실시예들은 트랜스포머 기반의 네트워크의 파라미터를 양자화하기 위한 방법 및 장치에 관한 것으로, 보다 상세하게는 파라미터의 민감도에 기초하여 양자화하는 방법 및 장치를 제공하는 방법 및 장치에 관한 것이다.Embodiments disclosed in this specification relate to a method and device for quantizing parameters of a transformer-based network, and more specifically, to a method and device for providing a method and device for quantizing parameters based on sensitivity.

인공지능 분야에 버트(Bidirectional Encoder Representations from Transformers, BERT)가 등장하면서, 자연어 처리분야에서 거대한 모델들이 등장하기 시작하였다. 버트(BERT)는 트랜스포머(Transformer) 기반의 모델로, 자연어 처리에서도 컴퓨터 비전과 마찬가지로 거대한 모델의 사전학습(Pre-training) 및 재학습(Fine-tuning)이 가능해졌고, 다양한 문제들에서 뛰어난 성능을 보여주었다.With the advent of Bidirectional Encoder Representations from Transformers (BERT) in the field of artificial intelligence, huge models began to appear in the field of natural language processing. BERT is a Transformer-based model that enables pre-training and fine-tuning of large models in natural language processing, similar to computer vision, and provides excellent performance in a variety of problems. showed it

버트는 자연어 처리에 널리 사용되는 모델이지만 모델 자체가 많은 파라미터를 사용하고 모델이 매우 크며 긴 추론 시간이 필요하다. 따라서, 버트를 압축하는 기술들이 필요하게 되었다. Vert is a widely used model in natural language processing, but the model itself uses many parameters, the model is very large, and requires a long inference time. Therefore, techniques for compressing butts have become necessary.

버트 압축에는 여러 방법이 있다. 그 중 양자화는 버트를 압축하는 하나의 방법이다. 양자화는 저 정밀도 숫자를 사용하여 모델을 표현하는 것이다. 저 정밀도 숫자는 많은 저장 공간을 절약하고 추론 속도를 높일 수 있다.There are several methods of butt compression. Among them, quantization is one way to compress butts. Quantization is the use of low-precision numbers to represent a model. Low-precision numbers can save a lot of storage space and speed up inference.

기존 버트 양자화(BERT Quantization)의 통상적인 방법은 파라미터의 민감도를 고려하지 않아 최적의 방법이 아니라고 할 수 있다. 버트를 기존의 양자화 방식으로 양자화할 경우, 파라미터의 민감도를 고려하지 않아 2가지 문제가 발생한다. 첫째, 버트의 민감한 파라미터를 너무 많이 압축하여 정확도 저하가 발생한다. 둘째, 버트의 민감하지 않은 파라미터를 최적으로 압축하지 않았기 때문에 버트가 최적으로 압축되지 않는 문제가 발생한다.It can be said that the conventional method of existing BERT quantization is not the optimal method because it does not consider the sensitivity of parameters. When quantizing a butt using the existing quantization method, two problems arise because the sensitivity of the parameters is not considered. First, the sensitive parameters of the butt are compressed too much, resulting in a decrease in accuracy. Second, because the non-sensitive parameters of the butt are not optimally compressed, a problem occurs in which the butt is not optimally compressed.

따라서 상술된 문제점을 해결하기 위한 기술이 필요하게 되었다.Therefore, technology to solve the above-mentioned problems has become necessary.

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.Meanwhile, the above-described background technology is technical information that the inventor possessed for deriving the present invention or acquired in the process of deriving the present invention, and cannot necessarily be said to be known technology disclosed to the general public before filing the application for the present invention. .

본 명세서에서 개시되는 실시예들은, 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 방법 및 장치를 제시하는데 목적이 있다.The purpose of the embodiments disclosed herein is to present a method and apparatus for quantizing a plurality of transformer encoder layers based on the sensitivity of parameters.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서 일 실시예에 따르면, 복수의 트랜스포머 인코더 레이어를 포함하는 뉴럴 네트워크에서 양자화를 수행하는 방법은, 상기 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 상기 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할하는 단계 및 분할된 그룹별로 결정된 양자화 방법을 적용하는 단계를 포함할 수 있다.According to one embodiment as a technical means for achieving the above-mentioned technical problem, a method of performing quantization in a neural network including a plurality of transformer encoder layers includes at least one transformer encoder layer based on the sensitivity of the parameter. It may include dividing into groups including the transformer encoder layer and applying a quantization method determined to each divided group.

다른 실시예에 따르면, 복수의 트랜스포머 뉴럴 네트워크 레이어를 포함하는 뉴럴 네트워크에서 양자화를 수행하는 장치에 있어서, 양자화를 수행하는 프로그램이 저장되는 저장부 및 적어도 하나의 프로세서를 포함하는 제어부를 포함하며, 상기 제어부는, 상기 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 상기 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할하며, 분할된 그룹별로 결정된 양자화 방법을 적용할 수 있다.According to another embodiment, an apparatus for performing quantization in a neural network including a plurality of transformer neural network layers includes a storage unit storing a program for performing quantization and a control unit including at least one processor, The control unit may divide the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on parameter sensitivity and apply a quantization method determined for each divided group.

다른 실시예에 따르면, 컴퓨터에 복수의 트랜스포머 인코더 레이어를 포함하는 뉴럴 네트워크에서 양자화를 수행하는 방법을 실행시키기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체로서, 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치에서 수행되는 양자화 방법은, 상기 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 상기 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할하는 단계 및 분할된 그룹별로 결정된 양자화 방법을 적용하는 단계를 포함할 수 있다.According to another embodiment, a computer-readable recording medium records a program for executing a method of performing quantization in a neural network including a plurality of transformer encoder layers in a computer, wherein the plurality of transformer encoder layers are stored based on the sensitivity of parameters. The quantization method performed in the quantization device includes dividing the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on parameter sensitivity and applying a quantization method determined for each divided group. May include steps.

다른 실시에에 다르면, 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치에 의해 수행되며, 양자화하는 방법을 수행하기 위해 기록매체에 저장된 컴퓨터프로그램으로서, 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치에서 수행되는 양자화 방법은, 상기 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 상기 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할하는 단계 및 분할된 그룹별로 결정된 양자화 방법을 적용하는 단계를 포함할 수 있다.According to another embodiment, it is performed by an apparatus for quantizing a plurality of transformer encoder layers based on the sensitivity of a parameter, and is a computer program stored in a recording medium to perform the quantization method, and quantizing the plurality of transformers based on the sensitivity of the parameter. A quantization method performed in an apparatus for quantizing an encoder layer includes dividing the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on sensitivity of a parameter, and a quantization method determined for each divided group. It may include the step of applying.

전술한 과제 해결 수단 중 어느 하나에 의하면, 파라미터의 민감도를 기초로 하여 양자화하기 때문에 최종 모델을 저장할 때 기본보다 훨씬 적은 용량이 소모될 수 있다. According to any of the above-mentioned problem solving methods, much less capacity than the default may be consumed when storing the final model because quantization is performed based on the sensitivity of the parameters.

또한, 민감한 부분을 8비트로 양자화하고 덜 민감한 부분을 1비트로 양자화하여, 크기가 작지만 정확도 손실이 없는 양자화 방법을 제공할 수 있다. In addition, by quantizing the sensitive part to 8 bits and the less sensitive part to 1 bit, it is possible to provide a quantization method that is small in size but has no loss of accuracy.

또한, 1비트 부분의 성능을 향상하기 위해 세 가지 학습 방법을 제안하여 성능을 추가적으로 향상시켰다.In addition, to improve the performance of the 1-bit part, three learning methods were proposed to further improve performance.

또한, 빠른 추론 속도를 위해 1비트 부분에서 XNOR-COUNT 연산을 적용하고, 8비트 인덱스 부분에는 FP 16 GEMM을 적용하여 고속 연산을 가능하게 한다.Additionally, for fast inference speed, XNOR-COUNT operation is applied to the 1-bit part, and FP 16 GEMM is applied to the 8-bit index part to enable high-speed calculation.

개시되는 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 개시되는 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects that can be obtained from the disclosed embodiments are not limited to the effects mentioned above, and other effects not mentioned are clear to those skilled in the art to which the disclosed embodiments belong from the description below. It will be understandable.

도 1은 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 과정을 설명하기 위한 일 예시도를 나타낸 것이다.
도 2는 일 실시예에 따른 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치의 구성을 도시한 블록도이다.
도3은 일 실시예에 따른 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 방법을 설명하기 위한 순서도이다.
도 4는 8비트 인덱스 양자화(8-bit Index Quantization)를 설명하기 위한 일 예시도이다.
도 5는 LWFT가 적용되는 과정을 설명하기 위한 예시도이다.Figure 1 shows an example diagram for explaining a process of quantizing a plurality of transformer encoder layers based on the sensitivity of parameters.
FIG. 2 is a block diagram illustrating the configuration of an apparatus for quantizing a plurality of transformer encoder layers based on sensitivity of parameters according to an embodiment.
Figure 3 is a flowchart illustrating a method of quantizing a plurality of transformer encoder layers based on the sensitivity of parameters according to an embodiment.
Figure 4 is an example diagram to explain 8-bit index quantization.
Figure 5 is an example diagram to explain the process of applying LWFT.

아래에서는 첨부한 도면을 참조하여 다양한 실시예들을 상세히 설명한다. 아래에서 설명되는 실시예들은 여러 가지 상이한 형태로 변형되어 실시될 수도 있다. 실시예들의 특징을 보다 명확히 설명하기 위하여, 이하의 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 널리 알려져 있는 사항들에 관해서 자세한 설명은 생략하였다. 그리고, 도면에서 실시예들의 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Below, various embodiments will be described in detail with reference to the attached drawings. The embodiments described below may be modified and implemented in various different forms. In order to more clearly explain the characteristics of the embodiments, detailed descriptions of matters widely known to those skilled in the art to which the following embodiments belong have been omitted. In addition, in the drawings, parts that are not related to the description of the embodiments are omitted, and similar parts are given similar reference numerals throughout the specification.

명세서 전체에서, 어떤 구성이 다른 구성과 "연결"되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐 아니라, '그 중간에 다른 구성을 사이에 두고 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성이 어떤 구성을 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한, 그 외 다른 구성을 제외하는 것이 아니라 다른 구성들을 더 포함할 수도 있음을 의미한다.Throughout the specification, when a configuration is said to be “connected” to another configuration, this includes not only cases where it is “directly connected,” but also cases where it is “connected with another configuration in between.” In addition, when a configuration “includes” a configuration, this means that other configurations may be further included rather than excluding other configurations, unless specifically stated to the contrary.

다만 이를 설명하기에 앞서, 아래에서 사용되는 용어들의 의미를 먼저 정의한다.However, before explaining this, we first define the meaning of the terms used below.

이하에서 '뉴럴 네트워크'는 입력레이어, 적어도 하나의 은닉레이어 그리고 출력레이어로 구성될 수 있으며, 각 레이어는 적어도 하나의 '노드'로 구성될 수 있다. 그리고 각 레이어의 노드는 다음 레이어의 노드와의 연결관계를 형성할 수 있다.Hereinafter, a 'neural network' may be composed of an input layer, at least one hidden layer, and an output layer, and each layer may be composed of at least one 'node'. And the nodes of each layer can form a connection relationship with the nodes of the next layer.

그리고 '파라미터'는 뉴럴 네트워크의 각 레이어의 노드에 입력되는 데이터를 다음 레이어에 전달할 때 레이어에 입력되는 데이터의 반영강도를 결정하는 값으로 예를 들어, 가중치(Weight), 커널파라미터(Kernel parameter) 또는 액티베이션(Activation)일 수 있다.And 'parameter' is a value that determines the intensity of reflection of the data input to the layer when transferring the data input to the node of each layer of the neural network to the next layer. For example, weight, kernel parameter Or it may be Activation.

그리고 '데이터'는 뉴럴네트워크의 각 레이어 단계에서 입력되는 값이다.And ‘data’ is the value input at each layer stage of the neural network.

버트(Bidirectional Encoder Representations from Transformers, BERT)는 트랜스포머(Transformer) 기반의 뉴럴 네트워크이다. 이때, 트랜스포머 뉴럴 네트워크는 seq2seq의 구조인 인코더-디코더를 따르며, RNN을 사용하지 않고 어텐션(Attention)만으로 구현한 모델이다. 버트(BERT)는 복수의 트랜스포머 인코더 레이어(Transformer encoder layer)를 포함하며, 각각의 트랜스포머 인코더 레이어는 문장의 일부 단어에 마스킹을 한 후 그것을 예측하는 MLM(Masked Language Model)과 다음 문장 예측 작업인 NSP(Next Sentence Prediction)에 의해 사전학습(Pre-training)되고, 이후 특정 작업에 대해 재학습(Fine-tuning)된다. 이때, 특정 작업에는 언어 추론(Language Inference) 및 질문 답변(Question Answering) 등이 포함될 수 있다.Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based neural network. At this time, the transformer neural network follows the encoder-decoder structure of seq2seq and is a model implemented only with attention without using RNN. BERT includes multiple transformer encoder layers, and each transformer encoder layer is an MLM (Masked Language Model) that masks some words in a sentence and then predicts them, and an NSP that predicts the next sentence. It is pre-trained by Next Sentence Prediction, and then retrained (fine-tuned) for specific tasks. At this time, specific tasks may include language inference and question answering.

일 실시예에 따르면, 버트-베이스(BERT-base)에는 워드피스 임베딩 레이어(WordPiece embedding layer)와 12개의 트랜스포머 인코더 레이어를 포함할 수 있으며, 각 트랜스포머 인코더 레이어에는 셀프어텐션 레이어(Self-attention layer)와 피드 포워드 네트워크(Feed Forward Network, FFN)를 포함할 수 있다. 이때 워드피스 임베딩 레이어는 입력에 대해서 임베딩을 수행한다.According to one embodiment, BERT-base may include a WordPiece embedding layer and 12 transformer encoder layers, and each transformer encoder layer includes a self-attention layer. and a Feed Forward Network (FFN). At this time, the wordpiece embedding layer performs embedding on the input.

일반적으로, 각 셀프어텐션 레이어와 피드포워드 망은 입력으로 받은 자연어에 대한 임베딩과 키(Key), 밸류(Value), 쿼리(Query) 행렬을 포함하는 행렬들 사이의 연산을 통해 최종 출력을 결정한다. In general, each self-attention layer and feedforward network determine the final output through operations between the embedding of the natural language received as input and matrices including Key, Value, and Query matrices. .

일반적으로, BERT에서 사용되는 피드 포워드 네트워크의 출력은 [수학식1]과 같다.Generally, the output of the feed forward network used in BERT is as shown in [Equation 1].

[수학식1][Equation 1]

FFN(x)= max(0,xW1+b1)W2+b2FFN(x)= max(0,xW1+b1)W2+b2

여기서 x는 입력을 의미하고, w1, w2는 가중치를 의미하고, b1,b2는 바이어스를 의미한다. max(c,d)는 c와 d 중 큰 값을 의미한다.Here, x means input, w1, w2 mean weight, and b1, b2 mean bias. max(c,d) means the larger value between c and d.

양자화(Quantization)는 BERT를 압축하는 하나의 방법이다. 양자화는 저 정밀도 숫자를 사용하여 모델을 표현하는 것이다. 저정밀도 숫자는 많은 저장 공간을 절약하고 추론 속도를 높일 수 있다. 일반적으로, BERT의 모든 가중치와 활성화를 포함한 파라미터는 FP32(32-bit Floating Point)형식을 사용한다. 즉, 일반적인 BERT의 모든 파라미터는 32비트 부동 소수점 숫자로 표현되며, 양자화된 모델의 파라미터는 8, 4, 2, 1 비트의 저정밀도 숫자로 표현될 수 있다.Quantization is one way to compress BERT. Quantization is the use of low-precision numbers to represent a model. Low-precision numbers can save a lot of storage space and speed up inference. In general, BERT parameters, including all weights and activations, use the FP32 (32-bit Floating Point) format. That is, all parameters of a general BERT are expressed as 32-bit floating point numbers, and parameters of a quantized model can be expressed as low-precision numbers of 8, 4, 2, or 1 bit.

이하 첨부된 도면을 참고하여 실시예들을 상세히 설명하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings.

도 1은 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 과정을 설명하기 위한 일 예시도를 나타낸 것이다.Figure 1 shows an example diagram for explaining a process of quantizing a plurality of transformer encoder layers based on the sensitivity of parameters.

도 1을 참조하면, 복수의 트랜스포머 인코더 레이어를 포함하는 제1뉴럴네트워크(10)를 제2뉴럴네트워크(20)로 양자화하는 과정이 도시되어 있다. 일 실시예에 따르면 제1뉴럴 네트워크(10)는 사전학습된 버트(Pre-trained BERT)일 수 있다. 트랜스포머 인코더 레이어 중 일부 트랜스포머 인코더 레이어를 MP 인코더 레이어로 양자화하고, 나머지 트랜스포머 인코더 레이어를 8비트 인코더 레이어로 양자화한다. 가령, N개의 트랜스포머 인코더 중 M개의 트랜스포머 인코더 레이어를 MP 인코더 레이어로 양자화하고, N-M개의 트랜스포머 인코더 레이어를 8비트 인코더 레이어로 양자화한다. Referring to FIG. 1, a process of quantizing a first neural network 10 including a plurality of transformer encoder layers into a second neural network 20 is shown. According to one embodiment, the first neural network 10 may be pre-trained BERT. Among the transformer encoder layers, some transformer encoder layers are quantized into MP encoder layers, and the remaining transformer encoder layers are quantized into 8-bit encoder layers. For example, among N transformer encoders, M transformer encoder layers are quantized into MP encoder layers, and N-M transformer encoder layers are quantized into 8-bit encoder layers.

8 비트 인코더 레이어는 8비트 인덱스 양자화를 적용하여 각각 양자화된 피드 포워드 네트워크와 셀프어텐션 레이어를 포함할 수 있고, MP 인코더 레이어는 1 비트 양자화를 적용하여 양자화된 피드 포워드 네트워크와 8비트 인덱스 양자화(8-bit Index Quantization)를 적용하여 양자화된 셀프어텐션 레이어를 포함할 수 있다. 바람직하게는, 임베딩 레이어도 8비트 인덱스 양자화를 적용하여 양자화 할 수 있다. The 8-bit encoder layer may include a feedforward network and a self-attention layer each quantized by applying 8-bit index quantization, and the MP encoder layer may include a feedforward network quantized by applying 1-bit quantization and an 8-bit index quantization (8 -bit Index Quantization) can be applied to include a quantized self-attention layer. Preferably, the embedding layer can also be quantized by applying 8-bit index quantization.

파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 구체적인 방법에 대해서는 아래에서 다른 도면들을 참조하여 자세하게 설명한다.A specific method of quantizing a plurality of transformer encoder layers based on parameter sensitivity will be described in detail below with reference to other drawings.

도 2는 일 실시예에 따른 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치(200)의 구성을 도시한 블록도이다.FIG. 2 is a block diagram illustrating the configuration of an apparatus 200 for quantizing a plurality of transformer encoder layers based on sensitivity of parameters according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치(200)는 입출력부(210), 저장부(220) 및 제어부(230)를 포함할 수 있다.Referring to FIG. 2, the device 200 for quantizing a plurality of transformer encoder layers based on the sensitivity of parameters according to an embodiment may include an input/output unit 210, a storage unit 220, and a control unit 230. there is.

일 실시예에 따른 입출력부(210)는 사용자로부터 입력을 수신하기 위한 입력장치와, 작업의 수행 결과 또는 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치(200)의 상태 등의 정보를 표시하기 위한 출력장치를 포함할 수 있다. 예를 들어, 입출력부(210)는 데이터 처리의 명령을 수신하기 위한 입력부와 수신한 명령에 따라 처리된 결과를 출력하는 출력부를 포함할 수 있다. 일 실시예에 따르면 입출력부(210)는 키보드나 마우스, 터치패널 등의 사용자 입력수단과, 모니터나 스피커 등의 출력수단을 포함할 수 있다.The input/output unit 210 according to an embodiment includes an input device for receiving input from a user, and information such as the status of the device 200 for quantizing a plurality of transformer encoder layers based on the result of performing a task or the sensitivity of parameters. It may include an output device for displaying. For example, the input/output unit 210 may include an input unit for receiving a data processing command and an output unit for outputting a result processed according to the received command. According to one embodiment, the input/output unit 210 may include a user input means such as a keyboard, mouse, or touch panel, and an output means such as a monitor or speaker.

한편, 저장부(220)는 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하기 위한 데이터를 저장할 수 있다. 가령, 뉴럴 네트워크를 양자화시키기 위한 양자화에 필요한 데이터를 저장할 수 있다. 또한, 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화시키기 위해 필요한 각종 데이터나 프로그램들을 저장할 수 있다.Meanwhile, the storage unit 220 may store data for quantizing a plurality of transformer encoder layers based on parameter sensitivity. For example, data required for quantization to quantize a neural network can be stored. Additionally, various data or programs required to quantize a plurality of transformer encoder layers based on parameter sensitivity can be stored.

그리고 제어부(230)는 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치(200)의 전체적인 동작을 제어하며, CPU 등과 같은 프로세서를 포함할 수 있다. 특히, 제어부(230)는 저장부(220)에 저장된 프로그램을 실행하거나 데이터를 읽어 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화할 수 있다. 제어부(230)가 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 구체적인 방법에 대해서는 아래에서 다른 도면들을 참조하여 자세하게 설명한다.Additionally, the control unit 230 controls the overall operation of the device 200 for quantizing a plurality of transformer encoder layers based on the sensitivity of parameters, and may include a processor such as a CPU. In particular, the control unit 230 may execute a program stored in the storage unit 220 or read data and quantize a plurality of transformer encoder layers based on the sensitivity of the parameter. A specific method by which the control unit 230 quantizes a plurality of transformer encoder layers based on parameter sensitivity will be described in detail below with reference to other drawings.

도3은 일 실시예에 따른 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 방법을 설명하기 위한 순서도이다.Figure 3 is a flowchart illustrating a method of quantizing a plurality of transformer encoder layers based on the sensitivity of parameters according to an embodiment.

도 3을 참조하면, 제어부(230)는 복수의 트랜스포머 인코더 레이어를 파라미터의 민감도에 기초하여 적어도 하나의 트랜스포머 인코더 레이어를 포함하는 그룹들로 분할한다(S310). S310단계 이후에, 제어부(230)는 분할된 그룹별로 결정된 양자화 방법을 적용한다(S320).Referring to FIG. 3, the control unit 230 divides the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on parameter sensitivity (S310). After step S310, the control unit 230 applies the quantization method determined for each divided group (S320).

버트를 기존의 양자화 방식으로 양자화할 경우, 파라미터의 민감도를 고려하지 않아 2가지 문제가 발생한다. 첫째, 버트의 민감한 파라미터를 너무 많이 압축하여 정확도 저하가 발생한다. 둘째, 버트의 민감하지 않은 파라미터를 최적으로 압축하지 않았기 때문에 버트가 최적으로 압축되지 않는 문제가 발생한다. 이를 해결하기 위해서 파라미터의 민감도를 고려하여, 버트의 일부분은 8비트 인덱스 양자화를 적용하여 양자화하고, 나머지 부분은 1비트 양자화를 적용하여 양자화한다. 한편, 전체 모델 크기 중에 아주 작은 부분을 차지하는 LayerNorm 레이어와 바이어스는 양자화하지 않을 수 있다. When quantizing a butt using the existing quantization method, two problems arise because the sensitivity of the parameters is not considered. First, the sensitive parameters of the butt are compressed too much, resulting in a decrease in accuracy. Second, because the non-sensitive parameters of the butt are not optimally compressed, a problem occurs in which the butt is not optimally compressed. To solve this problem, considering the sensitivity of the parameters, part of the butt is quantized by applying 8-bit index quantization, and the remaining part is quantized by applying 1-bit quantization. Meanwhile, the LayerNorm layer and bias, which account for a very small portion of the overall model size, may not be quantized.

셀프어텐션 레이어는 입력 단어 임베딩간의 관계를 연산하여 모델의 정확성을 향상시키는 데 결정적인 역할을 하기 때문에 셀프어텐션 레이어의 파라미터의 민감도가 피드 포워드 네트워크의 파라미터의 민감도보다 높다. 또한, 입력에 가까운 트랜스포머 인코더 레이어는 입력에서 중요한 저차원의 특징을 추출하므로 모델의 정확성을 향상시키는 데 결정적인 역할을 하기 문에, 트랜스포머 인코더 레이어의 파라미터의 민감도는 최초 입력에 가까운 레이어일수록 민감도가 높다. Because the self-attention layer plays a critical role in improving the accuracy of the model by calculating the relationship between input word embeddings, the sensitivity of the parameters of the self-attention layer is higher than the sensitivity of the parameters of the feed forward network. In addition, the transformer encoder layer close to the input plays a critical role in improving the accuracy of the model by extracting important low-dimensional features from the input. Therefore, the sensitivity of the parameters of the transformer encoder layer increases as the layer closer to the initial input increases. is high.

트랜스포머 인코더 레이어의 파라미터의 민감도는 최초 입력에 가까운 레이어일수록 민감도가 높고 셀프어텐션 레이어의 파라미터의 민감도가 피드 포워드 네트워크의 파라미터의 민감도보다 높기 때문에, MP 인코더 레이어는 최종 출력 레이어와 가깝게 배치될 수 있고, 8비트 인코더 레이어는 최초 입력 레이어와 가깝게 배치될 수 있다. Since the sensitivity of the parameters of the transformer encoder layer is higher in layers closer to the initial input, and the sensitivity of the parameters of the self-attention layer is higher than that of the feed forward network, the MP encoder layer can be placed close to the final output layer, The 8-bit encoder layer can be placed close to the first input layer.

관련하여, 제어부(230)는 복수의 트랜스포머 인코더 레이어를 최초 입력에 가까운 제1그룹과 최종 출력에 가까운 제2그룹으로 분할한다.In relation to this, the control unit 230 divides the plurality of transformer encoder layers into a first group close to the initial input and a second group close to the final output.

제어부(230)는 최초 입력에 가까운 순서대로 미리 설정된 개수의 레이어를 제1 그룹으로, 나머지 레이어를 제2 그룹으로 분할할 수 있다. 예를 들어, 레이어가 총 5개의 레이어가 있다고 할 때, 제어부(230)는 최초 입력에 가까운 2개의 레이어를 제1 그룹으로, 나머지 3개의 레이어를 제2 그룹으로 분할할 수 있다.The control unit 230 may divide a preset number of layers into a first group and the remaining layers into a second group in the order closest to the initial input. For example, if there are a total of 5 layers, the control unit 230 may divide the 2 layers close to the initial input into the first group and the remaining 3 layers into the second group.

제어부(230)는 최초 입력에 가까운 1그룹은 8비트 인덱스 양자화를 적용하여 각각 양자화된 피드 포워드 네트워크와 셀프어텐션 레이어를 포함한 8비트 인코더 레이어로 양자화하고, 나머지 레이어인 제2그룹은 1비트 양자화를 적용하여 양자화된 피드 포워드 네트워크와 8비트 인덱스 양자화를 적용하여 양자화된 셀프어텐션 레이어를 포함한 MP 인코더 레이어로 양자화한다.The control unit 230 applies 8-bit index quantization to the first group close to the initial input and quantizes it with an 8-bit encoder layer including a quantized feed forward network and self-attention layer, and 1-bit quantization to the second group, which is the remaining layer. By applying a quantized feed forward network and 8-bit index quantization, it is quantized into an MP encoder layer including a quantized self-attention layer.

도 4는 8비트 인덱스 양자화(8-bit Index Quantization)를 설명하기 위한 일 예시도이다.Figure 4 is an example diagram to explain 8-bit index quantization.

도 4를 참조하면, 8비트 인덱스 양자화(8-bit Index Quantization)는 양자화(Quantization), 역-양자화(De-quantization), 학습단계(Training step)로 수행된다.Referring to FIG. 4, 8-bit Index Quantization is performed through quantization, de-quantization, and training steps.

순방향 전파(Forward propagation)에서, 제어부(230)는 가중치 행렬에 8비트 인덱스(8-bit index)를 사용하여 양자화(Quantization)와 역-양자화(De-quantization)를 순차적으로 적용한다. 이를 통해, 모델의 크기와 정확도 저하를 최소화할 수 있다. 이때, 역-양자화는 숫자를 양자화하기 전의 비트 형식으로 표현하는 것을 의미한다.In forward propagation, the control unit 230 sequentially applies quantization and de-quantization to the weight matrix using an 8-bit index. Through this, the size and accuracy of the model can be minimized. At this time, de-quantization means expressing the number in bit format before quantization.

8 비트 인덱스(8-bit index)란, 본 발명에서 [-128,127]의 범위에 있는 256개의 정수 중 하나로 표현되는 INT8 인덱스 형식을 의미한다.In the present invention, an 8-bit index means an INT8 index format expressed as one of 256 integers in the range of [-128,127].

예를 들어, 제어부(230)는 각각의 레이어에서 먼저 FP32 형식의 가중치 행렬을 INT8 인덱스 형식의 행렬로 양자화하고, INT8 인덱스 형식의 행렬을 FP32형식의 가중치 행렬로 역-양자화한다. For example, in each layer, the control unit 230 first quantizes the FP32 format weight matrix into an INT8 index format matrix, and dequantizes the INT8 index format matrix into an FP32 format weight matrix.

8 비트 인덱스 양자화와 역-양자화는 수학식 (2) 및 수학식 (3)에 의하여 이루어진다.8-bit index quantization and de-quantization are performed by equation (2) and equation (3).

[수학식 2][Equation 2]

[수학식 3][Equation 3]

여기서, 는 l번째 레이어의 8비트 인덱스로 양자화된 가중치 행렬을, 는 l번째 레이어의 FP32형식으로 표현된 가중치 행렬을, 은 에서 최소 가중치 값을, 은 l번째 레이어의 양자화 가중치 스케일(Quantization weight scale)을, 는 양자화 스케일(Quantization scale)을, 는 양자화로 표현될 수 있는 숫자의 최소값을, 는 l번째 레이어의 역-양자화된 가중치 행렬을 의미한다. Floor(x)는 x 이하의 최대 정수를 출력하는 함수이다. 특히 , 은 수학식 (4), (5), (6)으로부터 도출될 수 있다.here, is the quantized weight matrix with the 8-bit index of the lth layer, is the weight matrix expressed in FP32 format of the lth layer, silver The minimum weight value at, is the quantization weight scale of the lth layer, is the quantization scale, is the minimum value of a number that can be expressed by quantization, means the inverse-quantized weight matrix of the lth layer. Floor(x) is a function that outputs the maximum integer less than x. especially , can be derived from equations (4), (5), and (6).

[수학식 4][Equation 4]

[수학식 5][Equation 5]

[수학식 6][Equation 6]

여기서, b는 양자화 비트 수(The number of Quantization bit)을, 는 양자화로 표현될 수 있는 숫자의 최대값을, 는 에서 최대 가중치 값을 의미한다. 가령 8비트 인덱스 양자화 방법에서는 b는 8, 는 127, 는 -128의 값을 가진다.Here, b is the number of quantization bits, is the maximum value of a number that can be expressed by quantization, Is means the maximum weight value. For example, in the 8-bit index quantization method, b is 8, is 127, has a value of -128.

예를 들어, 버트의 첫번째 트랜스포머 인코더 레이어 중 피드 포워드 네트워크는 가중치 행렬의 차원은 768*3072이며, 이는 768*3072개의 파라미터를 가지고 있다는 것을 의미한다. For example, in Burt's first transformer encoder layer, the feed forward network has a weight matrix dimension of 768*3072, which means it has 768*3072 parameters.

도 4를 참조하면, 제어부(230)는 4*4개의 가중치 행렬을 수학식 2에 따라 INT8 형식으로 양자화한다. 즉, 제어부(230)는 FP32 형식으로 표현된 행렬의 모든 가중치를 [-128,127]의 범위에 있는 256개의 정수 중 하나로 표현되는 INT8 인덱스 형식으로 양자화한다. 그리고 그 후에, 제어부(230)는 수학식 3에 따라 INT8 형식으로 양자화된 가중치 행렬을 다시 FP32형식으로 역-양자화한다. 역-양자화가 수행된 이후에는, 제어부(230)는 기존의 BERT가 수행하는 순방향전파 과정을 수행한다.Referring to FIG. 4, the control unit 230 quantizes the 4*4 weight matrices into INT8 format according to Equation 2. That is, the control unit 230 quantizes all weights of the matrix expressed in FP32 format into an INT8 index format expressed as one of 256 integers in the range of [-128,127]. And after that, the control unit 230 de-quantizes the weight matrix quantized in INT8 format back into FP32 format according to Equation 3. After de-quantization is performed, the control unit 230 performs the forward propagation process performed by the existing BERT.

역방향 전파(Backward propagation)에서, 제어부(230)는 표준 경사하강법을 사용하여 가중치를 학습시킬 수 있다. 이때, 양자화에서 사용되는 Floor함수의 미분 값을 사용하여 가중치를 학습시킬 수 없으므로, 제어부(230)는 8비트 클립 함수(8-bit clip function)를 Floor함수 대신 사용한다. 이때 8비트 클립 함수는 Floor 함수를 근사화한 함수이다. 즉, 역방향 전파에서 제어부(230)는 8비트 클립 함수를 사용하여 가중치를 학습시킬 수 있다.In backward propagation, the control unit 230 can learn weights using a standard gradient descent method. At this time, since the weight cannot be learned using the differential value of the Floor function used in quantization, the control unit 230 uses an 8-bit clip function instead of the Floor function. At this time, the 8-bit clip function is a function that approximates the Floor function. That is, in reverse propagation, the control unit 230 can learn the weight using an 8-bit clip function.

표준 경사하강법에서 사용되는 수학식 7로부터, 8비트 클립 함수는 수학식 8로부터 도출된다.From Equation 7 used in standard gradient descent, the 8-bit clip function is derived from Equation 8.

[수학식 7][Equation 7]

[수학식 8][Equation 8]

학습이 완료된 후, 제어부(230)는 양자화된 가중치 행렬의 최대 값과 최소값, 가중치 행렬의 최대값, 최소값을 INT8 형식으로 저장할 수 있다. 이를 통해 모델의 크기를 1/4로 줄일 수 있다.After learning is completed, the control unit 230 may store the maximum and minimum values of the quantized weight matrix and the maximum and minimum values of the weight matrix in INT8 format. Through this, the size of the model can be reduced to 1/4.

이하에서는 본 발명의 일 실시예인 1비트 양자화 방법에 대해 설명한다.Below, a 1-bit quantization method, which is an embodiment of the present invention, will be described.

1비트 양자화는 가중치와 액티베이션(Activation)을 2개의 숫자로 표현하는 양자화를 의미한다. 정보가 손상되지 않은 상태(Intact)에서 인덱스만 압축되는 8 비트 인덱스 양자화와 다르게, 1비트 양자화는 정보의 손실이 클 수 있다. 1-bit quantization refers to quantization that expresses weight and activation as two numbers. Unlike 8-bit index quantization, in which only the index is compressed while the information is intact (intact), 1-bit quantization may result in significant information loss.

순방향 전파(Forward propagation)에서, 제어부(230)는 sign 함수를 이용하여 모든 가중치와 액티베이션을 -1 또는 +1로 양자화할 수 있다.In forward propagation, the control unit 230 can quantize all weights and activations to -1 or +1 using the sign function.

1비트 양자화에서 사용하는 sign함수는 수학식 9와 같다.The sign function used in 1-bit quantization is as shown in Equation 9.

[수학식 9][Equation 9]

여기서, 와 는 각각 FP32형식의 액티베이션과 가중치를 의미하고,는 각각 양자화된 활성화와 가중치를 의미한다.here, and means activation and weight in FP32 format, respectively. means quantized activation and weight, respectively.

역방향 전파(Backward propagation)에서, sign 함수의 미분값은 거의 모든 영역에서 0을 갖기 때문에 학습단계에서 표준 경사하강법을 직접적으로 사용할 수 없다. 따라서 역방향 전파에서는, 제어부(230)는 1비트 클립 함수(1-bit clip function)를 사용하여 가중치를 학습시킨다. 이는 수학식 10 및 수학식 11로부터 도출된다.In backward propagation, the derivative of the sign function has 0 in almost all regions, so standard gradient descent cannot be used directly in the learning phase. Therefore, in reverse propagation, the control unit 230 learns the weight using a 1-bit clip function. This is derived from Equation 10 and Equation 11.

[수학식 10][Equation 10]

[수학식 11][Equation 11]

여기서, L은 모델의 손실 함수(Loss function)을 의미하고, 는 학습율(Learning rate)을 의미하고, 는 2비트로 양자화된 가중치 행렬을 의미한다. 는 클립 함수에 의해 가 [-1,1]의 영역에 존재할 경우 1의 값을 갖고, 그렇지 않으면 0의 값을 갖는다.Here, L refers to the loss function of the model, means learning rate, means a weight matrix quantized into 2 bits. by the clip function If exists in the area of [-1,1], it has the value of 1, otherwise it has the value of 0.

는 체인룰(Chain rule)에 의해 수학식 12로부터 도출된다. is derived from Equation 12 by the chain rule.

[수학식 12][Equation 12]

여기서, 는 FP32 형식의 액티베이션 행렬을 의미하고, 은 액티베이션 행렬의 간단화된 미드 텀(Simplified mid term of activation matrix)을 의미하고, 는 LayerNorm 레이어와 GeLU 활성화 함수(Activation function)의 미분을 나타낸다.here, means the activation matrix in FP32 format, means the simplified mid term of activation matrix, represents the differentiation of the LayerNorm layer and the GeLU activation function.

또한, 1 비트 양자화에 의해 발생하는 정확도 손실을 줄이기 위해서, 학습 단계 에서 제어부(230)는 파라미터의 각 절대값을 1에 가깝게 학습하는 절대 이진 가중치 정규화(Absolute Binary Weight Regularization, ABWR)를 수행할 수 있다.Additionally, in order to reduce the accuracy loss caused by 1-bit quantization, in the learning step, the control unit 230 can perform absolute binary weight regularization (ABWR), which learns each absolute value of the parameter to be close to 1. there is.

사전 훈련된 버트(Pre-trained BERT)의 가중치 중 95%는 [-0.07,0.07] 범위에 존재하며, 과는 거리가 멀다. 이는 1비트 양자화를 적용할 때 정확도가 크게 떨어지는 주요 원인 중 하나이다.95% of the weights of pre-trained BERT are in the range [-0.07,0.07], It is far from This is one of the main reasons why accuracy drops significantly when applying 1-bit quantization.

관련하여, 제어부(230)는 32FP 형식의 가중치가 학습 단계에서 +1에 근접하도록 레귤라이저(regularizer)를 도입할 수 있다. ABWR의 직관(Intuition)은 애초에 가중치의 절대값이 +1에 근접하도록 학습시켜 1비트 양자화를 적용할 때 정확도 하락을 최소화하는 것이다.In relation to this, the control unit 230 may introduce a regularizer so that the weight of the 32FP format approaches +1 in the learning stage. ABWR's intuition is to minimize the loss of accuracy when applying 1-bit quantization by initially training the absolute value of the weight to be close to +1.

절대 이진 가중치 정규화는 수학식 13으로부터 도출된다.The absolute binary weight normalization is derived from Equation 13.

[수학식 13][Equation 13]

여기서, 은 ABWR 항(Term)을 의미한다. 학습 단계에서, 손실 함수에 을 추가함으로써 절대 이진 가중치 정규화를 수행하여 1비트 양자화의 정확도 하락을 최소화할 수 있다.here, means ABWR term. In the learning phase, the loss function By adding , absolute binary weight normalization can be performed to minimize the decrease in accuracy of 1-bit quantization.

관련하여, 제어부(230)는 32FP 형식의 가중치가 학습 단계에서 +1에 근접하도록 ABWR 항을 손실함수에 추가할 수 있다.In relation to this, the control unit 230 may add the ABWR term to the loss function so that the weight of the 32FP format approaches +1 in the learning stage.

또한, FP32 형식으로 학습된 미리 학습된 지식을 잊는 것을 방지하기 위하여, 제어부(230)는 1 비트 양자화를 수행하기 전에, FP32 형식의 가중치가 입력의 이진 특성을 학습하도록 하는 이진화 특징 우선 학습(Prioritized Training, PT)을 수행할 수 있다. 이진화 특징 우선 학습은, FP32 형식으로 표현된 뉴럴 네트워크가 이진화된 입력의 특징을 학습하고, 이후 1비트 양자화로 가중치를 양자화시키고, 이후 1비트 양자화된 가중치에 이진화된 특징을 학습할 수 있도록 하는 것을 말한다. In addition, in order to prevent forgetting the previously learned knowledge learned in the FP32 format, the control unit 230 performs a binarization feature priority learning (Prioritized) function that allows the weights in the FP32 format to learn the binary characteristics of the input before performing 1-bit quantization. Training, PT) can be performed. Binarization feature priority learning allows a neural network expressed in FP32 format to learn the features of the binarized input, then quantize the weight with 1-bit quantization, and then learn the binarized feature on the 1-bit quantized weight. says

일반적인 1 비트 양자화는 액티베이션과 가중치를 바로 양자화하여 가중치를 학습시킨다. 반면에, PT는 우선 FP32 형식의 가중치를 학습시킨 후에 1비트 양자화를 적용하는 것인데, 이때 엑티베이션은 처음부터 1비트 양자화된 상태를 유지한다. 즉, PT는 먼저 입력의 이진 특성(Binary features)을 더 잘 이해하기 위해 입력의 이진 특성으로부터FP32형식의 뉴럴 네트워크를 먼저 학습시키고, 이후에 뉴럴 네트워크에 1비트 양자화를 적용한다.General 1-bit quantization learns the weights by directly quantizing the activation and weights. On the other hand, PT first learns FP32 format weights and then applies 1-bit quantization, and in this case, activation maintains the 1-bit quantized state from the beginning. In other words, PT first learns a neural network in FP32 format from the binary features of the input to better understand the binary features of the input, and then applies 1-bit quantization to the neural network.

기존 1비트 양자화와 달리, PT는 중간에 FP32형식의 가중치를 학습시키는 과정을 추가하여, 가중치에 입력의 이진 특징이 더 잘 학습되도록 할 수 있다.Unlike the existing 1-bit quantization, PT adds the process of learning FP32 format weights, allowing the binary features of the input to be better learned in the weights.

또한, 1비트 양자화의 학습단계에서 제어부(230)는 레이어 별 재학습(Layer-Wise Fine-Tuning, LWFT)을 적용할 수 있다. 이를 통해, 한번에 너무 많은 레이어를 1비트 양자화할 경우에 발생하는 지식 손실로 인한 정확도 저하 문제를 해결할 수 있다.Additionally, in the learning stage of 1-bit quantization, the control unit 230 can apply layer-wise re-learning (Layer-Wise Fine-Tuning, LWFT). Through this, the problem of accuracy degradation due to knowledge loss that occurs when too many layers are quantized by 1 bit at once can be solved.

LWFT는 다음과 같이 이루어진다. 제어부(230)는 먼저 복수의 트랜스포머 인코더 레이어 중 한 레이어를 MP 인코더 레이어로 양자화하고, 그 후 다음 레이어를 순차적으로 MP 인코더 레이어로 양자화한다. 이때, 복수의 트랜스포머 인코더 레이어의 어떠한 파라미터도 동결(Freeze)되지 않는다. LWFT의 직관(Intuition)은 최종 양자화 뉴럴 네트워크을 한 번에 학습시키는 것이 아니라 MP 인코더 레이어를 한층 씩 배치하는 것이다. 이를 통해 최종 양자화 뉴럴 네트워크를 한번에 학습시켜 발생되는 무능(Inability)를 효과적으로 회피할 수 있다.LWFT works as follows. The control unit 230 first quantizes one layer among the plurality of transformer encoder layers into an MP encoder layer, and then sequentially quantizes the next layer into an MP encoder layer. At this time, no parameters of the plurality of transformer encoder layers are frozen. The intuition of LWFT is not to learn the final quantized neural network at once, but to place the MP encoder layers one by one. Through this, it is possible to effectively avoid inability caused by learning the final quantized neural network at once.

도 5는 LWFT가 적용되는 과정을 설명하기 위한 예시도이다.Figure 5 is an example diagram to explain the process of applying LWFT.

도 5를 참조하면, 8비트 인코더와 MP 인코더 레이어의 셀프 어텐션 레이어에 8 비트 인덱스 양자화가 수행되어져 있음을 알 수 있다. 또한, 1개의 MP 인코더 레이어로 양자화가 완료된 뉴럴 네트워크가 두번째 MP 인코더 레이어를 양자화하는 과정이 도시되어 있음을 알 수 있다. 제어부(230)는 레이어 별로 MP 인코더 레이어의 양자화를 완료할 수 있다. 즉, 제어부(230)는 1비트 양자화를 수행해야 하는 MP 인코더 레이어를 레이어 별로 순차적으로 1비트 양자화를 완료할 수 있다. Referring to Figure 5, it can be seen that 8-bit index quantization is performed in the self-attention layer of the 8-bit encoder and MP encoder layer. In addition, it can be seen that the process of quantizing a second MP encoder layer by a neural network that has completed quantization with one MP encoder layer is shown. The control unit 230 can complete quantization of the MP encoder layer for each layer. That is, the control unit 230 can sequentially complete 1-bit quantization for each layer of the MP encoder layer that needs to perform 1-bit quantization.

관련하여, 제어부(230)는 최종 출력에 가까운 제2그룹의 피드 포워드 네트워크를 레이어 별로 순차적으로 1비트 양자화를 적용할 수 있다.In relation to this, the control unit 230 may sequentially apply 1-bit quantization for each layer of the second group of feed forward networks that are close to the final output.

또한, 추론 속도를 높이기 위해서, 제어부(230)는 XNOR-Count GEMM과 FP 16 GEMM을 뉴럴 네트워크의 1비트와 8비트 부분에 적용할 수 있다. 이때 GEMM(GEneral Matrix Multiplication)은 행렬곱셈을 의미한다.Additionally, to increase the inference speed, the control unit 230 can apply XNOR-Count GEMM and FP 16 GEMM to the 1-bit and 8-bit portions of the neural network. In this case, GEMM (GEneral Matrix Multiplication) refers to matrix multiplication.

1비트 양자화 부분에, 일반적인 GEMM을 대체하는 XNOR-Count GEMM을 사용할 수 있다. XNOR-Count GEMM은 1비트로 양자화된 행렬에서 빠른 행렬 곱셈을 위해 사용되는 방법이다. XNOR-Count GEMM은 이진화된 뉴럴 네트워크에서만 사용가능하며, 기존의 GEMM보다 훨씬 적은 계산 비용이 소요된다.In the 1-bit quantization part, XNOR-Count GEMM can be used, replacing the general GEMM. XNOR-Count GEMM is a method used for fast matrix multiplication in matrices quantized to 1 bit. XNOR-Count GEMM can only be used in binarized neural networks, and requires much less computational cost than existing GEMM.

XNOR-Count GEMM은 수학식 14로 이루어진다.XNOR-Count GEMM consists of Equation 14.

[수학식 14][Equation 14]

여기서, 는 l+1번째 레이어의 엑티베이션 행렬의 i행j열 성분(Element)를 나타내고, 는 l번째 레이어의 엑티베이션 행렬의 i행을 의미하고, 는 l번째 가중치 행렬의 j열을 의미한다.here, represents the i row j column element of the activation matrix of the l+1th layer, refers to the i row of the activation matrix of the lth layer, means the j column of the lth weight matrix.

XNOR-Count GEMM은, 먼저 엑티베이션 행렬 행과 가중치 행렬 열에 대한 XNOR 연산을 수행하고 출력 벡터의 1의 개수를 센 다음, 출력 벡터의 1의 개수의 2배에서 입력 벡터의 차원을 빼어 성분(Element)의 값을 구하는 방식이다.The XNOR-Count GEMM first performs an ) is a method of finding the value of.

제어부(230)는 기존 FP32 형식의 행렬 곱셈을 XNOR-Count GEMM으로 대체하여 사용할 수 있다.The control unit 230 can replace the existing FP32 format matrix multiplication with XNOR-Count GEMM.

관련하여, 널리 사용되는 NVIDIA GPU와 딥 러닝 프레임워크(Frame work)는 1비트를 사용하여 각 가중치를 저장하기가 어렵다. 이 문제를 해결하기 위해, 제어부(230)는 32개의 이진 가중치(Binary weights)를 FP32형식의 숫자에 저장하고 32개의 이진 가중치를 한 번에 연산한다. 즉, 제어부(230)는 각 행렬 곱셈 과정에서 입력 행렬과 가중치 행렬을 1비트 소형 행렬로 인코딩한다. 그 후, 제어부(230)는 인코딩된 소형 행렬에 XNOR-Count GEMM을 적용할 수 있다.Relatedly, it is difficult for the widely used NVIDIA GPU and deep learning framework to store each weight using 1 bit. To solve this problem, the control unit 230 stores 32 binary weights in FP32 format numbers and calculates 32 binary weights at once. That is, the control unit 230 encodes the input matrix and weight matrix into a 1-bit small matrix during each matrix multiplication process. Afterwards, the control unit 230 can apply XNOR-Count GEMM to the encoded small matrix.

가령, 입력 행렬과 가중치 행렬의 차원이 (100,3072) 및 (3072, 768)인 경우, 제어부(230)는 입력 행렬과 가중치 행렬을 (100, 96) 및 (96, 768)로 인코딩하고 32개 가중치마다 XNOR-Count GEMM을 적용한다. 가중치 행렬은 학습 후 이미 인코딩된 형식으로 저장되기 때문에 추론 단계에서 가중치 행렬을 인코딩할 필요가 없다. For example, if the dimensions of the input matrix and weight matrix are (100,3072) and (3072, 768), the control unit 230 encodes the input matrix and weight matrix as (100, 96) and (96, 768) and 32 XNOR-Count GEMM is applied to each weight. Because the weight matrix is already stored in encoded format after learning, there is no need to encode the weight matrix in the inference step.

8 비트 인덱스 양자화 부분에 대해서는, 제어부(230)는 FP32 GEMM을 대체하기 위해 FP16 GEMM을 사용할 수 있다. FP 16형식의 행렬간 곱셈을 선택하는 이유는, FP32형식과 비교하였을 때 보다 뉴럴 네트워크의 정확도가 높고 빠른 추론속도를 가지기 때문이다. 이때, FP16 GEMM은 FP16형식을 사용하여 일반적인 GEMM을 수행하는 것을 의미한다. For the 8-bit index quantization part, the control unit 230 can use FP16 GEMM to replace FP32 GEMM. The reason for choosing inter-matrix multiplication in the FP 16 format is that the accuracy of the neural network is higher and the inference speed is faster than that of the FP32 format. At this time, FP16 GEMM means performing a general GEMM using the FP16 format.

추론 단계에서, 제어부(230)는 수학식 3에 의하여 INT8 인덱스 형식의 행렬을 FP32형식의 가중치 행렬로 역-양자화하고, 이후 FP32형식의 가중치 행렬을 16 형식의 가중치 행렬로 양자화한다. 그 후 제어부(230)는 모든 작업을 FP16 형식으로 처리할 수 있다.In the inference step, the control unit 230 dequantizes the INT8 index format matrix into an FP32 format weight matrix according to Equation 3, and then quantizes the FP32 format weight matrix into a 16 format weight matrix. Afterwards, the control unit 230 can process all tasks in FP16 format.

학습 과정과 달리, 가중치 행렬은 INT8 형식으로 이미 저장되었기 때문에, 모든 순방향전파에서 가중치 행렬을 양자화 및 역-양자화할 필요가 없다. 또한, 제어부(230)는 동일한 정확도와 높은 추론 속도를 유지하기 위해 바이어스 및 LayerNOrm 레이어에도 FP16 형식을 사용할 수 있다.Unlike the learning process, there is no need to quantize and de-quantize the weight matrix in every forward propagation because the weight matrix is already stored in INT8 format. Additionally, the control unit 230 may use the FP16 format for the bias and LayerNOrm layers to maintain the same accuracy and high inference speed.

또한, 명세서에 기재된 "…부", "…모듈" 의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, the terms “…unit” and “…module” used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware, software, or a combination of hardware and software.

이상의 실시예들에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field programmable gate array) 또는 ASIC 와 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램특허 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다.The term '~unit' used in the above embodiments refers to software or hardware components such as FPGA (field programmable gate array) or ASIC, and the '~unit' performs certain roles. However, '~part' is not limited to software or hardware. The '~ part' may be configured to reside in an addressable storage medium and may be configured to reproduce on one or more processors. Therefore, as an example, '~ part' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로부터 분리될 수 있다.The functions provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or may be separated from additional components and 'parts'.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU 들을 재생시키도록 구현될 수도 있다.In addition, the components and 'parts' may be implemented to regenerate one or more CPUs within the device or secure multimedia card.

도 3내지 도5를 통해 설명된 실시예들에 따른 양자화 방법은 컴퓨터에 의해 실행 가능한 명령어 및 데이터를 저장하는, 컴퓨터로 판독 가능한 매체의 형태로도 구현될 수 있다. 이때, 명령어 및 데이터는 프로그램 코드의 형태로 저장될 수 있으며, 프로세서에 의해 실행되었을 때, 소정의 프로그램 모듈을 생성하여 소정의 동작을 수행할 수 있다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터 기록 매체일 수 있는데, 컴퓨터 기록 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 예를 들어, 컴퓨터 기록 매체는 HDD 및 SSD 등과 같은 마그네틱 저장 매체, CD, DVD 및 블루레이 디스크 등과 같은 광학적 기록 매체, 또는 네트워크를 통해 접근 가능한 서버에 포함되는 메모리일 수 있다.The quantization method according to the embodiments described with reference to FIGS. 3 to 5 may also be implemented in the form of a computer-readable medium that stores instructions and data executable by a computer. At this time, instructions and data can be stored in the form of program code, and when executed by a processor, they can generate a certain program module and perform a certain operation. Additionally, computer-readable media can be any available media that can be accessed by a computer and includes both volatile and non-volatile media, removable and non-removable media. Additionally, computer-readable media may be computer recording media, which are volatile and non-volatile implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. It can include both volatile, removable and non-removable media. For example, computer recording media may be magnetic storage media such as HDDs and SSDs, optical recording media such as CDs, DVDs, and Blu-ray discs, or memory included in servers accessible through a network.

또한 도 3내지 5를 통해 설명된 실시예들에 따른 양자화 방법은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 컴퓨터 프로그램(또는 컴퓨터 프로그램 제품)으로 구현될 수도 있다. 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다. 또한 컴퓨터 프로그램은 유형의 컴퓨터 판독가능 기록매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)에 기록될 수 있다.Additionally, the quantization method according to the embodiments described with reference to FIGS. 3 to 5 may be implemented as a computer program (or computer program product) including instructions executable by a computer. A computer program includes programmable machine instructions processed by a processor and may be implemented in a high-level programming language, object-oriented programming language, assembly language, or machine language. . Additionally, the computer program may be recorded on a tangible computer-readable recording medium (eg, memory, hard disk, magnetic/optical medium, or solid-state drive (SSD)).

따라서 도 3내지 5를 통해 설명된 실시예들에 따른 양자화 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다.Therefore, the quantization method according to the embodiments described with reference to FIGS. 3 to 5 can be implemented by executing the above-described computer program by a computing device. The computing device may include at least some of a processor, memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to a low-speed bus and a storage device. Each of these components is connected to one another using various buses and may be mounted on a common motherboard or in some other suitable manner.

여기서 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다.Here, the processor can process instructions within the computing device, such as displaying graphical information to provide a graphic user interface (GUI) on an external input or output device, such as a display connected to a high-speed interface. These may include instructions stored in memory or a storage device. In other embodiments, multiple processors and/or multiple buses may be utilized along with multiple memories and memory types as appropriate. Additionally, the processor may be implemented as a chipset consisting of chips including multiple independent analog and/or digital processors.

또한 메모리는 컴퓨팅 장치 내에서 정보를 저장한다. 일례로, 메모리는 휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 다른 예로, 메모리는 비휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 또한 메모리는 예컨대, 자기 혹은 광 디스크와 같이 다른 형태의 컴퓨터 판독 가능한 매체일 수도 있다.Memory also stores information within a computing device. In one example, memory may be comprised of volatile memory units or sets thereof. As another example, memory may consist of non-volatile memory units or sets thereof. The memory may also be another type of computer-readable medium, such as a magnetic or optical disk.

그리고 저장장치는 컴퓨팅 장치에게 대용량의 저장공간을 제공할 수 있다. 저장 장치는 컴퓨터 판독 가능한 매체이거나 이런 매체를 포함하는 구성일 수 있으며, 예를 들어 SAN(Storage Area Network) 내의 장치들이나 다른 구성도 포함할 수 있고, 플로피 디스크 장치, 하드 디스크 장치, 광 디스크 장치, 혹은 테이프 장치, 플래시 메모리, 그와 유사한 다른 반도체 메모리 장치 혹은 장치 어레이일 수 있다.And the storage device can provide a large amount of storage space to the computing device. A storage device may be a computer-readable medium or a configuration that includes such media, and may include, for example, devices or other components within a storage area network (SAN), such as a floppy disk device, a hard disk device, an optical disk device, Or it may be a tape device, flash memory, or other similar semiconductor memory device or device array.

상술된 실시예들은 예시를 위한 것이며, 상술된 실시예들이 속하는 기술분야의 통상의 지식을 가진 자는 상술된 실시예들이 갖는 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 상술된 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above-described embodiments are for illustrative purposes, and those skilled in the art will recognize that the above-described embodiments can be easily modified into other specific forms without changing the technical idea or essential features of the above-described embodiments. You will understand. Therefore, the above-described embodiments should be understood in all respects as illustrative and not restrictive. For example, each component described as single may be implemented in a distributed manner, and similarly, components described as distributed may also be implemented in a combined form.

본 명세서를 통해 보호받고자 하는 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태를 포함하는 것으로 해석되어야 한다.The scope sought to be protected through this specification is indicated by the patent claims described later rather than the detailed description above, and should be interpreted to include the meaning and scope of the claims and all changes or modified forms derived from the equivalent concept. .

200: 파라미터의 민감도에 기초하여 복수의 트랜스포머 인코더 레이어를 양자화하는 장치 210: 입출력부
220: 저장부 230: 제어부200: Device for quantizing a plurality of transformer encoder layers based on sensitivity of parameters 210: Input/output unit
220: storage unit 230: control unit

Claims

A method of performing quantization in a neural network including a plurality of transformer encoder layers by a device including a control unit and quantizing the plurality of transformer encoder layers,
dividing, by the control unit, the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on sensitivity of a parameter; and
It includes applying, by the control unit, a quantization method determined for each divided group,
The parameter represents a value that determines the intensity of reflection of data input to each layer of a model with a transformer-based neural network,
The sensitivity of the parameter corresponds to the change in accuracy of the model due to compression of the parameter,
The step of dividing the plurality of transformer encoder layers into groups including at least one transformer encoder layer includes a preset number of layers as a first group and the remaining layers as a second group in order from the first input layer. Divide,
In the step of applying the quantization method determined for each divided group, the first group set from the initial input layer is quantized with an 8-bit encoder layer including a feed forward network and a self-attention layer, each quantized by applying 8-bit index quantization. And, the second group set from the final output layer is quantized by an MP encoder layer including a feed forward network quantized by applying 1-bit quantization and a self-attention layer quantized by applying 8-bit index quantization.

delete

According to paragraph 1,
The 8-bit index quantization is,
sequentially applying quantization and inverse-quantization using an 8-bit index in forward propagation; and
A method comprising training weights using an 8-bit clip function in back propagation.

According to paragraph 1,
The 1-bit quantization is,
Quantizing using a sign function in forward propagation; and
A method comprising training weights using a 1-bit clip function in back propagation.

According to clause 5,
The 1-bit quantization is,
The method further includes performing absolute binary weight regularization to learn each absolute value of the parameter to be close to 1.

According to clause 5,
The 1-bit quantization is,
Before performing the 1-bit quantization, the method further comprises performing binarization feature priority training (Prioritized Training) such that weights in FP32 format learn the binary characteristics of the input.

According to clause 5,
The 1-bit quantization is,
The method further includes applying layer-wise fine-tuning in the learning step.

In a device that performs quantization in a neural network including a plurality of transformer encoder layers,
a storage unit in which a program for performing quantization is stored; and
A control unit including at least one processor,
The control unit,
Divide the plurality of transformer encoder layers into groups including at least one transformer encoder layer based on sensitivity of parameters,
The quantization method determined for each divided group is applied,
The parameter represents a value that determines the intensity of reflection of data input to each layer of a model with a transformer-based neural network,
The sensitivity of the parameter corresponds to the change in accuracy of the model due to compression of the parameter,
The control unit divides the plurality of transformer encoder layers in order from the first input layer, with a preset number of layers into a first group and the remaining layers into a second group,
The control unit quantizes the first group set from the initial input layer into an 8-bit encoder layer including a feed forward network and a self-attention layer each quantized by applying 8-bit index quantization, and the second group set from the final output layer. The group quantizes the device with a feedforward network that is quantized by applying 1-bit quantization and an MP encoder layer that includes a self-attention layer that is quantized by applying 8-bit index quantization.

delete

According to clause 9,
The control unit,
The 8-bit index quantization is performed,
In forward propagation, quantization and inverse-quantization are applied sequentially using an 8-bit index,
A device that trains weights using an 8-bit clip function in backpropagation.

According to clause 9,
The control unit,
The above 1-bit quantization is performed,
In forward propagation, quantize using the sign function,
Apparatus for learning weights using a 1-bit clip function in backpropagation

According to clause 13,
The control unit,
The above 1-bit quantization is performed,
A device that performs absolute binary weight regularization, which learns each absolute value of a parameter to be close to 1.

According to clause 13,
The control unit,
The above 1-bit quantization is performed,
Before performing the 1-bit quantization, a device that performs binarization feature priority learning (Prioritized Training) such that weights in FP32 format learn the binary characteristics of the input.

According to clause 13,
The control unit,
The above 1-bit quantization is performed,
A device that applies layer-wise relearning (Layer-Wise Fine-Tuning) in the learning stage.

A computer-readable recording medium on which a program for executing the method according to claim 1 is recorded on a computer.

A computer program stored in a medium for performing the method according to claim 1, which is performed by an apparatus for quantizing a plurality of transformer encoder layers based on the sensitivity of parameters,
The parameter represents a value that determines the intensity of reflection of data input to each layer of a model with a transformer-based neural network,
The computer program of claim 1, wherein the sensitivity of the parameter corresponds to a change in accuracy of the model due to compression of the parameter.