KR20190129698A

KR20190129698A - Electronic apparatus for compressing recurrent neural network and method thereof

Info

Publication number: KR20190129698A
Application number: KR1020190031618A
Authority: KR
Inventors: 나데즈다 알렉산드로브나 칠코바; 예카테리나 막시모브나 로바초바; 드미트리 페트로비치 베트로브
Original assignee: 삼성전자주식회사
Priority date: 2018-05-10
Filing date: 2019-03-20
Publication date: 2019-11-20

Abstract

Disclosed are an electronic device and method for compressing a recurrent neural network. According to the present disclosure, the electronic device and method use a sparsification technique for recurrent neural network, learn the recurrent neural network by obtaining first to third multiply variables, and compress the recurrent neural network by sparsifying the recurrent neural network.

Description

Electronic apparatus for compressing circulatory neural network and its method

본 개시는 순환신경망을 압축하는 전자장치 및 그 방법에 관한 것으로, 보다 상세하게는 사용자 단말과 같은 전자 장치에서 순환신경망 인공지능 모델을 효율적으로 사용하기 위한 전자장치 및 그 방법에 관한 것이다. The present disclosure relates to an electronic device and a method for compressing a circulatory neural network, and more particularly, to an electronic device and a method for efficiently using the circulatory neural network artificial intelligence model in an electronic device such as a user terminal.

인공지능(Artificial Intelligence, AI) 시스템은 인간 수준의 지능을 구현하는 컴퓨터 시스템이며, 기존 규칙 기반 스마트 시스템과 달리 기계가 스스로 학습하고 판단하며 똑똑해지는 시스템이다. 인공 지능 시스템은 사용할수록 인식률이 향상되고 사용자 취향을 보다 정확하게 이해할 수 있게 되어, 기존 규칙 기반 스마트 시스템은 점차 딥러닝 기반 인공 지능 시스템으로 대체되고 있다.Artificial Intelligence (AI) system is a computer system that implements human-level intelligence, and unlike conventional rule-based smart systems, the machine learns, judges, and becomes smart. As the artificial intelligence system is used, the recognition rate is improved and the user taste can be understood more accurately, and the existing rule-based smart system is gradually replaced by the deep learning-based artificial intelligence system.

인공 지능 기술은 기계학습(딥러닝) 및 기계 학습을 활용한 요소 기술들로 구성된다.Artificial intelligence technology consists of elementary technologies that utilize machine learning (deep learning) and machine learning.

기계 학습은 입력 데이터들의 특징을 스스로 분류/학습하는 알고리즘 기술이며, 요소 기술은 딥러닝 등의 기계학습 알고리즘을 활용하는 기술로서, 언어적 이해, 시각적 이해, 추론/예측, 지식 표현, 동작 제어 등의 기술 분야로 구성된다.Machine learning is an algorithm technology that classifies / learns characteristics of input data by itself, and element technology is a technology that utilizes machine learning algorithms such as deep learning. It consists of technical fields.

인공 지능 기술이 응용되는 다양한 분야는 다음과 같다. 언어적 이해는 인간의 언어/문자를 인식하고 응용/처리하는 기술로서, 자연어 처리, 기계 번역, 대화 시스템, 질의 응답, 음성 인식/합성 등을 포함한다. 시각적 이해는 사물을 인간의 시각처럼 인식하여 처리하는 기술로서, 객체 인식, 객체 추적, 영상 검색, 사람 인식, 장면 이해, 공간 이해, 영상 개선 등을 포함한다.The various fields in which artificial intelligence technology is applied are as follows. Linguistic understanding is a technology for recognizing and applying / processing human language / characters, including natural language processing, machine translation, dialogue system, question answering, speech recognition / synthesis, and the like. Visual understanding is a technology that recognizes and processes objects as human vision, and includes object recognition, object tracking, image retrieval, person recognition, scene understanding, spatial understanding, and image enhancement.

근래에는 순환신경망(recurrent neural network)을 이용한 인공지능 모델을 사용하여 언어모델링 작업(자연어처리, 음성인식, 질의응답 등을 수행하기 위한 모델링 작업) 등을 수행하고 있다. Recently, he has been performing language modeling (modeling to perform natural language processing, speech recognition, question and answer) using artificial intelligence model using recurrent neural network.

종래의 순환신경망 모델은 많은 수의 매개변수(Parameters)를 사용하기 때문에 많은 학습 시간 및 큰 저장 공간을 필요로 하였다. 따라서, 종래의 순환신경망 모델의 학습 등은 큰 저장 공간 및 높은 연산 수행이 가능한 외부 서버에서 이루어지는 경우가 많았으며, 스마트폰과 같은 제한된 메모리의 휴대용 장치 등에서 효율적으로 순환신경망 인공지능 모델을 이용하기 위한 방법에 대한 논의가 필요하게 되었다. The conventional circulatory neural network model requires a lot of learning time and a large storage space because it uses a large number of parameters. Therefore, the conventional training of the circulatory neural network model is often performed in an external server capable of performing a large storage space and high arithmetic, and the efficient use of the circulatory neural network artificial intelligence model in a portable device with limited memory such as a smartphone. There is a need for discussion of methods.

본 개시는 이와 같은 문제점을 해결하기 위해 안출된 것으로, 본 개시의 목적은 순환신경망에 있어 베이지안 희박화(sparsification) 기법을 이용해 순환신경망을 압축하는 전자 장치 및 그 방법을 제공함에 있다.SUMMARY The present disclosure has been made to solve the above problems, and an object of the present disclosure is to provide an electronic device and a method for compressing the circulatory neural network using a Bayesian sparification technique in the circulatory neural network.

상기 목적을 달성하기 위한 순환신경망을 압축하는 방법은 순환신경망의 입력요소(input element)에 관한 제1 곱셈변수(multiplicative variable)를 획득하는 단계, 상기 순환신경망의 입력 뉴런(input neuron) 및 은닉 뉴런(hidden neuron)에 관한 제2 곱셈변수를 획득하는 단계, 상기 가중치(weight), 상기 제1 곱셈변수 및 상기 제2 곱셈변수에 대한 평균값(mean) 및 분산값(variance)을 획득하는 단계, 상기 평균값 및 상기 분산값을 바탕으로 상기 순환신경망에 대해 희박화(sparsification)를 수행하는 단계,를 포함하며, 상기 희박화를 수행하는 단계는, 상기 평균값(mean) 및 상기 분산값(variance)을 바탕으로, 상기 희박화를 수행하기 위한 관련값을 계산하는 단계, 상기 관련값이 기 설정된 값 보다 작은 가중치, 제1 곱셈변수 및 제2 곱셈변수를 0으로 설정하는 단계,를 더 포함 할 수 있다. A method for compressing a circulatory neural network for achieving the above object may include obtaining a first multiplicative variable for an input element of the circulatory neural network, an input neuron and a hidden neuron of the circulatory neural network. obtaining a second multiplication variable for a hidden neuron, obtaining a mean and a variance for the weight, the first multiplication variable and the second multiplication variable, and And performing a sparification on the circulatory neural network based on an average value and the variance value, wherein the performing the thinning is based on the mean and the variance. The method may further include calculating a related value for performing the thinning, and setting the weighted value, the first multiplication variable, and the second multiplication variable to zero, wherein the related value is smaller than a preset value. .

이때, 상기 관련값은 상기 평균값의 제곱에 대한 상기 분산값의 비율값(ratio of square of mean to variance)일 수 있다.In this case, the related value may be a ratio of square of mean to variance with respect to the square of the mean value.

이때, 상기 기 설정된 값은 0.05일 수 있다.In this case, the preset value may be 0.05.

상기 순환신경망이 게이트 구조(gated structure)를 포함하는 경우, 상기 순환신경망의 순환레이어(recurrent layer)의 게이트(gate)및 정보 흐름(information flow)요소를 일정하게 만들기 위해 게이트의 사전활성(preactivation)에 관한 제3 곱셈변수를 획득하는 단계,를 더 포함하고, 상기 평균값(mean) 및 분산값(variance)을 획득하는 단계는, 상기 가중치, 상기 제1 곱셈변수, 상기 제2 곱셈변수 및 상기 제3 곱셈변수에 대한 평균값 및 분산값을 획득하는 단계;를 더 포함할 수 있다.If the cyclic neural network includes a gated structure, preactivation of the gate to make the gate and information flow elements of the recurrent layer of the cyclic neural network constant Obtaining a third multiplication variable for, wherein obtaining the mean and variance comprises: the weight, the first multiplication variable, the second multiplication variable, and the first The method may further include obtaining an average value and a variance value of the multiplication variable.

이때, 상기 게이트 구조는 상기 순환신경망의 LSTM(Long-Short term Memory)계층으로 구현될 수 있다.In this case, the gate structure may be implemented as a long-short term memory (LSTM) layer of the cyclic neural network.

이때, 상기 평균값(mean) 및 분산값(variance)을 획득하는 단계는, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 평균값 및 분산값을 초기화 하는 단계, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수의 상기 평균값 및 상기 분산값과 관련된 객체(objective)를 최적화 하여, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 상기 평균값 및 상기 분산값을 획득하는 단계,를 더 포함할 수 있다.In this case, the obtaining of the mean and the variance may include initializing an average value and a variance value of the weight, the first group variable, and the second group variable, the weight, and the first value. Optimizing an object associated with the mean value and the variance value of the group variable and the second group variable, so as to obtain the mean value and the variance value for the weight, the first group variable and the second group variable. Step, may further include.

이때, 상기 획득하는 단계는, 객체(objective)들의 미니배치를 선택하는 단계, 근사 사후 분포(approximated posterior distribution)로부터 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수를 생성하는 단계, 상기 생성된 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수를 바탕으로 상기 미니배치를 이용하여 상기 순환신경망을 순방향 통과(forward pass)시키는 단계, 상기 객체(objective)를 계산하고, 상기 객체(objective)에 대한 그래디언트(gradient)를 계산하는 단계, 상기 계산된 그래디언트를 바탕으로 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 상기 평균값 및 상기 분산값을 획득하는 단계,를 더 포함할 수 있다.The acquiring may include selecting a mini-batch of objects, generating the weight, the first group variable and the second group variable from an approximate posterior distribution, and generating the object. Forwarding the circulating neural network using the mini-batch based on the weighted value, the first group variable and the second group variable, calculating the objective, and calculating the objective Calculating a gradient with respect to the gradient; obtaining the average value and the variance value for the weight, the first group variable and the second group variable based on the calculated gradient; Can be.

이때, 상기 가중치는 미니배치에 의해 생성되고, 상기 제1 그룹변수 및 상기 제2 그룹변수는 상기 객체로부터 개별적으로 생성될 수 있다.In this case, the weight may be generated by the mini-batch, and the first group variable and the second group variable may be generated separately from the object.

이때, 상기 입력 요소는 어휘(vocabulary) 또는 단어(word)일 수 있다.In this case, the input element may be a vocabulary or a word.

한편, 상술한 목적을 달성하기 위한 본 개시의 실시 예에 따른 순환신경망을 압축하는 전자장치는, 적어도 하나의 인스트럭션(instruction)을 포함하는 메모리, 상기 적어도 하나의 인스트럭션을 제어하는 프로세서, 를 포함하고, 상기 프로세서는, 상기 순환신경망의 입력요소(input element)에 관한 제1 곱셈변수(multiplicative variable)를 획득하고, 상기 순환신경망의 입력 뉴런(input neuron) 및 은닉 뉴런(hidden neuron)에 관한 제2 곱셈변수를 획득하고, 상기 순환신경망의 가중치, 상기 제1 곱셈변수 및 상기 제2 곱셈변수에 대한 평균값(mean) 및 분산값(variance)을 획득하고, 상기 평균값(mean) 및 상기 분산값(variance)을 바탕으로 상기 순환신경망에 대해 희박화(sparsification)를 수행한다.Meanwhile, an electronic device for compressing a circulatory neural network according to an embodiment of the present disclosure for achieving the above object includes a memory including at least one instruction, a processor controlling the at least one instruction, and The processor acquires a first multiplicative variable for an input element of the circulatory neural network, and a second for input neuron and hidden neuron of the circulatory neural network. Obtaining a multiplication variable, obtaining a weight of the cyclic neural network, a mean and a variance of the first multiplication variable and the second multiplication variable, and obtaining the mean and the variance Sparsification is performed on the circulatory neural network.

그리고, 상기 프로세서는, 상기 평균값(mean) 및 상기 분산값(variance)을 바탕으로, 희박화를 수행하기 위한 관련값을 계산하고, 상기 관련값이 기 설정된 값 보다 작은 가중치, 제1 곱셈변수 또는 제2 곱셈변수를 0으로 설정하여 희박화를 수행할 수 있다.The processor may calculate a related value for performing thinning based on the mean and the variance, and may include a weight, a first multiplying variable, or a value smaller than the preset value. The thinning may be performed by setting the second multiplying variable to zero.

또한, 상기 관련값은 상기 평균값의 제곱에 대한 상기 분산값의 비율값(ratio of square of mean to variance)일 수 있으며, 상기 기설정된 값은 0.05일 수 있다.The related value may be a ratio of square of mean to variance to the square of the mean value, and the preset value may be 0.05.

그리고, 상기 순환신경망이 게이트 구조(gated structure)를 포함하는 경우, 상기 프로세서는, 상기 순환신경망의 순환레이어(recurrent layer)의 게이트(gate)및 정보 흐름(information flow)요소를 일정하게 만들기 위해 게이트의 사전활성(preactivation)에 관한 제3 곱셈변수를 획득하고, 상기 순환신경망의 가중치, 상기 제1 곱셈변수, 상기 제2 곱셈변수 및 상기 제3 곱셈변수에 대한 평균값(mean) 및 분산값(variance)을 획득하고, 상기 평균값(mean) 및 상기 분산값(variance)을 바탕으로 상기 순환신경망에 대해 희박화를 수행할 수 있다.In addition, when the cyclic neural network includes a gated structure, the processor may include a gate to make a gate and an information flow element of the recurrent layer of the cyclic neural network constant. Obtaining a third multiplication variable for preactivation of the mean, and a mean and variance of the weight of the cyclic neural network, the first multiplication variable, the second multiplication variable, and the third multiplication variable; ) And thinning of the circulatory neural network based on the mean and the variance.

또한, 상기 게이트 구조는 상기 순환신경망의 LSTM(Long-Short term Memory)계층으로 구현될 수 있다.In addition, the gate structure may be implemented as a long-short term memory (LSTM) layer of the cyclic neural network.

그리고, 상기 프로세서는, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 평균값(mean) 및 분산값(variance)을 초기화 하고, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수의 상기 평균값(mean) 및 상기 분산값(variance)과 관련된 객채(objective)를 최적화 하여, 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 상기 평균값(mean) 및 상기 분산값(variance)을 획득할 수 있다.The processor initializes a mean and a variance of the weight, the first group variable and the second group variable, and initializes the weight, the first group variable and the second group variable. Optimize the objective associated with the mean and the variance of the mean, the mean and the variance for the weight, the first group variable and the second group variable. ) Can be obtained.

또한, 상기 프로세서는, 객체(objective)들의 미니배치(mini batch)를 선택하고, 근사 사후 분포(approximated posterior distribution)로부터 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수를 생성하고, 상기 생성된 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수를 바탕으로 상기 미니배치를 이용하여 상기 순환신경망을 순방향 통과(forward pass)시키고, 상기 객체(objective)를 계산하고, 상기 객체(objective)에 대한 그래디언트(gradient)를 계산하고, 상기 계산된 그래디언트(gradient)를 바탕으로 상기 가중치, 상기 제1 그룹변수 및 상기 제2 그룹변수에 대한 상기 평균값(mean) 및 상기 분산값(variance)을 획득할 수 있다.The processor may also select a mini batch of objects, generate the weight, the first group variable and the second group variable from an approximated posterior distribution, and generate the Forward through the circulatory neural network using the mini-batch based on the weighted value, the first group variable, and the second group variable, calculate the objective, and calculate the objective. Compute a gradient for and obtain the mean and the variance for the weight, the first group variable and the second group variable based on the calculated gradient. can do.

그리고, 상기 가중치는 미니배치(mini batch)에 의해 생성되고, 상기 제1 그룹변수 및 상기 제2 그룹변수는 개별적인 상기 객체(objective)로부터 생성될 수 있다.The weight may be generated by a mini batch, and the first group variable and the second group variable may be generated from individual objects.

또한, 상기 입력요소(input element)는 어휘(vocabulary) 또는 단어(word)일 수 있다.In addition, the input element may be a vocabulary or a word.

이상과 같은 본 개시의 다양한 실시 예 들에 따르면, 희박화 기법을 이용해 순환신경망 인공지능 모델을 압축함으로 언어모델링 작업을 가속화 할 수 있으며, 제한된 메모리의 휴대용 장치 등에서도 순환신경망 인공지능 모델을 이용한 언어모델링 작업을 수행할 수 있다.According to various embodiments of the present disclosure as described above, the language modeling operation may be accelerated by compressing the circulatory neural network artificial intelligence model using a thinning technique, and the language using the circulatory neural network artificial intelligence model even in a portable device having limited memory. Modeling can be done.

도 1은 본 개시의 일 실시 예에 따른, 전자장치의 구성을 간략히 도시한 블록도 이다.
도 2는 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델의 압축방법을 나타내는 흐름도 이다.
도 3은 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델의 학습방법을 나타내는 흐름도 이다.
도 4는 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델에 대한 희박화 수행방법을 나타내는 흐름도 이다.
도 5는 본 개시의 다른 실시 예에 따른, 순환 신경망 인공지능 모델의 압축방법을 나타내는 흐름도 이다.1 is a block diagram schematically illustrating a configuration of an electronic device according to an embodiment of the present disclosure.
2 is a flowchart illustrating a compression method of a circulatory neural network artificial intelligence model according to an embodiment of the present disclosure.
3 is a flowchart illustrating a method of learning a cyclic neural network artificial intelligence model according to an embodiment of the present disclosure.
4 is a flowchart illustrating a method of performing lean thinning for an artificial intelligence network model according to an embodiment of the present disclosure.
5 is a flowchart illustrating a compression method of a cyclic neural network artificial intelligence model according to another embodiment of the present disclosure.

이하, 본 문서의 다양한 실시 예가 첨부된 도면을 참조하여 기재된다. 그러나, 이는 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 문서의 실시예의 다양한 변경(modifications), 균등물(equivalents), 및/또는 대체물(alternatives)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.Hereinafter, various embodiments of the present disclosure will be described with reference to the accompanying drawings. However, this is not intended to limit the techniques described in this document to specific embodiments, but should be understood to cover various modifications, equivalents, and / or alternatives to the embodiments of this document. . In connection with the description of the drawings, similar reference numerals may be used for similar components.

또한, 본 문서에서 사용된 "제 1," "제 2," 등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 예를 들면, 제 1 사용자 기기와 제 2 사용자 기기는, 순서 또는 중요도와 무관하게, 서로 다른 사용자 기기를 나타낼 수 있다. 예를 들면, 본 문서에 기재된 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 바꾸어 명명될 수 있다.In addition, the expressions "first," "second," and the like used in this document may modify various components in any order and / or importance, and may distinguish one component from another. Used only and do not limit the components. For example, the first user device and the second user device may represent different user devices regardless of the order or importance. For example, without departing from the scope of rights described in this document, the first component may be called a second component, and similarly, the second component may be renamed to the first component.

어떤 구성요소(예: 제 1 구성요소)가 다른 구성요소(예: 제 2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "접속되어(connected to)" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제 3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소(예: 제 1 구성요소)가 다른 구성요소(예: 제 2 구성요소)에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소와 상기 다른 구성요소 사이에 다른 구성요소(예: 제 3 구성요소)가 존재하지 않는 것으로 이해될 수 있다.One component (such as a first component) is "(functionally or communicatively) coupled with / to" to another component (such as a second component) or " When referred to as "connected to", it should be understood that any component may be directly connected to the other component or may be connected through another component (eg, a third component). On the other hand, when a component (e.g., a first component) is said to be "directly connected" or "directly connected" to another component (e.g., a second component), the component and the It can be understood that no other component (eg, a third component) exists between the other components.

본 문서에서 사용된 용어들은 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 다른 실시예의 범위를 한정하려는 의도가 아닐 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 용어들은 본 문서에 기재된 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 본 문서에 사용된 용어들 중 일반적인 사전에 정의된 용어들은, 관련 기술의 문맥상 가지는 의미와 동일 또는 유사한 의미로 해석될 수 있으며, 본 문서에서 명백하게 정의되지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 경우에 따라서, 본 문서에서 정의된 용어일지라도 본 문서의 실시 예들을 배제하도록 해석될 수 없다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the scope of other embodiments. Singular expressions may include plural expressions unless the context clearly indicates otherwise. The terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by one of ordinary skill in the art described in this document. Among the terms used in this document, terms defined in the general dictionary may be interpreted as having the same or similar meaning as the meaning in the context of the related art, and ideally or excessively formal meanings are not clearly defined in this document. Not interpreted as In some cases, even if terms are defined in the specification, they may not be interpreted to exclude embodiments of the present disclosure.

도 1은 본 개시의 일 실시 예에 따른, 전자장치의 구성을 간략히 도시한 블록도 이다.1 is a block diagram schematically illustrating a configuration of an electronic device according to an embodiment of the present disclosure.

도 1의 전자 장치는 순환신경망 인공지능 모델을 희박화하여 압축하는 장치의 일 실시 예를 나타낸다. 도 1의 전자 장치의 구체적인 설명에 앞서, 순환 신경망 인공지능 모델과 관련된 각종 용어들에 대하여 먼저 설명한다.The electronic device of FIG. 1 shows an embodiment of a device that thins and compresses a circulatory neural network artificial intelligence model. Prior to describing the electronic device of FIG. 1, various terms related to the cyclic neural network artificial intelligence model will be described.

본 개시에서 레이어(Layer)라 함은 다층 신경망 구조에서 하나의 층을 의미할 수 있다. 즉, 인공지능 모델에서 하나의 층을 레이어라 정의할 수 있다. 일반적으로 레이어는 입력 벡터 x와 출력 벡터 y에 대하여 y=f(Wx+b)의 형태로 정의 될 수 있다. 이때, W 및 b는 레이어 파라미터라고 할 수 있다.In the present disclosure, a layer may mean one layer in a multilayer neural network structure. That is, one layer may be defined as a layer in the AI model. In general, a layer may be defined in the form of y = f (Wx + b) for the input vector x and the output vector y. In this case, W and b may be referred to as layer parameters.

본 개시에서 레이어 함수(Layer Function)라 함은, 상기 레이어의 정의(y=f(Wx+b))를 일반화 하여 임의의 함수 g로 표현한 것을 의미할 수 있다. 예를 들어, 레이어 함수는 y = g(x) 혹은 y_j= g(x1, x2, ... x_N)와 같이 표현될 수 있다. 이때, 레이어 함수 g는 입력 x1,..., x_N의 선형 함수로 표현되나, 다른 형태로 표현도 가능함은 물론이다.In the present disclosure, a layer function may mean that the definition of the layer (y = f (Wx + b)) is generalized and expressed as an arbitrary function g. For example, the layer function may be expressed as y = g (x) or y_j = g (x1, x2, ... x_N). In this case, the layer function g may be expressed as a linear function of inputs x1, ..., x_N, but may be expressed in other forms.

[베이지안 신경망][Beige neural network]

베이지안 신경망에서는, 신경망의 가중치(weight)를 정해진 값으로 취급하는 것이 아닌 임의의 변수로 취급하여 인공지능 모델을 학습할 수 있다. 또한, 베이지안 신경망에서는 신경망의 가중치의 확률분포를 이용하여 인공지능 모델을 학습할 수 있다. 즉, 가중치를 확률 분포 함수로 보는 것이다. In the Bayesian neural network, the artificial intelligence model can be learned by treating the weight of the neural network as an arbitrary variable rather than as a predetermined value. In addition, in a Bayesian neural network, an artificial intelligence model may be trained using a probability distribution of weights of the neural network. In other words, the weight is viewed as a probability distribution function.

베이지안 신경망에서는 사전분포(prior distribution) 및 근사사후분포(approximated posterior distribution)를 이용하여 인공지능 모델을 학습할 수 있다. 구체적으로 사전 분포(prior distribution)는 인공지능 모델의 학습 이전 가중치들의 예상 분포를 나타내며, 근사 사후분포는 인공지능 모델의 학습 이후 가중치들의 예상 분포를 나타낸다. 근사 사후 분포(approximated posterior distribution)가 실제 사후 분포(true posterior distribution)에 가까워질수록 인공지능 모델의 성능이 좋음을 의미한다. In Bayesian neural networks, AI models can be trained using a prior distribution and an approximated posterior distribution. Specifically, the prior distribution represents the expected distribution of the pre-training weights of the AI model, and the approximate post-distribution represents the expected distribution of the post-training weights of the AI model. The closer the approximated posterior distribution is to the true posterior distribution, the better the performance of the AI model.

란 A의 사전분포로 현재 우리가 알 수 있는 A에 대한 분포도를 의미한다. 즉,

란 가중치행렬 W에 대한 사전분포를 뜻하며, 현재 알 수 있는 가중치 행렬 W에 대한 분포도를 의미한다.

Is the prior distribution of A, which is the distribution of A as we know it now. In other words,

Denotes the prior distribution of the weight matrix W, and means the distribution of the currently known weight matrix W.

란 사건 A라는 증거에 대한 사후분포를 의미하며, 구체적으로 사건 A가 발생 한 경우, 사건 A가 사건 B로부터 발생한 것이라고 생각되는 조건부 확률분포를 의미한다. 즉,

는

가 주어졌을때 가중치행렬 W에 대한 사후분포를 뜻한다.

Means a post-distribution of evidence for Event A; specifically, if Event A occurs, it means a conditional probability distribution that is thought to have occurred from Event B. In other words,

Is

Given is the posterior distribution of the weight matrix W.

베이지안 신경망에서는 입력 데이터가 들어올 때 마다 인공지능 모델을 학습하여 사전분포 및 사후분포를 획득함으로 근사 사후 분포의 정확성을 높일 수 있다.In the Bayesian neural network, the accuracy of the approximate posterior distribution can be improved by learning the AI model each time the input data comes in and obtaining the pre and post distribution.

입력요소

에 대한 출력(목표)요소

의 의존성을 나타내는 가중치W를 갖는 베이지안 신경망에서, 가중치 W는 임의의 변수로 취급될 수 있다. 베이지안 신경망 에서는 사전분포

로부터 사후분포

를 추론할 수 있다.Input element

(Target) element for

In a Bayesian neural network with weights W representing the dependence of, the weights W can be treated as any variable. Bayesian Neural Network Pre-Distribution

Post-Distribution from

Can be deduced.

순환신경망(recurrent neural network) 에서 실제사후분포(true posterior)를 측정하기 어렵지만, 사후분포를 어떠한 파라매틱 분포(some parametric distribution)를 이루는 근사 사후분포

로 근사화 될 수 있다. 이 근사 사후 분포의 품질은 KL-divergence 인

에 의해 측정될 수 있다.It is difficult to measure true posterior in a recurrent neural network, but approximate posterior distribution that forms some parametric distribution.

Can be approximated. The quality of this approximate posterior distribution is KL-divergence

Can be measured by

최적의 매개변수(optimal parameter)

는

에 대한 변이하한값(variational lower bound)을 최대화 함으로 구할 수 있다.Optimal parameter

Is

This can be found by maximizing the variational lower bound for.

여기서 변이하한값(variational lower bound)이란 실제 사후분포와 근사 사후분포 사이의 거리인 KL divergence(

)를 최소화 시키기 위해 도입되는 개념이다.Here, the variational lower bound is the KL divergence (the distance between the actual posterior distribution and the approximate posterior distribution).

This is a concept introduced to minimize).

[수학식 1][Equation 1]

수학식1의

은 변이하한값(variational lower bound)을 의미하며, log-likelihood 항(

)은 일반적으로 Monte-Carlo 기법에 따라 근사화 될 수 있다. Monte-Carlo 기법에서 편향(bias)을 없애기 위해 가중치는

와 같은 함수로 매개변수화 될 수 있다. 여기서

는 일부 비 매개변수 분포(non-parametric distribution) 로부터 얻어질 수 있다.Of Equation 1

Is the variational lower bound, and the log-likelihood term (

) Can generally be approximated using the Monte-Carlo technique. To eliminate bias in the Monte-Carlo technique, the weights are

Can be parameterized with a function such as here

Can be obtained from some non-parametric distribution.

수학식1의 KL-divergence항(

)은 regularizer 역할을 하며, 일반적으로 계산되거나 분석적으로 근사화 될 수 있다.KL-divergence term in Equation 1

) Acts as a regularizer and can generally be computed or analytically approximated.

베이지안 신경망에 대한 희박화(sparsification) 기법의 이점은 pruning-based 기법과 비교 할때 초매개변수(hyperparameters)가 적다는 점이다. 또한 베이지안 신경망에 대한 희박화 기법은 pruning-based 기법에 비해 더 높은 희박화 레벨(sparsity level)을 제공할 수 있다.The advantage of sparification techniques for Bayesian neural networks is that they have fewer hyperparameters compared to pruning-based techniques. In addition, the thinning technique for Bayesian neural networks can provide a higher spacing level than the pruning-based technique.

[Sparse Variational Dropout(Sparse VD)][Sparse Variational Dropout (Sparse VD)]

드롭아웃(Dropout)은 신경망의 정규화(regularization)를 위한 표준기술이다. 드롭아웃은 각 레이어에 무작위로 생성된 노이즈(noise) 벡터를 곱하는 것을 의미한다. 노이즈 벡터의 요소는 cross-validation를 이용한 조정된 매개변수(parameter)를 사용하여 Bernoulli분포 또는 normal분포로부터 생성될 수 있다.Dropout is a standard technique for regularization of neural networks. Dropout means multiplying each layer by a randomly generated noise vector. The elements of the noise vector can be generated from the Bernoulli distribution or the normal distribution using adjusted parameters using cross-validation.

본 개시의 일 실시 예에 따른 노이즈가 획득되는 경우, 가중치의 사전 분포 및 사후분포는 각 가중치의 레이어 함수에 노이즈를 곱하여 나타난 레이어 함수에 대한 사전 분포 및 사후분포가 될 수 있다. When noise is obtained according to an embodiment of the present disclosure, the pre-distribution and post-distribution of weights may be pre-distribution and post-distribution for the layer function indicated by multiplying the noise by the layer function of each weight.

입력 크기(size)가 n이고, 출력 크기(size)가 m이며, 가중치 행렬W를 가지는 피드 포워드(feed-forward)신경망의 한 완전연결(fully-connected) 레이어에 대한 가중치의 사전분포는 fully factorized log-uniform 분포로

로 표현 되며, 사후분포는 fully factorized normal분포로 수학식 2 와 같이 표현 될 수 있다.The pre-distribution of weights for a fully-connected layer of a feed-forward neural network with an input size of n, an output size of m, and a weighting matrix W is fully factorized. log-uniform distribution

The posterior distribution may be expressed as Equation 2 as a fully factorized normal distribution.

[수학식 2][Equation 2]

사후분포는 수학식 3 또는 수학식 4와 같이 가중치에 정상노이즈(normal noise)를 곱하거나 더함으로 얻어질 수 있다.The post-distribution can be obtained by multiplying or adding normal noise to the weight as in Equation 3 or Equation 4.

[수학식 3][Equation 3]

[수학식 4][Equation 4]

수학식 4는 부가적인 재매개변수화(additive reparameterization)라고 불린다. 정상노이즈(normal noise)는

에 관한 변이하한값(variational lower bound)

의 그래디언트(gradients)의 분산(variance)을 줄일 수 있다.Equation 4 is called additive reparameterization. Normal noise

Variation lower bound on

This can reduce the variance of gradients.

여기서,

는

에 대한 평균값을 의미하고,

은

에 대한 분산값을 의미한다.here,

Is

Mean for,

silver

The variance value for.

그래디언트(gradients)란 각 뉴런의 입력값에 대한 손실함수(loss function)의 편미분을 계산하여 나타나는 값으로, 손실함수(loss function)는 출력값에 대한 예측값과 실제값의 차이를 계산하기 위해 이용되는 함수로 자세한 내용은 생략하도록 한다.Gradients are values obtained by calculating partial derivatives of the loss function of each neuron's input value, and the loss function is a function used to calculate the difference between the predicted value and the actual value of the output value. The details are omitted.

정규분포(normal distributions)의 합은 계산가능한 파라미터(computable parameters)가 있는 정규분포 이므로, 가중치 대신 사전활성(preactivation)(입력 벡터에 가중치행렬 W를 곱한값)에 노이즈(noise)가 적용될 수 있다. 위 기법을 로컬 재매개변수화 기법(local reparameterization trick)이라고 부른다. 로컬 재매개변수화 기법은 그래디언트(gradients)의 분산(variance)을 효율적으로 줄이고, 신경망에 대한 학습을 효율적으로 수행할 수 있게 한다.Since the sum of the normal distributions is a normal distribution with computeable parameters, noise may be applied to preactivation (the input vector multiplied by the weight matrix W) instead of the weight. This technique is called the local reparameterization trick. The local reparameterization technique effectively reduces the variance of gradients and makes it possible to efficiently learn neural networks.

Sparse VD 기법에서 {Θ, log σ}와 관련된 변이하한값(the variational lower bound)에 대한 최적화가 수행될 수 있다. KL-divergence항은 각각의 가중치를 인수분해(factorize)할 수 있으며, 각각의 가중치는 수학식 5와 같이

에만 의존될 수 있다.Optimization for the variational lower bound associated with {Θ, log σ } may be performed in the Sparse VD technique. The KL-divergence term can factorize each weight, and each weight is expressed by Equation 5

Can only be relied upon.

[수학식 5][Equation 5]

KL-divergence항은 수학식 6의 우항과 같이 근사될 수 있다. The KL-divergence term may be approximated like the right term of Equation 6.

[수학식 6][Equation 6]

KL-divergence항에서

->∞ 이면, 가중치에 대한 사후분포는 고분산정규분포(high-variance normal distribution)의 형태로 나타날 수 있다.. 사후분포의 정확성을 위해

=0, 즉

으로 설정할 수 있다. 그 결과 가중치의 사후분포는 zero-centered δ-function로 접근하고, 사후분포가 zero-centered δ-function로 접근한 가중치는 신경망의 출력에 영향이 없으므로 무시할 수 있다.From KL-divergence Port

If-> ∞, then the post-distribution of the weights can appear in the form of a high-variance normal distribution. For accuracy of the post-distribution

= 0, i.e.

Can be set with As a result, the post-distribution of the weights approaches the zero-centered δ-function, and the weights after the post-distribution approaches the zero-centered δ-function can be ignored because they do not affect the output of the neural network.

[그룹희박화를 위한 SparseVD기법]SparseVD Technique for Group Thinning

수학식 4 에서 SparseVD기법은 신경망 인공지능 모델의 그룹 희박화(group sparsity)를 수행하기 위해서도 적용 될 수 있다. 그룹 희박화에서는 가중치가 일부 그룹으로 나뉘어 지며, 그룹 희박화 기법은 개별적인 가중치 대신 나뉘어진 그룹 단위의 가중치를 제거함으로 희박화를 수행하는 기법을 의미한다.In Equation 4, the SparseVD technique may be applied to perform group sparsity of neural network AI model. In group thinning, the weights are divided into some groups, and the group thinning technique is a technique of performing thinning by removing weights of divided groups instead of individual weights.

예로, 완전연결(fully-connected) 레이어에서 하나의 입력 뉴런에 해당하는 가중치의 그룹을 고려하면, 그룹 희박화 기법을 수행하기 위해 각 그룹에 대해 제2 곱셈변수(multiplicative variable)

를 획득하여 부가하고,

와 같은 형태의 가중치를 이용해 신경망을 학습할 수 있다.For example, considering a group of weights corresponding to one input neuron in a fully-connected layer, a second multiplicative variable for each group to perform the group thinning technique.

Obtain and add

You can learn neural networks using weights in the form of.

위 기법은 완전연결(fully-connected) 레이어에서 입력(input)레이어에 제2 곱셈변수를 부가하는 것과 동일하다. SparseVD 기법을 이용해,

으로 설정하여

와 관련된 뉴런을 제거함으로 신경망의 인공지능 모델에 대한 희박화를 수행할 수 있다.

에 대한 사전분포(

) 및 사후분포(

)쌍은 기존의 SparseVD 기법에서와 동일하게 표현될 수 있다. 각각의 가중치

에서는 학습가능한(learnable) 평균값(mean) 및 분산값(variance)을 가지는 정규사전(standard normal prior)분포 및 정규근사사후(normal approximate posterior)분포를 사용할 수 있다.The above technique is equivalent to adding a second multiplication variable to the input layer in a fully-connected layer. Using the SparseVD technique,

Set to

By eliminating neurons associated with, we can thin the AI model of the neural network.

Predistribution for

) And post-distribution (

) Pairs can be expressed in the same manner as in the conventional SparseVD technique. Each weight

We can use the standard normal prior distribution and the normal approximate posterior distribution with learnable mean and variance.

SparseVD 기법에서 각각의 가중치에 대한 사전분포(prior)는

을 0으로 유도(encourage)하며, 이는 각 그룹의 평균값

가 0으로 설정되도록 할 수 있다.In the SparseVD technique, the prior distribution for each weight is

Encourages to 0, which is the mean of each group

Can be set to zero.

[순환신경망의 베이지안 희박화 기법(Bayesian Sparsification of recurrent neural networks)][Bayesian Sparsification of recurrent neural networks]

순환 신경망에서는

와 같은 배열(sequence)을 입력받아, 입력받은 배열을 은닉 레이어(hidden states)에 매핑(maps)할 수 있다..In a cyclic neural network

By receiving a sequence such as, the received array can be mapped to hidden states.

[수학식 7][Equation 7]

수학식 7 은 순환신경망의 은닉(hidden)레이어에 대한 레이어 함수

를 나타낸다.Equation 7 is a layer function for the hidden layer of the cyclic neural network.

Indicates.

일반적인 신경망은 입력레이어에서 출력레이어로 한 방향으로만 흐르는 피드포워드(feedforward)신경망일 수 있다. 순환신경망은 피드포워드 신경망과 비슷하지만, 출력값을 다시 입력값으로 받는 부분이 있다. 즉, 순환신경망은 입력값을 받아 출력값을 만들고, 다시 출력값을 입력값으로 받을 수 있다. 각 순환신경망의 순환뉴런

은 수학식 7와 같이 두 개의 가중치행렬

와

을 포함할 수 있다.

는 입력배열

에 대응되는 가중치 행렬이며,

는 이전 타임스텝 t-1의 순환뉴런에 대한 출력인

에 대응되는 가중치 행렬이다.A general neural network may be a feedforward neural network flowing in only one direction from an input layer to an output layer. The cyclic neural network is similar to the feedforward neural network, but there is a part that receives an output value as an input value again. That is, the cyclic neural network may receive an input value, generate an output value, and receive the output value again as an input value. Circulating neurons in each circulatory neural network

Are two weight matrices,

Wow

It may include.

Input array

Is a weight matrix corresponding to

Is the output for the cyclic neuron of the previous time step t-1

Is a weight matrix corresponding to.

는 타임스텝 t에서의 입력 배열 요소를 뜻하며, 구체적으로 타임스텝 t에서 모든 샘플의 입력값을 담고 있는 행렬이다.

는 입력 배열 요소

와 이전 타임스텝 t-1의 순환뉴런에 대한 출력인

에 의해 결정된다.

은 각 뉴런의 편향(bias)의 크기를 의미하는 편향 벡터이다.

Denotes an input array element at time step t, specifically, a matrix containing the input values of all samples at time step t.

Is an input array element

And the output for the cyclic neuron of previous time step t-1

Determined by

Is a deflection vector representing the magnitude of the bias of each neuron.

는

와

의 함수이므로, 타임스텝 t=0에서부터 모든 입력에 대한 함수가 될 수 있다. 첫 번째 타임스텝인 t=0에서는 이전의 출력이 없기 때문에 일반적으로 0으로 초기화 될 수 있다.

Is

Wow

Since it is a function of, it can be a function for all inputs from timestep t = 0. At t = 0, the first time step, it can typically be initialized to zero since there is no previous output.

[수학식 8][Equation 8]

본 개시의 일 실시 예에 따르면 순환신경망의 출력값 y는 수학식 8과 같이 마지막 은닉 레이어에(

)만 의존할 수 있다.According to an embodiment of the present disclosure, the output value y of the cyclic neural network is represented by the last hidden layer (8).

) Can only depend.

수학식 7의

및 수학식 8의

는 비선형함수(nonlinear function)이다.Of equation (7)

And of Equation 8

Is a nonlinear function.

가중치에 대한 희박화를 수행하기 위해 순환신경망에 상술한 SparseVD기법을 적용할 수 있다. In order to perform thinning of weights, the above-described SparseVD technique may be applied to a cyclic neural network.

fully factorized log-uniform 분포를 가지는 사전분포가 SparseVD 기법을 이용한 순환신경망의 압축방법에 사용될 수 있으며, Pre-distribution with fully factorized log-uniform distribution can be used for the compression method of circulatory neural network using SparseVD technique.

[수학식 9][Equation 9]

사후분포는 수학식 9와 같이

인 가중치에 대한 fully factorized normal분포로 근사화 될 수 있다. The posterior distribution is given by

It can be approximated by a fully factorized normal distribution of the weights.

수학식 9의

및

는 수학식 4의 부가적인 재매개변수화(additive reparameterization)와 의미가 같을 수 있다.Of equation (9)

And

May have the same meaning as additive reparameterization of Equation 4.

[수학식 10][Equation 10]

SparseVD 기법을 이용한 순환신경망 인공지능 모델의 학습결과 근사 변이하한값(lower bound approximation )이 최대화(maximized) 될 수 있다.The lower bound approximation of the training results of the cyclic neural network (AI) model using the SparseVD technique can be maximized.

SparseVD 기법을 이용한 순환신경망 인공지능 모델의 학습과정은, 먼저,

와 관련된 미니배치(mini-batch) 기법을 이용해 확률적 최적화(stochastic methods of optimization)가 수행된다.The learning process of AI model using SparseVD method is as follows.

Stochastic methods of optimization are performed using a mini-batch technique.

미니배치 기법이란 인공지능 모델의 학습방법에 있어, 여러 학습 입력요소(예제)들의 그래디언트(gradients)를 동시에 계산하는 학습방법이다. 즉, 각 과정에서 전체 학습 요소나 하나의 샘플을 기반으로 그래디언트(gradients)를 계산하는 것이 아닌, 미니배치라 부르는 임의의 작은 샘플 세트에 대해 그래디언트(gradients)를 계산하는 것이다.The mini-batch technique is a learning method that calculates the gradients of several learning input elements (examples) in the AI model learning method. In other words, instead of computing gradients based on the entire learning element or a sample in each process, the gradients are calculated for any small set of samples called minibatches.

수학식 10의 적분(integral)항은 미니배치 기법당 하나의 샘플

로 추정될 수 있다. 비편향 적분추정(unbiased integral estimation)을 위한 재매개변수화된 기법(reparameterization trick)과 그래디언트의 분산값 감소(gradients variance reduction)를 위한 부가적인 재매개변수화(additive reparameterization )가 입력-은닉 가중치(input-to-hidden weight,

)와 은닉-은닉 가중치(hidden-to-hidden weight,

)를 샘플링(생성)하는데 사용된다.The integral term in Equation 10 is one sample per minibatch technique.

Can be estimated as Reparameterization tricks for unbiased integral estimation and additive reparameterizations for gradients variance reduction are input-hidden weights. to-hidden weight,

) And the hidden-to-hidden weight,

) Is used to sample (create).

로컬 재매개변수화 기법(local reparameterization trick)은 입력-은닉 가중치(input-to-hidden weight,

)또는 은닉-은닉 가중치(hidden-to-hidden weight,

)에 적용될 수 없다.Local reparameterization tricks use input-to-hidden weights,

) Or hidden-to-hidden weight,

Cannot be applied).

[수학식 11][Equation 11]

[수학식 12][Equation 12]

3차원 노이즈(noise)는 많은 메모리 용량을 필요로 하므로, 수학식 11 및 수학식 12와 같이 하나의 노이즈 메트릭스가 미니배치 기법에 의한 모든 객체(object)들에 대해 생성될 수 있다.Since three-dimensional noise requires a large memory capacity, one noise matrix can be generated for all objects by the mini-batch technique, as shown in Equations 11 and 12.

미니배치 기법에 의한 순환신경망의 학습 방법은 먼저, 입력-은닉 가중치(input-to-hidden weight,

)또는 은닉-은닉 가중치(hidden-to-hidden weight,

)가 샘플링(생성)된다. 그 후, 수학식 10의 변이하한값(lower bound approximation)에 대해

와 관련하여 최적화가 수행된다. 그 후 KL-divergence항을 통해 순환신경에 대한 희박화를 수행할 수 있으며, 다수의 가중치에 대한 사후분포가 zero-centered δ-function 형태로 얻어질 수 있다.The learning method of circulatory neural network by mini-batch technique is as follows: input-to-hidden weight,

) Or hidden-to-hidden weight,

) Is sampled (created). Then, for the lower bound approximation of Equation 10

Optimization is performed in connection with this. After that, the circulatory nerve thinning can be performed through the KL-divergence term, and post-distribution of a plurality of weights can be obtained in the form of zero-centered δ-function.

LSTM(Long-Short term Memory)계층의 경우에도 입력-은닉 가중치(input-to-hidden weight,

)및 은닉-은닉 가중치(hidden-to-hidden weight,

)에 대한 사전분포 및 사후분포 쌍이 사용되며, 위와 같은 희박화 기법이 동일하게 수행될 수 있다.In the case of the Long-Short term Memory (LSTM) layer, the input-to-hidden weight,

) And hidden-to-hidden weights,

Pre- and post-distribution pairs for) are used, and the same thinning technique as above can be performed.

LSTM(Long-Short term Memory)계층의 경우 입력-은닉(input-to-hidden weight)및 은닉-은닉(hidden-to-hidden weight) 행렬에 대한 노이즈가 게이트(gate)i, o, f 및 입력 조정(input modulation) g에 생성될 수 있다.For long-short term memory (LSTM) layers, noise for the input-to-hidden weight and hidden-to-hidden weight matrices is applied to gates i, o, f, and input. Can be generated in input modulation g.

[LSTM(Long-Short term Memory)계층에 대한 베이지안 그룹 희박화 수행 방법]How to Perform Bayesian Group Thinning for Long-Short Term Memory (LSTM) Layers

수학식 4에는 그룹가중치에 대한 노이즈 및 개별가중치에 대한 노이즈가 포함되어 있다. 일반적으로 많이 사용되는 순환신경망은 압축 및 가속 수준을 향상시키기 위해 복잡한 게이트 구조(gated structure)를 갖는 LSTM계층을 포함할 수 있다. LSTM계층에는 내부 메모리(internal memory)

를 포함하고, LSTM계층의 3개의 게이트(i,o,f)는 내부 메모리

로부터 정보(information)를 업데이트(update),삭제(erasing) 및 릴리즈(releasing)할 수 있다.Equation 4 includes noise for group weights and noise for individual weights. In general, a circulatory neural network that is widely used may include an LSTM layer having a complicated gate structure to improve the level of compression and acceleration. Internal memory on the LSTM layer

The three gates (i, o, f) of the LSTM layer include an internal memory

Information can be updated, erased, and released from the information.

[수학식 13][Equation 13]

[수학식 14][Equation 14]

[수학식 15][Equation 15]

수학식15는 내부메모리

에 대한 함수이고, 수학식 13 및 수학식 14는 각각 i, f, g, o에 대한 함수를 나타낸다. i는 input gate, f는 forget gate, o는 out gate를 의미하며 자세한 내용은 생략하도록 한다.Equation 15 is an internal memory

Equation 13 and Equation 14 represent functions for i, f, g, and o, respectively. i means input gate, f means forget gate, o means out gate, and details are omitted.

본 개시의 일 실시 예에 따르면, 위 LSTM계층에 입력뉴런에 대한 제2 곱셈변수

및 은닉뉴런에 대한 제2 곱셈변수

를 획득(도입)할 수 있다. 이에 더하여, 제3 곱셈변수

,

를 획득(도입)하여 각 게이트(gate) i,f,o 및 정보흐름(information flow)g 에 부가할 수 있다. 제2 곱셈변수와 제3곱셈변수를 부가한 LSTM계층을 포함하는 순환신경망 인공지능 모델은 다음과 같이 표현될 수 있다.According to an embodiment of the present disclosure, a second multiplication variable for an input neuron in the LSTM layer

Second multiplicative variable for and hidden neurons

Can be acquired (introduced). In addition, the third multiplication variable

,

Can be obtained (introduced) and added to each gate i, f, o and information flow g. The cyclic neural network artificial intelligence model including the LSTM layer added with the second multiplication variable and the third multiplication variable may be expressed as follows.

[수학식 16][Equation 16]

[수학식 17][Equation 17]

[수학식 18]Equation 18

[수학식 19][Equation 19]

[수학식 20][Equation 20]

수학식 16 내지 수학식 20은 가중치 행렬의 행(column)뿐만 아니라 가중치 행렬의 열(row)에도 제2 곱셈변수 및 제3 곱셉변수를 부가한 것을 나타낸다.Equations 16 to 20 show the addition of the second multiplication variable and the third multiplication variable to the rows of the weighting matrix as well as the columns of the weighting matrix.

예를 들어,

행렬에 대해 제2 곱셈변수

및 제3 곱셈변수

를 부가하면

와 같이 나타낼 수 있다. LSTM계층을 포함하는 순환신경망 인공지능 모델에 대한 나머지 7개의 가중치 행렬 또한 위와 같이 나타낼 수 있다.For example,

Second multiply variable for matrix

And third multiplication variable

If you add

Can be expressed as: The remaining seven weight matrices for the cyclic neural network AI model including the LSTM layer may also be expressed as above.

수학식 4 에서와 같이 제2 곱셈변수

및

가 0에 가까워 지면, 제2 곱셈변수에 대응되는 뉴런이 순환신경망 모델에서 제거 될 수 있다.Second multiplication variable as in equation (4)

And

Is close to 0, the neuron corresponding to the second multiplication variable can be removed from the cyclic neural network model.

제3 곱셈변수

,

가 0에 가까워 지는 경우에는, 제3 곱셈변수에 대응되는 게이트 (gate)또는 정보흐름(information flow)요소가 일정(constant)해질 수 있다.Third multiplication variable

,

When N approaches 0, a gate or information flow element corresponding to the third multiplication variable may be constant.

게이트(gate) 또는 정보흐름(information flow)요소가 일정해진다는 의미는 게이트(gate)를 계산할 필요가 없다는 뜻이며, 이 경우 LSTM계층의 순방향 통과(forward pass)가 가속화 될 수 있다.The fact that the gate or information flow element is constant means that there is no need to calculate the gate, in which case the forward pass of the LSTM layer may be accelerated.

[수학식 21][Equation 21]

[수학식 22][Equation 22]

[수학식 23][Equation 23]

수학식 21은 가중치행렬

에 대한 사전분포 및 사후분포를 나타낸 함수이며, 수학식 22는 제2 곱셈변수

에 대한 사전분포 및 사후분포를 나타낸 함수이며, 수학 식23은 제3 곱셈변수

에 대한 사전분포 및 사후분포를 나타낸 함수이다.Equation 21 is a weight matrix

Is a function showing the pre and post distributions for, and

Is a function showing the pre- and post-distribution for, and Equation 23 is the third multiplication variable.

Pre- and post-distribution for.

본 개시의 일 실시 예에 따르면 개별 가중치에 대한 표준 정규(standard normal)분포 대신 log-uniform 사전분포를 함으로 그룹변수들에 대한 희박화 기법이 향상될 수 있다.According to an embodiment of the present disclosure, the thinning technique for group variables may be improved by performing a log-uniform predistribution instead of a standard normal distribution for individual weights.

[자연 언어 처리를 위한 베이지안 압축 기법(Bayesian Compression for Natural Language Processing)] Bayesian Compression for Natural Language Processing

자연 언어(Natural Language)처리 작업에서 순환신경망의 대부분의 가중치는 어휘(Vocabulary)와 연결된 첫 번째 레이어에 집중될 수 있다. 그러나 순환신경망을 이용한 자연 언어 처리 작업의 경우 모든 단어가 필요하지는 않는다.In natural language processing, most of the weight of the circulatory neural network can be concentrated in the first layer associated with the vocabulary. However, not all words are needed for natural language processing using cyclic neural networks.

따라서 본 개시의 일 실시 예에 따르면 어휘 희박화(vocabulary sparsification)를 수행하기 위해 제1 곱셈변수가 도입될 수 있다. 제1 곱셈변수가 순환신경망의 학습과정에서 0으로 설정되는 경우, 자연 언어 처리 작업에서 불필요한 단어가 필터링(제거) 될 수 있다.Therefore, according to an embodiment of the present disclosure, a first multiplication variable may be introduced to perform vocabulary sparsification. When the first multiplication variable is set to 0 in the learning process of the cyclic neural network, unnecessary words may be filtered (removed) in the natural language processing task.

자연 언어 처리작업에서

는 입력배열(input sequence)이며, y는 실제 출력값(true output)이며,

는 순환신경망에 의해 예측되는 출력값을 의미한다. y 및

는 벡터의 배열(sequences of vectors)로 표현 될 수 있다. X 및 Y는 훈련세트

를 의미한다. 편향(bias)를 제외한 모든 순환신경망의 가중치는 w로 표현될 수 있다. 편향(bias)에 대하여는 희박화를 수행하지 않고, 편향은 B로 표시된다.In natural language processing

Is the input sequence, y is the true output,

Denotes an output value predicted by the cyclic neural network. y and

Can be expressed as an array of vectors. X and Y are training sets

Means. The weight of all circulatory neural networks except bias can be expressed as w. No deflection is performed on the bias, and the deflection is indicated by B.

본 개시의 일 실시 예에 따른 자연 언어 처리 작업을 위한 순환신경망 모델은 다음과 같이 이루어질 수 있다.A cyclic neural network model for a natural language processing task according to an embodiment of the present disclosure may be performed as follows.

일때,

when,

입력(embedding)레이어:

Embedding Layer:

순환(recurrent)레이어:

Recurrent Layer:

완전연결(fully-connected )레이어:

Fully-connected layer:

위 순환신경망 모델에서

,

이며, 위 모델은 어느 순환신경망 구조(recurrent architecture)에도 직접 적용될 수 있다.In gastric circulatory neural network model

,

The model can be applied directly to any recurrent architecture.

수학식 4 및 수학식 18에 따라, 가중치에 대한 fully-factorized log-uniform 분포를 이루는 사전분포

및 fully factorized normal 분포를 이루는 근사사후분포

를 순환신경망 모델에 입력(put)된다.Pre-distribution of a fully-factorized log-uniform distribution of weights according to equations (4) and (18).

Approximate Post-Distribution Comprising Full and Factorized Normal Distribution

Is input to the cyclic neural network model.

[수학식 24][Equation 24]

수학식 24 에서 첫 번째 항(

)은 손실함수(loss function)를 나타내며,

로부터 하나의 샘플을 이용하여 근사화될 수 있다.In Equation 24, the first term (

) Represents a loss function,

Can be approximated using one sample.

수학식 24 에서 두 번째 항(

)은 regularizer를 나타내며, 사후분포를 사전분포에 가깝게 만들며, 순환 신경망 인공지능 모델의 희박화를 수행하기 위해 사용되며, 수학식 26과 같이 근사화 될 수 있다. In Equation 24, the second term (

) Represents a regularizer, makes the post-distribution close to the pre-distribution, and is used to perform the thinning of the circulatory neural network artificial intelligence model, which can be approximated by Equation 26.

[수학식 25][Equation 25]

,

[수학식 26][Equation 26]

완전 비편향(integral unbiased)을 추정하기 위해, 수학식 27과 같이 재매개변수화 기법(reparametrization trick)을 이용하는 경우 사후분포가 생성될 수 있다.In order to estimate integral unbiased, a post-distribution may be generated when using a reparametrization trick as shown in Equation 27.

[수학식 27][Equation 27]

순환신경망은 일반적인 feed-forward 신경망과 달리 서로 다른 시간간격(different timestep)에서 동일한 가중치를 사용할 수 있다. 따라서, likelihood (

, 우도)를 계산하기 위해 각 시간 스텝 t 마다 동일한 가중치 샘플이 사용되어야 한다.The circulatory neural network can use the same weight at different time steps, unlike the general feed-forward neural network. Thus, likelihood (

The same weighted sample should be used for each time step t to calculate the likelihood.

기존 feed-forward 신경망에서는 개별 가중치를 샘플링(생성)하는 대신, 사전활성화(preactivation)를 샘플링하는 LRT(local reparametrization trick)기법을 사용하였다..In the existing feed-forward neural network, instead of sampling (generating) individual weights, a local reparametrization trick (LRT) sampling is used to sample preactivation.

그러나, 순환신경망에서는 가중치 행렬이 하나이상의 시간간격에 대해 사용되므로,

와 같이 묶인 가중치 샘플링(Tied weight sampling)로는 LRT기법이 순환신경망의 가중치 행렬에 대해 적용될 수 없다.However, in cyclic neural networks, the weight matrix is used for more than one time interval,

With Tied weight sampling, the LRT technique cannot be applied to the weight matrix of the cyclic neural network.

가 이전 시간간격으로부터의

에 의존하기 때문에, 은닉-은닉 행렬(hidden-to-hidden matrix)

에 대한 선형결합(linear combination)

은 정규분포(normally distribute)가 아니다. 따라서, 일정 계수(constant coefficients)를 가지는 정규 분포의 합에 대한 규칙(rule about a sum of independent normal distributions)이 적용 될 수 없다. 따라서, 순환신경망을 LRT기법으로 학습시키는 경우 학습효율이 떨어지게 된다.

From the previous time interval

Because it depends on the hidden-to-hidden matrix

Linear combination

Is not normally distribute. Therefore, the rule about a sum of independent normal distributions cannot be applied. Therefore, the learning efficiency decreases when the cyclic neural network is trained by the LRT technique.

입력-은닉 행렬(input-to-hidden matrix)

에 대한 선형 결합

은 정규분포(normally distribute)를 이룰 수 있다. 그러나 모든 시간간격에 대해 같은

를 샘플링 하는 것은 모든 시간간격에 대해 사전활성(preactivations)에 대한 노이즈

를 샘플링하는 것과 대응되지 않으며, 모든 시간간격에 대해 같은

를 샘플링 하는 것은 다른 시간 간격에 대해 서로 다른

에 의한 서로 다른 노이즈

를 샘플링하는 것에 대응된다. 따라서 LRT기법은 순환신경망의 학습에 사용될 수 없다.Input-to-hidden matrix

Linear coupling to

Can be distributed normally. But the same for all time intervals

Sampling the noise to preactivations for all time intervals

Is not equivalent to sampling

Sampling is different for different time intervals

Different noise caused by

Corresponds to sampling. Therefore, LRT technique cannot be used for learning circulatory neural network.

위의 학습과정은 2D잡음 텐서(noise tensor)에만 효과적이므로, 본 개시의 일 실시 예에 따르면, 개별 객체(individual object)당 노이즈가 샘플링되는 것이 아닌 미니배치 기법당 가중치에 대한 노이즈가 샘플링될 수 있다.Since the above learning process is effective only for a 2D noise tensor, according to an embodiment of the present disclosure, the noise for the weight per mini-batch technique may be sampled instead of the noise per individual object. have.

따라서, 순환신경망을 이용한 자연언어 처리 학습은 수학식 26과 관련된 가중치를 생성하고, 생성된 가중치를 이용해 순환신경망을 순방향으로 통과(forward pass)시켜 미니배치 기법에 대한 순방향 통과를 수행할 수 있다. 그 후,

,

, B 에 관해 수학식24의 그래디언트(gradient)가 계산될 수 있다.Therefore, the natural language processing learning using the cyclic neural network generates a weight associated with Equation 26, and forwards the cyclic neural network forward using the generated weights to perform a forward pass for the mini-batch technique. After that,

,

With respect to B, the gradient of equation (24) can be calculated.

위 순환신경망 인공지능 모델의 학습과정에서 가중치의 평균값'

'가 사용되며, 수학식25의 regularizer는 많은 수를

를 0으로 설정하게 하여 가중치에 대한 희박화를 수행할 수 있다.Mean Value of Weights in the Learning Process of Gastrointestinal Neural Network Artificial Intelligence Model

'Is used, and the regularizer of Equation 25

Can be set to 0 to thinning the weight.

본 개시의 일 실시 예에 따르면 평균값의 제곱에 대한 분산값의 비율값(ratio of square of mean to variance)을 의미하는

가 기 설정된 값 보다 작은 가중치를 제거함으로 가중치에 대한 희박화를 수행할 수 있다.According to an embodiment of the present disclosure means a ratio of the square of the mean value to the variance

By thinning the weight smaller than the preset value, the thinning of the weight can be performed.

베이지안 희박화 기법은 이점 중 하나는 가중치 그룹의 희박화를 쉽게 일반화(generalization) 할 수 있다는 것이다. 이를 위해 본 개시의 일 실시 예에 따르면, 각 그룹에 곱셈변수를 획득(도입)하고, 순환신경망의 학습을 통해 각 곱셈변수를 제거하면 곱셈변수에 대응되는 그룹이 제거된다.One of the advantages of the Bayesian thinning technique is that it can easily generalize the thinning of the weighting group. To this end, according to an embodiment of the present disclosure, when a multiplication variable is obtained (introduced) in each group, and each multiplication variable is removed through learning of a cyclic neural network, the group corresponding to the multiplication variable is removed.

구체적으로, 본 개시의 일 실시 예에 따르면, 어휘(vocabulary)내의 단어(word)에 대해 제1곱셈변수(확률적 곱셈 가중치)

가 획득(도입)된다. V는 어휘의 크기를 나타낸다. 획득된 z를 이용한 순환신경망 인공지능 모델의 학습방법은 다음과 같다.Specifically, according to one embodiment of the present disclosure, a first multiplication variable (probabilistic multiplication weight) for a word in a vocabulary

Is acquired (introduced). V represents the size of the vocabulary. The learning method of the AI model using the obtained z is as follows.

1. 미니 배치 기법으로부터 각 입력 배열

에 대한 현재 사후분포(current approximation of the posterior)로부터의

벡터가 샘플링(획득)된다.1. Each input array from the mini batch technique

From the current approximation of the posterior

The vector is sampled (acquired).

2. 각 입력 배열

에 제1 곱셈변수

가 곱해진다. 2. Each input array

To the first multiplication variable

Is multiplied.

3. 일반적인 순환신경망의 학습과 같이 순방향 통과(forward pass)가 수행된다.3. A forward pass is performed as in the learning of a general cyclic neural network.

다른 가중치에 대해서도 위의 방법과 동일하게 z를 샘플링(획득)하여 순방향 통과가 수행될 수 있다. 위 학습방법 에서는 log-uniform 형태의 사전분포가 이용되며, 사후분포는 학습 가능한 평균값 및 분산값을 갖는 fully-factorized normal 분포 형태로 근사화 될 수 있다.For other weights, forward pass may be performed by sampling (acquiring) z in the same manner as the above method. In the above learning method, log-uniform pre-distribution is used, and post-distribution can be approximated to fully-factorized normal distribution with learnable mean and variance.

z는 1차원 벡터 이므로, 미니 배치 기법의 각 객채(objective)에 대해 개별적으로 생성하여 그래디언트(gradient)의 분산값을 줄일 수 있다.Since z is a one-dimensional vector, the variance of the gradient can be reduced by generating it individually for each objective of the mini-batch technique.

위 z를 이용한 순환신경망의 학습 이후, 낮은 signal-to-noise(노이즈) 비율을 갖는 z는 제거되고, 이후에 z에 대응되는 단어는 사용되지 않을 수 있다.After learning the circulatory neural network using the above z, z having a low signal-to-noise ratio is removed, and then the word corresponding to z may not be used.

위와 같은 순환신경망 인공지능 모델의 희박화 기법에 대한 동작들은 본 개시의 일 실시 예에 따른 전자장치의 프로세서에 의해 수행될 수 있으나, 이는 일 실시 예에 불과할 뿐, 여러 장치에 의해 수행될 수도 있다.Operations for the thinning technique of the circulatory neural network artificial intelligence model as described above may be performed by a processor of an electronic device according to an embodiment of the present disclosure, but this is only an embodiment and may be performed by various devices. .

본 개시의 일 실시 예에 따른 텍스트 분류(text classification) 및 언어모델링(language modeling)의 두 가지 유형에 대해 LSTM구조를 이용하여 실험이 수행되였다. 실험에는 정규화(regularization)가 없는 모델, SparseVD 모델 및 곱셈변수가 있는 SparseVD 모델(SparseVD-Voc)의 세가지 모델이 사용되었다.Experiments were performed using LSTM structures for two types of text classification and language modeling according to one embodiment of the present disclosure. Three experiments were used: the model without regularization, the SparseVD model, and the SparseVD model with multiplication variables (SparseVD-Voc).

각 모델의 희박화 수준을 측정하기 위해 개별 가중치의 압축률은 |w| / |w ≠ 0| 와 같이 계산된다. 가중치의 희박화는 순환신경망 인공지능 모델의 압축뿐만 아니라, 순환신경망 인공지능 모델의 가속화를 가능하게 한다. 위 실험에서 각 모델에 대해 입력(input) 레이어, embedding 레이어, 순환(recurrent)레이어의 모든 레이어에 남아 있는 뉴런의 수가 계산된다. SparseVD-Voc 모델의 입력 레이어에 남아 있는 뉴런의 수를 계산하기 위해

변수가 획득(도입)된다. SparseVD 모델 및 SparseVD-Voc 모델의 다른 모든 레이어 에서, 뉴런에 연결된 가중치가 제거되는 경우 그 뉴런이 제거된다. To measure the level of thinning for each model, the compression rate of the individual weights is | w | / | w ≠ 0 | Is calculated as The thinning of the weights allows not only the compression of the circulatory neural network AI model, but also the acceleration of the cyclic neural network AI model. In the above experiments, the number of neurons remaining in all input layers, embedding layers, and recurrent layers for each model is calculated. To count the number of neurons remaining in the input layer of the SparseVD-Voc model

The variable is acquired (introduced). In all other layers of the SparseVD model and the SparseVD-Voc model, the neuron is removed when the weight associated with the neuron is removed.

위 실험에서 signal-to-noise(노이즈) 비율(

)이 0.05보다 낮은 가중치가 제거된다.In the above experiment, the signal-to-noise ratio (

Weights less than 0.05 are removed.

텍스트 분류를 위한 실험에는(Text Classification) 2진분류(binary classification)를 위한 IMDb 데이터 세트와, 4클래스 분류(four-class classification)를 위한 AGNews 데이터 세트가 이용된다. In the experiment for text classification, IMDb data set for binary classification and AGNews data set for four-class classification are used.

실험에는 각각 15% 및 5%의 학습 데이터를 따로 설정하였고 두 데이터 세트 모두에서 가장 자주 사용되는 20,000단어의 어휘가 사용되었다.In the experiment, 15% and 5% of learning data were set separately, and the vocabulary of 20,000 words which was used most frequently in both data sets was used.

실험에서 300유닛(unit)의 하나의 입력(embedding)레이어와 128/512 은닉 유닛(hidden unit)의 하나의 LSTM 레이어가 사용되었다. 그리고, 완전연결 레이어(fully connected layer)가 LSTM레이어의 마지막 출력(last out-put)레이어에 적용되었다. 입력(embedding)레이어는 word2vec 및 GLoVe로 초기설정 되며, SparseVD 모델 및 SparseVD-Voc 모델은 IMDb 및 AGNews데이터세트에서 800/150 epochs에 대해 학습된다.In the experiment, one embedding layer of 300 units and one LSTM layer of 128/512 hidden units were used. A fully connected layer was then applied to the last out-put layer of the LSTM layer. The embedding layer is initially set to word2vec and GLoVe, and the SparseVD model and SparseVD-Voc model are trained on 800/150 epochs in the IMDb and AGNews datasets.

[표 1]TABLE 1

표 1 에는 각 모델의 학습결과를 도시하였다. SparseVD 모델이 품질 저하 없이 매우 높은 압축률을 나타내며, SparseVD-Voc 모델은 정확도(accuracy)를 유지하면서, 압축률을 더욱 높여 주었다. 위와 같은 높은 압축률은 어휘(vocabulary)의 희박화를 수행하여 달성된다. 즉, 텍스트를 분류하기 위해 중요텍스트 위주로만 학습되어야 한다. Table 1 shows the training results of each model. The SparseVD model shows very high compression rates without compromising quality, while the SparseVD-Voc model achieves higher compression rates while maintaining accuracy. This high compression rate is achieved by performing lean vocabulary. That is, to classify texts, only the important texts should be learned.

언어모델링(Language Modeling) 작업은 문자(character)수준 및 단어(word)수준 언어모델링 작업으로 수행되었다. 실험에는 50문자 또는 10000개 단어의 문자 데이터 세트가 사용되었다. 문자/단어 수준의 작업을 수행하기 위해, 실험에는 10000/256 은닉(hidden)유닛의 하나의 LSTM레이어와 softmax활성화를 갖는 완전연결된(fully-connected)레이어를 갖는 순환신경망을 이용하여, 문자 또는 단어가 예측되었다. SparseVD 모델 및 SparseVD-Voc 모델은 단어수준/문자수준 작업에서 250/150 epochs에 대해 학습된다.Language modeling work was performed with character level and word level language modeling. The experiment used a character data set of 50 characters or 10000 words. To perform character / word level work, experiments use a cyclic neural network with one LSTM layer of a 10000/256 hidden unit and a fully-connected layer with softmax activation. Was predicted. The SparseVD model and SparseVD-Voc model are trained on 250/150 epochs in word-level / character-level work.

[표 2]TABLE 2

표 2 에는 각 모델의 학습결과를 도시하였다. 위 실험을 위해 마지막 완전연결(last fully-connected )레이어에 LRT가 사용되었다. 마지막 레이어의 LRT는 최종 결과에 부정적인 영향을 미치지 않으며, 학습이 가속화 된다. 실험에서 사용되는 어휘가 50자에 불과하므로, 문자수준의 실험에서는 입력 어휘에 대한 희박화를 수행하지 않았다. 단어 수준의 실험에서는 절반 이상의 단어가 삭제되었다Table 2 shows the training results of each model. LRT was used for the last fully-connected layer for this experiment. The LRT of the last layer does not have a negative impact on the final result and accelerates learning. Since the vocabulary used in the experiment is only 50 characters, the character-level experiment did not diminish the input vocabulary. More than half the words were deleted in word level experiments

텍스트 분류 실험에서 은닉-은닉(hidden-to-hidden) 가중치 행렬

이 직교(orthogonally)로 초기설정 되며, 다른 모든 행렬이 균일(uniformly)하게 초기설정 된다. 그리고 순환신경망은 크기 128 및 학습률 0.0005의 미니배치 기법을 이용해 학습된다. Hidden-to-hidden Weight Matrix in Text Classification Experiments

This is orthogonally initialized, and all other matrices are uniformly initialized. The cyclic neural network is trained using a mini-batch technique with size 128 and learning rate 0.0005.

언어모델링 실험에서 모든 가중치 행렬이 직교(orthogonally)로 초기설정 되며, 모든 편향(bias)는 0으로 초기화 된다. 은닉 요소(hidden element)와 LSTM 요소의 초기값은 학습할 수 없으며, 0과 같다.In the language modeling experiment, all weight matrices are orthogonally initialized and all biases are initialized to zero. The initial values of the hidden element and the LSTM element cannot be learned and are equal to zero.

언어모델링 실험에서 문자수준의 학습의 경우 겹치지 않는 100개의 문자 배열이 이용되었으며, 순환신경망은 0.002의 학습률과 기준값 1에 대한 clip 그래디언트(gradient)를 갖는 크기64의 미니배치 기법을 이용해 학습되었다.In the language modeling experiment, 100 non-overlapping character arrays were used for the character level learning, and the cyclic neural network was trained using a mini-batching technique of size 64 with a learning rate of 0.002 and a clip gradient of 1 for the reference value.

언어모델링 실험에서 단어수준의 학습의 경우, 순환 미니 배치 기법의 마지막 은닉상태(final hidden state)는 후속 미니 배치 기법상의 초기 은닉 상태(initial hidden state) 에 사용되었다. 각 미니 배치 기법의 크기는 32이며, 학습률 0.002 및 기준값 10 에 대한 클립 그래디언트(gradient)를 사용해 학습되었다.For word-level learning in language modeling experiments, the final hidden state of the cyclic mini-placement technique is used for the initial hidden state of the subsequent mini-placement technique. Each mini batch technique is 32 in size and trained using clip gradients for a learning rate of 0.002 and a baseline of 10.

곱셈변수가 있는 SparseVD 모델에서 IMDB을 이용한 작업(실험)에서 다음과 같은 단어가 사용되었다.In the SparseVD model with multiplying variables, the following words are used in the work with IMDB (experimental):

start, oov, and, to, is, br, in, it, this, was, film, t, you, not, have, It, just, good, very, would, story, if, only, see, even, no, were, my, much, well, bad, will, great, first, most, make, also, could, too, any, then, seen, plot, acting, life, over, off, did, love, best, better, i, If, still, man, some- thing, m, re, thing, years, old, makes, director, nothing, seems, pretty, enough, own, original, world, series, young, us, right, always, isn, least, interesting, bit, both, script, minutes, making, 2, performance, might, far, anything, guy, She, am, away, woman, fun, played, worst, trying, looks, especially, book, DVD, reason, money, actor, shows, job, 1, someone, true, wife, beautiful, left, idea, half, excellent, 3, nice, fan, let, rest, poor, low, try, classic, production, boring, wrong, enjoy, mean, No, instead, awful, stupid, remember, wonderful, often, become, terrible, others, dialogue, perfect, liked, supposed, entertaining, waste, His, problem, Then, worse, definitely, 4, seemed, lives, example, care, loved, Why, tries, guess, genre, history, enjoyed, heart, amazing, starts, town, favorite, car, today, decent, brilliant, horrible, slow, kill, attempt, lack, interest, strong, chance, wouldn, sometimes, except, looked, crap, highly, wonder, annoying, Oh, simple, reality, gore, ridiculous, hilarious, talking, female, episodes, body, saying, running, save, disappointed, 7, 8, OK, word, thriller, Jack, silly, cheap, Oscar, predictable, enjoyable, moving, Un- fortunately, surprised, release, effort, 9, none, dull, bunch, comments, realistic, fantastic, weak, atmosphere, apparently, premise, greatest, believable, lame, poorly, NOT, superb, badly, mess, perfectly, unique, joke, fails, masterpiece, sorry, nudity, flat, Good, dumb, Great, D, wasted, unless, bored, Tony, language, incredible, pointless, avoid, trash, failed, fake, Very, Stewart, awesome, garbage, pathetic, genius, glad, neither, laughable, beautifully, excuse, disappointing, disappointment, outstanding, stunning, noir, lacks, gem, F, redeeming, thin, absurd, Jesus, blame, rubbish, unfunny, Avoid, irritating, dreadful, skip, racist, Highly, MST3K.start, oov, and, to, is, br, in, it, this, was, film, t, you, not, have, It, just, good, very, would, story, if, only, see, even, no, were, my, much, well, bad, will, great, first, most, make, also, could, too, any, then, seen, plot, acting, life, over, off, did, love, best, better, i, If, still, man, some- thing, m, re, thing, years, old, makes, director, nothing, seems, pretty, enough, own, original, world, series, young, us, right, always, isn, least, interesting, bit, both, script, minutes, making, 2, performance, might, far, anything, guy, She, am, away, woman, fun, played, worst, trying, looks, especially, book, DVD, reason, money, actor, shows, job, 1, someone, true, wife, beautiful, left, idea, half, excellent, 3, nice, fan, let, rest, poor, low, try, classic, production, boring, wrong, enjoy, mean, No, instead, awful, stupid, remember, wonderful, often, become, terrible, others, dialogue, perfect, liked, supposed, entertaining, waste, His, problem, Then, worse, definitely, 4, seemed, lives, example, care, loved, Why, tries, guess, genre, history, enjoyed, heart, amazing, starts, town, favorite, car, today, decent, brilliant, horrible, slow, kill, attempt, lack, interest, strong, chance, wouldn, sometimes, except, looked, crap, highly, wonder, annoying, Oh, simple, reality, gore, ridiculous, hilarious, talking, female, episodes, body, saying, running, save, disappointed, 7, 8, OK, word, thriller, Jack, silly, cheap, Oscar, predictable, enjoyable, moving, Un- fortunately, surprised, release, effort, 9, none, dull, bunch, comments, realistic, fantastic, weak, atmosphere, apparently, premise, greatest, believable, lame, poorly, NOT, superb, badly, mess, perfectly, unique, joke, fails, masterpiece, sorry, nudity, flat, Good, dumb, Great, D, wasted, unless, bored, Tony, language, incredible, pointless, avoid, trash, failed, fake, Very, Stewart, awesome, garbage, pathetic, genius, glad, neither, laughable, beautifully, excuse, disappointing, disappointment, outsta nding, stunning, noir, lacks, gem, F, redeeming, thin, absurd, Jesus, blame, rubbish, unfunny, Avoid, irritating, dreadful, skip, racist, Highly, MST3K.

도 1에 도시된 바와 같이, 전자장치(100)는 메모리(110) 및 프로세서(120)를 포함 할 수 있다. 본 개시의 다양한 실시 예들에 따른 전자장치(100)는, 예를 들면, 스마트폰, 태블릿PC, 이동 전화기, 영상 전화기, 전자책 리더기, 데스크탑 PC, 랩탑 PC, 넷북 컴퓨터, 의료기기, 카메라, 또는 웨어러블 장치 중 적어도 하나를 포함할 수 있다. 웨어러블 장치는 액세서리형(예: 시계, 반지, 팔찌, 목걸이, 안경, 콘택트 렌즈, 또는 머리 착용형 장치(head-mounted-device(HMD)), 직물 또는 의류 일체형(예: 전자 의복), 신체 부착형(예: 스킨 패드 또는 문신), 또는 생체 이식형 회로 중 적어도 하나를 포함 할 수 있다.As shown in FIG. 1, the electronic device 100 may include a memory 110 and a processor 120. The electronic device 100 according to various embodiments of the present disclosure may be, for example, a smartphone, a tablet PC, a mobile phone, a video phone, an e-book reader, a desktop PC, a laptop PC, a netbook computer, a medical device, a camera, or It may include at least one of the wearable devices. Wearable devices may be accessory (e.g. watches, rings, bracelets, necklaces, glasses, contact lenses, or head-mounted-devices (HMDs), textiles or clothing integrated (e.g. electronic clothing), body attachments) It may include at least one of a type (eg, skin pad or tattoo), or a living implantable circuit.

메모리(110)는, 예를 들면, 전자장치(100)의 적어도 하나의 다른 구성요소에 관계된 명령 또는 데이터를 저장할 수 있다. 특히, 메모리(110)는 비휘발성 메모리, 휘발성 메모리, 플래시메모리(flash-memory), 하드디스크 드라이브(HDD) 또는 솔리드 스테이트 드라이브(SSD) 등으로 구현될 수 있다. 메모리(110)는 프로세서(120)에 의해 액세스되며, 프로세서(120)에 의한 데이터의 독취/기록/수정/삭제/갱신 등이 수행될 수 있다. 본 개시에서 메모리라는 용어는 메모리(110), 프로세서(120) 내 롬(미도시), 램(미도시) 또는 전자 장치(100)에 장착되는 메모리 카드(미도시)(예를 들어, micro SD 카드, 메모리 스틱)를 포함할 수 있다. 또한, 메모리(110)에는 디스플레이의 디스플레이 영역에 표시될 각종 화면을 구성하기 위한 프로그램 및 데이터 등이 저장될 수 있다. The memory 110 may store, for example, commands or data related to at least one other element of the electronic device 100. In particular, the memory 110 may be implemented as a nonvolatile memory, a volatile memory, a flash-memory, a hard disk drive (HDD), or a solid state drive (SSD). The memory 110 is accessed by the processor 120, and may read / write / modify / delete / update data, etc. by the processor 120. In the present disclosure, the term memory refers to a memory 110, a ROM (not shown), a RAM (not shown), or a memory card (not shown) mounted in the electronic device 100 (eg, micro SD). Card, memory stick). In addition, the memory 110 may store programs and data for configuring various screens to be displayed in the display area of the display.

특히, 메모리(110)는 인공지능 에이전트를 수행하기 위한 프로그램을 저장할 수 있다. 이때, 인공지능 에이전트는 전자 장치(100)에 대한 다양한 서비스를 제공하기 위한 개인화된 프로그램이다. In particular, the memory 110 may store a program for executing the AI agent. In this case, the artificial intelligence agent is a personalized program for providing various services for the electronic device 100.

프로세서(120)는, 중앙처리장치, 어플리케이션 프로세서, 또는 커뮤니케이션 프로세서(communication processor(CP)) 중 하나 또는 그 이상을 포함할 수 있다.The processor 120 may include one or more of a central processing unit, an application processor, or a communication processor (CP).

또한, 프로세서는(120) 주문형 집적 회로(application specific integrated circuit, ASIC), 임베디드 프로세서, 마이크로프로세서, 하드웨어 컨트롤 로직, 하드웨어 유한 상태 기계(hardware finite state machine, FSM), 디지털 신호 프로세서(digital signal processor, DSP), 중 적어도 하나로 구현될 수 있다. 도시하진 않았으나, 프로세서(120)는 각 구성들과 통신을 위한 버스(bus)와 같은 인터페이스를 더 포함할 수 있다.In addition, the processor 120 may include an application specific integrated circuit (ASIC), an embedded processor, a microprocessor, hardware control logic, a hardware finite state machine (FSM), a digital signal processor, DSP). Although not shown, the processor 120 may further include an interface such as a bus for communicating with each component.

프로세서(120)는, 예를 들면, 운영 체제 또는 응용 프로그램을 구동하여 프로세서(120)에 연결된 다수의 하드웨어 또는 소프트웨어 구성요소들을 제어 할 수 있고, 각종 데이터 처리 및 연산을 수행할 수 있다. 프로세서(120)는, 예를 들면 SoC(system on chip)로 구현될 수 있다. 일 실시 예에 따르면, 프로세서(120)는 GPU(graphic processing unit) 및/또는 이미지 신호 프로세서를 더 포함할 수 있다. 프로세서(120)는 다른 구성요소들(예: 비휘발성 메모리) 중 적어도 하나로부터 수신된 명령 또는 데이터를 휘발성 메모리에 로드하여 처리하고, 결과 데이터를 비휘발성 메모리에 저장할 수 있다.The processor 120 may control, for example, a plurality of hardware or software components connected to the processor 120 by running an operating system or an application program, and may perform various data processing and operations. The processor 120 may be implemented with, for example, a system on chip (SoC). According to an embodiment, the processor 120 may further include a graphic processing unit (GPU) and / or an image signal processor. The processor 120 may load and process instructions or data received from at least one of other components (eg, nonvolatile memory) into the volatile memory, and store the result data in the nonvolatile memory.

한편, 프로세서(120)는 인공지능(AI: artificial intelligence)을 위한 전용 프로세서를 포함하거나, 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작될 수 있다. 이 때, 인공 지능을 위한 전용 프로세서는 확률 연산에 특화된 전용 프로세서로서, 기존의 범용 프로세서 보다 병렬처리 성능이 높아 기계 학습과 같은 인공지능 분야의 연산 작업을 빠르게 처리할 수 있다.Meanwhile, the processor 120 may include a dedicated processor for artificial intelligence (AI), or may be manufactured as part of an existing general purpose processor (eg, CPU or application processor) or graphics dedicated processor (eg, GPU). . At this time, the dedicated processor for artificial intelligence is a dedicated processor specialized in probability computation, and has higher parallelism performance than the conventional general-purpose processor, so that it is possible to process arithmetic tasks in the field of artificial intelligence such as machine learning.

특히, 본 개시의 일 실시 예에 따른, 프로세서(120)는 순환신경망의 입력요소에 관한 제1 곱셈변수를 획득할 수 있다. 입력요소는 앞서 상술 하였듯이, 어휘(vocabulary) 또는 단어(word)일 수 있다. 또한, 프로세서(120)는 순환신경망의 입력 뉴런 및 은닉 뉴런에 대한 제2 곱셈 변수를 획득할 수 있다. 입력 뉴런에 대한 제2 곱셈변수는 상술한 바와 같이

로 표현되고, 은닉 뉴런에 대한 제2 곱셈변수는

와 같이 표현될 수 있다.In particular, according to an embodiment of the present disclosure, the processor 120 may obtain a first multiplication variable for an input element of the cyclic neural network. As described above, the input element may be a vocabulary or a word. In addition, the processor 120 may obtain a second multiplication variable for input neurons and hidden neurons of the cyclic neural network. The second multiplying variable for the input neuron is described above.

Where the second multiplicative variable for the hidden neuron is

It can be expressed as

제1 곱셈변수 및 제2 곱셈변수를 획득한 후, 프로세서(120)는 순환신경망의 가중치, 획득한 제1 곱셈변수 및 제2 곱셈변수를 이용하여, 순환신경망을 학습할 수 있다. After obtaining the first multiplication variable and the second multiplication variable, the processor 120 may learn the cyclic neural network by using the weight of the cyclic neural network, the obtained first multiplication variable, and the second multiplication variable.

프로세서(120)는 순환신경망의 가중치, 획득한 제1 및 제2 곱셈변수에 대한 평균값 및 분산값을 초기화 하고, 가중치, 획득한 제1 및 제2 곱셈변수의 평균값 및 분산값과 관련된 객채(objective)를 최적화함으로 순환신경망의 학습을 수행할 수 있다.The processor 120 initializes a weight of the cyclic neural network, an average value and a variance value of the obtained first and second multiplication variables, and an objective associated with the weight, the average value and the variance value of the obtained first and second multiplication variables. Learning can be done by circulating neural network.

객채(objective)는 수학식 1의

에 해당한다. 객채(objective)에 대한 최적화는 확률적(stochastic) 최적화를 사용해 수행될 수 있다.The objective is the equation

Corresponds to Optimization for the objective can be performed using stochastic optimization.

프로세서(120)는 객채(objective)들의 미니배치를 선택하고, 근사 사후분포로부터 가중치, 제1 및 제2 곱셈변수를 생성해 순환신경망을 순방향으로 통과(forward pass)시킨다. 이때, 가중치는 미니배치에 의해 생성될 수 있고, 제1 그룹변수 및 제2 그룹변수는 객체로부터 개별적으로 생성될 수 있다. 그 후, 프로세서(120)는 객채(objective)를 계산하고, 객채(objective)에 대한 그래디언트(gradient)를 계산한다. 그리고, 프로세서(120)는 계산된 그래디언트에 기초해 가중치, 제1 및 제2 곱셈변수에 대한 평균값 및 분산값을 획득(업데이트)하여 객체에 대한 최적화를 수행할 수 있다.The processor 120 selects the mini-batch of the objectives, generates weights, first and second multiply variables from the approximate post-distribution, and forwards the cyclic neural network forward. In this case, the weight may be generated by the mini-batch, and the first group variable and the second group variable may be generated separately from the object. The processor 120 then calculates an objective and calculates a gradient for the objective. The processor 120 may perform optimization on the object by obtaining (update) a weight, an average value, and a variance value of the first and second multiplication variables based on the calculated gradient.

순환신경망의 학습이 완료되면, 프로세서(120)는 획득된 평균값 및 분산값을 바탕으로 가중치, 제1 곱셈변수, 제2 곱셈변수에 대해 희박화(sparsification)를 수행할 수 있다.When the learning of the cyclic neural network is completed, the processor 120 may perform sparification on the weight, the first multiplication variable, and the second multiplication variable based on the obtained average value and the variance value.

희박화는 일정 가중치, 제1 곱셈변수 또는 제2 곱셈변수를 0으로 만들어 순환신경망을 압축하는 방법으로, 프로세서(120)는 획득된 평균값(mean) 및 분산값(variance)을 바탕으로, 희박화를 수행하기 위한 관련값을 계산을 계산할 수 있다. 관련값은 획득된 평균값의 제곱에 대한 분산값의 비율값(ratio of square of mean to variance )이며, 상술한 바와 같이

로 표현된다.Thinning is a method of compressing a cyclic neural network by making a constant weight, a first multiplication variable, or a second multiplication variable zero, and the processor 120 based on the obtained mean and variance. Calculate the relevant value to perform The related value is the ratio of square of mean to variance to the square of the obtained mean value, as described above.

It is expressed as

프로세서(120)는 관련값이 기 설정된 값보다 작은 가중치, 제1 곱셈변수 또는 제2 곱셈변수를 0으로 설정함으로 순환신경망 인공지능 모델에 대한 희박화를 수행할 수 있다.The processor 120 may perform thinning of the cyclic NAI model by setting a weight, a first multiplication variable, or a second multiplication variable whose related value is smaller than a preset value to zero.

기 설정된 값은 0.05일 수 있으나, 이에 한정되지 않는다.The preset value may be 0.05, but is not limited thereto.

본 개시의 일 실시 예에 따르면, 순환신경망이 게이트 구조(gated structure)를 포함하는 경우 프로세서(120)는 순환신경망의 순환레이어의 게이트를 일정하게 만들기 위해 게이트의 사전활성(preactivation)에 관한 제3 곱셈변수를 획득(도입)한다. 제3 곱셈변수는 상술한 바와 같이,

,

로 표현될 수 있다.According to an embodiment of the present disclosure, when the cyclic neural network includes a gated structure, the processor 120 may further include a third method for preactivation of the gate to make the gate of the cyclic layer of the cyclic neural network constant. Acquire (introduce) a multiply variable. As described above, the third multiplication variable

,

It can be expressed as.

순환신경망이 게이트 구조(gated structure)를 포함하는 경우 프로세서(120)가 최적화 및 희박화를 수행함에 있어, 프로세서(120)는 제3 곱셈변수를 더 포함하여 순환신경망을 학습하고, 순환신경망 인공지능 모델에 대한 희박화를 수행할 수 있다. 즉, 프로세서(120)는 제1 곱셈변수 내지 제3 곱셈변수를 획득한 후 순환신경망의 가중치, 제1 곱셈변수, 제2 곱셈변수, 및 제3 곱셈변수를 이용하여, 순환신경망을 학습할 수 있다. When the cyclic neural network includes a gated structure, when the processor 120 performs optimization and thinning, the processor 120 further includes a third multiplication variable to learn the cyclic neural network, and the cyclic neural network AI The thinning of the model can be performed. That is, the processor 120 may acquire the first to third multiply variables and then learn the cyclic neural network by using the weight of the cyclic neural network, the first multiplying variable, the second multiplying variable, and the third multiplying variable. have.

프로세서(120)가 가중치 및 제1 내지 제3 곱셈변수에 대한 평균값 및 분산값을 초기화 하고, 가중치 및 제1 내지 제3 곱셈변수의 평균값 및 분산값과 관련된 객채(objective)에 대해 최적화를 수행함으로 순환신경망을 학습할 수 있다..The processor 120 initializes the weights and the mean and variance values for the first to third multiply variables, and performs optimization for the objectives associated with the mean and variance values for the weights and the first to third multiply variables. Can learn circulatory neural network

프로세서(120)는 객채(objective)들의 미니배치를 선택하고, 근사 사후분포로부터 가중치 및 제1 내지 제3 곱셈 변수를 샘플링(생성)하고, 생성된 가중치, 제1 그룹변수 내지 제3 그룹변수를 바탕으로 순환신경망을 순방향으로 통과(forward pass)시켜, 객채(objective)를 계산할 수 있다. 그 후, 프로세서(120)는 객채(objective)에 대한 그래디언트(gradient)를 계산하고, 그래디언트에 기초해 가중치 및 제1 내지 제3 곱셈변수에 대한 평균값 및 분산값을 획득하는 과정을 통해 객체에 대한 최적화를 수행할 수 있다.The processor 120 selects a mini-batch of objectives, samples (generates) the weights and the first to third multiply variables from the approximate post-distribution, and generates the generated weights, the first to third group variables. Based on the forward pass of the cyclic neural network (objective) can be calculated. Thereafter, the processor 120 calculates a gradient for the objective, and obtains an average value and a variance value of the weights and the first to third multiply variables based on the gradient, for the object. Optimization can be performed.

순환신경망의 학습이 완료되면, 프로세서(120)는 획득된 평균값 및 분산값을 바탕으로, 가중치, 제1 곱셈변수 내지 제3 곱셈변수에 대해 희박화를 수행할 수 있다.When the learning of the cyclic neural network is completed, the processor 120 may perform thinning on the weight, the first multiplication variable, and the third multiplication variable based on the obtained average value and the variance value.

희박화는 일정 가중치, 제1 곱셈변수, 제2 곱셈변수 또는 제3 곱셈변수를 0으로 만들어 순환신경망을 압축하는 방법으로, 프로세서(120)는 획득된 평균값 및 분산값을 바탕으로 희박화를 수행하기 위한 관련 값을 계산할 수 있다. 관련값은 가중치 및 제1 곱셈변수 내지 제3 곱셈변수에 대한 획득된 평균값의 제곱에 대한 분산값의 비율값(ratio of square of mean to variance )이며, 상술한 바와 같이

로 표현된다.Thinning is a method of compressing a cyclic neural network by making a constant weight, a first multiplication variable, a second multiplication variable, or a third multiplication variable to zero, and the processor 120 performs thinning based on the obtained average and variance values. The relevant value can be calculated. The related value is a ratio of the weight and the variance of the obtained mean value for the first multiplier to the third multiply variable, as described above.

It is expressed as

프로세서(120)는 관련값이 기 설정된 값보다 작은 가중치, 제1 곱셈변수, 제2 곱셈변수 또는 제3 곱셈변수를 0으로 설정함으로 순환신경망 인공지능 모델에 대한 희박화를 수행할 수 있다.The processor 120 may perform thinning of the cyclic neural network artificial intelligence model by setting a weight, a first multiplication variable, a second multiplication variable, or a third multiplication variable whose related value is smaller than a preset value to zero.

순환신경망의 게이트 구조는 LSTM(Long-Short term Memory)계층으로 구현될 수 있으며 자세한 내용은 상술 하였으므로 생략한다.The gate structure of the cyclic neural network may be implemented as a long-short term memory (LSTM) layer, and details thereof will be omitted.

도 2는 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델의 압축방법을 나타내는 흐름도이다.2 is a flowchart illustrating a compression method of a circulatory neural network artificial intelligence model according to an embodiment of the present disclosure.

먼저, 전자 장치(100)는 순환신경망의 입력 요소에 관한 제1 곱셈변수를 획득한다(S210). 입력요소는 앞서 상술 하였듯이, 어휘(vocabulary) 또는 단어(word)일 수 있다. 그리고, 전자 장치(100)는 순환신경망의 입력 뉴런 및 은닉 뉴런에 대한 제2 곱셈변수를 획득한다(S220). 입력 뉴런에 대한 제2 곱셈변수는 상술한 바와 같이

로 표현되고, 은닉 뉴런에 대한 제2 곱셈변수는

와 같이 표현될 수 있다.First, the electronic device 100 obtains a first multiplication variable for an input element of a cyclic neural network (S210). As described above, the input element may be a vocabulary or a word. In operation S220, the electronic device 100 obtains a second multiplication variable for input neurons and hidden neurons of the cyclic neural network. The second multiplying variable for the input neuron is described above.

Where the second multiplicative variable for the hidden neuron is

It can be expressed as

순환 신경망이 게이트 구조를 포함하는 경우(S230-Y), 전자 장치(100)는 게이트의 사전 활성(preactivation)에 관한 제3 곱셈변수를 획득한다(S240). 제3 곱셈변수는 상술한 바와 같이,

,

로 표현될 수 있다.When the cyclic neural network includes the gate structure (S230-Y), the electronic device 100 obtains a third multiplication variable related to the preactivation of the gate (S240). As described above, the third multiplication variable

,

It can be expressed as.

획득한 곱셈변수들과 순환신경망의 가중치를 바탕으로 전자 장치(100)는 순환 신경망을 학습한다(S250). 그리고, 학습된 가중치 및 곱셈변수를 바탕으로 순환신경망에 대해 희박화를 수행하여(S260) 처리를 종료한다.The electronic device 100 learns the cyclic neural network based on the obtained multiplication variables and the weight of the cyclic neural network (S250). Then, the process is terminated by performing thinning on the cyclic neural network based on the learned weights and multiplication variables (S260).

순환 신경망이 게이트 구조를 포함하지 않는 경우(S230-N), 전자 장치(100)는 순환신경망의 가중치, 제1 곱셈변수 및 제2 곱셈변수를 바탕으로 순환 신경망을 학습하고(S250), 순환 신경망에 대해 희박를 수행하여(S260)처리를 종료한다. If the cyclic neural network does not include the gate structure (S230-N), the electronic device 100 learns the cyclic neural network based on the weight of the cyclic neural network, the first multiplication variable and the second multiplication variable (S250), and the cyclic neural network. The process is terminated by performing lean on (S260).

도 3은 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델의 학습방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method of learning a circulatory neural network artificial intelligence model according to an embodiment of the present disclosure.

우선, 전자 장치(100)는 가중치 및 그룹변수들에 대한 평균값 및 분산값을 초기화 한다(S310). 그룹변수들은 제1 및 제2 그룹변수를 포함하며, 순환신경망이 게이트 구조를 포함하는 경우 제3 그룹변수를 더 포함할 수 있다.First, the electronic device 100 initializes an average value and a variance value of weights and group variables (S310). The group variables may include first and second group variables, and may further include a third group variable when the cyclic neural network includes a gate structure.

그리고, 전자 장치(100)는 객체들의 미니배치를 선택하고(S320), 근사 사후분포로부터 가중치 및 그룹변수들을 생성(샘플링)한다(S330).The electronic device 100 selects a mini-batch of objects (S320) and generates (samples) weights and group variables from the approximate post-distribution (S330).

전자 장치(100)는 생성된 가중치 및 그룹변수들을 바탕으로 미니배치를 이용하여 순환신경망을 순방향 통과 시킨다(S340). The electronic device 100 forwards the cyclic neural network forward using the mini-batch based on the generated weights and group variables (S340).

그리고, 전자 장치(100)는 객체를 계산하고, 객체에 대한 그래디언트를 계산한다(S350).The electronic device 100 calculates an object and calculates a gradient of the object in operation S350.

그리고, 전자 장치(100)는 계산된 그래디언트를 바탕으로 가중치 및 그룹변수들에 대한 평균값 및 분산값을 획득 하여(S360) 순환신경망 인공지능 모델의 학습을 종료할 수 있다.In operation S360, the electronic device 100 may end the learning of the cyclic neural network AI model by obtaining an average value and a variance value of the weights and group variables based on the calculated gradient.

도 4는 본 개시의 일 실시 예에 따른, 순환신경망 인공지능 모델에 대한 희박화 수행방법을 나타내는 흐름도 이다.4 is a flowchart illustrating a method of performing lean thinning for an artificial intelligence network model according to an embodiment of the present disclosure.

전자 장치(100)는 획득된 평균값 및 분산값을 바탕으로 관련값을 계산한다(S410). 관련값은 획득된 평균값의 제곱에 대한 분산값의 비율값(ratio of square of mean to variance)을 의미하며,

로 표현 될 수 있다.The electronic device 100 calculates a related value based on the obtained average value and variance value (S410). The related value means a ratio of square of mean to variance,

Can be expressed as

관련값이 기 설정된 값 보다 작은 경우(S420-Y), 전자 장치(100)는 관련값이 기 설정된 값 보다 작은 가중치 또는 곱셈변수를 0으로 설정하여 순환신경망 인공지능 모델의 희박화를 수행한다(S430). 전자 장치(100)는 관련값이 기 설정된 값 보다 큰 가중치 또는 곱셈변수에 대하여는(S420-N) 희박화를 수행하지 않고 처리를 종료한다.When the related value is smaller than the preset value (S420-Y), the electronic device 100 sets the weight or multiplying variable whose related value is smaller than the preset value to 0 to perform thinning of the circulatory neural network artificial intelligence model ( S430). The electronic device 100 terminates the process without performing thinning on the weight or multiplying variable whose related value is larger than the preset value (S420-N).

기 설정된 값은 0.05일 수 있으나 이에 한정되지 않는다.The preset value may be 0.05, but is not limited thereto.

도 5는 본 개시의 다른 실시 예에 따른, 순환신경망 인공지능 모델에 대한 압축방법을 나타내는 흐름도이다.5 is a flowchart illustrating a compression method for a circulatory neural network artificial intelligence model according to another embodiment of the present disclosure.

전자 장치(100)는 순환신경망 인공지능 모델의 가중치에 대한 희박화를 수행할 수 있다(S510). 구체적으로 전자 장치(100)는 가중치를 바탕으로 순환신경망을 학습하여, 가중치에 대한 평균값 및 분산값을 획득하고, 획득된 평균값 및 분산값을 바탕으로 평균값의 제곱에 대한 분산값의 비율값을 계산하고, 계산된 비율값이 기 설정된 값 보다 작은 가중치를 0으로 설정한다.The electronic device 100 may perform thinning of the weight of the circulatory neural network artificial intelligence model (S510). In detail, the electronic device 100 learns a cyclic neural network based on weights, obtains an average value and a variance value of the weight, and calculates a ratio value of the variance value with respect to the square of the average value based on the obtained average value and the variance value. Then, the weight value whose calculated ratio is smaller than the preset value is set to zero.

그리고, 전자 장치(100)는 순환 신경망 인공지능 모델의 입력요소에 대한 희박화를 수행할 수 있다(S520). 구체적으로 전자 장치(100)는 입력 요소에 관한 제1 곱셈변수를 획득하고, 제1 곱셈변수를 바탕으로 순환신경망을 학습하여, 제1 곱셈변수에 대한 평균값 및 분산값을 획득하고, 획득된 평균값 및 분산값을 바탕으로 평균값의 제곱에 대한 분산값의 비율값을 계산하고, 계산된 비율값이 기 설정된 값 보다 작은 제1 곱셈변수를 0으로 설정한다.In operation S520, the electronic device 100 may perform thinning of input elements of the cyclic neural network AI model. In detail, the electronic device 100 obtains a first multiplication variable for an input element, learns a cyclic neural network based on the first multiplication variable, obtains an average value and a variance value of the first multiplication variable, and obtains the obtained average value. And calculates a ratio value of the variance value with respect to the square of the average value based on the variance value, and sets the first multiplication variable whose calculated ratio value is smaller than the preset value to zero.

그리고, 전자 장치(100)는 순환 신경망 인공지능 모델의 뉴런에 대한 희박화를 수행할 수 있다(S530). 구체적으로 구체적으로 전자 장치(100)는 입력뉴런 및 은닉뉴런에 관한 제2 곱셈변수를 획득하고, 제2 곱셈변수를 바탕으로 순환신경망을 학습하여, 제2 곱셈변수에 대한 평균값 및 분산값을 획득하고, 획득된 평균값 및 분산값을 바탕으로 평균값의 제곱에 대한 분산값의 비율값을 계산하고, 계산된 비율값이 기 설정된 값 보다 작은 제2 곱셈변수를 0으로 설정한다.In operation S530, the electronic device 100 may perform thinning of neurons of the cyclic neural network AI model. Specifically, the electronic device 100 obtains a second multiplication variable for the input neuron and the hidden neuron, learns a cyclic neural network based on the second multiplication variable, and obtains an average value and a variance value of the second multiplication variable. The ratio value of the variance value with respect to the square of the average value is calculated based on the obtained average value and the variance value, and a second multiplication variable having a calculated ratio value smaller than the preset value is set to zero.

순환 신경망 인공지능 모델이 게이트 구조를 더 포함하는 경우, 전자 장치(100)는 순환 신경망 인공지능 모델의 게이트에 대한 희박화를 수행할 수 있다(S540). 구체적으로 구체적으로 전자 장치(100)는 게이트의 사전활성에 관한 제3 곱셈변수를 획득하고, 제3 곱셈변수를 바탕으로 순환신경망을 학습하여, 제3 곱셈변수에 대한 평균값 및 분산값을 획득하고, 획득된 평균값 및 분산값을 바탕으로 평균값의 제곱에 대한 분산값의 비율값을 계산하고, 계산된 비율값이 기 설정된 값 보다 작은 제3 곱셈변수를 0으로 설정한다.When the cyclic neural network AI model further includes a gate structure, the electronic device 100 may perform thinning of the gate of the cyclic neural network AI model (S540). Specifically, the electronic device 100 obtains a third multiplication variable for pre-activation of the gate, learns a cyclic neural network based on the third multiplication variable, and obtains an average value and a variance value of the third multiplication variable. Based on the obtained average value and the variance value, the ratio value of the variance value with respect to the square of the average value is calculated, and the third multiplication variable whose calculated ratio value is smaller than the preset value is set to zero.

한편, 본 개시의 일시 예에 따르면, 이상에서 설명된 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시 예들에 따른 전자 장치(예: 전자 장치(A))를 포함할 수 있다. 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 프로세서의 제어 하에 다른 구성요소들을 이용하여 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Meanwhile, according to an exemplary embodiment of the present disclosure, the various embodiments described above may be implemented in software including instructions stored in a machine-readable storage media. The device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include an electronic device (eg, the electronic device A) according to the disclosed embodiments. When executed by a processor, the processor may perform functions corresponding to the instructions directly or under the control of the processor, which may include code generated or executed by a compiler or interpreter. The readable storage medium may be provided in the form of a non-transitory storage medium, where 'non-transitory' is defined as a storage medium. Does not include the (signal) does not distinguish that the data can only mean that the real (tangible) is permanently or temporarily stored in the storage medium.

또한, 본 개시의 일 실시 예에 따르면, 이상에서 설명된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.In addition, according to an embodiment of the present disclosure, the method according to various embodiments described above may be provided included in a computer program product. The computer program product may be traded between the seller and the buyer as a product. The computer program product may be distributed online in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)) or through an application store (eg Play StoreTM). In the case of an online distribution, at least a portion of the computer program product may be stored at least temporarily on a storage medium such as a server of a manufacturer, a server of an application store, or a relay server, or may be temporarily created.

또한, 상술한 다양한 실시 예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시 예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 다양한 실시 예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.In addition, each component (for example, a module or a program) according to the above-described various embodiments may be composed of a singular or plural number of objects, and some of the above-described subcomponents may be omitted or other subcomponents may be omitted. Components may be further included in various embodiments. Alternatively or additionally, some components (eg, modules or programs) may be integrated into one entity to perform the same or similar functions performed by each corresponding component prior to integration. According to various embodiments, operations performed by a module, program, or other component may be executed sequentially, in parallel, repeatedly, or heuristically, or at least some operations may be executed in a different order, omitted, or another operation may be added. Can be.

이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.While the above has been illustrated and described with respect to preferred embodiments of the present disclosure, the present disclosure is not limited to the above-described specific embodiments, and is normally made in the art without departing from the gist of the present disclosure as claimed in the claims. Various modifications may be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present disclosure.

100: 전자 장치
110: 프로세서
120: 메모리100: electronic device
110: processor
120: memory

Claims

In a method of compressing a recurrent neural network,
Obtaining a first multiplicative variable for an input element of the cyclic neural network;
Obtaining a second multiplying variable for input neurons and hidden neurons of the circulatory neural network;
Obtaining a weight, a mean and a variance of the cyclic neural network, the first multiplication variable and the second multiplication variable;
Performing sparification on the circulatory neural network based on the mean and the variance;
How to include.

The method of claim 1,
Performing the thinning is,
Calculating a related value for performing the thinning based on the mean and the variance;
And setting a weight value, a first multiplication variable, or a second multiplication variable to which the related value is smaller than a preset value to zero.

The method of claim 2,
Wherein the relevant value is a ratio of square of mean to variance to the square of the mean value.

The method of claim 2,
The predetermined value is 0.05.

The method of claim 1,
When the circulatory neural network includes a gated structure,
Acquiring a third multiplication variable relating to preactivation of the gate to make the gate and information flow elements of the recurrent layer of the circulatory neural network constant; ,
Obtaining the mean and variance,
Obtaining an average value and a variance value of the weight of the cyclic neural network, the first multiplication variable, the second multiplication variable, and the third multiplication variable.

The method of claim 5,
And the gate structure is implemented as a long-short term memory (LSTM) layer of the cyclic neural network.

The method of claim 1,
Obtaining the mean and variance,
Initializing a mean and a variance of the weight, the first group variable and the second group variable;
Optimize the object associated with the weight, the mean and the variance of the first group variable and the second group variable, thereby optimizing the weight, the first group variable and the second group Obtaining the mean and the variance for a variable.

The method of claim 7, wherein
The obtaining step,
Selecting a mini batch of objects;
Generating the weight, the first group variable and the second group variable from an approximated posterior distribution;
Forward-passing the circulatory neural network using the mini-batch based on the generated weight, the first group variable and the second group variable;
Calculating the objective and calculating a gradient for the objective;
Obtaining the mean and the variance of the weight, the first group variable and the second group variable based on the calculated gradient.

The method of claim 8,
Wherein the weights are generated by mini-batch, and wherein the first group variable and the second group variable are generated separately from the object.

The method of claim 1,
Wherein the input element is a vocabulary or word.

In an electronic device that compresses a recurrent neural network,
A memory comprising at least one instruction,
A processor for controlling the at least one instruction,
The processor,
Obtaining a first multiplicative variable relating to an input element of the cyclic neural network,
Obtaining a second multiplication variable for input neurons and hidden neurons of the circulatory neural network,
Obtaining a weight of the cyclic neural network, a mean and a variance of the first multiplication variable and the second multiplication variable,
The electronic device performs sparification on the circulatory neural network based on the mean and the variance.

The method of claim 11,
The processor,
Based on the mean and the variance, calculate a relevant value for performing thinning,
An electronic device for performing the thinning by setting a weight, a first multiplication variable, or a second multiplication variable whose related value is smaller than a preset value to 0

The method of claim 12,
Wherein the relevant value is a ratio of square of mean to variance to the square of the mean value.

The method of claim 12,
The preset value is 0.05

The method of claim 11,
When the circulatory neural network includes a gated structure,
The processor,
Obtaining a third multiplication variable relating to preactivation of the gate to make the gate and information flow elements of the recurrent layer of the circulatory neural network constant;
Obtaining a mean and a variance of the weight, the first multiplication variable, the second multiplication variable, and the third multiplication variable,
The thinning of the circulatory neural network based on the mean and the variance.

The method of claim 15,
The gate structure is an electronic device implemented as a long-short term memory (LSTM) layer of the cyclic neural network.

The method of claim 11,
The processor,
Initialize a mean and a variance of the weight, the first group variable, and the second group variable,
Optimize the objective, associated with the mean and the variance of the weight, the first group variable and the second group variable, thereby optimizing the weight, the first group variable and the second group And obtaining the mean and the variance for a variable.

The method of claim 17,
The processor,
Select a mini batch of objects,
Generating the weight, the first group variable and the second group variable from an approximated posterior distribution,
Forward pass the circulatory neural network using the mini-batch based on the generated weight, the first group variable and the second group variable,
Calculate the objective, calculate the gradient for the objective,
And obtaining the mean and the variance of the weight, the first group variable, and the second group variable based on the calculated gradient.

The method of claim 18,
Wherein the weight is generated by a mini batch, and wherein the first group variable and the second group variable are generated separately from the object.

The method of claim 11,
And the input element is a vocabulary or word.