KR102205430B1

KR102205430B1 - Learning method using artificial neural network

Info

Publication number: KR102205430B1
Application number: KR1020190095084A
Authority: KR
Inventors: 강병호
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2021-01-20

Abstract

According to one embodiment of the present invention, a learning method using an artificial neural network circuit may comprise the steps of: receiving a document including a plurality of natural languages and converting the plurality of natural languages included in the received document into tokens; generating a plurality of partial tokens by dividing the tokens; generating embedding vectors by converting the plurality of partial tokens into a vector; generating an embedding vector inferred by performing RNN encoding and RNN decoding on the embedding vectors; generating an inverted embedding vector by inverting an order of the embedding vectors; calculating a difference between the inferred embedding vector and the inverted embedding vector; generating a backpropagation value for reducing the difference; and performing the RNN encoding and the RNN decoding again using the back propagation value. Therefore, the usability is improved.

Description

Learning method using artificial neural network {LEARNING METHOD USING ARTIFICIAL NEURAL NETWORK}

본 발명은 인공 신경망을 이용한 학습 방법 및 그 방법을 이용한 유사 문서 추천 방법에 관한 것이다.The present invention relates to a learning method using an artificial neural network and a method for recommending similar documents using the method.

최근에 기술이 발전함에 따라, 머신 러닝을 이용한 자연 언어 처리(Natural Language Processing, NLP)에 대한 관심이 높아지고 있다. 자연 언어 처리란 우리가 사용하는 언어를 컴퓨터가 인식하고 이를 처리할 수 있도록 하는 기술로서, 자동 번역, 검색 엔진 등에 사용된다.With recent advances in technology, interest in natural language processing (NLP) using machine learning is increasing. Natural language processing is a technology that enables computers to recognize and process the language we use, and is used in automatic translation and search engines.

자연 언어 처리에 대한 대표적인 알고리즘으로는 BERT(Bidirectional Encoder Representations from Transformers), GPT(Generative Pre-Training model) 등이 있는데, BERT나 GPT는 문장 단위로 컨텍스트를 반영한 자연 언어 처리 알고리즘으로서, 512개의 토큰으로 구성된 문장을 다룰 때 가장 우수한 성능을 보인다.Representative algorithms for natural language processing include BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-Training model).BERT and GPT are natural language processing algorithms that reflect context in units of sentences. It shows the best performance when dealing with composed sentences.

BERT 등 종래의 알고리즘에서는 컨텍스트 파악을 위하여 트랜스포머(Transformer) 알고리즘을 이용하고 있다. 종래에는 문서 단위의 컨텍스트를 파악하기 위하여 트랜스포머 알고리즘을 조율하여 사용하였는데, 트랜스포머 알고리즘이 문서 단위의 컨텍스트를 처리하기 위하여 헤드(head) 수를 늘리면 메모리 복잡도가 증가하고, 어휘(vocabulary) 수를 늘리면 범용성이 떨어진다는 문제가 있다.Conventional algorithms such as BERT use a transformer algorithm to grasp the context. Conventionally, a transformer algorithm was tuned and used to determine the context of a document unit, but increasing the number of heads to handle the context of a document unit increases memory complexity, and increasing the number of vocabularies increases versatility. There is a problem with this dropping.

본 발명이 해결하고자 하는 과제는 인공 신경망을 이용한 학습을 수행하기 위하여 문서 단위의 컨텍스트가 반영된 벡터를 생성하는 방법을 제공하는 것이다.The problem to be solved by the present invention is to provide a method of generating a vector reflecting the context of a document unit in order to perform learning using an artificial neural network.

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved by the present invention is not limited to those mentioned above, and another problem to be solved that is not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

본 발명의 일 실시 예에 따른 인공 신경망 회로를 이용한 학습 방법은, 복수의 자연어들을 포함하는 문서를 입력 받고, 입력 받은 문서에 포함된 복수의 자연어들을 토큰으로 변환하는 단계; 상기 토큰을 나누어 복수의 부분 토큰들을 생성하는 단계; 상기 복수의 부분 토큰들을 벡터로 변환하여 임베딩 벡터를 생성하는 단계; 상기 임베딩 벡터에 RNN(recurrent neural network) 인코딩과 RNN 디코딩을 수행하여 추론된 임베딩 벡터를 생성하는 단계; 상기 임베딩 벡터의 순서를 역전하여 역전된 임베딩 벡터를 생성하는 단계; 상기 추론된 임베딩 벡터와 상기 역전된 임베딩 벡터 사이의 차이를 계산하는 단계; 상기 차이를 줄이기 위한 역전파 값(backpropagation)을 생성하는 단계; 및 상기 역전파 값을 이용하여 상기 RNN 인코딩과 상기 RNN 디코딩을 재수행하는 단계를 포함할 수 있다.A learning method using an artificial neural network circuit according to an embodiment of the present invention includes: receiving a document including a plurality of natural languages and converting a plurality of natural languages included in the received document into tokens; Dividing the token to generate a plurality of partial tokens; Converting the plurality of partial tokens into vectors to generate an embedding vector; Generating an inferred embedding vector by performing RNN (recurrent neural network) encoding and RNN decoding on the embedding vector; Generating an inverted embedding vector by reversing the order of the embedding vectors; Calculating a difference between the inferred embedding vector and the inverted embedding vector; Generating a backpropagation value for reducing the difference; And re-performing the RNN encoding and the RNN decoding using the backpropagation value.

상기 방법은, 상기 역전파 값에 따라 상기 RNN 인코딩 및 상기 RNN 디코딩의 가중치를 조정하는 단계를 더 포함할 수 있다.The method may further include adjusting weights of the RNN encoding and the RNN decoding according to the backpropagation value.

상기 임베딩 벡터를 생성하는 단계는, 상기 복수의 부분 토큰들 각각에 트랜스포머 알고리즘을 수행하여 복수의 임베딩 벡터 리스트들을 생성하는 단계; 및 상기 복수의 임베딩 벡터 리스트들을 결합하여 상기 임베딩 벡터를 생성하는 단계를 포함할 수 있다.The generating of the embedding vector may include: generating a plurality of embedding vector lists by performing a transformer algorithm on each of the plurality of partial tokens; And generating the embedding vector by combining the plurality of embedding vector lists.

부분 토큰의 길이 및 부분 토큰을 생성하는 보폭(stride)에 따라, 상기 복수의 부분 토큰들 중 일부는 중첩되고, 상기 임베딩 벡터는, 상기 복수의 부분 토큰들 중 중첩되는 일부에 대응하는 상기 임베딩 벡터 리스트들 내의 벡터들에 평균값 연산을 수행하여 생성될 수 있다.Depending on the length of the partial token and the stride to generate the partial token, some of the plurality of partial tokens overlap, and the embedding vector is the embedding vector corresponding to the overlapping portion of the plurality of partial tokens It can be generated by performing an average value operation on vectors in lists.

상기 추론된 임베딩 벡터와 상기 역전된 임베딩 벡터 사이의 차이를 계산하는 단계는 Smooth-L1 알고리즘, L1 알고리즘, L2 알고리즘 및 Huber Loss 알고리즘 중에서 어느 하나를 이용하여 차이를 계산할 수 있다.In the step of calculating the difference between the inferred embedding vector and the inverted embedding vector, the difference may be calculated using any one of the Smooth-L1 algorithm, the L1 algorithm, the L2 algorithm, and the Huber Loss algorithm.

본 발명의 실시 예에 의하면, 문서 단위(약 5만여개의 토큰)에서도 컨텍스트 파악이 가능하다.According to an embodiment of the present invention, it is possible to grasp a context in a document unit (about 50,000 tokens).

또한, 본 발명에서는 BERT 알고리즘을 통해 학습한 모델을 그대로 사용하므로 범용성이 높으며, 까다로운 BERT fine-tuning 절차를 RNN 학습 과정에서 대신 수행하므로 사용성이 개선되는 효과가 있다.In addition, in the present invention, since the model learned through the BERT algorithm is used as it is, it has high versatility, and since a tricky BERT fine-tuning procedure is performed instead in the RNN learning process, the usability is improved.

도 1은 인공 신경망을 이용한 학습을 수행하기 위하여 문서 단위의 컨텍스트가 반영된 벡터 생성 알고리즘을 수행하는 DRAE의 블록도를 나타낸다.
도 2는 도 1에 도시된 임베딩 벡터 생성부의 상세 블록도를 나타낸다.
도 3은 도 2에 도시된 토큰 디바이더에서 복수의 부분 토큰들을 생성하는 방법을 나타낸다.
도 4는 도 2에 도시된 임베딩 벡터 연관부에서 임베딩 벡터를 생성하는 방법을 나타낸다.
도 5a 및 도 5b는 본 발명의 일 실시 예에 따라 인공 신경망을 이용한 학습을 수행하기 위하여 문서 단위의 컨텍스트가 반영된 벡터를 생성하는 방법을 나타내는 순서도이다.
도 6은 본 발명의 일 실시 예에 따라 유사 문서를 추천하는 방법을 나타내는 순서도이다.
도 7은 본 발명의 일 실시 예에 따라 자연어 질의를 통한 유사 문서를 추천하는 방법을 나타내는 순서도이다.
도 8은 본 발명의 일 실시 예에 따른 방법을 수행하는 장치를 나타낸다.1 is a block diagram of a DRAE that performs a vector generation algorithm reflecting the context of a document unit to perform learning using an artificial neural network.
2 is a detailed block diagram of an embedding vector generator shown in FIG. 1.
3 shows a method of generating a plurality of partial tokens in the token divider illustrated in FIG. 2.
4 shows a method of generating an embedding vector in the embedding vector association unit shown in FIG. 2.
5A and 5B are flowcharts illustrating a method of generating a vector reflecting the context of a document unit in order to perform learning using an artificial neural network according to an embodiment of the present invention.
6 is a flowchart illustrating a method of recommending a similar document according to an embodiment of the present invention.
7 is a flowchart illustrating a method of recommending a similar document through a natural language query according to an embodiment of the present invention.
8 shows an apparatus for performing a method according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and a method of achieving them will become apparent with reference to the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms, only the present embodiments are intended to complete the disclosure of the present invention, and the general knowledge in the art It is provided to completely inform the scope of the invention to those who have it, and the invention is only defined by the scope of the claims.

본 발명의 실시 예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시 예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing embodiments of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted. In addition, terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to the intention or custom of users or operators. Therefore, the definition should be made based on the contents throughout this specification.

도 1은 인공 신경망을 이용한 학습을 수행하기 위하여 문서 단위의 컨텍스트가 반영된 벡터 생성 알고리즘을 수행하는 DRAE의 블록도를 나타낸다.1 is a block diagram of a DRAE that performs a vector generation algorithm reflecting the context of a document unit to perform learning using an artificial neural network.

도 1을 참조하면, DRAE(Document Recurrent AutoEncoder)(10)는 제1 임베딩 벡터 생성부(Embedding Vector Generator, 100), 제2 임베딩 벡터 생성부(110), RNN 인코더-디코더(Recurrent Neural Network, RNN, Encoder)(120), 반전부(Reverser)(150) 및 코스트 함수 계산부(Cost Function)(160)를 포함할 수 있다.Referring to FIG. 1, a document recurrent autoencoder (DRAE) 10 includes a first embedding vector generator 100, a second embedding vector generator 110, an RNN encoder-decoder (Recurrent Neural Network, RNN). , Encoder 120, a reverser 150, and a cost function calculation unit 160 may be included.

제1 임베딩 벡터 생성부(100)는 복수의 자연어들을 포함하는 문서를 입력 받고, 입력 받은 문서에 포함된 복수의 자연어들을 토큰으로 변환하고, 변환한 토큰에 대한 임베딩 벡터(Embedding Vector)(EV)를 생성할 수 있다. 제1 임베딩 벡터 생성부(100)는 생성한 임베딩 벡터(EV)를 RNN 인코더-디코더(120)로 전달할 수 있다.The first embedding vector generation unit 100 receives a document including a plurality of natural languages, converts a plurality of natural languages included in the received document into tokens, and embeds an embedding vector (EV) for the converted token. Can be created. The first embedding vector generator 100 may transmit the generated embedding vector EV to the RNN encoder-decoder 120.

제2 임베딩 벡터 생성부(110)는 상기 복수의 자연어들을 포함하는 문서를 입력 받고, 입력 받은 문서에 포함된 복수의 자연어들을 토큰으로 변환하고, 변환한 토큰에 대한 임베딩 벡터(EV)를 생성할 수 있다. 제2 임베딩 벡터 생성부(110)는 생성한 임베딩 벡터(EV)를 반전부(150)로 전달할 수 있다.The second embedding vector generator 110 receives a document including the plurality of natural languages, converts a plurality of natural languages included in the received document into tokens, and generates an embedding vector (EV) for the converted token. I can. The second embedding vector generator 110 may transmit the generated embedding vector EV to the inverting unit 150.

본 명세서에서는 DRAE(10)가 두 개의 임베딩 벡터 생성부들(100 및 110)을 포함하는 것으로 도시되어 있지만, 이제 제한되는 것은 아니다. 예컨대, 실시 예에 따라, DRAE(10)는 하나의 임베딩 벡터 생성부를 포함할 수 있다. DRAE(10)가 하나의 임베딩 벡터 생성부를 포함하는 경우, 임베딩 벡터 생성부는 입력 받은 문서에 포함된 복수의 자연어들을 토큰으로 변환하고, 변환한 토큰에 대한 임베딩 벡터를 생성하고, 생성한 임베딩 벡터를 RNN 인코더-디코더(120) 및 반전부(150)로 전달할 수 있다. 임베딩 벡터 생성부(100 및 110)에 대한 설명은 도 2에서 보다 상세하게 하기로 한다.In this specification, the DRAE 10 is shown to include two embedding vector generators 100 and 110, but is not limited thereto. For example, according to an embodiment, the DRAE 10 may include one embedding vector generator. When the DRAE 10 includes a single embedding vector generator, the embedding vector generator converts a plurality of natural languages included in the input document into tokens, generates an embedding vector for the converted token, and generates the embedding vector. It can be transmitted to the RNN encoder-decoder 120 and the inverting unit 150. A description of the embedding vector generation units 100 and 110 will be described in more detail in FIG. 2.

RNN 인코더-디코더(120)는 RNN 인코더(130) 및 RNN 디코더(140)를 포함할 수 있다. RNN 인코더-디코더(120)는 제1 임베딩 벡터 생성부(100)로부터 임베딩 벡터(EV)를 수신할 수 있다. 또한, RNN 인코더-디코더(120)는 RNN 인코더(130)를 초기화하기 위한 벡터 값으로 인코더 초기화 벡터(Encoder Initial Vector)(EIV)를 수신하고 RNN 디코더(140)를 초기화하기 위한 벡터 값으로 디코더 초기화 벡터(Decoder Initial Vector)(DIV)를 수신하고, 코스트 함수 계산부(160)로부터 역전파(Backpropagation)(BP) 값을 수신할 수 있다. The RNN encoder-decoder 120 may include an RNN encoder 130 and an RNN decoder 140. The RNN encoder-decoder 120 may receive an embedding vector (EV) from the first embedding vector generator 100. In addition, the RNN encoder-decoder 120 receives an Encoder Initial Vector (EIV) as a vector value for initializing the RNN encoder 130, and initializes the decoder with a vector value for initializing the RNN decoder 140. A vector (Decoder Initial Vector) (DIV) may be received, and a backpropagation (BP) value may be received from the cost function calculation unit 160.

인코더 초기화 벡터(EIV)는 RNN 인코더(130)를 초기화하기 위한 값을 의미할 수 있다. 인코더 초기화 벡터(EIV)는 0을 나타내는 벡터로 설정될 수 있으나, 이에 한정되는 것은 아니다. 즉, 실시 예에 따라 인코더 초기화 벡터(EIV)는 0이 아닌 다른 값을 나타내는 벡터로 설정될 수 있다. 디코더 초기화 벡터(DIV)는 RNN 디코더(140)를 초기화하기 위한 값을 의미할 수 있다. 디코더 초기화 벡터(DIV)는 0을 나타내는 벡터로 설정될 수 있으나, 이에 한정되는 것은 아니다. 즉, 실시 예에 따라 디코더 초기화 벡터(DIV)는 0이 아닌 다른 값을 나타내는 벡터로 설정될 수 있다.The encoder initialization vector EIV may mean a value for initializing the RNN encoder 130. The encoder initialization vector EIV may be set to a vector representing 0, but is not limited thereto. That is, according to an embodiment, the encoder initialization vector EIV may be set to a vector representing a value other than 0. The decoder initialization vector (DIV) may mean a value for initializing the RNN decoder 140. The decoder initialization vector DIV may be set to a vector representing 0, but is not limited thereto. That is, according to an embodiment, the decoder initialization vector DIV may be set to a vector representing a value other than 0.

RNN 인코더-디코더(120)는 수신한 임베딩 벡터를 인코딩하여 컨텍스트 벡터(Context Vector)(CV)를 생성하고, 생성한 컨텍스트 벡터(CV)를 디코딩하여 추론된 임베딩 벡터(Inferred Embedding Vector)(IEV) (또는 추론된 임베딕 벡터 로짓(Inferred Embedding Vector Logits))를 생성할 수 있다. RNN 인코더-디코더(120)는 추론된 임베딩 벡터(IEV)를 코스트 함수 계산부(160)로 전송할 수 있다.The RNN encoder-decoder 120 generates a context vector (CV) by encoding the received embedding vector, and decodes the generated context vector (CV) to inferred embedding vector (IEV). (Or inferred Embedding Vector Logits) can be created. The RNN encoder-decoder 120 may transmit the inferred embedding vector (IEV) to the cost function calculation unit 160.

RNN 인코더(130)는 인코더 초기화 벡터(EIV)를 이용하여 임베딩 벡터(EV)를 인코딩하고, 인코딩 결과로 컨텍스트 벡터(CV)를 생성할 수 있다. 또한, RNN 인코더(130)는 코스트 함수 계산부(160)로부터 수신한 역전파 값(BP)을 이용하여 RNN 디코더(140)와 경사 하강법(Gradient Descent) 알고리즘을 수행할 수 있다. RNN 인코더(130)가 한번 이상의 경사 하강법(Gradient Descent) 알고리즘을 수행함으로써, 추론된 임베딩 벡터(IEV)와 역전된 임베딩 벡터(REV) 사이의 차이인 에러(error)가 0(또는 0으로 인정 가능한 값)이 될 수 있다. RNN 인코더(130)는 임베딩 벡터(IEV)와 역전된 임베딩 벡터(REV) 사이의 차이인 에러(error)가 0일 때의 문서의 컨텍스트 벡터를 출력할 수 있고, 메모리(도 8의 3)는 상기 문서의 컨텍스트 벡터를 저장할 수 있다.The RNN encoder 130 may encode the embedding vector (EV) using the encoder initialization vector (EIV) and generate a context vector (CV) as a result of the encoding. In addition, the RNN encoder 130 may perform a gradient descent algorithm with the RNN decoder 140 using the backpropagation value BP received from the cost function calculator 160. As the RNN encoder 130 performs one or more gradient descent algorithms, an error, which is the difference between the inferred embedding vector (IEV) and the inverted embedding vector (REV), is recognized as 0 (or 0). Possible value). The RNN encoder 130 may output a context vector of a document when an error, which is a difference between the embedding vector (IEV) and the inverted embedding vector (REV), is 0, and the memory (3 in FIG. 8) is It is possible to store the context vector of the document.

RNN 디코더(140)는 역전파 값(BP) 및 디코더 초기화 벡터(DIV)를 이용하여 컨텍스트 벡터(CV)를 디코딩하고, 디코딩 결과로 추론된 임베딩 벡터(IEV)를 생성할 수 있다. 여기서, 추론된 임베딩 벡터(IEV)란 최초로 입력된 임베딩 벡터(EV)라고 예측된 벡터를 의미할 수 있다. The RNN decoder 140 may decode the context vector (CV) using the backpropagation value (BP) and the decoder initialization vector (DIV), and generate an embedding vector (IEV) inferred from the decoding result. Here, the inferred embedding vector (IEV) may mean a vector predicted as the first input embedding vector (EV).

또한, RNN 디코더(140)는 트랜스포머 알고리즘을 이용하여 추론된 임베딩 벡터(IEV)를 생성할 수 있다. 종래의 트랜스포머 알고리즘을 이용하여 생성된 임베딩 벡터는 역함수가 존재하지 않는다는 문제가 존재하였는데, 본 발명에서는, RNN 디코더(140)가 트랜스포머 알고리즘을 이용하여 토큰(TK)의 임베딩 벡터를 획득함으로서, 토큰 추정 문제를 해결할 수 있다.In addition, the RNN decoder 140 may generate an inferred embedding vector (IEV) using a transformer algorithm. There was a problem that the inverse function does not exist in the embedding vector generated using the conventional transformer algorithm.In the present invention, the RNN decoder 140 acquires the embedding vector of the token (TK) using the transformer algorithm, thereby estimating the token. You can solve the problem.

RNN 디코더(140)는 추론된 임베딩 벡터(IEV)를 코스트 함수 계산부(160)로 전송할 수 있다. RNN 디코더(140)는 코스트 함수 계산부(160)로부터 역전파(backpropagation) 값(BP)를 수신하고, 수신한 역전파 값(BP)을 이용하여 경사 하강법 알고리즘을 수행할 수 있다. The RNN decoder 140 may transmit the inferred embedding vector (IEV) to the cost function calculator 160. The RNN decoder 140 may receive a backpropagation value BP from the cost function calculation unit 160 and may perform a gradient descent algorithm using the received backpropagation value BP.

반전부(150)는 제2 임베딩 벡터 생성부(110)로부터 임베딩 벡터(EV)를 수신하고, 수신한 임베딩 벡터(EV)에 포함된 숫자들의 순서를 역전시켜 역전된 임베딩 벡터(REV)를 생성할 수 있다. 반전부(150)는 역전된 임베딩 벡터(REV)를 코스트 함수 계산부(160)로 전송할 수 있다.The inverting unit 150 receives the embedding vector (EV) from the second embedding vector generation unit 110, and generates an inverted embedding vector (REV) by reversing the order of the numbers included in the received embedding vector (EV). can do. The inverting unit 150 may transmit the inverted embedding vector REV to the cost function calculating unit 160.

종래의 RNN 인코더-디코더는 길이가 긴 문장에 대한 학습 성능이 좋지 않다. 이는 문장의 첫 번째 지점에서 마지막 지점까지의 길이가 길어, 문장 내 문맥을 유지하기 어렵기 때문이다. 따라서, 문서의 첫 번째 토큰에 대한 추정 값을 문서의 마지막 토큰으로 구성하는 순서 역전 학습을 수행해야 하는데, 문서를 토큰 단위로 역전 시키는 경우, 올바르게 역전된 임베딩 벡터를 생성하지 못할 수 있다. 그 이유는, 문서를 토큰 단위로 역전시키면, 문장의 문맥이 유지되지가 않아 전혀 다른 임베딩 벡터가 생성될 수 있기 때문이다.The conventional RNN encoder-decoder has poor learning performance for long sentences. This is because the length from the first point to the last point of the sentence is long, making it difficult to maintain the context within the sentence. Therefore, it is necessary to perform order reversal learning in which the estimated value of the first token of the document is composed of the last token of the document. If the document is reversed in units of tokens, it may not be possible to generate an inverted embedding vector correctly. The reason is that if the document is reversed in units of tokens, the context of the sentence is not maintained and a completely different embedding vector may be generated.

본 발명에서는 이러한 문제를 해결하기 위하여 반전부(150)를 이용하여 임베딩 벡터(EV)의 순서를 역전시켰다. 이로 인해서 본 발명의 DRAE(10)는 문장의 맥락이 보존된 역전된 임베딩 벡터를 생성할 수 있고, 긴 문장에 대한 학습 성능이 좋지 않다는 종래의 RNN 인코더-디코더의 문제를 해결할 수 있다.In the present invention, in order to solve this problem, the order of the embedding vector (EV) is reversed by using the inversion unit 150. Accordingly, the DRAE 10 of the present invention can generate an inverted embedding vector in which the context of the sentence is preserved, and solve the problem of the conventional RNN encoder-decoder that the learning performance for a long sentence is poor.

코스트 함수 계산부(160)는 RNN 디코더(140)로부터 추론된 임베딩 벡터(IEV)를 수신하고, 반전부(150)로부터 역전된 임베딩 벡터(REV)를 수신할 수 있다. 코스트 함수 계산부(160)는 추론된 임베딩 벡터(IEV)와 역전된 임베딩 벡터(REV) 사이의 차이인 에러(error)를 계산하고, 에러를 줄이기 위한 피드백으로서 역전파 값(BP)을 RNN 디코더(140)로 전송할 수 있다. The cost function calculation unit 160 may receive an embedding vector (IEV) inferred from the RNN decoder 140 and may receive an inverted embedding vector (REV) from the inverting unit 150. The cost function calculation unit 160 calculates an error, which is a difference between the inferred embedding vector (IEV) and the inverted embedding vector (REV), and uses the backpropagation value (BP) as a feedback to reduce the error. It can be transmitted to 140.

코스트 함수 계산부(160)는 추론된 임베딩 벡터(IEV)와 역전된 임베딩 벡터(REV) 사이의 차이를 줄이는 방법으로 역전파 방식을 이용한다. 이때, 토큰 번호에 대한 추정이 아닌 추론된 임베딩 벡터에 대한 추정이 이루어져야 하기 때문에, 코스트 함수 계산부(160)에서 Softmax & Log Likelihood를 사용하지 못하여 학습이 불가능한 문제가 발생할 수 있다. 따라서, 이러한 문제를 해결하기 위하여, 본 발명의 코스트 함수 계산부(160)는 Smooth-L1 알고리즘, L1 알고리즘, L2 알고리즘 및/또는 Huber Loss 알고리즘을 이용하여 역전파 값(BP)을 결정함으로써, 학습이 불가능한 종래의 문제를 해결하였다.The cost function calculation unit 160 uses a backpropagation method to reduce a difference between the inferred embedding vector (IEV) and the inverted embedding vector (REV). At this time, since the inferred embedding vector must be estimated, not the token number, the cost function calculation unit 160 cannot use Softmax & Log Likelihood, and thus learning is impossible. Therefore, in order to solve this problem, the cost function calculation unit 160 of the present invention learns by determining the backpropagation value (BP) using the Smooth-L1 algorithm, the L1 algorithm, the L2 algorithm, and/or the Huber Loss algorithm. This impossible conventional problem has been solved.

도 2는 도 1에 도시된 임베딩 벡터 생성부의 상세 블록도를 나타낸다.2 is a detailed block diagram of an embedding vector generator shown in FIG. 1.

도 2를 참조하면, 임베딩 벡터 생성부(100 또는 110, 편의상 100이라 함)는 토큰 생성부(Tokenizer)(200), 토큰 디바이더(Token Divider)(210), 트랜스포머 임베딩부(Transformer Embedding)(220) 및 임베딩 벡터 연관부(Embedding Vector Concatenator)(230)를 포함할 수 있다.Referring to FIG. 2, the embedding vector generation unit 100 or 110 (referred to as 100 for convenience) includes a token generator 200, a token divider 210, and a transformer embedding unit 220. ) And an Embedding Vector Concatenator 230.

토큰 생성부(200)는 복수의 자연어들을 포함하는 문서를 입력 받고, 입력 받은 문서에 포함된 복수의 자연어들을 토큰(TK)으로 변환할 수 있다. 토큰(TK)은 숫자들의 집합으로 구성될 수 있으며, 토큰(TK) 내 숫자들은 문서 내의 단어, 음절, 음소 또는 이들의 결합 중에서 적어도 어느 하나의 단위로 생성될 수 있다. 토큰 생성부(200)는 생성한 토큰(TK)을 토큰 디바이더(210)로 전송할 수 있다.The token generator 200 may receive a document including a plurality of natural languages and convert a plurality of natural languages included in the received document into a token TK. The token TK may be composed of a set of numbers, and the numbers in the token TK may be generated in at least one unit among words, syllables, phonemes, or combinations thereof in the document. The token generator 200 may transmit the generated token TK to the token divider 210.

토큰 디바이더(210)는 토큰(TK)에 포함된 숫자들 중에서 n개의 연속된 숫자들을 포함하는 복수의 부분 토큰들(PTK1~PTK(k))을 생성할 수 있다. 여기서, n은 토큰에 포함된 숫자의 개수 이하의 자연수를 의미하고, 트랜스포머 알고리즘을 구동하기에 적절한 입력 개수를 의미할 수 있다. 또한, k는 2 이상의 자연수를 의미할 수 있다. 토큰 디바이더(210)는 m(m은 자연수)의 크기를 갖는 보폭(stride)으로 복수의 부분 토큰들(PTK1~PTK(k))을 생성할 수 있다. 이때, 토큰(TK)의 길이에 따라, 마지막 부분 토큰(PTK(k))의 길이는 n보다 작을 수 있다. 토큰 디바이더(210)는 생성한 복수의 부분 토큰들(PTK1~PTK(k))을 순서대로 트랜스포머 임베딩부(220)로 전송할 수 있다.The token divider 210 may generate a plurality of partial tokens PTK1 to PTK(k) including n consecutive numbers from among the numbers included in the token TK. Here, n may mean a natural number less than or equal to the number of numbers included in the token, and may mean an appropriate number of inputs to drive the transformer algorithm. In addition, k may mean a natural number of 2 or more. The token divider 210 may generate a plurality of partial tokens PTK1 to PTK(k) with a stride having a size of m (m is a natural number). In this case, depending on the length of the token TK, the length of the last partial token PTK(k) may be less than n. The token divider 210 may sequentially transmit the generated partial tokens PTK1 to PTK(k) to the transformer embedding unit 220.

실시 예에 따라, m의 크기가 n보다 작은 경우, 부분 토큰들의 일부는 중첩될 수도 있다. 예컨대, 토큰(TK)은 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}이고, n은 4이고, m은 2인 경우, 제1 부분 토큰(PTK1)은 {1, 2, 3, 4}이고, 제2 부분 토큰(PTK2)은 {3, 4, 5, 6}이 되고, 따라서, 3 및 4는 제1 부분 토큰(PTK1)과 제2 부분 토큰(PTK2)에서 중첩된다.According to an embodiment, when the size of m is smaller than n, some of the partial tokens may overlap. For example, if the token TK is {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, n is 4, and m is 2, the first partial token PTK1 is { 1, 2, 3, 4}, and the second partial token (PTK2) becomes {3, 4, 5, 6}, and thus, 3 and 4 are the first partial token PTK1 and the second partial token PTK2 ).

복수의 부분 토큰들을 생성하는 방법은 도 3을 통해 자세히 설명하기로 한다.A method of generating a plurality of partial tokens will be described in detail with reference to FIG. 3.

트랜스포머 임베딩부(220)는 복수의 부분 토크들(PTK1~PTK(k)) 각각에 트랜스포머 알고리즘을 수행하여 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 생성할 수 있다. 트랜스포머 임베딩부(220)는 제1 부분 토크(PTK1)에 트랜스포머 알고리즘을 수행하여 제1 임베딩 벡터 리스트(EVL1)를 생성하고, 제2 부분 토크(PTK2)에 트랜스포머 알고리즘을 수행하여 제2 임베딩 벡터 리스트(EVL2)를 생성하고, 제3 부분 토크(PTK3)에 트랜스포머 알고리즘을 수행하여 제3 임베딩 벡터 리스트(EVL3)를 생성할 수 있다. 트랜스포머 임베딩부(220)는 생성한 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 임베딩 벡터 연관부(230)로 전송할 수 있다.The transformer embedding unit 220 may generate a plurality of embedding vector lists EVL1 to EVL(k) by performing a transformer algorithm on each of the plurality of partial torques PTK1 to PTK(k). The transformer embedding unit 220 generates a first embedding vector list EVL1 by performing a transformer algorithm on the first partial torque PTK1, and performs a transformer algorithm on the second partial torque PTK2 to perform a second embedding vector list. A third embedding vector list EVL3 may be generated by generating EVL2 and performing a transformer algorithm on the third partial torque PTK3. The transformer embedding unit 220 may transmit the generated embedding vector lists EVL1 to EVL(k) to the embedding vector association unit 230.

임베딩 벡터 연관부(230)는 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 결합하여 임베딩 벡터(EV)를 생성할 수 있다. 임베딩 벡터 연관부(230)는 생성한 임베딩 벡터(EV)를 RNN 인코더-디코더(120)로 전송할 수 있다.The embedding vector association unit 230 may generate an embedding vector EV by combining a plurality of embedding vector lists EVL1 to EVL(k). The embedding vector association unit 230 may transmit the generated embedding vector (EV) to the RNN encoder-decoder 120.

임베딩 벡터 연관부(230)는 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 인덱스(index) 순으로 정렬하고, 복수의 임베딩 벡터 리스트들(EVL1~EVL(k)) 중에서 동일한 인덱스를 갖는 벡터들끼리 결합하여 임베딩 벡터(EV)를 생성할 수 있다. 복수의 임베딩 벡터 리스트들(EVL1~EVL(k)) 중에서 동일한 인덱스를 갖는 벡터들에 평균값 연산을 수행하여 임베딩 벡터를 생성할 수 있다. 예컨대, 실시 예에 따라, m의 크기가 n보다 작은 경우, 임베딩 벡터 리스트들(EVL1~EVL(k))의 일부는 중첩될 수 있고, 중첩되는 벡터들에 평균값 연산을 수행함으로써, 임베딩 벡터(EV)를 생성할 수 있다. 임베딩 벡터(EV)를 생성하는 방법은 도 4를 통해 보다 자세히 설명하기로 한다.The embedding vector association unit 230 sorts the plurality of embedding vector lists EVL1 to EVL(k) in the order of index, and selects the same index among the plurality of embedding vector lists EVL1 to EVL(k). It is possible to create an embedding vector (EV) by combining the owned vectors. An embedding vector may be generated by performing an average value operation on vectors having the same index among a plurality of embedding vector lists EVL1 to EVL(k). For example, according to an embodiment, when the size of m is smaller than n, some of the embedding vector lists EVL1 to EVL(k) may be overlapped, and by performing an average value operation on the overlapping vectors, the embedding vector ( EV) can be generated. A method of generating the embedding vector (EV) will be described in more detail with reference to FIG. 4.

종래의 자연어 처리 알고리즘들은 문서 단위의 컨텍스트가 반영된 임베딩 벡터를 추출할 수 없다는 문제가 문제가 있었다. 본 명세서에서는, 토큰 디바이더(210)가 문장에서의 문맥을 보존하면서 부분 토큰들을 생성하고, 임베딩 벡터 연관부(230)가 생성된 토큰(부분 토큰들)의 문맥이 유지되도록 임베딩 벡터 리스트들을 결합함으로써, 문서 전체의 컨텍스트가 반영된 임베딩 벡터를 추출할 수 있다.Conventional natural language processing algorithms have a problem in that they cannot extract an embedding vector reflecting the context of a document unit. In this specification, the token divider 210 generates partial tokens while preserving the context in the sentence, and the embedding vector association unit 230 combines the embedding vector lists so that the context of the generated token (partial tokens) is maintained. , It is possible to extract the embedding vector reflecting the context of the entire document.

도 3은 도 2에 도시된 토큰 디바이더에서 부분 토큰을 생성하는 방법을 나타낸다.3 shows a method of generating a partial token in the token divider shown in FIG. 2.

도 3을 참조하면, 토큰 디바이더는 L(L은 자연수)개의 숫자로 구성된 토큰으로부터 k개(k는 자연수)의 부분 토큰들(PTK1~PTK(k))을 생성할 수 있다. 부분 토큰들(PTK1~PTK(k)) 각각은 n개(n은 L보다 작은 자연수)의 숫자로 구성될 가질 수 있다. 토큰 디바이더는 m(m은 L보다 작은 자연수)의 보폭으로 부분 토큰을 생성할 수 있다.Referring to FIG. 3, the token divider may generate k (k is a natural number) partial tokens PTK1 to PTK(k) from a token consisting of L (L is a natural number) number. Each of the partial tokens PTK1 to PTK(k) may be composed of n numbers (n is a natural number less than L). The token divider can generate partial tokens with a stride of m (m is a natural number less than L).

토큰 디바이더는 토큰의 첫 번째 숫자부터 n-번째 숫자를 포함하는 제1 부분 토큰(PTK1)를 생성할 수 있다. 토큰 디바이더는 제1 부분 토큰(PTK1)으로부터 m의 보폭을 두고 제2 부분 토큰(PTK2)을 생성할 수 있다. 즉, 토큰 디바이더는 토큰의 (1+m)-번째 숫자부터 (n+m)-번째 숫자를 포함하는 제2 부분 토크(PTK2)를 생성할 수 있다. 또한, 토큰 디바이더는 제2 부분 토큰(PTK2)으로부터 m의 보폭을 두고 제3 부분 토큰(PTK3)을 생성할 수 있다. 즉, 토큰 디바이더는 (1+2m)-번째 숫자부터 (n+2m)-번째 숫자를 포함하는 제3 부분 토크를 생성할 수 있다. 토큰 디바이더(210)는 생성한 복수의 부분 토크들(PTK1~PTK(k))을 트랜스포머 임베딩부(220)로 전송할 수 있다.The token divider may generate a first partial token PTK1 including an n-th number from the first number of the token. The token divider may generate the second partial token PTK2 with a stride of m from the first partial token PTK1. That is, the token divider may generate the second partial torque PTK2 including the (n+m)-th digit from the (1+m)-th digit of the token. In addition, the token divider may generate a third partial token PTK3 with a stride of m from the second partial token PTK2. That is, the token divider may generate the third partial torque including the (1+2m)-th digit to the (n+2m)-th digit. The token divider 210 may transmit the generated partial torques PTK1 to PTK(k) to the transformer embedding unit 220.

도 2에서는 k-번째 부분 토큰(PTK(k))의 길이는 (k-1)-번째 부분 토큰(PTK(k-1))의 길이와 동일한 것으로 도시하였지만, 토큰(TK)의 길이(L)에 따라, k-번째 부분 토큰(PTK(k))의 길이와 (k-1)-번째 부분 토큰(PTK(k-1))의 길이는 상이할 수도 있다.2 shows that the length of the k-th partial token (PTK(k)) is the same as the length of the (k-1)-th partial token (PTK(k-1)), but the length of the token TK (L ), the length of the k-th partial token (PTK(k)) and the length of the (k-1)-th partial token (PTK(k-1)) may be different.

도 4는 도 2에 도시된 임베딩 벡터 연관부에서 임베딩 벡터를 생성하는 방법을 나타낸다.4 shows a method of generating an embedding vector in the embedding vector association unit shown in FIG. 2.

도 4의 (a)를 참조하면, m과 n이 동일할 경우, 임베딩 벡터 리스트들이 중첩되지 않을 수 있다. 예컨대, m과 n이 5이고, L이 10인 경우, 제1 임베딩 벡터 리스트(EVL1)는 인덱스 1부터 5까지에 해당하고, 제2 임베딩 벡터 리스트(EVL2)는 인덱스 6부터 10까지 해당하여 서로 중첩되지 않을 수 있다. 따라서, 임베딩 벡터 연관부는 제1 임베딩 벡터 리스트(EVL1)와 제2 임베딩 벡터 리스트(EVL2)를 순서대로 결합함으로써 임베딩 벡터(EV)를 생성할 수 있다.Referring to FIG. 4A, when m and n are the same, embedding vector lists may not overlap. For example, when m and n are 5 and L is 10, the first embedding vector list EVL1 corresponds to indices 1 to 5, and the second embedding vector list EVL2 corresponds to indices 6 to 10. May not overlap. Accordingly, the embedding vector association unit may generate the embedding vector EV by sequentially combining the first embedding vector list EVL1 and the second embedding vector list EVL2.

도 4의 (b)를 참조하면, m이 n보다 작을 경우, 임베딩 벡터 리스트들의 일부는 중첩될 수 있다. 예컨대, m은 1, n은 5이고, L이 10인 경우, 제1 임베딩 벡터 리스트(EVL1)는 인덱스 1부터 5까지에 해당하고, 제2 임베딩 벡터 리스트(EVL2)는 인덱스 2부터 6까지 해당함으로써, 인덱스 2에서 제1 임베딩 벡터 리스트(EVL1)와 제2 임베딩 벡터 리스트(EVL2)가 중첩될 수 있다. 따라서, 인덱스 2에서의 제1 임베딩 벡터 리스트(EVL1)의 값과 인덱스 2에서의 제2 임베딩 벡터 리스트(EVL2)의 값의 평균값이 인덱스 2에 대한 임베딩 벡터로 결정될 수 있다.Referring to FIG. 4B, when m is less than n, some of the embedding vector lists may overlap. For example, when m is 1, n is 5, and L is 10, the first embedding vector list (EVL1) corresponds to indices 1 to 5, and the second embedding vector list (EVL2) corresponds to indices 2 to 6 Accordingly, the first embedding vector list EVL1 and the second embedding vector list EVL2 may overlap at index 2. Accordingly, an average value of the value of the first embedding vector list EVL1 at index 2 and the value of the second embedding vector list EVL2 at index 2 may be determined as an embedding vector for index 2.

도 5a 및 도 5b는 본 발명의 일 실시 예에 따라 인공 신경망을 이용한 학습을 수행하기 위하여 문서 단위의 컨텍스트가 반영된 벡터를 생성하는 방법을 나타내는 순서도이다.5A and 5B are flowcharts illustrating a method of generating a vector reflecting the context of a document unit in order to perform learning using an artificial neural network according to an embodiment of the present invention.

도 5a 및 도 5b를 참조하면, 토큰 생성부(200)는 복수의 자연어들을 포함하는 문서를 입력 받고, 입력 받은 문서에 포함된 복수의 자연어들을 토큰(TK)으로 변환할 수 있다(S500). 토큰 디바이더(210)는 토큰(TK)에 포함된 숫자들 중에서 n개의 연속된 숫자들을 포함하는 복수의 부분 토큰들(PTK1~PTK(k))을 생성할 수 있다(S510). 트랜스포머 임베딩부(220)는 복수의 부분 토크들(PTK1~PTK(k)) 각각에 트랜스포머 알고리즘을 수행하여 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 생성할 수 있다(S520). 임베딩 벡터 연관부(230)는 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 결합하여 임베딩 벡터(EV)를 생성할 수 있다(S530). 임베딩 벡터 연관부(230)는 복수의 임베딩 벡터 리스트들(EVL1~EVL(k))을 인덱스(index) 순으로 정렬하고, 복수의 임베딩 벡터 리스트들(EVL1~EVL(k)) 중에서 동일한 인덱스를 갖는 벡터들끼리 결합하여 임베딩 벡터(EV)를 생성할 수 있다.5A and 5B, the token generator 200 may receive a document including a plurality of natural languages and convert a plurality of natural languages included in the received document into a token TK (S500). The token divider 210 may generate a plurality of partial tokens PTK1 to PTK(k) including n consecutive numbers from among the numbers included in the token TK (S510). The transformer embedding unit 220 may generate a plurality of embedding vector lists EVL1 to EVL(k) by performing a transformer algorithm on each of the plurality of partial torques PTK1 to PTK(k) (S520). The embedding vector association unit 230 may generate an embedding vector EV by combining the plurality of embedding vector lists EVL1 to EVL(k) (S530). The embedding vector association unit 230 sorts the plurality of embedding vector lists EVL1 to EVL(k) in the order of index, and selects the same index among the plurality of embedding vector lists EVL1 to EVL(k). It is possible to create an embedding vector (EV) by combining the owned vectors.

RNN 인코더(130)는 인코더 초기화 벡터(EIV)를 이용하여 임베딩 벡터(EV)를 인코딩하고, 인코딩 결과로 컨텍스트 벡터(CV)를 생성할 수 있다(S540). RNN 디코더(140)는 역전파 값(BP) 및 디코더 초기화 벡터(DIV)를 이용하여 컨텍스트 벡터(CV)를 디코딩하고, 디코딩 결과로 추론된 임베딩 벡터(IEV)를 생성할 수 있다(S550).The RNN encoder 130 may encode the embedding vector (EV) using the encoder initialization vector (EIV) and generate a context vector (CV) as a result of the encoding (S540). The RNN decoder 140 may decode the context vector (CV) using the backpropagation value (BP) and the decoder initialization vector (DIV), and generate an embedding vector (IEV) inferred as a result of the decoding (S550).

반전부(150)는 임베딩 벡터 연관부(230)로부터 임베딩 벡터(EV)를 수신하고, 수신한 임베딩 벡터(EV)에 포함된 숫자들의 순서를 역전시켜 역전된 임베딩 벡터(REV)를 생성할 수 있다The inverting unit 150 receives the embedding vector (EV) from the embedding vector association unit 230 and reverses the order of numbers included in the received embedding vector (EV) to generate an inverted embedding vector (REV). have

코스트 함수 계산부(160)는 추론된 임베딩 벡터(IEV)와 역전된 임베딩 벡터(REV) 사이의 차이인 에러(error)를 계산하고, 상기 에러가 0(zero)인지 여부를 판단할 수 있다(S560). The cost function calculation unit 160 may calculate an error, which is a difference between the inferred embedding vector (IEV) and the inverted embedding vector (REV), and determine whether the error is 0 (zero) ( S560).

상기 에러가 0(zero)이 아닌 경우(S560의 NO), 코스트 함수 계산부(160)는 상기 에러를 0으로 만들기 위한 역전파 값(BP)을 RNN 디코더(140)로 전송하고, RNN 인코더(130) 및 RNN 디코더(140)는 역전파 값(BP)을 이용하여 역전파 알고리즘을 수행할 수 있다(S570). RNN 인코더(130) 및 RNN 디코더는 역전파 알고리즘에 따라 변경된 가중치 값(weight)을 이용하여 추론된 임베딩 벡터와 역전된 임베딩 벡터의 차이가 0이 될 때까지 S540 단계 내지 S570 단계를 반복할 수 있다. S540 단계 내지 S570 단계를 반복함으로써, DRAE(10)(또는 RNN 인코더-디코더(120))는 학습될 수 있다.When the error is not 0 (NO in S560), the cost function calculation unit 160 transmits a backpropagation value BP for making the error 0 to the RNN decoder 140, and the RNN encoder ( 130) and the RNN decoder 140 may perform a backpropagation algorithm using the backpropagation value BP (S570). The RNN encoder 130 and the RNN decoder may repeat steps S540 to S570 until a difference between the embedding vector inferred using the weight changed according to the backpropagation algorithm and the inverted embedding vector becomes 0. . By repeating steps S540 to S570, the DRAE 10 (or RNN encoder-decoder 120) may be learned.

상기 에러가 0(zero)인 경우(S560의 YES), RNN 인코더(130)는 에러가 0일 때의 컨텍스트 벡터(CV)를 출력하고(S580), 메모리(도 8의 3)는 RNN 인코더(130)에서 출력된 컨텍스트 벡터(CV)를 저장할 수 있다. 컨텍스트 벡터(CV)는 DRAE(10)(또는 RNN 인코더-디코더(120))가 학습된 결과일 수 있다.If the error is 0 (zero) (YES in S560), the RNN encoder 130 outputs a context vector (CV) when the error is 0 (S580), and the memory (3 in FIG. 8) is an RNN encoder ( 130), the output context vector (CV) may be stored. The context vector CV may be a result of training the DRAE 10 (or the RNN encoder-decoder 120).

도 6은 본 발명의 일 실시 예에 따라 유사 문서를 추천하는 방법을 나타내는 순서도이다.6 is a flowchart illustrating a method of recommending a similar document according to an embodiment of the present invention.

DRAE(10)를 이용하여 유사 문서를 검색하고자 하는 기준 문서의 컨텍스트 벡터(CV)를 결정할 수 있다(S600). 상기 기준 문서의 컨텍스트 벡터(CV)를 결정하는 방법은 도 1 내지 도 5b에서 설명한 방법에 의할 수 있다. 이후, 기준 문서와 유사도를 비교할 문서 집단에 포함된 문서들 각각의 컨텍스트 벡터를 결정하고(S610), 기준 문서의 컨텍스트 벡터와 문서 집단에 포함된 문서들 각각의 컨텍스트 벡터 사이의 거리를 측정할 수 있다(S620). The context vector (CV) of the reference document to be searched for similar documents may be determined by using the DRAE 10 (S600). A method of determining the context vector (CV) of the reference document may be performed by the method described in FIGS. 1 to 5B. Thereafter, the context vector of each of the documents included in the document group to be compared with the reference document and similarity is determined (S610), and a distance between the context vector of the reference document and the context vector of each of the documents included in the document group may be measured. There is (S620).

측정 결과, 기준 문서의 컨텍스트 벡터와 문서 집단에 포함된 문서의 컨텍스트 벡터 사이의 거리가 가까울수록 유사도가 높다고 판단하고, 가장 거리가 가까운 몇몇의 문서들을 거리 순으로 출력할 수 있다(S630).As a result of the measurement, as the distance between the context vector of the reference document and the context vector of the document included in the document group is closer, it is determined that the similarity is high, and several documents with the closest distance may be output in order of distance (S630).

도 7은 본 발명의 일 실시 예에 따라 자연어 질의를 통한 유사 문서를 추천하는 방법을 나타내는 순서도이다.7 is a flowchart illustrating a method of recommending a similar document through a natural language query according to an embodiment of the present invention.

DRAE(10)를 이용하여 검색하고자 하는 자연어로 구성된 질의의 컨텍스트 벡터(CV)를 결정할 수 있다(S700). 여기서, 질의는 단순한 단어의 나열일 수도 있고, 하나 이상의 문장 또는 하나 이상의 문단을 포함하는 문서 단위의 글일 수도 있다.The context vector (CV) of a query composed of a natural language to be searched may be determined by using the DRAE 10 (S700). Here, the query may be a simple list of words, or may be a document unit including one or more sentences or one or more paragraphs.

이후, 자연어와 유사도를 비교할 문서 집단에 포함된 복수의 문서들 각각의 컨텍스트 벡터를 결정하고(S710), 기준 문서의 컨텍스트 벡터와 문서 집단에 포함된 문서들 각각의 컨텍스트 벡터 사이의 거리를 측정할 수 있다(S720). Thereafter, the context vector of each of the plurality of documents included in the document group to be compared with the natural language and similarity is determined (S710), and the distance between the context vector of the reference document and each of the documents included in the document group is measured. Can be (S720).

측정 결과, 기준 문서의 컨텍스트 벡터와 문서 집단에 포함된 문서의 컨텍스트 벡터 사이의 거리가 가까울수록 유사도가 높다고 판단하고, 가장 거리가 가까운 몇몇의 문서들을 거리 순으로 출력할 수 있다(S730).As a result of the measurement, it is determined that the similarity is higher as the distance between the context vector of the reference document and the context vector of the document included in the document group is closer, and several documents with the closest distance may be output in order of distance (S730).

도 8은 본 발명의 일 실시 예에 따른 방법을 수행하는 장치를 나타낸다.8 shows an apparatus for performing a method according to an embodiment of the present invention.

장치(1)는 프로세서(2), 메모리(3) 및 입출력 장치(4)를 포함할 수 있다.The device 1 may include a processor 2, a memory 3, and an input/output device 4.

프로세서(2)는 장치(1)의 동작을 전반적으로 제어할 수 있다. 프로세서(2)는 메모리(3) 및 입출력 장치(4)의 동작을 제어할 수 있다.The processor 2 can control overall the operation of the device 1. The processor 2 may control operations of the memory 3 and the input/output device 4.

메모리(3)는 DRAE(10)에 대한 정보를 저장할 수 있다. 메모리(3)는 프로세서(2)의 제어에 따라 RNN 인코더(130)로부터 출력된 컨텍스트 벡터(CV)에 대한 정보를 저장할 수 있다. 유사 문서를 추천하기 위하여, 프로세서(2)는 메모리(3)에 저장된 DRAE(10)에 대한 정보 및/또는 컨텍스트 벡터(CV)에 대한 정보를 로드할 수 있다.The memory 3 may store information on the DRAE 10. The memory 3 may store information on the context vector CV output from the RNN encoder 130 under the control of the processor 2. In order to recommend a similar document, the processor 2 may load information about the DRAE 10 and/or information about the context vector (CV) stored in the memory 3.

본 발명에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방법으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each block of the block diagram attached to the present invention and each step of the flowchart may be performed by computer program instructions. Since these computer program instructions can be mounted on the encoding processor of a general-purpose computer, special purpose computer or other programmable data processing equipment, the instructions executed by the encoding processor of the computer or other programmable data processing equipment are each block of the block diagram or Each step of the flow chart will create a means to perform the functions described. These computer program instructions can also be stored in computer-usable or computer-readable memory that can be directed to a computer or other programmable data processing equipment to implement a function in a particular way, so that the computer-usable or computer-readable memory It is also possible to produce an article of manufacture in which the instructions stored in the block diagram contain instruction means for performing the functions described in each block or flow chart. Computer program instructions can also be mounted on a computer or other programmable data processing equipment, so that a series of operating steps are performed on a computer or other programmable data processing equipment to create a computer-executable process to create a computer or other programmable data processing equipment. It is also possible for the instructions to perform the processing equipment to provide steps for performing the functions described in each block of the block diagram and each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each block or each step may represent a module, segment, or part of code comprising one or more executable instructions for executing the specified logical function(s). It should also be noted that in some alternative embodiments, functions mentioned in blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially simultaneously, or the blocks or steps may sometimes be performed in the reverse order depending on the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential quality of the present invention. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

10: DRAE
100, 110: 임베딩 벡터 생성부
120: RNN 인코더-디코더
150: 반전부
160: 코스트 함수 계산부10: DRAE
100, 110: embedding vector generator
120: RNN encoder-decoder
150: reverse part
160: cost function calculation unit

Claims

In a method of learning a document recurrent auto encoder including a recurrent neural network (RNN) encoder and an RNN decoder performed by a document recurrent autoencoder learning device,
Receiving a document including a plurality of natural languages and converting a plurality of natural languages included in the received document into tokens;
Dividing the token to generate a plurality of partial tokens;
Converting the plurality of partial tokens into vectors to generate an embedding vector;
Generating an inferred embedding vector by performing RNN encoding and RNN decoding on the embedding vector;
Generating an inverted embedding vector by reversing the order of the embedding vectors;
Calculating a difference between the inferred embedding vector and the inverted embedding vector;
Generating a backpropagation value for reducing the difference; And
And re-performing the RNN encoding and the RNN decoding using the backpropagation value.
How to train the document rotation autoencoder.

The method of claim 1,
Adjusting the weight of the RNN encoding and the RNN decoding according to the backpropagation value further comprising
How to train the document rotation autoencoder.

The method of claim 1,
Generating the embedding vector,
Generating a plurality of embedding vector lists by performing a transformer algorithm on each of the plurality of partial tokens; And
Comprising the step of combining the plurality of embedding vector lists to generate the embedding vector
How to train the document rotation autoencoder.

The method of claim 3,
Depending on the length of the partial token and the stride to generate the partial token, some of the plurality of partial tokens overlap,
The embedding vector is generated by performing an average value operation on vectors in the embedding vector lists corresponding to an overlapping part of the plurality of partial tokens.
How to train the document rotation autoencoder.

The method of claim 1,
The step of calculating the difference between the inferred embedding vector and the inverted embedding vector includes calculating the difference using any one of Smooth-L1 algorithm, L1 algorithm, L2 algorithm, and Huber Loss algorithm.
How to train the document rotation autoencoder.

As a computer-readable recording medium storing a computer program,
The computer program,
A computer-readable recording medium comprising instructions for causing a processor to perform the method according to any one of claims 1 to 5.