KR102571902B1

KR102571902B1 - Method for translate sign language gloss using transformer, and computer program recorded on record-medium for executing method thereof

Info

Publication number: KR102571902B1
Application number: KR1020230030985A
Authority: KR
Inventors: 한승대; 오승진; 김주희
Original assignee: 주식회사 인피닉
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-08-30

Abstract

본 발명은 높은 정확도로 자연어를 수어 글로스로 번역하기 위한, 트랜스포머를 이용한 수어 글로스 번역 방법을 제안한다. 상기 방법은 번역서버가, 자연어 텍스트(text)를 입력 받는 단계, 상기 번역서버가, 상기 자연어 텍스트를 인코딩(encoding)하여 상기 자연어 텍스트와 대응하는 벡터(vector)를 생성하는 단계 및 상기 번역서버가, 자연어 및 상기 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 트랜스포머(transformer) 모델을 통해, 상기 벡터를 디코딩(decoding)하여 상기 자연어 텍스트와 매칭되는 수어 글로스를 생성하는 단계를 포함할 수 있다.The present invention proposes a sign language gloss translation method using a transformer for translating natural language into sign language gloss with high accuracy. The method includes the steps of receiving, by a translation server, natural language text; generating, by the translation server, a vector corresponding to the natural language text by encoding the natural language text; and , Decoding the vector through a transformer model pre-machine-learned by a natural language and a sign language data set matching the natural language to obtain a sign language gloss matching the natural language text It may include generating steps.

Description

Method for translate sign language gloss using transformer, and computer program recorded on record-medium for executing method thereof}

본 발명은 언어 번역(language translation)에 관한 것이다. 보다 상세하게는, 높은 정확도로 자연어(natural language)를 수어(sign language) 글로스(gloss)로 번역하기 위한, 트랜스포머를 이용한 수어 글로스 번역 방법 및 이를 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램에 관한 것이다.The present invention relates to language translation. More specifically, it relates to a sign language gloss translation method using a transformer for translating natural language into sign language gloss with high accuracy, and a computer program recorded on a recording medium to execute the same. .

수어(수화, sign language)는 소리로 하는 언어가 아닌 손짓을 이용하여 뜻을 전달할 수 있는 언어의 일종이다. 음성언어가 청각으로 이해되고 음성으로 표현되는 청각-음성 체계임에 반하여, 수어는 시각으로 이해되고 손운동으로 표현되는 시각-운동 체계이다. 수어는 대부분 청각 장애인의 의사소통을 위해 사용된다.Sign language (sign language) is a kind of language that can convey meaning using hand gestures rather than a language spoken by sound. Sign language is a visual-motor system that is understood visually and expressed with hand movements, whereas spoken language is an auditory-voice system that is understood auditorily and expressed by voice. Sign language is mostly used for communication by hearing impaired people.

이러한, 수어는 수지신호와 비수지신호로 구성되어 있다. 수지신호는 수위(손의 위치), 수형(손의 모양), 수동(손의 움직임) 등이 있다. 비수지신호는 얼굴의 표정과 머리와 몸의 움직임 등이 있으며, 놀람, 공포, 기쁨, 증오, 행복, 슬픔, 혐오, 비웃음 등의 감정을 나타낼 수 있다.Such a sign language is composed of a resin signal and a non-accept signal. Hand signals include water level (hand position), hand shape (hand shape), and passive (hand movement). Non-handicap signals include facial expressions, head and body movements, etc., and can express emotions such as surprise, fear, joy, hatred, happiness, sadness, disgust, and ridicule.

한편, 최근에는 정보통신 수단에 의한 사회복지 향상에 대해 많은 사람들이 관심을 보이고 있다. 구체적으로, 일상 생활 및 사회 참여에 곤란을 겪고 있는 사람들의 특수한 요구에 부응하여, 그들의 일상 생활 및 사회 참여를 지원하는 다양한 시스템 개발 및 구축이 중요한 문제로 대두되고 있다.On the other hand, recently, many people have shown interest in improving social welfare through information and communication means. Specifically, in response to the special needs of people who have difficulties in daily life and social participation, development and construction of various systems that support their daily life and social participation has emerged as an important issue.

특히, 청각 장애인들이 자신들의 주된 의사소통 수단인 수어를 이용하여 정보통신 서비스를 받을 수 있도록, 자연어를 수어로 자동 번역할 수 있는 시스템에 대한 다양한 연구가 진행되고 있다.In particular, various studies are being conducted on a system capable of automatically translating natural language into sign language so that hearing-impaired people can receive information communication services using sign language, their main communication means.

그러나, 수어는 자연어와 사용하는 문법, 단어, 어순, 표현 방법 등에 차이가 있다. 이에 따라, 수어의 문법, 단어, 어순, 표현 방법 등을 고려하여, 높은 정확도로 자연어를 수어로 변환할 수 있는 시스템의 개발이 요구되고 있다.However, sign language differs from natural language in terms of grammar, words, word order, and expression method. Accordingly, there is a demand for the development of a system capable of converting a natural language into a sign language with high accuracy in consideration of grammar, words, word order, expression method, and the like of sign language.

대한민국 등록특허공보 제10-1915088호, ‘수화번역장치’, (2018.10.30. 등록)Republic of Korea Patent Registration No. 10-1915088, ‘sign language translation device’, (registered on October 30, 2018)

본 발명의 일 목적은 높은 정확도로 자연어(natural language)를 수어(sign language) 글로스(gloss)로 번역하기 위한, 트랜스포머를 이용한 수어 글로스 번역 방법을 제공하는 것이다.One object of the present invention is to provide a sign language gloss translation method using a transformer for translating natural language into sign language gloss with high accuracy.

본 발명의 다른 목적은 높은 정확도로 자연어를 수어 글로스로 번역하기 위한, 트랜스포머를 이용한 수어 글로스 번역 방법을 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램을 제공하는 것이다.Another object of the present invention is to provide a computer program recorded on a recording medium to execute a sign language gloss translation method using a transformer for translating natural language into sign language gloss with high accuracy.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 바와 같은 기술적 과제를 달성하기 위하여, 본 발명은 높은 정확도로 자연어를 수어 글로스로 번역하기 위한, 트랜스포머를 이용한 수어 글로스 번역 방법을 제안한다. 상기 방법은 번역서버가, 자연어 텍스트(text)를 입력 받는 단계, 상기 번역서버가, 상기 자연어 텍스트를 인코딩(encoding)하여 상기 자연어 텍스트와 대응하는 벡터(vector)를 생성하는 단계 및 상기 번역서버가, 자연어 및 상기 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 트랜스포머(transformer) 모델을 통해, 상기 벡터를 디코딩(decoding)하여 상기 자연어 텍스트와 매칭되는 수어 글로스를 생성하는 단계를 포함할 수 있다.In order to achieve the technical problem as described above, the present invention proposes a sign language gloss translation method using a transformer for translating natural language into sign language gloss with high accuracy. The method includes the steps of receiving, by a translation server, natural language text; generating, by the translation server, a vector corresponding to the natural language text by encoding the natural language text; and , Decoding the vector through a transformer model pre-machine-learned by a natural language and a sign language data set matching the natural language to obtain a sign language gloss matching the natural language text It may include generating steps.

구체적으로, 상기 수어 글로스를 생성하는 단계는 상기 인코딩 과정에서 생성된 토큰을 카피(copy)하여 선택적으로 사용하여 상기 수어 글로스를 생성하는 것을 특징으로 한다.Specifically, the generating of the sign language gloss may include generating the sign language gloss by copying and selectively using a token generated in the encoding process.

또한, 상기 벡터를 생성하는 단계는 상기 자연어 텍스트의 각 단어를 토큰화 한 제1 토큰을 생성하고, 상기 자연어 텍스트의 각 단어 및 상기 각 단어의 언어 자질을 토큰화 한 제2 토큰을 생성하는 단계 및 자연어 문장으로 사전 기계 학습된 인공지능을 통해 상기 제1 토큰과 대응하는 문맥 정보가 포함된 제1 컨텍스트 벡터를 생성하고, 상기 제2 토큰을 임베딩하여 제2 컨텍스트 벡터를 생성하는 단계를 포함하는 것을 특징으로 한다.The generating of the vector may include generating a first token by tokenizing each word of the natural language text, and generating a second token by tokenizing each word of the natural language text and language features of each word. and generating a first context vector including context information corresponding to the first token through artificial intelligence pre-machine-learned as a natural language sentence, and generating a second context vector by embedding the second token. characterized by

상기 벡터를 생성하는 단계는 상기 제1 컨텍스트 벡터 및 상기 제2 컨텍스트 벡터를 합성한 혼합 특징 벡터(mixed feature vector)를 생성하고, 상기 생성된 혼합 특징 벡터를 상기 수어 글로스를 생성하기 위한 인공지능에 입력하는 것을 특징으로 한다.The generating of the vector may include generating a mixed feature vector obtained by synthesizing the first context vector and the second context vector, and applying the generated mixed feature vector to artificial intelligence for generating the sign language gloss. It is characterized by input.

상기 제1 토큰을 생성하는 단계 및 상기 제2 토큰을 생성하는 단계 이전에 상기 자연어 텍스트 중 적어도 둘 이상의 의미를 갖는 단어를 검출하고, 상기 검출된 단어를 의미 단위로 띄어쓰기 처리하는 것을 특징으로 한다.Before the generating of the first token and the generating of the second token, words having at least two or more meanings in the natural language text are detected, and the detected words are spaced in semantic units.

상기 제2 토큰을 생성하는 단계는 품사(POS, Part Of Speech) 분석 및 개체명 인식(NER, Named Entity Recognition) 결과를 기초로, 상기 자연어 텍스트를 임베딩(embedding)하여 상기 제2 토큰을 생성하는 것을 특징으로 한다.Generating the second token may include generating the second token by embedding the natural language text based on a result of Part Of Speech (POS) analysis and Named Entity Recognition (NER). characterized by

상기 수어 글로스를 생성하는 단계는 상기 번역서버가, 상기 자연어 텍스트와 대응하는 혼합 특징 벡터를 입력 받는 단계 및 상기 번역서버가, 자연어 및 상기 자연어와 매칭되는 수어 데이터 셋에 의해 사전 기계 학습된 인공지능을 통해, 상기 혼합 특징 벡터와 매칭되는 수어에 관한 출력 토큰을 추출하는 단계; 를 포함하는 것을 특징으로 한다.The step of generating the sign language gloss is the step of receiving, by the translation server, a mixed feature vector corresponding to the natural language text, and by the translation server, artificial intelligence pre-machine-learned by a natural language and a sign language data set matched with the natural language extracting an output token related to a sign language that matches the mixed feature vector; It is characterized in that it includes.

상기 출력 토큰을 추출하는 단계는 하기의 수학식을 통해 상기 인코딩 과정에서 생성된 입력 토큰을 사용할 스코어를 산출하는 것을 특징으로 한다.The step of extracting the output token is characterized in that a score to be used for the input token generated in the encoding process is calculated through the following equation.

[수학식][mathematical expression]

(여기서, 는 디코딩 과정의 t시점에 인코딩 과정의 j시점의 상기 입력 토큰을 카피하는 것에 대한 스코어를 의미하고, X는 입력 토큰을 의미하고, S_t는 디코더 셀에서 나온 t시점의 상태 벡터(state vector)를 의미하고, h_t는 인코더에서 나온 t시점의 결과 벡터를 의미한다.)(here, Means the score for copying the input token at time j of the encoding process at time t of the decoding process, X means the input token, and S _t is the state vector at time t from the decoder cell , and h _t means the result vector at time t from the encoder.)

상기 출력 토큰을 추출하는 단계는 하기의 수학식을 통해 상기 수어 토큰의 스코어를 산출하는 것을 특징으로 한다.The step of extracting the output token is characterized in that the score of the sign language token is calculated through the following equation.

[수학식][mathematical expression]

(여기서, 는 디코더에서 t 시점에 수어 데이터 셋의 i번째 토큰을 출력하는 것에 대한 스코어를 의미하고, V는 수어 데이터 셋에 포함된 수어 토큰을 의미하고, S_t는 디코더 셀에서 나온 t시점의 상태 벡터(state vector)를 의미한다.)(here, denotes the score for outputting the ith token of the sign language data set at time t in the decoder, V denotes the sign language token included in the sign language data set, and S _t is the state vector at time t from the decoder cell ( state vector).)

이때, t시점의 생성 토큰을 G_t라고 했을 때, 하기의 수학식에 기재된 네가지 경우로 나뉘어지게 된다.At this time, when the generation token at time t is G _t , it is divided into four cases described in the following equation.

상기 출력 토큰을 추출하는 단계는 하기의 수학식을 이용하여 최종 토큰 출력 확률을 산출하며, 가장 확률이 높은 토큰을 출력하는 것을 특징으로 한다.In the step of extracting the output token, a final token output probability is calculated using the following equation, and a token with the highest probability is output.

[수학식][mathematical expression]

상기 출력 토큰을 추출하는 단계는 산출된 스코어를 기초로 상기 수어 글로스를 생성하기 위한 출력 토큰의 확률을 산출하되, 하기의 수학식을 기초로 카피와 관련된 정보를 다음 출력 토큰을 추측할 때 제공해 주기 위한 selective read 값을 산출하는 것을 특징으로 한다.The step of extracting the output token calculates the probability of the output token for generating the sign language gloss based on the calculated score, but provides copy-related information based on the following equation when estimating the next output token. It is characterized by calculating a selective read value for

[수학식][mathematical expression]

(여기서, 는 하기의 수학식을 통해 산출된다.)(here, is calculated through the following equation.)

[수학식][mathematical expression]

인 경우, If

그러지 않은 경우, = 0If not, = 0

(여기서 K는 하기의 수학식을 통해 산출된다.)(Here, K is calculated through the following equation.)

[수학식][mathematical expression]

상기 수어 토큰을 추출하는 단계는 상기 혼합 특징 벡터를 기초로 상기 자연어 텍스트의 문장 유형을 추정하고, 상기 추정된 문장 유형에 따른 비수지기호를 추출하고, 상기 추출된 비수지기호를 상기 수어 토큰에 임베딩하는 것을 특징으로 한다.The extracting of the sign language token may include estimating a sentence type of the natural language text based on the mixed feature vector, extracting a non-financial symbol according to the estimated sentence type, and assigning the extracted non-financial symbol to the sign language token. It is characterized by embedding.

상기 출력 토큰을 추출하는 단계는 상기 추정된 문장 유형에 따라 상기 수어 글로스를 수어로 동작하는데 따른 속도 지수를 도출하고, 상기 도출된 속도 지수를 상기 수어 토큰에 임베딩하고, 상기 속도 지수를 나타내는 문자를 상기 수어 글로스에 포함시키는 것을 특징으로 한다.The extracting of the output token may include deriving a speed index according to operating the sign language gloss as a sign language according to the estimated sentence type, embedding the derived speed index into the sign language token, and generating a character representing the speed index. Characterized in that it is included in the sign language gloss.

상기 출력 토큰을 추출하는 단계는 상기 혼합 특징 벡터에 포함된 상기 자연어 텍스트의 언어 자질을 기초로, 상기 자연어 텍스트의 상기 문장 유형을 식별하고, 상기 식별된 문장 유형을 기초로 상기 속도 지수를 결정하는 것을 특징으로 한다.The extracting of the output token may include identifying the sentence type of the natural language text based on language features of the natural language text included in the mixed feature vector, and determining the speed index based on the identified sentence type. characterized by

상술한 바와 같은 기술적 과제를 달성하기 위하여, 본 발명은 수어 글로스 번역 방법을 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램을 제안한다. 상기 컴퓨터 프로그램은 메모리(memory), 송수신기(transceiver) 및 상기 메모리에 상주된 명령어를 처리하는 프로세서(processor)를 포함하여 구성된 컴퓨팅 장치와 결합될 수 있다. 그리고, 상기 컴퓨터 프로그램은 상기 프로세서가, 자연어 텍스트(text)를 입력 받는 단계, 상기 프로세서가, 상기 자연어 텍스트를 인코딩(encoding)하여 상기 자연어 텍스트와 대응하는 벡터(vector)를 생성하는 단계 및 상기 프로세서가, 자연어 및 상기 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 트랜스포머(transformer) 모델을 통해, 상기 벡터를 디코딩(decoding)하여 상기 자연어 텍스트와 매칭되는 수어 글로스를 생성하는 단계를 포함하고, 상기 수어 글로스를 생성하는 단계는 상기 인코딩 과정에서 생성된 토큰을 카피(copy)하여 선택적으로 사용하여 상기 수어 글로스를 생성하는 것을 특징으로 하는, 수어 번역 방법을 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.In order to achieve the technical problem as described above, the present invention proposes a computer program recorded on a recording medium to execute a sign language gloss translation method. The computer program may be combined with a computing device comprising a memory, a transceiver, and a processor that processes instructions resident in the memory. Further, the computer program includes steps of receiving, by the processor, natural language text; generating, by the processor, a vector corresponding to the natural language text by encoding the natural language text; and A, a sign language gloss matched with the natural language text by decoding the vector through a transformer model pre-machined by a natural language and a sign language data set matching the natural language Executing a sign language translation method comprising generating a sign language gloss, wherein the sign language gloss is generated by copying and selectively using a token generated in the encoding process. For this purpose, it may be a computer program recorded on a recording medium.

기타 실시 예들의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Details of other embodiments are included in the detailed description and drawings.

본 발명의 실시 예들에 따르면, 자연어 및 자연어와 매칭되는 수어 데이터 셋에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해, 높은 정확도로 자연어를 수어 글로스로 변환할 수 있다.According to embodiments of the present invention, it is possible to convert natural language into sign language gloss with high accuracy through artificial intelligence (AI) pre-machine-learned by natural language and a sign language data set matching the natural language. .

본 발명의 효과들은 이상에서 언급한 효과로 제한되지 아니하며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 본 발명이 속한 기술분야의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은 본 발명의 일 실시예에 따른 수어번역시스템의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 번역서버의 논리적 구성도이다.
도 3은 본 발명의 일 실시예에 따른 데이터전처리부의 기능을 설명하기 위한 예시도이다.
도 4 및 도 5는 본 발명의 일 실시예에 따른 제1 인공지능을 설명하기 위한 예시도이다.
도 6은 본 발명의 일 실시예에 따른 수어글로스생성부의 기능을 설명하기 위한 예시도이다.
도 7은 본 발명의 일 실시예에 따른 제2 인공지능을 설명하기 위한 예시도이다.
도 8은 본 발명의 일 실시예에 따른 번역서버의 하드웨어 구성도이다.
도 9는 본 발명의 일 실시예에 따른 번역 방법을 설명하기 위한 순서도이다.
도 10은 본 발명의 일 실시예에 따른 수어 영상 생성 단계를 설명하기 위한 순서도이다.
도 11은 본 발명의 일 실시예에 따른 수어 영상 생성 방법을 설명하기 위한 예시도이다.
도 12는 본 발명의 일 실시예에 따른 번역 방법을 설명하기 위한 예시도이다.1 is a block diagram of a sign language translation system according to an embodiment of the present invention.
2 is a logical configuration diagram of a translation server according to an embodiment of the present invention.
3 is an exemplary diagram for explaining the function of a data pre-processing unit according to an embodiment of the present invention.
4 and 5 are exemplary diagrams for explaining a first artificial intelligence according to an embodiment of the present invention.
6 is an exemplary diagram for explaining the function of a sign language generator according to an embodiment of the present invention.
7 is an exemplary diagram for explaining a second artificial intelligence according to an embodiment of the present invention.
8 is a hardware configuration diagram of a translation server according to an embodiment of the present invention.
9 is a flowchart illustrating a translation method according to an embodiment of the present invention.
10 is a flowchart illustrating a step of generating a sign language image according to an embodiment of the present invention.
11 is an exemplary diagram for explaining a method of generating a sign language image according to an embodiment of the present invention.
12 is an exemplary diagram for explaining a translation method according to an embodiment of the present invention.

본 명세서에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본명세서에서 사용되는 기술적 용어는 본 명세서에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 명세서에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는, 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다. 또한, 본 발명에서 사용되는 일반적인 용어는 사전에 정의되어 있는 바에 따라, 또는 전후 문맥상에 따라 해석되어야 하며, 과도하게 축소된 의미로 해석되지 않아야 한다.It should be noted that the technical terms used in this specification are only used to describe specific embodiments and are not intended to limit the present invention. In addition, technical terms used in this specification should be interpreted in terms commonly understood by those of ordinary skill in the art to which the present invention belongs, unless specifically defined otherwise in this specification, and are excessively inclusive. It should not be interpreted in a positive sense or in an excessively reduced sense. In addition, when the technical terms used in this specification are incorrect technical terms that do not accurately express the spirit of the present invention, they should be replaced with technical terms that those skilled in the art can correctly understand. In addition, general terms used in the present invention should be interpreted as defined in advance or according to context, and should not be interpreted in an excessively reduced sense.

또한, 본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, '구성된다' 또는 '가지다' 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, singular expressions used in this specification include plural expressions unless the context clearly indicates otherwise. In this application, terms such as 'consisting of' or 'having' should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or steps are included. It should be construed that it may not be, or may further include additional components or steps.

또한, 본 명세서에서 사용되는 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성 요소는 제2 구성 요소로 명명될 수 있고, 유사하게 제2 구성 요소도 제1 구성 요소로 명명될 수 있다.Also, terms including ordinal numbers such as first and second used in this specification may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention.

어떤 구성 요소가 다른 구성 요소에 '연결되어' 있다거나 '접속되어' 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "'직접 연결되어' 있다거나 '직접 접속되어' 있다고 언급된 때에는, 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but other components may exist in the middle. On the other hand, when a component is referred to as "'directly connected' or 'directly connected' to another component, it should be understood that no other component exists in the middle.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다. 본 발명의 사상은 첨부된 도면 외에 모든 변경, 균등물 내지 대체물에 까지도 확장되는 것으로 해석되어야 한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings, but the same or similar components are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. In addition, in describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description will be omitted. In addition, it should be noted that the accompanying drawings are only for easily understanding the spirit of the present invention, and should not be construed as limiting the spirit of the present invention by the accompanying drawings. The spirit of the present invention should be construed as extending to all changes, equivalents or substitutes other than the accompanying drawings.

이러한 한계를 극복하고자, 본 발명은 높은 정확도로 자연어(natural language)를 수어(sign language)로 번역할 수 있는 다양한 수단들을 제안하고자 한다.In order to overcome these limitations, the present invention intends to propose various means capable of translating natural language into sign language with high accuracy.

도 1은 본 발명의 일 실시예에 따른 수어번역시스템의 구성도이다.1 is a block diagram of a sign language translation system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 수어변역시스템(1)은 적어도 하나의 단말기(terminal, 100a, 100b, 100c, …, 100n; 100) 및 번역서버(200)를 포함하여 구성될 수 있다.Referring to FIG. 1, a sign language translation system 1 according to an embodiment of the present invention includes at least one terminal (terminal, 100a, 100b, 100c, ..., 100n; 100) and a translation server 200. It can be.

이와 같은, 본 발명의 일 실시예에 따른 수어번역시스템(1)의 구성 요소들은 기능적으로 구분되는 요소들을 나타낸 것에 불과하므로, 둘 이상의 구성 요소가 실제 물리적 환경에서는 서로 통합되어 구현되거나, 하나의 구성 요소가 실제 물리적 환경에서는 서로 분리되어 구현될 수 있을 것이다.As such, since the components of the sign language translation system 1 according to an embodiment of the present invention are only functionally distinct elements, two or more components are integrated with each other in an actual physical environment or implemented as one component. Elements may be implemented separately from each other in an actual physical environment.

각각의 구성 요소에 대하여 설명하면, 단말기(100)는 사용자로부터 자연어(natural language) 텍스트를 입력 받거나, 번역서버(200)에 의해 번역된 수어(sign language) 텍스트(text) 또는 수어 영상(video)을 출력하여 사용자에게 제공할 수 있는 장치이다.Each component is described. The terminal 100 receives natural language text from the user, or transmits sign language text or sign language video translated by the translation server 200. It is a device that can output and provide it to the user.

여기서, 자연어는 인간이 일상생활에서 의사 소통을 위해 사용하는 언어가 될 수 있다. 특히, 자연어는 한국어, 영어, 독일어, 스페인어, 프랑스어, 이탈리아어 등 다양한 국가의 언어가 해당될 수 있다. 구체적으로, 자연어는 각 국가의 언어 중에서도 구어체(colloquial style), 문어체(literary style) 등이 해당될 수 있다.Here, the natural language may be a language that humans use for communication in daily life. In particular, the natural language may correspond to languages of various countries such as Korean, English, German, Spanish, French, and Italian. Specifically, the natural language may include a colloquial style, a literary style, and the like among the languages of each country.

이러한, 단말기(100)는 사용자로부터 자연어를 입력 받기 위한 입력 장치(input device) 및 번역 서버(200)에 의해 번역된 수어 글로스 또는 수어 영상을 출력하기 위한 출력 장치(output device)를 포함하여 구성될 수 있다.The terminal 100 may include an input device for receiving a natural language input from a user and an output device for outputting a sign language gloss or sign language image translated by the translation server 200. can

또한, 단말기(100)는 번역서버(200)를 포함한 다른 장치들과 데이터를 송수신할 수 있으며, 송수신된 데이터를 기반으로 연산을 수행할 수 있는 장치라면 어떠한 장치라도 허용될 수 있다. 예를 들어, 단말기(100)는 3GPP(3rd Generation Partnership Project)에서 규정하고 있는 사용자 장치(User Equipment, UE) 및 IEEE(Institute of Electrical and Electronics Engineers)에서 규정하고 있는 모바일 스테이션(Mobile Station, MS) 중 어느 하나에 해당될 수 있다. In addition, the terminal 100 can transmit/receive data with other devices including the translation server 200, and any device capable of performing an operation based on the transmitted/received data can be used. For example, the terminal 100 includes a user equipment (UE) defined by the 3rd Generation Partnership Project (3GPP) and a mobile station (MS) defined by the Institute of Electrical and Electronics Engineers (IEEE). may apply to any one of them.

그러나 이에 한정되지 아니하고, 단말기(100)는 데스크탑(desktop), 워크스테이션(workstation) 또는 서버(server)와 같은 고정식 컴퓨팅 장치, 또는 랩탑(laptop), 태블릿(tablet), 패블릿(phablet), 휴대용 멀티미디어 재생장치(Portable Multimedia Player, PMP), 개인용 휴대 단말기(Personal Digital Assistants, PDA) 또는 전자책 단말기(E-book reader)과 같은 이동식 컴퓨팅 장치 중 어느 하나가 될 수도 있다.However, the terminal 100 is not limited thereto, and the terminal 100 may be a stationary computing device such as a desktop, workstation, or server, or a laptop, tablet, phablet, or portable device. It may be any one of mobile computing devices such as portable multimedia players (PMPs), personal digital assistants (PDAs), or e-book readers.

다음 구성으로, 번역서버(200)는 단말기(100)로부터 자연어 텍스트를 입력 받고, 입력 받은 자연어 텍스트를 수어 글로스 및 수어 영상 중 적어도 하나로 번역하여, 번역된 수어 글로스 및 수어 영상 중 적어도 하나를 단말기(100)에 제공할 수 있는 장치가 될 수 있다.With the following configuration, the translation server 200 receives input natural language text from the terminal 100, translates the input natural language text into at least one of a sign language gloss and a sign language image, and transmits at least one of the translated sign language gloss and sign language image to the terminal ( 100).

이러한, 번역서버(200)는 단말기(100)로부터 자연어 텍스트를 입력 받고, 입력 받은 자연어 텍스트를 인코딩(incoding)하여, 자연어 텍스트와 대응하는 벡터(vector)를 생성할 수 있다.The translation server 200 may receive input of natural language text from the terminal 100, encode the input natural language text, and generate a vector corresponding to the natural language text.

구체적으로, 번역서버(200)는 단말기(100)로부터 입력된 자연어 텍스트를 토큰화(tokenization)하고, 토큰화 작업 전후에 자연어 텍스트를 용도에 맞게 정제(cleaning) 및 정규화(normalization)하여 전처리하고, 전처리 된 토큰들을 압축해서 하나의 벡터로 만들 수 있다.Specifically, the translation server 200 tokenizes the natural language text input from the terminal 100, preprocesses the natural language text before and after the tokenization operation by cleaning and normalizing the natural language text according to the purpose, The preprocessed tokens can be compressed into a single vector.

또한, 번역 서버(200)는 자연어 및 상기 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해, 벡터를 디코딩(decoding)하여 자연어 텍스트와 매칭되는 수어 글로스를 생성하여 단말기(100)에 제공할 수 있다.In addition, the translation server 200 decodes a vector through artificial intelligence (AI) pre-machine learning by a natural language and a sign language data set matching the natural language, Sign language gloss matching natural language text may be generated and provided to the terminal 100 .

그리고, 번역 서버(200)는 번역된 수어 글로스를 수어 영상으로 생성하여, 단말기(100)에 제공할 수 있다.In addition, the translation server 200 may generate the translated sign language gloss as a sign language image and provide it to the terminal 100 .

이와 같은, 번역서버(200)는 단말기(100)와 데이터를 송수신할 수 있으며, 송수신된 데이터를 기반으로 연산을 수행할 수 있는 장치라면 어떠한 장치라도 허용될 수 있다. 예를 들어, 번역서버(200)는 데스크탑, 워크스테이션 또는 서버와 같은 고정식 컴퓨팅 장치 중 어느 하나가 될 수 있으나, 이에 한정되는 것은 아니다.As such, the translation server 200 can transmit/receive data with the terminal 100, and any device capable of performing calculations based on the transmitted/received data can be used. For example, the translation server 200 may be any one of a fixed computing device such as a desktop, workstation, or server, but is not limited thereto.

이러한, 특징을 가지는 번역서버(200)의 구체적인 구성 및 동작에 대해서는 도 2 내지 도 7을 참조하여 후술하기로 한다. The specific configuration and operation of the translation server 200 having these characteristics will be described later with reference to FIGS. 2 to 7 .

지금까지 상술한 바와 같은, 수어번역시스템(1)을 구성하는 단말기(100) 및 번역서버(200)는 장치들 사이를 직접 연결하는 보안 회선, 공용 유선 통신망 또는 이동통신망 중 하나 이상이 조합된 네트워크를 이용하여 데이터를 송수신할 수 있다.As described above, the terminal 100 and the translation server 200 constituting the sign language translation system 1 are connected to a network in which at least one of a security line, a public wired communication network, and a mobile communication network directly connects devices. Data can be transmitted and received using

예를 들어, 공용 유선 통신망에는 이더넷(ethernet), 디지털가입자선(x Digital Subscriber Line, xDSL), 광동축 혼합망(Hybrid Fiber Coax, HFC) 및 광가입자망(Fiber To The Home, FTTH) 중 하나 이상이 포함될 수 있으나, 이에 한정되는 것은 아니다. For example, public wired communication networks include Ethernet, digital subscriber line (xDSL), hybrid fiber coax (HFC), and fiber to the home (FTTH). Abnormalities may be included, but are not limited thereto.

또한, 이동통신망에는 코드 분할 다중 접속(Code Division Multiple Access, CDMA), 와이드 밴드 코드 분할 다중 접속(Wideband CDMA, WCDMA), 고속 패킷 접속(High Speed Packet Access, HSPA), 롱텀 에볼루션(Long Term Evolution, LTE) 및 5세대 이동통신(5th generation mobile telecommunication) 중 하나 이상이 포함될 수 있으나, 이에 한정되는 것도 아니다.In addition, the mobile communication network includes Code Division Multiple Access (CDMA), Wideband CDMA (WCDMA), High Speed Packet Access (HSPA), Long Term Evolution, LTE) and 5th generation mobile telecommunication may include one or more, but is not limited thereto.

이하, 상술한 바와 같은 특징을 가지는, 번역서버(200)의 구성에 대하여 보다 구체적으로 설명하기로 한다.Hereinafter, the configuration of the translation server 200 having the above-described characteristics will be described in more detail.

도 2는 본 발명의 일 실시예에 따른 번역서버의 논리적 구성도이고, 도 3은 본 발명의 일 실시예에 따른 데이터전처리부의 기능을 설명하기 위한 예시도이고, 도 4 및 도 5는 본 발명의 일 실시예에 따른 제1 인공지능을 설명하기 위한 예시도이고, 도 6은 본 발명의 일 실시예에 따른 수어글로스생성부의 기능을 설명하기 위한 예시도이고, 도 7은 본 발명의 일 실시예에 따른 제2 인공지능을 설명하기 위한 예시도이다.2 is a logical configuration diagram of a translation server according to an embodiment of the present invention, FIG. 3 is an exemplary diagram for explaining the function of a data pre-processing unit according to an embodiment of the present invention, and FIGS. 4 and 5 are diagrams of the present invention. 6 is an exemplary diagram for explaining the function of a sign language generator according to an embodiment of the present invention, and FIG. 7 is an exemplary diagram for explaining a first artificial intelligence according to an embodiment of the present invention. It is an exemplary view for explaining the second artificial intelligence according to the example.

우선적으로, 도 2를 참조하면, 본 발명의 일 실시예에 따른 번역 서버(200)는 통신부(205), 입출력부(210), 저장부(215), 데이터전처리부(220), 수어글로스생성부(225) 및 수어영상생성부(230)를 포함하여 구성될 수 있다.First of all, referring to FIG. 2, the translation server 200 according to an embodiment of the present invention includes a communication unit 205, an input/output unit 210, a storage unit 215, a data pre-processing unit 220, and a sign language generator. It may be configured to include a unit 225 and a sign language image generator 230.

이와 같은, 본 발명의 일 실시예에 따른 번역서버(200)의 구성 요소들은 기능적으로 구분되는 요소들을 나타낸 것에 불과하므로, 둘 이상의 구성 요소가 실제 물리적 환경에서는 서로 통합되어 구현되거나, 하나의 구성 요소가 실제 물리적 환경에서는 서로 분리되어 구현될 수 있을 것이다.As such, the components of the translation server 200 according to an embodiment of the present invention are merely functionally distinct elements, so that two or more components are integrated with each other in an actual physical environment or implemented as one component. may be implemented separately from each other in an actual physical environment.

각각 구성 요소에 대하여 설명하면, 통신부(205)는 단말기(100)와 데이터를 송수신할 수 있다.When describing each component, the communication unit 205 may transmit and receive data to and from the terminal 100 .

구체적으로, 통신부(205)는 단말기(100)로부터 자연어 텍스트를 입력 받을 수 있고, 입력 받은 자연어 텍스트를 번역한 수어 글로스 및 수어 영상 중 적어도 하나를 단말기(100)로 전송할 수 있다.Specifically, the communication unit 205 may receive natural language text from the terminal 100 and transmit at least one of a sign language gloss and a sign language image obtained by translating the input natural language text to the terminal 100 .

다음 구성으로, 입출력부(210)는 사용자 인터페이스(UI)를 통해, 관리자로부터 명령을 입력 받거나 또는 연산 결과를 출력할 수 있다. 이 경우, 관리자는 번역 서비스를 제공하는 서비스 제공자로 지칭될 수 있으며, 이에 한정되지 않는다.With the following configuration, the input/output unit 210 may receive a command from a manager or output an operation result through a user interface (UI). In this case, the administrator may be referred to as a service provider providing translation services, but is not limited thereto.

구체적으로, 입출력부(210)는 관리자로부터 인공지능을 학습하기 위한 데이터 셋을 입력 받을 수 있다. 예를 들어, 입출력부(210)는 제1 인공지능의 학습을 위하여, 다양한 형태의 자연어 문장에 관한 데이터 셋을 입력 받을 수 있다. 또한, 입출력부(210)는 제2 인공지능의 학습을 위하여, 자연어 및 자연어와 매칭되는 수어 데이터 셋을 입력 받을 수 있다.Specifically, the input/output unit 210 may receive a data set for learning artificial intelligence from a manager. For example, the input/output unit 210 may receive data sets related to various types of natural language sentences for learning of the first artificial intelligence. In addition, the input/output unit 210 may receive a natural language and a sign language data set matched with the natural language for learning of the second artificial intelligence.

또한, 입출력부(210)는 데이터전처리부(220)로부터 생성된 결과 값, 수어글로스생성부(225)로부터 생성된 수어 글로스 및 수어영상생성부(230)로부터 생성된 수어 영상 중 적어도 하나를 출력할 수 있다.In addition, the input/output unit 210 outputs at least one of a result value generated by the data pre-processing unit 220, a sign language gloss generated by the sign language generator 225, and a sign language image generated by the sign language image generator 230. can do.

다음 구성으로, 저장부(215)는 번역서버(200)의 동작에 필요한 데이터를 저장할 수 있다.With the following configuration, the storage unit 215 may store data necessary for the operation of the translation server 200 .

구체적으로, 저장부(215)는 데이터전처리부(220), 수어글로스생성부(225) 및 수어영상생성부(230)에 의해 주기적으로 갱신되는 데이터베이스를 저장할 수 있다. 또한, 저장부(215)는 인공지능(AI) 학습을 위한 데이터 셋을 저장할 수 있다. 그리고, 저장부(215)는 데이터전처리부(220), 수어글로스생성부(225) 및 수어영상생성부(230)에서 사용되는 인공지능 모델을 저장할 수 있다.Specifically, the storage unit 215 may store a database periodically updated by the data pre-processing unit 220 , the sign language generation unit 225 , and the sign language image generation unit 230 . Also, the storage unit 215 may store a data set for artificial intelligence (AI) learning. Also, the storage unit 215 may store artificial intelligence models used in the data pre-processing unit 220 , the sign language generation unit 225 , and the sign language image generation unit 230 .

다음 구성으로, 데이터전처리부(220)는 단말기(100)로부터 입력 받은 자연어 텍스트를 인코딩(encoding)하여 자연어 텍스트와 대응하는 벡터(vector)를 생성할 수 있다. 즉, 데이터전처리부(220)는 수어글로스생성부(225)가 수어 글로스를 생성하기 위하여, 입력 받은 자연어 텍스트를 전처리하는 역할을 수행할 수 있다.With the following configuration, the data preprocessing unit 220 may encode the natural language text input from the terminal 100 to generate a vector corresponding to the natural language text. That is, the data pre-processing unit 220 may play a role of preprocessing the input natural language text so that the sign language generating unit 225 generates sign language gloss.

구체적으로, 데이터전처리부(220)는 자연어 텍스트의 각 단어를 토큰화 한 제1 토큰을 생성할 수 있다. 예를 들어, 도 3에 도시된 바와 같이, 데이터전처리부(220)는 "죽음기와 발명품이 잔뜩 우리 아이들과의 과학 놀이터"라는 문장을 입력 받은 경우, 문장에 포함된 각 단어를 토큰화(tokenization)하여 "a1, a2, …, a7"과 같은 제1 토큰을 생성할 수 있다. 이때, 데이터전처리부(220)는 인공지능 성능을 향상시키기 위하여, 제1 토큰 생성 이전에, 자연어 텍스트 중 적어도 둘 이상의 의미를 갖는 단어를 검출하고, 검출된 단어를 의미 단위로 띄어쓰기 처리할 수 있다.Specifically, the data pre-processing unit 220 may generate a first token by tokenizing each word of the natural language text. For example, as shown in FIG. 3 , when the data pre-processing unit 220 receives the sentence “A science playground for our children full of deaths and inventions”, each word included in the sentence is tokenized. ) to generate a first token such as "a1, a2, ..., a7". In this case, in order to improve AI performance, the data preprocessing unit 220 may detect words having at least two or more meanings in the natural language text before generating the first token, and space the detected words in semantic units. .

또한, 데이터전처리부(220)는 생성된 제1 토큰을 제1 인공지능에 입력하여, 문맥 정보를 반영하는 임베딩(contextual embedding)을 수행할 수 있다. 즉, 데이터전처리부(220)는 자연어 문장으로 사전 기계 학습된 인공지능을 통해 제1 토큰과 대응하는 문맥 정보가 포함된 제1 컨텍스트 벡터(context vector)를 생성할 수 있다. 예를 들어, 데이터전처리부(220)는 자연어 텍스트를 토큰화하여 생성된 "a1, a2, …, a7"과 같은 제1 토큰을 제1 인공지능에 입력하여 "h1, h2, …, h7"를 포함하는 제1 컨텍스트 벡터를 생성할 수 있다. 이때, 제1 컨텍스트 벡터는 제1 인공지능으로부터 연산된 마지막 히든 레이어(hidden layer)가 될 수 있다.In addition, the data pre-processing unit 220 may input the generated first token to the first artificial intelligence and perform contextual embedding reflecting contextual information. That is, the data pre-processing unit 220 may generate a first context vector including context information corresponding to the first token through artificial intelligence pre-machine-learned as a natural language sentence. For example, the data pre-processing unit 220 inputs a first token such as "a1, a2, ..., a7" generated by tokenizing natural language text to the first artificial intelligence to generate "h1, h2, ..., h7" It is possible to generate a first context vector including. In this case, the first context vector may be the last hidden layer calculated from the first artificial intelligence.

예를 들어, 데이터전처리부(220)는 BERT(Bidirectional Encoder Representations from Transformers) 모델에 기반한 인공지능(AI)을 이용하여, 제1 컨텍스트 벡터를 생성할 수 있다.For example, the data pre-processing unit 220 may generate a first context vector using artificial intelligence (AI) based on a BERT (Bidirectional Encoder Representations from Transformers) model.

보다 상세하게 도 4 및 도 5를 참조하면, BERT 모델은 트랜스포머(transformer)를 기반으로, 인코더(encoder)만을 사용하는 모델에 해당된다. BERT 모델은 일반적인 트랜스포머와 다르게, 토큰 임베딩(token embeddings), 토큰의 포지션 임베딩(position embeddings) 및 세그먼트 임베딩(segment embedding)으로 이루어진 입력 값을 가진다.Referring to FIGS. 4 and 5 in detail, the BERT model corresponds to a model using only an encoder based on a transformer. Unlike general transformers, the BERT model has input values consisting of token embeddings, token position embeddings, and segment embeddings.

이러한, BERT 모델은 복수 개의 인코딩 블록으로 구성될 수 있다. 기본 BERT 모델은 12개의 인코딩 블록으로 구성되고, 대형 BERT 모델은 24개의 인코딩 블록으로 구성될 수 있으나, 이에 한정되는 것은 아니다. 각각의 인코더 블록은 이전의 출력 값을 현재의 입력 값으로 가지며, BERT 모델은 인코더 블록의 개수만큼 재귀적으로 반복 처리되는 형태로 복수 개의 인코더들이 구성될 수 있다. 그리고, 각각의 인코더 블록의 출력 값은 매번 잔차 연결(residual connections)되게 처리될 수 있다.Such a BERT model may be composed of a plurality of encoding blocks. The basic BERT model consists of 12 encoding blocks, and the large BERT model may consist of 24 encoding blocks, but is not limited thereto. Each encoder block has a previous output value as a current input value, and a plurality of encoders may be configured in a form in which the BERT model is recursively processed as many times as the number of encoder blocks. In addition, the output value of each encoder block may be processed to be residual connections each time.

각 인코더 블록을 구성하는 멀티 헤드 어텐션(multi-head attention)은 다음의 수식 1과 같이, 서로 다른 가중치 행렬(weight matrix)를 이용하여 어텐션(attention)을 h번 계산한 다음 이를 서로 연결(concatenates)한 결과를 출력할 수 있다. The multi-head attention constituting each encoder block calculates attention h times using different weight matrices, as shown in Equation 1 below, and then concatenates them One result can be output.

[수학식 1][Equation 1]

MultiHead(Q, K, V) = [head₁; …; head_h]w^O MultiHead(Q, K, V) = [head ₁ ; … ; head _h ] w ^O

여기서, head_i는 Attention(QW_i ^Q, KW_i ^K, VW_i ^V)⁴이다. Q는 디코더의 히든 스테이지(hidden stage), K는 인코더의 히든 스테이지, V는 K에 어텐션을 부여받은 정규화된 가중치(normalized weight)이며, Q, K, V에 대한 스케일드 닷-프로덕트 어텐션(scaled dot-product attention)은 다음의 수식 2를 통해 산출될 수 있다.Here, head _i is Attention (QW _i ^Q , KW _i ^K , VW _i ^V ) ⁴ . Q is the hidden stage of the decoder, K is the hidden stage of the encoder, V is the normalized weight given attention to K, and the scaled dot-product attention for Q, K, and V dot-product attention) can be calculated through Equation 2 below.

[수학식 2][Equation 2]

Attention(Q, K, V) = softmax(QK^T/root(d_k))VAttention(Q, K, V) = softmax(QK ^T /root(d _k ))V

그리고, 어텐션 결과를 받은 피드-포워드 네트워크(Feed Forward Network, FFN)는 두 개의 리니어 트랜스포메이션(linear transformation)으로 구성되어, GELU(Gaussian Error Linear Units)가 적용된 다음의 수식 3을 기반으로 구현될 수 있다.In addition, the feed forward network (FFN) receiving the attention result is composed of two linear transformations, and can be implemented based on the following Equation 3 to which Gaussian Error Linear Units (GELU) is applied. there is.

[수학식 3][Equation 3]

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂ FFN(x) = max(0, xW ₁ + b ₁ )W ₂ + b ₂

또한, 데이터전처리부(220)는 단말기(100)로부터 입력 받은 자연어 텍스트의 각 단어 및 각 단어의 언어 자질을 토큰화 한 제2 토큰을 생성할 수 있다. 여기서, 제2 토큰은 품사(POS, Part Of Speech) 분석 및 개체명 인식(NER, Named Entity Recognition) 결과를 기초로, 자연어 텍스트를 임베딩(embedding)하여 생성될 수 있다.In addition, the data preprocessing unit 220 may generate a second token by tokenizing each word of the natural language text input from the terminal 100 and the language quality of each word. Here, the second token may be generated by embedding natural language text based on a result of Part Of Speech (POS) analysis and Named Entity Recognition (NER).

예를 들어, 데이터전처리부(220)는 점별 예측(pointwise prediction) 모델, 확률 기반의 모델(probabilistic model), 신경망 기반의 모델(neural network based model)을 기반으로, 자연어 텍스트를 형태소 단위로 나눈 뒤, 각 형태소에 해당 품사를 태깅(tagging)할 수 있다.For example, the data preprocessor 220 divides natural language text into morpheme units based on a pointwise prediction model, a probabilistic model, and a neural network based model, and then , each morpheme can be tagged with a corresponding part of speech.

또한, 데이터전처리부(220)는 자연어 텍스트의 개체명(named entity)을 인식하고, 인식된 개체명의 종류를 분류할 수 있다. 즉 데이터전처리부(220)는 자연어 텍스트에 포함된 각 단어가 어떤 유형에 속하는지 인식할 수 있다.In addition, the data pre-processing unit 220 may recognize the named entity of the natural language text and classify the type of the recognized entity name. That is, the data preprocessing unit 220 can recognize which type each word included in the natural language text belongs to.

데이터전처리부(220)는 제2 토큰을 임베딩하여 제2 컨텍스트 벡터를 생성할 수 있다. 즉, 데이터전처리부(220)는 제2 토큰을 고정된 차원의 실수 벡터로 변환하여 제2 컨텍스트 벡터를 생성할 수 있다.The data preprocessing unit 220 may generate a second context vector by embedding the second token. That is, the data pre-processing unit 220 may generate a second context vector by converting the second token into a fixed-dimensional real-valued vector.

이후, 데이터전처리부(220)는 상술한 바와 같이 생성된 제1 컨텍스트 벡터 및 제2 컨텍스트 벡터를 혼합(concat)한 혼합 특징 벡터(mixed feature vector)를 생성하고, 생성된 혼합 특징 벡터를 수어글로스생성부(225)에 전달할 수 있다. Thereafter, the data pre-processing unit 220 generates a mixed feature vector by concating the first context vector and the second context vector generated as described above, and converts the generated mixed feature vector to Sueogloss. It can be passed to the generator 225.

예를 들어, 데이터전처리부(220)는 "h1, h2, …, h7"을 포함하는 제1 컨텍스트 벡터와, "z1, z2, …, z8"를 포함하는 제2 컨텍스트 벡터를 혼합하여, "x1, x2, …, x7"을 포함하는 혼합 특징 벡터를 생성할 수 있다.For example, the data preprocessor 220 mixes a first context vector including "h1, h2, ..., h7" and a second context vector including "z1, z2, ..., z8", You can create a mixed feature vector containing x1, x2, ..., x7".

여기서, 데이터전처리부(220)는 생성된 혼합 특징 벡터를 수어글로스생성부(225)로 전달하여, 제2 인공지능의 입력으로 사용하도록 함과 동시에, 제2 인공지능에 의한 결과 값 중 일부를 대체하는 데 사용하도록 할 수 있다.Here, the data pre-processing unit 220 transfers the generated mixed feature vector to the sign language generator 225 to be used as an input for the second artificial intelligence, and at the same time, some of the result values by the second artificial intelligence It can be used for replacement.

다음 구성으로, 수어글로스생성부(225)는 자연어 및 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해, 데이터전처리부(220)로부터 전달받은 혼합 특징 벡터를 디코딩(decoding)하여 자연어 텍스트와 매칭되는 수어 글로스를 생성할 수 있다.With the following configuration, the sign language generation unit 225 is a data pre-processing unit ( 220), a sign language gloss matched with a natural language text may be generated by decoding the mixed feature vector.

구체적으로 도 7에 도시된 바와 같이, 수어글로스생성부(225)는 데이터전처리부(220)부터 전달받은 혼합 특징 벡터를 디코딩하기 위한, 트랜스포머(transformer) 모델의 디코더(decoder)에 해당될 수 있다. 이러한, 트랜스포머 모델은 복수 개의 디코딩 블록으로 구성될 수 있다. Specifically, as shown in FIG. 7, the sign language generator 225 may correspond to a decoder of a transformer model for decoding the mixed feature vector received from the data preprocessor 220. . Such a transformer model may be composed of a plurality of decoding blocks.

각 디코더 블록을 구성하는 첫번째 서브층인 마스크드 멀티 헤드 셀프 어텐션(masked multi-head self-attention)은 전술한 인코더의 서브층인 멀티 헤드 어텐션과 동일한 연산을 수행하되, 어텐션 스코어 행렬에서 마스킹을 적용하는 점에서 일부 상이하다. 즉, 서브층인 마스크드 멀티 헤드 셀프 어텐션은 현재 처리중인 단어보다 앞쪽에 해당하는 단어에 대해서만 어텐션 점수를 참고할 수 있도록 하기 위하여 마스킹을 적용할 수 있다.Masked multi-head self-attention, which is the first sub-layer constituting each decoder block, performs the same operation as multi-head attention, which is the sub-layer of the encoder described above, but applies masking in the attention score matrix It differs in some respects. That is, masking can be applied to the masked multi-head self-attention, which is a sub-layer, so that the attention score can be referred to only for words that precede the word currently being processed.

그리고, 디코더는 두번째 서브층인 멀티 헤드 어텐션(multi-head attention)을 통해 엔코더의 출력 값인 혼합 특징 벡터를 입력 받고, 입력 받은 혼합 특징 벡터를 멀티 헤드 어텐션(multi-head attention) 및 세번째 서브층인 피드-포워드 네트워크(Feed Forward Network, FFN)를 통과시키고, 리니어 레이어(Linear Layer) 및 소프트맥스 레이어(softmax layer)를 거쳐 학습된 수어 단어 데이터베이스 중 가장 관계가 높은 수어 토큰을 출력할 수 있다.In addition, the decoder receives the mixed feature vector, which is the output value of the encoder, through the second sub-layer, multi-head attention, and the input mixed feature vector is converted into multi-head attention and the third sub-layer. It passes through a Feed Forward Network (FFN), and through a linear layer and a softmax layer, it is possible to output a sign language token having the highest relationship among the learned sign language word database.

이때, 리니어 레이어는 완전 접속망(fully-connected network)으로 디코더가 마지막으로 출력한 벡터를 그보다 훨씬 더 큰 사이즈의 벡터인 로짓(logits) 벡터로 투영시킬 수 있다. 여기서, 로짓 벡터의 각 셀은 각 단어에 대한 점수가 될 수 있다.At this time, the linear layer is a fully-connected network and can project the vector finally output by the decoder into a logits vector, which is a vector of a much larger size. Here, each cell of the logit vector may be a score for each word.

그리고, 소프트맥스 레이어는 이 점수들을 확률로 변환해주며, 가장 높은 확률 값을 가지는 셀에 해당하는 단어를 최종 수어 글로스로서 출력할 수 있다.And, the softmax layer converts these scores into probabilities, and outputs a word corresponding to a cell having the highest probability value as a final sign language gloss.

이때, 수어글로스생성부(225)는 혼합 특징 벡터와 매칭되는 수어 토큰을 추출하되, 수어 토큰을 혼합 특징 벡터에 포함된 토큰 중 하나로 대체할 수 있다. In this case, the sign language generator 225 extracts a sign language token that matches the mixed feature vector, and may replace the sign language token with one of the tokens included in the mixed feature vector.

즉, 수어글로스생성부(225)는 수어 글로스를 생성할 때, 필요한 어휘가 출력 사전(output vocabulary)에 없는 문제(out-of-vocabulary)와 고유명사들의 출력 확률이 작아지는 문제를 해결하기 위하여, 출력에 필요한 어휘를 데이터전처리부(220)의 출력에서 찾아 복사(copy)할 수 있다. 여기서, 수어글로스생성부(225)는 디코더에 카피 어텐션(copy attention)을 별도로 구비하여, 디코딩 과정에서 각 시간별 출력 어휘를 예측할 때, 출력 사전에 있는 어휘들의 확률과 함께 혼합 특징 벡터 열 중에서 카피 어텐션 점수가 가장 높은 어휘를 그대로 출력할 확률도 함께 계산할 수 있다.That is, when the sign language gloss is generated, the sign language gloss generator 225 solves the problem of out-of-vocabulary where necessary vocabulary is not in the output vocabulary and the problem of a decrease in the output probability of proper nouns. , Vocabulary required for output can be found and copied from the output of the data pre-processing unit 220. Here, the sign language generator 225 separately equips the decoder with copy attention, and when predicting output vocabularies for each time in the decoding process, copy attention among mixed feature vector streams together with the probability of vocabularies in the output dictionary. The probability of outputting the vocabulary with the highest score as it is can also be calculated.

한편, 일반적인 번역 모델의 경우, 입력으로 주어진 텍스트 시퀀스와 출력으로 주어지는 텍스트 시퀀스가 다른 언어이다.Meanwhile, in the case of a general translation model, a text sequence given as an input and a text sequence given as an output are different languages.

반면에, 수어 번역은 입력과 출력이 모두 동일한 한국어 기반 데이터 셋이라는 점에서 차이가 있다.On the other hand, sign language translation is different in that it is a Korean-based data set with the same input and output.

이에 따라, 입력에 이용된 고유 명사가 출력에도 동일하게 나타나며, 고유명사가 아닌 경우에도 입력에 나타난 토큰이 출력에도 동일하게 나타나는 경우가 다수 존재한다. Accordingly, there are many cases in which the proper noun used in the input appears identically in the output, and even when the token appears in the input, even if it is not a proper noun, the same appears in the output.

따라서, 수어글로스생성부(225)는 위와 같은 특성을 고려하여, 인코딩 과정에서 생성된 토큰을 카피(copy)하여 선택적으로 사용하여 수어 글로스를 생성할 수 있다.Accordingly, the sign language gloss generating unit 225 may generate sign language gloss by copying and selectively using the token generated in the encoding process in consideration of the above characteristics.

이를 위해, 번역 서버는 하기의 수학식 4를 통해 상기 인코딩 과정에서 생성된 입력 토큰을 사용할 스코어를 산출할 수 있다.To this end, the translation server may calculate a score to use the input token generated in the encoding process through Equation 4 below.

[수학식 4][Equation 4]

즉, 입력 토큰을 사용할 스코어를 산출하기 위하여, 인코더의 j시점의 입력 토큰의 결과 벡터(은닉 벡터)를 W_c와 비 선형 함수인 σ를 통해 임베딩하고, 이를 s_t와 내적함으로써 스코어를 산출한다.That is, in order to calculate the score to use the input token, the resulting vector (hidden vector) of the input token at the time point j of the encoder is embedded through W _c and a non-linear function σ, and the score is calculated by doing the dot product with s _t .

또한, 수어글로스생성부(225)는 하기의 수학식 5를 통해 수어 토큰의 스코어를 산출할 수 있다.In addition, the sign language generator 225 may calculate the score of the sign language token through Equation 5 below.

[수학식 5][Equation 5]

수어글로스생성부(225)는 수학식 4 및 수학식 5를 통해 산출된 스코어를 기초로 수어 글로스를 생성하기 위한 출력 토큰의 확률을 산출할 수 있다.The sign language gloss generating unit 225 may calculate a probability of an output token for generating sign language gloss based on the scores calculated through Equations 4 and 5.

이때, t시점의 생성 토큰을 G_t라고 했을 때, 하기의 수학식 6 내지 9에 기재된 네가지 경우로 나뉘어지게 된다. 수어글로스생성부(225)는 하기의 수학식을 이용하여 최종 토큰 출력 확률을 산출하며, 가장 확률이 높은 토큰을 출력할 수 있다.At this time, when the generation token at time t is G _t , it is divided into four cases described in Equations 6 to 9 below. The sign language generator 225 calculates the final token output probability using the following equation, and outputs a token with the highest probability.

[수학식 6][Equation 6]

[수학식 7][Equation 7]

[수학식 8][Equation 8]

[수학식 9][Equation 9]

또한, 수어글로스생성부(225)는 혼합 특징 벡터를 기초로 자연어 텍스트의 문장 유형을 추정하고, 추정된 문장 유형에 따른 비수지기호를 추출할 수 있다. 그리고, 수어글로스생성부(225)는 추출된 비수지기호를 수어 토큰에 임베딩할 수 있다. 여기서, 문장 유형은 평소문, 의문문, 명령문, 청유문 및 감탄문 중 적어도 하나를 포함할 수 있다. 이때, 수어글로스생성부(225)는 혼합 특징 벡터에 포함된 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트의 문장 유형을 식별할 수 있다. 하지만, 이에 한정된 것은 아니고, 수어글로스생성부(225)는 데이터전처리부(220)에 의한 품사 분석 및 개체명 인식 결과를 가져와 문장 유형을 식별할 수도 있다.In addition, the sign language generator 225 may estimate the sentence type of the natural language text based on the mixed feature vector and extract non-financial symbols according to the estimated sentence type. Also, the sign language generator 225 may embed the extracted non-residual symbols into sign language tokens. Here, the sentence type may include at least one of a usual sentence, a question sentence, a command sentence, a request sentence, and an exclamation sentence. In this case, the sign language generator 225 may identify the sentence type of the natural language text based on the linguistic features of the natural language text included in the mixed feature vector. However, the present invention is not limited thereto, and the sign language generator 225 may bring the result of the part-of-speech analysis and object name recognition by the data pre-processor 220 to identify the sentence type.

즉, 수어글로스생성부(225)는 추정된 문자 유형에 따라 수어 글로스를 수어로 동작하는데 따른 속도 지수를 도출하고, 도출된 속도 지수를 수어 토큰에 임베딩할 수 있다. 이후, 수어글로스생성부(225)는 속도 지수를 나타내는 문자를 수어 글로스에 포함시킬 수 있다. 예를 들어, 수어글로스생성부(225)는 생성된 수어 글로스의 각 단어 사이에 속도를 의미하는 속도 지수를 삽입하여 출력할 수 있다. 여기서, 속도 지수는 특정 속도 범위를 나타내는 문자가 될 수 있다. 예를 들어, 빠른 속도를 나타내는 속도 지수는 'a', 보통 속도를 나타내는 속도 지수는 'b', 느린 속도를 나타내는 속도 지수는 'c'가 될 수 있다. 즉, 수어글로스생성부(225)는 "(a)죽임기 또 발명 물건 크다 많다 아이 과학 놀다 곳"과 같이, 출력된 수어 글로스의 전단에 수어 동작을 수행하는데 따른 속도 지수를 표시하거나, "죽임기(a)또(a)발명(b)발명(c)물건(a)크다(a)많다(b)아이(a) 과학(a)놀다(b)곳(c)과 같이, 각 단어 사이에 속도 지수를 나타내는 문자를 표시하여, 수어 글로스를 수어 동작으로 표현하는 것을 지원할 수 있다.That is, the sign language gloss generating unit 225 may derive a speed index according to operating the sign language gloss as a sign language according to the estimated character type, and embed the derived speed index into the sign language token. Thereafter, the sign language gloss generating unit 225 may include characters representing the speed index in the sign language gloss. For example, the sign language gloss generating unit 225 may insert and output a speed index meaning speed between each word of the generated sign language gloss. Here, the speed index may be a character representing a specific speed range. For example, a speed index indicating a high speed may be 'a', a speed index indicating a normal speed may be 'b', and a speed index indicating a slow speed may be 'c'. That is, the sign language gloss generating unit 225 displays the speed index according to the execution of the sign language operation on the front end of the output sign language gloss, such as "(a) Killer or invention, large number, child science play place", or "kill" Between each word, such as (a) or (a) invention (b) invention (c) thing (a) big (a) many (b) child (a) science (a) play (b) place (c) By displaying characters representing the speed index on , it is possible to support expressing sign language gloss through sign language motion.

다음 구성으로, 수어영상생성부(230)는 변환된 수어 글로스와 매칭되는 수어 영상을 생성할 수 있다.With the following configuration, the sign language image generator 230 may generate a sign language image that matches the converted sign language gloss.

구체적으로, 수어영상생성부(230)는 변환된 수어 글로스에 포함된 각 단어와 매칭되는 사전 저장된 단어 수화 영상을 추출할 수 있다. 이후, 수어영상생성부(230)는 추출된 단어 수화 영상에 포함된 프레임 각각에서 2D 키포인트(keypoint)를 추출할 수 있다. 즉, 수어영상생성부(230)는 2D 키포인트가 포함된 수화 영상 데이터 셋을 기초로 사전 기계 학습된 인공지능을 통해, 단어 수화 영상에서 2D 키포인트를 추출할 수 있다.Specifically, the sign language image generator 230 may extract a pre-stored sign language image of a word that matches each word included in the converted sign language gloss. Thereafter, the sign language image generator 230 may extract 2D keypoints from each frame included in the extracted word sign language image. That is, the sign language image generator 230 may extract 2D keypoints from the word sign language image through artificial intelligence pre-machine-learned based on the sign language image data set including the 2D keypoints.

예를 들어, 수어영상생성부(230)는 오픈포즈(openpose) 모델을 통해 2D 키포인트를 추출할 수 있다. 여기서, 오픈포즈 모델은 단일 이미지에서 실시간으로 몸체, 손, 얼굴 그리고 발들의 키포인트들을 최대 130개까지 인식할 수 있으며, 입력된 이미지 또는 비디오로부터 2D 키포인트를 추출하여, 배경이미지와 키포인트가 합쳐진 이미지 또는 키포인트만 가진 이미지를 JSON, XML, 영상 데이터 등으로 저장할 수 있다.For example, the sign language image generator 230 may extract 2D key points through an openpose model. Here, the open pose model can recognize up to 130 key points of the body, hands, face, and feet in real time from a single image, and extracts 2D key points from an input image or video to create an image in which the background image and key points are combined or Images with only keypoints can be saved as JSON, XML, video data, etc.

또한, 수어영상생성부(230)는 추출된 2D 키포인트를 3D 조인트(joint)로 변환할 수 있다. 이때, 수어영상생성부(230)는 3D 조인트를 2D 이미지 위에 프로젝션(projection) 시킨 이미지 및 인공지능을 통해 추출된 2D 키포인트를 기초로 손실(loss)이 최소화되도록 학습된 인공지능을 통해, 추출된 2D 키포인트를 3D 조인트로 변환할 수 있다. 여기서, 수어영상생성부(230)는 2D 키포인트 중 중수지관절(metacarpophalangeal joint)에 해당하는 2D 키포인트를 추출하고, 중수지관절에 해당하는 2D 키포인트를 3D 조인트로 변환할 수 있다. 즉, 수어영상생성부(230)는 손의 조인트 21개 전부를 사용하는 것이 아닌 중수지관절을 사용할 수 있다.Also, the sign language image generator 230 may convert the extracted 2D keypoints into 3D joints. At this time, the sign language image generator 230 is extracted through artificial intelligence learned to minimize loss based on the 2D keypoint extracted through the image and artificial intelligence in which the 3D joint is projected onto the 2D image. You can convert 2D keypoints to 3D joints. Here, the sign language image generator 230 may extract a 2D keypoint corresponding to a metacarpophalangeal joint from among 2D keypoints and convert the 2D keypoint corresponding to a metacarpophalangeal joint into a 3D joint. That is, the sign language image generator 230 may use the metacarpophalangeal joint instead of all 21 joints of the hand.

수어영상생성부(230)는 변환된 3D 조인트를 기초로 3D 조인트에 따른 동작 정보를 생성할 수 있다. 즉, 수어영상생성부(230)는 중수지관절을 대상으로 손목의 회전 각도 및 팔꿈치의 회전 각도와의 상관 관계를 기초로 사전 학습된 인공지능을 통해, 3D 조인트에 따른 손목의 회전 각도 및 팔꿈치의 회전 각도를 추정할 수 있다. 이때, 손목의 회전 각도 및 팔꿈치의 회전 각도를 추정하기 위한 인공지능은 손목의 회전 각도 특징을 팔꿈치를 포함하는 몸의 특징과 관계성을 형성하여 학습될 수 있다.The sign language image generator 230 may generate motion information according to the 3D joint based on the converted 3D joint. That is, the sign language image generator 230 calculates the rotation angle of the wrist and the elbow according to the 3D joint through pre-learned artificial intelligence based on the correlation between the rotation angle of the wrist and the rotation angle of the elbow for the metacarpophalangeal joint. The rotation angle of can be estimated. In this case, the artificial intelligence for estimating the rotation angle of the wrist and the rotation angle of the elbow may be learned by forming a relationship between the rotation angle feature of the wrist and body features including the elbow.

이후, 수어영상생성부(230)는 생성된 3D 조인트 및 동작 정보를 기초로 수어 글로스의 각 단어별 영상을 생성할 수 있다. 즉, 수어영상생성부(230)는 3D 조인트 및 동작 정보를 기초로 3D 매쉬(mesh)를 생성하고, 생성된 3D 매쉬를 2D 이미지에 투영시켜 영상으로 변환시킬 수 있다. 예를 들어, 수어영상생성부(230)는 가상 인간이 수화를 수행하는 영상을 생성할 수 있다.Thereafter, the sign language image generator 230 may generate an image for each word of the sign language gloss based on the generated 3D joint and motion information. That is, the sign language image generator 230 may generate a 3D mesh based on the 3D joint and motion information, project the generated 3D mesh onto a 2D image, and convert the image into an image. For example, the sign language image generator 230 may generate an image of a virtual human performing sign language.

그리고, 수어영상생성부(230)는 각 단어별 영상을 조합하여 문장 수어 영상을 생성할 수 있다. 이때, 수어영상생성부(230)는 연속되는 각 단어별 수어 영상 사이의 모션 저더(motion judder) 현상을 방지하기 위하여, 모션 인터폴레이션(motion interpolation)을 통해 연속되는 각 단어별 영상 사이에 적어도 하나의 이미지를 생성할 수 있다.In addition, the sign language image generator 230 may generate a sentence sign language image by combining images for each word. At this time, in order to prevent motion judder between consecutive sign language images for each word, the sign language image generator 230 interpolates at least one image between consecutive words for each word through motion interpolation. image can be created.

여기서, 수어영상생성부(230)는 연속되는 각 단어별 영상 사이에 적어도 하나의 이미지를 생성하되, 선행되는 제1 단어 영상의 최후 프레임과, 후행되는 제2 단어 영상의 최초 프레임 사이에 사전 저장된 예비 동작 이미지를 삽입할 수 있다. 그리고, 수어영상생성부(230)는 예비 동작 이미지를 기준으로, 제1 단어 영상의 최후 프레임 및 제2 단어 영상의 최초 프레임 사이에 적어도 하나의 이미지를 생성할 수 있다. 즉, 수어영상생성부(230)는 단순히 각 단어별 영상 사이의 연관성을 통해 예측되는 이미지를 삽입하는 것이 아니고, 예비 동작 이미지를 각 단어별 영상 사이에 삽입한 후에, 각 단어별 영상과 예비 동작과의 모션 인터폴레이션을 통해 보다 자연스러운 수어 영상을 생성할 수 있다.Here, the sign language image generator 230 generates at least one image between consecutive images for each word, and pre-stored between the last frame of the first word image that precedes and the first frame of the second word image that follows. Preliminary action images can be inserted. Further, the sign language image generator 230 may generate at least one image between the last frame of the first word image and the first frame of the second word image based on the preliminary motion image. That is, the sign language image generator 230 does not simply insert an image predicted through correlation between images for each word, but inserts a preliminary action image between images for each word, and then inserts the image for each word and the preliminary action. A more natural sign language image can be generated through motion interpolation with the .

또한, 수어영상생성부(230)는 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트의 문장 유형을 식별하고, 식별된 문장 유형에 따라 기본 자세의 유지 시간을 결정할 수 있다. 또한, 수어영상생성부(230)는 자연어 텍스트의 언어 자질을 기초로, 상기 자연어 텍스트의 문장 유형을 식별하고, 식별된 문장 유형에 따라 생성된 수어 영상의 재생 속도를 결정할 수 있다.Also, the sign language image generator 230 may identify a sentence type of the natural language text based on language features of the natural language text, and determine a maintenance time of the basic posture according to the identified sentence type. In addition, the sign language image generator 230 may identify a sentence type of the natural language text based on language features of the natural language text, and determine a playback speed of the generated sign language image according to the identified sentence type.

여기서, 수어영상생성부(220)는 데이터전처리부(220)에 의한 품사 분석 및 개체명 인식 결과를 가져와 문장 유형을 식별하거나, 수어글로스생성부(225)에 의해 분석된 문장 유형 결과를 가져올 수 있다.Here, the sign language image generator 220 may bring the result of the part-of-speech analysis and object name recognition by the data pre-processor 220 to identify the sentence type, or bring the result of the sentence type analyzed by the sign language gloss generator 225. there is.

예를 들어, 수어영상생성부(230)는 문장 유형이 청유문으로 식별되는 경우, 기본 자세의 유지 시간을 길게 하거나, 수어 영상의 재생 속도를 느리게 하여, 정중한 표현이 될 수 있도록 할 수 있다. 이와 같이, 수어영상생성부(230)는 단순히 동작 영상을 출력할 뿐만 아니라, 비수지신호를 고려하여 동작 영상을 생성할 수 있다.For example, when the sentence type is identified as a request sentence, the sign language image generator 230 may lengthen the maintenance time of the basic posture or slow down the reproduction speed of the sign language image so that a polite expression can be achieved. In this way, the sign language image generation unit 230 may generate a motion image in consideration of non-acceptance signals as well as simply outputting a motion image.

또한, 수어영상생성부(220)는 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트를 작성한 화자와, 생성된 수어 영상을 청취하는 청자 각각의 지휘를 식별하고, 식별된 지휘 기초로 예비 동작 유지 시간 및 수어 영상 재생 속도 중 적어도 하나를 결정할 수 있다.In addition, the sign language image generation unit 220 identifies the command of the speaker who created the natural language text and the listener who listens to the generated sign language image based on the linguistic features of the natural language text, and maintains a preliminary operation based on the identified command. And at least one of the sign language video reproduction speed may be determined.

이하, 상술한 바와 같은 번역서버(200)의 논리적 구성 요소를 구현하기 위한 하드웨어에 대하여 보다 구체적으로 설명한다.Hereinafter, hardware for implementing the logical components of the translation server 200 as described above will be described in more detail.

도 8은 본 발명의 일 실시예에 따른 번역서버의 하드웨어 구성도이다.8 is a hardware configuration diagram of a translation server according to an embodiment of the present invention.

도 8에 도시된 바와 같이, 본 발명의 일 실시예에 따른 번역서버(200)는 프로세서(Processor, 250), 메모리(Memory, 255), 송수신기(260), 입출력장치(Input/output device, 165), 데이터 버스(Bus, 270) 및 스토리지(Storage, 275)를 포함하여 구성될 수 있다. As shown in FIG. 8, the translation server 200 according to an embodiment of the present invention includes a processor 250, a memory 255, a transceiver 260, and an input/output device 165. ), a data bus (Bus, 270) and a storage (Storage, 275).

구체적으로, 프로세서(250)는 메모리(255)에 상주된 수어 글로스 또는 수어 영상 번역 방법이 구현된 소프트웨어(280a)에 따른 명령어를 기초로, 번역서버(200)의 동작 및 기능을 구현할 수 있다. Specifically, the processor 250 may implement the operation and function of the translation server 200 based on instructions according to the software 280a in which the sign language gloss or sign language image translation method resident in the memory 255 is implemented.

메모리(255)에는 스토리지(275)에 저장된 번역 방법이 구현된 소프트웨어(280b)가 상주(loading)될 수 있다. Software 280b in which a translation method stored in the storage 275 is implemented may be loaded in the memory 255 .

송수신기(260)는 복수 개의 단말기(100)와 데이터를 송수신할 수 있다.The transceiver 260 may transmit and receive data with a plurality of terminals 100 .

입출력장치(265)는 프로세서(250)의 명령에 따라, 번역서버(200)의 동작에 필요한 신호를 입력 받거나 연산 결과를 외부로 출력할 수 있다.The input/output device 265 may receive signals necessary for the operation of the translation server 200 or output calculation results to the outside according to commands of the processor 250 .

데이터 버스(270)는 프로세서(250), 메모리(255), 송수신기(260), 입출력장치(265) 및 스토리지(275)와 각각 연결되어, 각각의 구성 요소 사이에서 신호를 전달하기 위한 이동 통로의 역할을 수행할 수 있다.The data bus 270 is connected to the processor 250, the memory 255, the transceiver 260, the input/output device 265, and the storage 275, respectively, and is a moving path for transmitting signals between the respective components. role can be fulfilled.

스토리지(275)는 본 발명의 다양한 실시예에 따른 번역 방법이 구현된 소프트웨어(280a)의 실행을 위해 필요한 애플리케이션 프로그래밍 인터페이스(Application Programming Interface, API), 라이브러리(library) 파일, 리소스(resource) 파일 등을 저장할 수 있다. 스토리지(275)는 본 발명의 다양한 실시예에 따른 번역 방법이 구현된 소프트웨어(280b)를 저장할 수 있다. 그리고, 스토리지(275)는 인공지능 및 인공지능을 학습하기 위한 데이터 셋을 저장할 수 있다.The storage 275 includes an application programming interface (API), library files, resource files, etc. necessary for the execution of the software 280a in which the translation method according to various embodiments of the present invention is implemented. can be saved. The storage 275 may store software 280b in which a translation method according to various embodiments of the present disclosure is implemented. And, the storage 275 may store artificial intelligence and data sets for learning artificial intelligence.

본 발명의 일 실시예에 따르면, 메모리(255)에 상주되거나 또는 스토리지(275)에 저장된 수어 글로스 번역 방법을 구현하기 위한 소프트웨어(280a, 280b)는 프로세서(250)가 자연어 텍스트(text)를 입력 받는 단계, 프로세서(250)가, 자연어 텍스트를 인코딩(encoding)하여 자연어 텍스트와 대응하는 벡터(vector)를 생성하는 단계 및 프로세서(250)가, 자연어 및 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해, 벡터를 디코딩(decoding)하여 자연어 텍스트와 매칭되는 수어 글로스를 생성하는 단계를 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.According to one embodiment of the present invention, the software 280a, 280b for implementing the sign language gloss translation method resident in the memory 255 or stored in the storage 275 allows the processor 250 to input natural language text. Receiving, generating, by the processor 250, a vector corresponding to the natural language text by encoding the natural language text, and the processor 250 generating a natural language and a sign language data set matched with the natural language In order to execute the step of decoding a vector through artificial intelligence (AI) pre-machine learning by machine learning and generating a sign language gloss that matches natural language text, a computer recorded on a recording medium can be a program.

그리고, 본 발명의 또 다른 실시예에 따르면, 메모리(255)에 상주되거나 또는 스토리지(275)에 저장된 수어 영상 번역 방법을 구현하기 위한 소프트웨어(280a, 280b)는 프로세서(250)가 자연어 텍스트(text)를 입력 받는 단계, 프로세서(250)가, 자연어 및 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해 자연어 텍스트를 수어 글로스로 변환하는 단계, 프로세서(250)가, 변환된 수어 글로스와 매칭되는 수어 영상을 생성하는 단계를 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.And, according to another embodiment of the present invention, the software (280a, 280b) for implementing the sign language image translation method resident in the memory 255 or stored in the storage 275 is a natural language text (text) by the processor 250. ), the processor 250 converts natural language text into sign language gloss through artificial intelligence (AI) pre-machine learning by natural language and a sign language data set matching the natural language. It may be a computer program recorded on a recording medium in order to execute the step of converting to , and the step of generating a sign language image matching the converted sign language gloss by the processor 250 .

보다 상세하게, 프로세서(250)는 중앙 처리 장치(Central Processing Unit, CPU), ASIC(Application-Specific Integrated Circuit), 칩셋(chipset), 논리 회로 중 하나 이상을 포함하여 구성될 수 있으며, 이에 한정되지 않는다.In more detail, the processor 250 may include one or more of a central processing unit (CPU), an application-specific integrated circuit (ASIC), a chipset, and a logic circuit, but is not limited thereto. don't

메모리(255)는 ROM(Read-Only Memory), RAM(Random Access Memory), 플래쉬 메모리(flash memory), 메모리 카드(memory card) 중 하나 이상을 포함하여 구성될 수 있으며, 이에 한정되지 않는다.The memory 255 may include, but is not limited to, one or more of a read-only memory (ROM), a random access memory (RAM), a flash memory, and a memory card.

입출력장치(265)는 버튼(button), 스위치(switch), 키보드(keyboard), 마우스(mouse), 조이스틱(joystick) 및 터치스크린(touch screen) 등과 같은 입력 장치와, LCD(Liquid Crystal Display), LED(Light Emitting Diode), 유기 발광 다이오드(Organic LED, OLED), 능동형 유기 발광 다이오드(Active Matrix OLED, AMOLED), 프린터(printer), 플로터(plotter) 등과 같은 출력 장치 중 하나 이상을 포함하여 구성될 수 있으며, 이에 한정되지 않는다.The input/output device 265 includes input devices such as buttons, switches, keyboards, mice, joysticks, and touch screens, liquid crystal displays (LCDs), It may be configured to include one or more of output devices such as Light Emitting Diodes (LEDs), Organic LEDs (OLEDs), Active Matrix OLEDs (AMOLEDs), printers, and plotters. may, but is not limited thereto.

본 명세서에 포함된 실시 예가 소프트웨어로 구현될 경우, 상술한 방법은 상술한 기능을 제각각 수행하는 모듈(과정, 기능 등)들로 구현될 수 있다. 각각의 모듈은 메모리(255)에 상주되고 프로세서(250)에 의해 실행될 수 있다. 메모리(255)는 프로세서(250)의 내부 또는 외부에 존재할 수 있고, 널리 알려진 다양한 수단으로 프로세서(250)와 연결될 수 있다.When the embodiments included in this specification are implemented as software, the above-described method may be implemented as modules (processes, functions, etc.) that individually perform the above-described functions. Each module may reside in memory 255 and be executed by processor 250 . The memory 255 may exist inside or outside the processor 250 and may be connected to the processor 250 through various well-known means.

도 8에 도시된 각 구성 요소는 다양한 수단(예를 들어, 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등)에 의해 구현될 수 있다. 하드웨어에 의해 구현될 경우, 본 발명의 일 실시예는 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 콘트롤러, 마이크로 콘트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.Each component shown in FIG. 8 may be implemented by various means (eg, hardware, firmware, software, or a combination thereof). When implemented by hardware, one embodiment of the present invention includes one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), FPGAs ( Field Programmable Gate Arrays), processors, controllers, microcontrollers, microprocessors, etc.

또한, 펌웨어나 소프트웨어에 의해 구현될 경우, 본 발명의 일 실시예는 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차, 함수 등의 형태로 구현되어, 다양한 컴퓨터 수단을 통하여 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. In addition, when implemented by firmware or software, an embodiment of the present invention is implemented in the form of modules, procedures, functions, etc. that perform the functions or operations described above, and is stored on a recording medium readable through various computer means. can be recorded. Here, the recording medium may include program commands, data files, data structures, etc. alone or in combination.

기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 업계의 통상의 지식을 가진 자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. Program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in the computer software industry. For example, recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs (Compact Disk Read Only Memory) and DVDs (Digital Video Disks), floptical It includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, such as a floptical disk, and ROM, RAM, flash memory, and the like.

프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한, 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. These hardware devices may be configured to operate as one or more pieces of software to perform the operations of the present invention, and vice versa.

이하, 상술한 바와 같은 번역서버(200)의 동작에 대하여 보다 구체적으로 설명한다.Hereinafter, the operation of the translation server 200 as described above will be described in more detail.

도 9는 본 발명의 일 실시예에 따른 번역 방법을 설명하기 위한 순서도이고, 도 10은 본 발명의 일 실시예에 따른 수어 영상 생성 단계를 설명하기 위한 순서도이다.9 is a flowchart for explaining a translation method according to an embodiment of the present invention, and FIG. 10 is a flowchart for explaining a sign language image generating step according to an embodiment of the present invention.

도 9를 참조하면, S100 단계에서 번역서버는 단말기로부터 자연어 텍스트를 입력 받을 수 있다.Referring to FIG. 9 , in step S100, the translation server may receive input of natural language text from the terminal.

다음으로, S200 단계에서 번역서버는 단말기로부터 입력 받은 자연어 텍스트를 전처리할 수 있다.Next, in step S200, the translation server may pre-process the natural language text input from the terminal.

구체적으로, 번역서버는 단말기로부터 입력 받은 자연어 텍스트를 인코딩(encoding)하여 자연어 텍스트와 대응하는 벡터(vector)를 생성할 수 있다. 즉, 번역서버는 S300 단계에서 수어 글로스를 생성하기 위하여, 입력 받은 자연어 텍스트를 전처리하는 역할을 수행할 수 있다.Specifically, the translation server may generate a vector corresponding to the natural language text by encoding the natural language text input from the terminal. That is, the translation server may perform a role of pre-processing the input natural language text to generate sign language gloss in step S300.

즉, 번역서버는 자연어 텍스트의 각 단어를 토큰화 한 제1 토큰을 생성할 수 있다. 이때, 번역서버는 인공지능 성능을 향상시키기 위하여, 제1 토큰 생성 이전에, 자연어 텍스트 중 적어도 둘 이상의 의미를 갖는 단어를 검출하고, 검출된 단어를 의미 단위로 띄어쓰기 처리할 수 있다.That is, the translation server may generate a first token by tokenizing each word of the natural language text. At this time, in order to improve AI performance, the translation server may detect words having at least two or more meanings in the natural language text before generating the first token, and space the detected words in semantic units.

또한, 번역서버는 생성된 제1 토큰을 제1 인공지능에 입력하여, 문맥 정보를 반영하는 임베딩(contextual embedding)을 수행할 수 있다. 즉, 번역 서버는 자연어 문장으로 사전 기계 학습된 인공지능을 통해 제1 토큰과 대응하는 문맥 정보가 포함된 제1 컨텍스트 벡터(context vector)를 생성할 수 있다. 이때, 제1 컨텍스트 벡터는 제1 인공지능으로부터 연산된 마지막 히든 레이어(hidden layer)가 될 수 있다.In addition, the translation server may input the generated first token to the first artificial intelligence and perform contextual embedding reflecting contextual information. That is, the translation server may generate a first context vector including context information corresponding to the first token through artificial intelligence pre-machine-learned as a natural language sentence. In this case, the first context vector may be the last hidden layer calculated from the first artificial intelligence.

또한, 번역서버는 단말기로부터 입력 받은 자연어 텍스트의 각 단어 및 각 단어의 언어 자질을 토큰화 한 제2 토큰을 생성할 수 있다. 여기서, 제2 토큰은 품사(POS, Part Of Speech) 분석 및 개체명 인식(NER, Named Entity Recognition) 결과를 기초로, 자연어 텍스트를 임베딩(embedding)하여 생성될 수 있다.In addition, the translation server may generate a second token by tokenizing each word of the natural language text input from the terminal and the language quality of each word. Here, the second token may be generated by embedding natural language text based on a result of Part Of Speech (POS) analysis and Named Entity Recognition (NER).

또한, 번역서버는 제2 토큰을 임베딩하여 제2 컨텍스트 벡터를 생성할 수 있다. 즉, 번역서버는 제2 토큰을 고정된 차원의 실수 벡터로 변환하여 제2 컨텍스트 벡터를 생성할 수 있다.Also, the translation server may generate a second context vector by embedding the second token. That is, the translation server may generate a second context vector by converting the second token into a real vector of a fixed dimension.

이후, 번역 서버는 상술한 바와 같이 생성된 제1 컨텍스트 벡터 및 제2 컨텍스트 벡터를 혼합(concat)한 혼합 특징 벡터(mixed feature vector)를 생성할 수 있다.Thereafter, the translation server may generate a mixed feature vector by concating the first context vector and the second context vector generated as described above.

다음으로, S300 단계에서 번역서버는 S200 단계에서 생성된 혼합 특징 벡터를 입력 받아 수어 글로스를 생성할 수 있다.Next, in step S300, the translation server may generate sign language gloss by receiving the mixed feature vector generated in step S200.

이때, 번역서버는 자연어 및 자연어와 매칭되는 수어 데이터 셋(data set)에 의해 사전 기계 학습(machine learning)된 인공지능(Artificial Intelligence, AI)을 통해, 혼합 특징 벡터를 디코딩(decoding)하여 자연어 텍스트와 매칭되는 수어 글로스를 생성할 수 있다.At this time, the translation server decodes the mixed feature vector through artificial intelligence (AI) pre-machine-learned by the natural language and the sign language data set matching the natural language, and converts the natural language text You can create a sign language gloss that matches.

이때, 번역서버는 혼합 특징 벡터와 매칭되는 수어 토큰을 추출하되, 추출된 수어 토큰 중 매칭 확률 값이 사전 설정된 값보다 낮은 수어 토큰을 혼합 특징 벡터에 포함된 토큰 중 하나로 대체할 수 있다. 이때, 번역서버는 혼합 특징 벡터에 포함된 토큰 중 매칭 확률 값이 사전 설정된 값보다 낮은 수어 토큰에 적합한 확률을 산출하고, 확률이 사전 설정된 값 이상인 토큰으로 대체할 수 있다.At this time, the translation server extracts sign language tokens that match the mixed feature vector, and may replace a sign language token having a matching probability lower than a preset value among the extracted sign language tokens with one of the tokens included in the mixed feature vector. At this time, the translation server may calculate a probability suitable for a sign language token having a matching probability value lower than a preset value among tokens included in the mixed feature vector, and replace it with a token having a probability equal to or higher than the preset value.

즉, 번역서버는 수어 글로스를 생성할 때, 필요한 어휘가 출력 사전(output vocabulary)에 없는 문제(out-of-vocabulary)와 고유명사들의 출력 확률이 작아지는 문제를 해결하기 위하여, 출력에 필요한 어휘를 S200 단계의 출력에서 찾아 복사(copy)할 수 있다. 여기서, 번역서버는 디코더에 카피 어텐션(copy attention)을 별도로 구비하여, 디코딩 과정에서 각 시간별 출력 어휘를 예측할 때, 출력 사전에 있는 어휘들의 확률과 함께 혼합 특징 벡터 열 중에서 카피 어텐션 점수가 가장 높은 어휘를 그대로 출력할 확률도 함께 계산할 수 있다.That is, when the translation server generates sign language gloss, in order to solve the problem that the required vocabulary is not in the output vocabulary and the output probability of proper nouns becomes small, the required vocabulary for output is solved. can be found and copied from the output of step S200. Here, the translation server separately equips the decoder with copy attention, and when predicting the output vocabulary for each time in the decoding process, the vocabulary with the highest copy attention score among the mixed feature vector sequences along with the probabilities of the words in the output dictionary. The probability of outputting as it is can also be calculated together.

또한, 번역서버는 혼합 특징 벡터를 기초로 자연어 텍스트의 문장 유형을 추정하고, 추정된 문장 유형에 따른 비수지기호를 추출하고, 추출된 비수지기호를 수어 토큰에 임베딩할 수 있다. 여기서, 문장 유형은 평소문, 의문문, 명령문, 청유문 및 감탄문 중 적어도 하나를 포함할 수 있다. 이때, 번역서버는 혼합 특징 벡터에 포함된 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트의 문장 유형을 식별할 수 있다.In addition, the translation server may estimate the sentence type of the natural language text based on the mixed feature vector, extract non-financial symbols according to the estimated sentence type, and embed the extracted non-financial symbols into sign language tokens. Here, the sentence type may include at least one of a usual sentence, a question sentence, a command sentence, a request sentence, and an exclamation sentence. In this case, the translation server may identify the sentence type of the natural language text based on the language features of the natural language text included in the mixed feature vector.

즉, 번역서버는 추정된 문자 유형에 따라 수어 글로스를 수어로 동작하는데 따른 속도 지수를 도출하고, 도출된 속도 지수를 수어 토큰에 임베딩할 수 있다. 이후, 번역서버는 속도 지수를 나타내는 문자를 수어 글로스에 포함시킬 수 있다. That is, the translation server may derive a speed index according to operating the sign language gloss as a sign language according to the estimated character type, and embed the derived speed index into the sign language token. Thereafter, the translation server may include characters representing the speed index in the sign language gloss.

그리고, S400 단계에서 번역서버는 변환된 수어 글로스와 매칭되는 수어 영상을 생성할 수 있다.In step S400, the translation server may generate a sign language image matching the converted sign language gloss.

구체적으로 도 10에 도시된 바와 같이, S410 단계에서 번역 서버는 변환된 수어 글로스에 포함된 각 단어와 매칭되는 사전 저장된 단어 수화 영상을 추출할 수 있다.Specifically, as shown in FIG. 10 , in step S410, the translation server may extract pre-stored sign language images matching each word included in the converted sign language gloss.

다음으로, S420 단계에서 번역서버는 추출된 단어 수화 영상에 포함된 프레임 각각에서 2D 키포인트(keypoint)를 추출할 수 있다. 즉, 번역서버는 2D 키포인트가 포함된 수화 영상 데이터 셋을 기초로 사전 기계 학습된 인공지능을 통해, 단어 수화 영상에서 2D 키포인트를 추출할 수 있다. 여기서, 2D 키포인트를 추출하기 위한 인공지능은 결과 값인 추출된 2D 키포인트 및 후술할 변환된 3D 조인트를 2D 이미지에 프로젝션(projection) 시킨 이미지를 포함하는 데이터 셋을 통해 학습될 수 있다.Next, in step S420, the translation server may extract 2D keypoints from each frame included in the extracted word sign language image. That is, the translation server may extract 2D keypoints from the word sign language image through pre-machine learning artificial intelligence based on the sign language image data set including the 2D keypoints. Here, artificial intelligence for extracting 2D keypoints can be learned through a data set including an image obtained by projecting the resulting extracted 2D keypoints and converted 3D joints to be described later on a 2D image.

다음으로, S430 단계에서 번역서버는 추출된 2D 키포인트를 3D 조인트(joint)로 변환할 수 있다. 이때, 번역서버는 3D 조인트를 2D 이미지 위에 프로젝션(projection) 시킨 이미지 및 인공지능을 통해 추출된 2D 키포인트를 기초로 손실(loss)이 최소화되도록 학습된 인공지능을 통해, 추출된 2D 키포인트를 3D 조인트로 변환할 수 있다. 여기서, 번역서버는 2D 키포인트 중 중수지관절(metacarpophalangeal joint)에 해당하는 2D 키포인트를 추출하고, 중수지관절에 해당하는 2D 키포인트를 3D 조인트로 변환할 수 있다. 즉, 번역서버는 손의 조인트 21개 전부를 사용하는 것이 아닌 중수지관절을 사용할 수 있다.Next, in step S430, the translation server may convert the extracted 2D keypoint into a 3D joint. At this time, the translation server converts the extracted 2D keypoints into 3D joints through artificial intelligence trained to minimize loss based on the 2D keypoints extracted through the image and artificial intelligence in which the 3D joint is projected onto the 2D image can be converted to Here, the translation server may extract a 2D keypoint corresponding to the metacarpophalangeal joint from among the 2D keypoints and convert the 2D keypoint corresponding to the metacarpophalangeal joint into a 3D joint. That is, the translation server can use the metacarpophalangeal joint instead of using all 21 joints of the hand.

다음으로, S440 단계에서 번역서버는 변환된 3D 조인트를 기초로 3D 조인트에 따른 동작 정보를 생성할 수 있다. 즉, 번역서버는 중수지관절을 대상으로 손목의 회전 각도 및 팔꿈치의 회전 각도와의 상관 관계를 기초로 사전 학습된 인공지능을 통해, 3D 조인트에 따른 손목의 회전 각도 및 팔꿈치의 회전 각도를 추정할 수 있다. 이때, 손목의 회전 각도 및 팔꿈치의 회전 각도를 추정하기 위한 인공지능은 손목의 회전 각도 특징을 팔꿈치를 포함하는 몸의 특징과 관계성을 형성하여 학습될 수 있다.Next, in step S440, the translation server may generate motion information according to the 3D joint based on the converted 3D joint. That is, the translation server estimates the rotation angle of the wrist and the rotation angle of the elbow according to the 3D joint through pre-learned artificial intelligence based on the correlation between the rotation angle of the wrist and the rotation angle of the elbow for the metacarpophalangeal joint. can do. In this case, the artificial intelligence for estimating the rotation angle of the wrist and the rotation angle of the elbow may be learned by forming a relationship between the rotation angle feature of the wrist and body features including the elbow.

다음으로, S450 단계에서 번역서버는 생성된 3D 조인트 및 동작 정보를 기초로 수어 글로스의 각 단어별 수어 영상을 생성할 수 있다. 즉, 번역서버는 3D 조인트 및 동작 정보를 기초로 3D 매쉬(mesh)를 생성하고, 생성된 3D 매쉬를 2D 이미지에 투영시켜 영상으로 변환시킬 수 있다. 예를 들어, 번역서버는 가상 인간이 수화를 수행하는 영상을 생성할 수 있다.Next, in step S450, the translation server may generate a sign language image for each word of the sign language gloss based on the generated 3D joint and motion information. That is, the translation server may generate a 3D mesh based on the 3D joint and motion information, project the generated 3D mesh onto a 2D image, and convert it into an image. For example, the translation server may generate an image of a virtual human performing sign language.

그리고, S450 단계에서 번역서버는 각 단어별 수어 영상을 조합하여 문장 수어 영상을 생성할 수 있다. 이때, 번역서버는 연속되는 각 단어별 수어 영상 사이의 모션 저더(motion judder) 현상을 방지하기 위하여, 모션 인터폴레이션(motion interpolation)을 통해 연속되는 각 단어별 영상 사이에 적어도 하나의 이미지를 생성할 수 있다.In step S450, the translation server may generate a sentence sign language image by combining sign language images for each word. At this time, the translation server may generate at least one image between consecutive images of each word through motion interpolation in order to prevent motion judder between consecutive sign language images of each word. there is.

여기서, 번역서버는 연속되는 각 단어별 영상 사이에 적어도 하나의 이미지를 생성하되, 선행되는 제1 단어 영상의 최후 프레임과, 후행되는 제2 단어 영상의 최초 프레임 사이에 사전 저장된 예비 동작 이미지를 삽입할 수 있다. 그리고, 번역서버는 예비 동작 이미지를 기준으로, 제1 단어 영상의 최후 프레임 및 제2 단어 영상의 최초 프레임 사이에 적어도 하나의 이미지를 생성할 수 있다.Here, the translation server generates at least one image between consecutive images for each word, and inserts a pre-stored preliminary motion image between the last frame of the first word image that precedes and the first frame of the second word image that follows. can do. Further, the translation server may generate at least one image between the last frame of the first word image and the first frame of the second word image based on the preliminary motion image.

또한, 번역서버는 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트의 문장 유형을 식별하고, 식별된 문장 유형에 따라 기본 자세의 유지 시간을 결정할 수 있다. 또한, 번역서버는 자연어 텍스트의 언어 자질을 기초로, 상기 자연어 텍스트의 문장 유형을 식별하고, 식별된 문장 유형에 따라 생성된 수어 영상의 재생 속도를 결정할 수 있다.In addition, the translation server may identify a sentence type of the natural language text based on language features of the natural language text, and determine a basic posture maintenance time according to the identified sentence type. In addition, the translation server may identify a sentence type of the natural language text based on language features of the natural language text, and determine a reproduction speed of the generated sign language image according to the identified sentence type.

또한, 번역서버는 자연어 텍스트의 언어 자질을 기초로, 자연어 텍스트를 작성한 화자와, 생성된 수어 영상을 청취하는 청자 각각의 지휘를 식별하고, 식별된 지휘 기초로 예비 동작 유지 시간 및 수어 영상 재생 속도 중 적어도 하나를 결정할 수 있다.In addition, the translation server identifies the command of the speaker who wrote the natural language text and the listener who listens to the generated sign language image based on the linguistic features of the natural language text, and based on the identified command, the preliminary operation holding time and sign language video playback speed At least one of them can be determined.

도 11은 본 발명의 일 실시예에 따른 수어 영상 생성 방법을 설명하기 위한 예시도이다.11 is an exemplary diagram for explaining a method of generating a sign language image according to an embodiment of the present invention.

도 11을 참조하면, 번역서버는 연속되는 각 단어별 수어 영상 사이의 모션 저더(motion judder) 현상을 방지하기 위하여, 모션 인터폴레이션(motion interpolation)을 통해 연속되는 각 단어별 영상 사이에 적어도 하나의 이미지를 생성할 수 있다.Referring to FIG. 11, in order to prevent a motion judder phenomenon between consecutive sign language images for each word, the translation server transmits at least one image between consecutive images for each word through motion interpolation. can create

이때, 도 11에 도시된 바와 같이, 선행되는 제1 단어(기술) 영상의 최후 프레임이 'a'이고, 후행되는 제2 단어(전통) 영상의 최초 프레임이 'b'라고 가정하면, 번역서버는 선행되는 제1 단어 영상의 최후 프레임'a'과, 후행되는 제2 단어 영상의 최초 프레임'b'사이에 사전 저장된 예비 동작 이미지'c'를 삽입할 수 있다.At this time, as shown in FIG. 11, assuming that the last frame of the preceding first word (technology) image is 'a' and the first frame of the following second word (traditional) image is 'b', the translation server may insert a pre-stored preliminary motion image 'c' between the last frame 'a' of the first word image that precedes and the first frame 'b' of the second word image that follows.

그리고, 번역서버는 예비 동작 이미지'c'를 기준으로, 제1 단어 영상의 최후 프레임'a' 및 제2 단어 영상의 최초 프레임'b' 사이에, 모션 인터폴레이션을 통해 적어도 하나의 이미지를 생성할 수 있다.Then, the translation server generates at least one image through motion interpolation between the last frame 'a' of the first word image and the first frame 'b' of the second word image based on the preliminary motion image 'c'. can

이를 통해, 번역서버는 단순히 각 단어별 영상 사이의 연관성을 통해 예측되는 이미지를 삽입하는 것이 아니고, 예비 동작 이미지를 각 단어별 영상 사이에 삽입한 후에, 각 단어별 영상과 예비 동작과의 모션 인터폴레이션을 통해 보다 자연스러운 수어 영상을 생성할 수 있다.Through this, the translation server does not simply insert an image predicted through the correlation between images for each word, but inserts a preliminary motion image between images for each word, and then performs motion interpolation between the video for each word and the preliminary motion. Through this, a more natural sign language image can be created.

도 12는 본 발명의 일 실시예에 따른 번역 방법을 설명하기 위한 예시도이다.12 is an exemplary diagram for explaining a translation method according to an embodiment of the present invention.

한편, 도 12를 참조하면, 일반적인 번역 모델의 경우, 입력으로 주어진 텍스트 시퀀스와 출력으로 주어지는 텍스트 시퀀스가 다른 언어이다.Meanwhile, referring to FIG. 12 , in the case of a general translation model, a text sequence given as an input and a text sequence given as an output are different languages.

반면에, 본 발명의 일 실시예에 따른 번역 방법은 입력과 출력이 모두 동일한 한국어 기반 데이터 셋이라는 점에서 차이가 있다.On the other hand, the translation method according to an embodiment of the present invention is different in that both input and output are Korean-based data sets.

따라서, 본 발명의 일 실시예에 따른 번역 서버는 위와 같은 특성을 고려하여, 인코딩 과정에서 생성된 토큰을 카피(copy)하여 선택적으로 사용하여 수어 글로스를 생성할 수 있다.Accordingly, the translation server according to an embodiment of the present invention may generate sign language gloss by copying and selectively using the token generated in the encoding process in consideration of the above characteristics.

[수학식 4][Equation 4]

또한, 번역 서버는 하기의 수학식 5를 통해 수어 토큰의 스코어를 산출할 수 있다.In addition, the translation server may calculate the score of the sign language token through Equation 5 below.

[수학식 5][Equation 5]

번역 서버는 수학식 4 및 수학식 5를 통해 산출된 스코어를 기초로 수어 글로스를 생성하기 위한 출력 토큰의 확률을 산출할 수 있다.The translation server may calculate the probability of the output token for generating the sign language gloss based on the scores calculated through Equations 4 and 5.

이때, t시점의 생성 토큰을 G_t라고 했을 때, 하기의 수학식 6 내지 9에 기재된 네가지 경우로 나뉘어지게 된다.At this time, when the generation token at time t is G _t , it is divided into four cases described in Equations 6 to 9 below.

[수학식 6][Equation 6]

[수학식 7][Equation 7]

[수학식 8][Equation 8]

[수학식 9][Equation 9]

그리고, 번역 서버는 수학식 4 및 수학식 5를 통해 산출된 스코어를 기초로 수어 글로스를 생성하기 위한 출력 토큰의 확률을 산출하되, 하기의 수학식 10을 기초로 카피와 관련된 정보를 다음 출력 토큰을 추측할 때 제공해 주기 위한 selective read 값을 산출할 수 있다. 여기서, t-1 시점의 selective read 값은 로 표현될 수 있다.And, the translation server calculates the probability of an output token for generating a sign language gloss based on the scores calculated through Equations 4 and 5, but transmits copy-related information based on Equation 10 below to the next output token. It is possible to calculate a selective read value to provide when guessing. Here, the selective read value at time t-1 is can be expressed as

[수학식 10][Equation 10]

(여기서, 는 하기의 수학식 11과 같이 주어진다.)(here, is given by Equation 11 below.)

[수학식 11][Equation 11]

인 경우, If

그러지 않은 경우, = 0If not, = 0

(여기서 K는 하기의 수학식 12과 같이 주어진다.)(Here, K is given as in Equation 12 below.)

[수학식 12][Equation 12]

즉, K는 결과 토큰인 y_t-1과 같은 토큰의 모든 카피 확률을 더한 것을 의미한다. 이는 카피할 토큰이 입력에 하나가 있을 수도 있지만, 여러 개가 있을 수도 있기 때문이다.That is, K means the sum of all copy probabilities of the token equal to the resulting token y _t-1 . This is because there may be one token in the input to be copied, but there may be several.

수학식 10을 해석해보면, 디코더의 출력 토큰이 인코더의 입력 토큰과 동일한 경우, 해당 토큰의 인코더의 출력 벡터(은닉 벡터)를 가중치를 곱해 더한 값으로, 어떤 토큰을 카피하였는지 알려주는 정보를 가진 벡터이다.Analyzing Equation 10, when the output token of the decoder is the same as the input token of the encoder, the output vector (hidden vector) of the encoder of the corresponding token is multiplied by the weight and added, a vector having information indicating which token was copied. am.

다시 정리하면, 번역 서버는 디코더에서 나온 결과 벡터를 카피 어텐션을 이용하여 카피 스코어를 계산한다.In other words, the translation server calculates a copy score by using copy attention of the resulting vector from the decoder.

이때, 도 12에서 기본적인 수어 데이터 셋의 크기는 15개이며, 입력 토큰 중 기본적인 수어 데이터 셋에 속하지 않는 토큰이 3개가 있어 입력을 고려한 추가적인 토큰이 3개 추가되어 총 18개의 스코어가 계산된다.At this time, in FIG. 12, the size of the basic sign language data set is 15, and among the input tokens, there are 3 tokens that do not belong to the basic sign language data set, so 3 additional tokens considering the input are added to calculate a total of 18 scores.

한편, 디코더의 기본적인 결과로 기본 수어 데이터 셋의 크기인 15개에 대한 스코어가 계산된다.Meanwhile, as a basic result of the decoder, scores for 15, which is the size of the basic sign language data set, are calculated.

다음으로, 두 스코어를 연결하고, 확률로 변환하기 위해 소프트맥스(softmax) 함수를 통과시킨 뒤 각 토큰에 대한 확률을 합하여 최종확률을 도출한다.Next, after connecting the two scores and passing a softmax function to convert them into probabilities, the final probabilities are derived by summing the probabilities for each token.

이때, 디코더의 결과는 추가적인 토큰 3개에 대한 스코어가 없으므로, 카피 스코어 부분에서 구해진 스코어를 그대로 사용한다.At this time, since the result of the decoder does not have a score for the additional three tokens, the score obtained in the copy score part is used as it is.

결과적으로, 수어 데이터 셋에서 6번째 토큰이 0.2로 추정확률이 가장 높아 다음 토큰으로 예측된다.As a result, in the sign language data set, the 6th token has the highest estimation probability of 0.2, so it is predicted as the next token.

예를 들어, 도 12에 도시된 바와 같이 첫번째 결과로 생성된 토큰이 수어 데이터 셋의 6번째 단어이며, 동일하게 그 단어는 입력의 4번째로 들어온 단어이기도 하다.For example, as shown in FIG. 12, the token generated as the first result is the 6th word of the sign language data set, and the same word is also the 4th input word of the input.

이때, selective read는 해당 결과 토큰이 4번째 들어온 토큰이기 때문에, 4번째 토큰의 은닉 벡터인 h4가 된다.At this time, the selective read becomes h4, the hidden vector of the 4th token, because the resulting token is the 4th entered token.

다음으로, 다음 디코더의 입력으로 들어가는 벡터는 6번째 단어, 동일하게 a4를 임베딩 한 것에 h4를 연결한 값이 입력으로 들어가게 된다.Next, the vector entering the input of the next decoder is the 6th word, the same embedding of a4 and the value of connecting h4 to the input.

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시예에 대하여 개시하였으나, 여기에 개시된 실시예 외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다. 따라서, 상술한 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적해석에 의해 선정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.As described above, although preferred embodiments of the present invention have been disclosed in the present specification and drawings, it is in the technical field to which the present invention belongs that other modified examples based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein. It is self-evident to those skilled in the art. In addition, although specific terms have been used in the present specification and drawings, they are only used in a general sense to easily explain the technical content of the present invention and help understanding of the present invention, but are not intended to limit the scope of the present invention. Accordingly, the foregoing detailed description should not be construed as limiting in all respects and should be considered illustrative. The scope of the present invention should be selected by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

100 : 단말기 200 : 번역서버
205 : 통신부 210 : 입출력부
215 : 저장부 220 : 데이터전처리부
225 : 수어글로스생성부 230 : 수어영상생성부100: terminal 200: translation server
205: communication unit 210: input/output unit
215: storage unit 220: data pre-processing unit
225: sign language generation unit 230: sign language image generation unit

Claims

The translation server, the step of receiving input of natural language text (text);
generating, by the translation server, a vector corresponding to the natural language text by encoding the natural language text; and
The translation server decodes the vector through a transformer model pre-machine-learned by a natural language and a sign language data set matching the natural language, and matches the natural language text generating a sign language gloss; including,
The step of generating the sign language gloss
Characterized in that the token generated in the encoding process is copied and selectively used to generate the sign language gloss,
The step of creating the vector is
generating a first token by tokenizing each word of the natural language text, and generating a second token by tokenizing each word of the natural language text and a linguistic feature of each word; and
generating a first context vector including context information corresponding to the first token through artificial intelligence pre-machine-learned as a natural language sentence, and generating a second context vector by embedding the second token; It is characterized in that it includes,
The step of creating the vector is
Characterized in that a mixed feature vector synthesized from the first context vector and the second context vector is generated, and the generated mixed feature vector is input to artificial intelligence for generating the sign language gloss,
Before generating the first token and generating the second token
Characterized in that, in the natural language text, words having at least two or more meanings are detected, and the detected words are spaced in semantic units,
Generating the second token
Characterized in that the second token is generated by embedding the natural language text based on a result of Part Of Speech (POS) analysis and Named Entity Recognition (NER),
The step of generating the sign language gloss
receiving, by the translation server, a mixed feature vector corresponding to the natural language text; and
extracting, by the translation server, an output token related to a sign language matched with the mixed feature vector through artificial intelligence pre-machine-learned by a natural language and a sign language data set matching the natural language; It is characterized in that it includes,
The step of extracting the output token is
A sign language gloss translation method using a transformer, characterized in that a score to be used for the input token generated in the encoding process is calculated through the following equation.
[mathematical expression]

(here, Means the score for copying the input token at time j of the encoding process at time t of the decoding process, X means the input token, and S _t is the state vector at time t from the decoder cell , and h _t means the result vector at time t from the encoder.)

2. The method of claim 1, wherein extracting the output token comprises:
A sign language gloss translation method using a transformer, characterized in that the score of the sign language token is calculated through the following equation.
[mathematical expression]

(here, denotes the score for outputting the ith token of the sign language data set at time t in the decoder, V denotes the sign language token included in the sign language data set, and S _t is the state vector at time t from the decoder cell ( state vector).)

3. The method of claim 2, wherein extracting the output token comprises:
Calculate the probability of the output token for generating the sign language gloss based on the calculated score, but calculate a selective read value for providing copy-related information when estimating the next output token based on the following equation Characterized in that, a sign language gloss translation method using a transformer.
[mathematical expression]

(here, is given by the following equation.)
[mathematical expression]
If
If not, = 0
(Here, K is given by the following equation.)
[mathematical expression]

The method of claim 1, wherein extracting the sign language token
Characterized in that, based on the mixed feature vector, a sentence type of the natural language text is estimated, a non-financial symbol according to the estimated sentence type is extracted, and the extracted non-financial symbol is embedded in the sign language token. Sign language gloss translation method using .

5. The method of claim 4, wherein extracting the output token comprises:
A speed index according to operating the sign language gloss according to the estimated sentence type is derived, the derived speed index is embedded in the sign language token, and a character representing the speed index is included in the sign language gloss. A sign language gloss translation method using a transformer.

6. The method of claim 5, wherein extracting the output token comprises:
Based on the linguistic features of the natural language text included in the mixed feature vector, the sentence type of the natural language text is identified, and the speed index is determined based on the identified sentence type. How to translate sign language gloss.

memory;
transceiver; and
In combination with a computing device configured to include a processor for processing instructions resident in the memory,
receiving, by the processor, natural language text;
generating, by the processor, a vector corresponding to the natural language text by encoding the natural language text; and
The processor decodes the vector through a transformer model pre-machine-learned by a natural language and a sign language data set matching the natural language to match the natural language text generating a sign language gloss; including,
The step of generating the sign language gloss
Including a sign language gloss translation method using a transformer, wherein the sign language gloss is generated by copying and selectively using the token generated in the encoding process,
The step of creating the vector is
generating a first token by tokenizing each word of the natural language text, and generating a second token by tokenizing each word of the natural language text and a linguistic feature of each word; and
generating a first context vector including context information corresponding to the first token through artificial intelligence pre-machine-learned as a natural language sentence, and generating a second context vector by embedding the second token; It is characterized in that it includes,
The step of creating the vector is
Characterized in that a mixed feature vector synthesized from the first context vector and the second context vector is generated, and the generated mixed feature vector is input to artificial intelligence for generating the sign language gloss,
Before generating the first token and generating the second token
Characterized in that, in the natural language text, words having at least two or more meanings are detected, and the detected words are spaced in semantic units,
Generating the second token
Characterized in that the second token is generated by embedding the natural language text based on a result of Part Of Speech (POS) analysis and Named Entity Recognition (NER),
The step of generating the sign language gloss
receiving, by the processor, a mixed feature vector corresponding to the natural language text; and
extracting, by the processor, an output token related to a sign language matched with the mixed feature vector through artificial intelligence pre-machine-learned by a natural language and a sign language data set matching the natural language; It is characterized in that it includes,
The step of extracting the output token is
A computer program recorded on a recording medium, characterized in that for calculating a score to use the input token generated in the encoding process through the following equation.
[mathematical expression]

(here, Means the score for copying the input token at time j of the encoding process at time t of the decoding process, X means the input token, and S _t is the state vector at time t from the decoder cell , and h _t means the result vector at time t from the encoder.)

8. The method of claim 7, wherein extracting the output token comprises:
A computer program recorded on a recording medium, characterized in that the score of the sign language token is calculated through the following equation.
[mathematical expression]

(here, denotes the score for outputting the ith token of the sign language data set at time t in the decoder, V denotes the sign language token included in the sign language data set, and S _t is the state vector at time t from the decoder cell ( state vector).)

9. The method of claim 8, wherein extracting the output token comprises:
Calculate the probability of the output token for generating the sign language gloss based on the calculated score, but calculate a selective read value for providing copy-related information when estimating the next output token based on the following equation Characterized in that, a computer program recorded on a recording medium.
[mathematical expression]

(here, is given by the following equation.)
[mathematical expression]
If
If not, = 0
(Here, K is given by the following equation.)
[mathematical expression]

8. The method of claim 7, wherein extracting the sign language token
Characterized in that, based on the mixed feature vector, a sentence type of the natural language text is estimated, a non-financial symbol according to the estimated sentence type is extracted, and the extracted non-financial symbol is embedded in the sign language token. A computer program recorded on media.