KR102253015B1

KR102253015B1 - Apparatus and method of an automatic simultaneous interpretation using presentation scripts analysis

Info

Publication number: KR102253015B1
Application number: KR1020170148777A
Authority: KR
Inventors: 이기영; 김창현; 김영길
Original assignee: 한국전자통신연구원
Priority date: 2017-11-09
Filing date: 2017-11-09
Publication date: 2021-05-17
Also published as: KR20190052924A

Abstract

본 발명은 실시간 강연 자동통역의 사전작업으로서 강연자의 발표자료 분석을 통하여 실시간 자동 동시통역 시스템의 번역지식을 해당 강연 도메인에 맞도록 조정하는 방식을 포함하는 것으로, 강연자의 발표자료를 자동으로 분석하여 사용자사전 확장, 음성인식 사전 반영, 미등록어 대역어 구축, 시스템 사전 대역어 가중치 조절 등의 도메인 적용(adaptation) 과정을 수행하는 것을 특징으로 한다. The present invention includes a method of adjusting the translation knowledge of the real-time automatic simultaneous interpretation system to fit the corresponding lecture domain through analysis of the lecturer's presentation data as a preliminary work of real-time automatic interpretation of the lecture, and automatically analyzes the lecturer's presentation data. It is characterized by performing a domain adaptation process such as extending a user dictionary, reflecting a voice recognition dictionary, constructing an unregistered word, and adjusting the weight of a system dictionary.

Description

Apparatus and method of an automatic simultaneous interpretation using presentation scripts analysis}

본 발명은 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치 및 그 방법에 관한 것으로, 특히, 동시 통역 성능을 개선하기 위해 강연자의 발표자료를 자동으로 분석하여 실시간으로 이루어지는 자동 동시 통역 성능을 향상시키도록 한 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치 및 그 방법 에 관한 것이다. The present invention relates to an apparatus and method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, and in particular, to improve the automatic simultaneous interpretation performance made in real time by automatically analyzing the lecturer's presentation data in order to improve the simultaneous interpretation performance. The present invention relates to an apparatus and method for simultaneous interpretation of real-time lectures based on automatic analysis of presentation data.

최근, 자연어 처리 분야의 비약적인 발전은 음성언어와 관련된 많은 어플리케이션의 개발을 가능하게 하였다. 특히 신경망 기반 자동 학습 기술의 발달은 음성인식 및 자동 번역의 품질을 한 단계 높이는 직접적인 원인이 되었다. Recently, the rapid development in the field of natural language processing has enabled the development of many applications related to speech language. In particular, the development of neural network-based automatic learning technology has been a direct cause of raising the quality of speech recognition and automatic translation to the next level.

이러한 결과로 최근에는 연속적인 발화를 자동으로 번역하는 실시간 동시 통역과 같은 기술에 많은 관심이 집중되고 있다. As a result of this, a lot of attention has recently been focused on technologies such as real-time simultaneous interpretation that automatically translates continuous speech.

실시간 자동 동시 통역은 강연이나 강좌에서 강연자와 청중의 모국어가 다를 경우, 이러한 언어적 차이를 해소하기 위하여 강연 내용을 실시간으로 동시통역하여 청중에게 제공함으로써 청중이 강연내용을 이해하도록 도와주는 기술이다. Real-time automatic simultaneous interpretation is a technology that helps the audience understand the contents of the lecture by simultaneously interpreting the contents of the lecture in real time and providing it to the audience in order to resolve the linguistic difference when the speaker and the audience's native language are different in a lecture or lecture.

일반적으로 사람에 의한 동시통역을 생각해 볼 때, 동시통역가는 통역을 위한 준비 작업으로 자신이 통역할 강연의 도메인(분야) 정보나 통역할 강연과 관련된 단어집 등을 미리 준비한다. 이러한 사전 작업을 통하여 동시통역가는 실제 통역 상황에서 발생할 수도 있는 다양한 언어적 모호성 문제를 해결할 수 있다. 이러한 이유는 영한 동시통역의 예를 들 때, 많은 영어 어휘가 다양한 한국어로 번역될 수 있기 때문이다.In general, when considering simultaneous interpretation by humans, a simultaneous interpreter prepares information on the domain (field) of the lecture to be interpreted or a vocabulary related to the lecture to be interpreted as a preparation for interpretation. Through such preliminary work, simultaneous interpreters can solve various linguistic ambiguity problems that may occur in actual interpreting situations. This is because many English vocabulary can be translated into various Koreans when taking the example of simultaneous English-Korean interpretation.

동일한 이유로 실시간 자동 동시통역의 경우에도 해당 강연에 대한 사전 정보를 미리 알 수 있다면, 언어가 갖는 번역 모호성의 해소와 문맥에 맞는 대역문장 생성에 도움이 될 수 있다.For the same reason, even in the case of real-time automatic simultaneous interpretation, if the prior information for the lecture can be known in advance, it can be helpful in resolving the translation ambiguity of the language and generating a band sentence that fits the context.

실시간 자동 동시통역을 위해 기본적으로 음성인식과 자동번역 장치를 구비하여 동시통역 장치를 구성한다고 할 때, 단순히 문장 단위의 번역은 수행할 수 있다. 하지만, 보다 양질의 번역결과를 생성하는 측면에서 볼 때, 단순히 시스템이 가지고 있는 일반적인 번역지식을 활용해서는 문맥에 맞는 정확한 번역을 수행할 수 없다. 이러한 이유는 기존의 자동통번역 시스템은 특정 강연을 위해 최적화되어 있지 않으며, 그 번역지식도 최적화되어 있지 않기 때문이다.Assuming that the simultaneous interpretation device is basically provided with a voice recognition and automatic translation device for real-time automatic simultaneous interpretation, it is possible to simply perform a sentence-by-sentence translation. However, in terms of generating higher quality translation results, it is not possible to perform accurate translation appropriate to the context simply by utilizing the general translation knowledge possessed by the system. This is because the existing automatic interpretation and translation systems are not optimized for specific lectures, and their translation knowledge is not optimized.

예를 들어, “You may change its resolution or leave it unchanged.” 과 같은 영어 문장을 한국어로 동시통역할 때, 의미적 모호성을 지니는 resolution의 대역어를 결정하는 것은 문맥을 고려하지 않고서는 상당히 어렵다. For example, “You may change its resolution or leave it unchanged.” When simultaneously interpreting an English sentence such as in Korean, it is quite difficult to determine a resolution word that has semantic ambiguity without considering the context.

이러한 경우, 강연자가 강연할 내용을 미리 아는 것은 정확한 번역 품질을 제공하는데 있어서 상당히 중요한 역할을 할 수 있다.In this case, knowing in advance what the speaker will be speaking can play a very important role in providing accurate translation quality.

따라서, 강연자의 발표자료를 자동으로 분석함으로써 강연 의도, 내용, 어휘, 문장 등을 미리 파악하여 올바른 번역결과를 제공할 수 있는 연구가 필요한 실정이다.Therefore, there is a need for research that can provide correct translation results by identifying lecture intentions, contents, vocabulary, and sentences in advance by automatically analyzing the lecturer's presentation materials.

따라서, 상기한 기술적 문제점을 해결하기 위한 본 발명의 목적은, 실시간 자동 동시통역을 하는데 있어서 문맥 정보를 미리 자동으로 파악하여 강연 도메인 및 문맥에 맞는 가장 정확하고 자연스러운 번역결과를 생성하도록 한 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치 및 그 방법을 제공함에 있다. Accordingly, an object of the present invention for solving the above technical problem is to automatically grasp context information in advance in real-time automatic simultaneous interpretation to generate the most accurate and natural translation results suitable for the lecture domain and context. It is to provide a simultaneous interpretation apparatus and method for real-time lectures based on analysis.

즉, 본 발명은 강연자의 발표자료를 자동으로 스캔하여 발표자료의 어휘, 문장 등을 분석하고, 미등록어, 고유명사, 의미관계 및 대역어 정보 등을 가공하여 번역 지식화함으로써 강연자의 강연을 보다 정확하게 동시통역하도록 한 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치 및 그 방법을 제공함에 그 목적이 있는 것이다. That is, the present invention automatically scans the lecturer's presentation material to analyze the vocabulary and sentences of the presentation material, and processes unregistered words, proper nouns, semantic relations, and translational information to make the lecture of the lecturer more accurate. Its purpose is to provide an apparatus and method for simultaneous interpretation of real-time lectures based on automatic analysis of presentation data for simultaneous interpretation.

상기한 목적을 달성하기 위한 본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치는 문서 또는 전자 문서에 포함된 문자열 정보를 인식하고 이를 텍스트화하는 문자 인식부; 상기 문자 인식부를 통해 인식된 텍스트에 대하여 형태소 분석을 수행하여 토큰을 추출하는 형태소 분석부; 상기 형태소 분석부에서 제공되는 토큰을 번역 사전 데이터베이스에 저장된 등록어와 비교하여 미등록어를 추출하는 미등록어 추출부; 및 상기 미등록어 추출부를 통해 추출된 미등록어를 상기 번역 사전 데이터베이스에 등록하여 갱신하는 번역 지식 반영부;를 포함한다. A real-time lecture simultaneous interpretation apparatus based on automatic analysis of presentation data according to an embodiment of the present invention for achieving the above object comprises: a character recognition unit for recognizing character string information included in a document or electronic document and converting it into text; A morpheme analysis unit for extracting a token by performing a morpheme analysis on the text recognized through the character recognition unit; An unregistered word extracting unit for extracting an unregistered word by comparing the token provided by the morpheme analysis unit with the registered word stored in a translation dictionary database; And a translation knowledge reflecting unit for registering and updating the non-registered words extracted through the non-registered word extracting unit in the translation dictionary database.

본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 방법은 문자 인식부가 문서 또는 전자 문서에 포함된 문자열 정보를 인식하고 이를 텍스트화하는 단계; 형태소 분석부가 상기 문자 인식부를 통해 인식된 텍스트에 대하여 형태소 분석을 수행하여 토큰을 추출하는 단계; 미등록어 추출부가 상기 형태소 분석부에서 제공되는 토큰을 번역 사전 데이터베이스에 저장된 등록어와 비교하여 미등록어를 추출하는 단계; 및 번역 지식 반영부가 상기 미등록어 추출부를 통해 추출된 미등록어를 상기 번역 사전 데이터베이스에 등록하여 갱신하는 단계;를 포함한다. A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to an embodiment of the present invention includes the steps of: a character recognition unit recognizing character string information included in a document or electronic document and converting it into text; Extracting a token by performing a morpheme analysis on the text recognized through the character recognition unit by a morpheme analysis unit; Extracting a non-registered word by comparing the token provided by the morpheme analysis unit with a registered word stored in a translation dictionary database by a non-registered word extracting unit; And registering and updating the unregistered words extracted by the translation knowledge reflecting unit in the translation dictionary database.

본 발명에 따르면, 종래의 강연 동시통역 장치는 대부분의 자동번역 장치가 갖는 문제점들을 가지고 있다. 즉, 미등록어 문제, 고유명사 문제, 음성인식 오류, 도메인 적용 오류 등이 바로 그것이다. 이러한 문제들은 강연자의 강연이 청중에게 정확하게 전달되지 않도록 한다.According to the present invention, the conventional lecture simultaneous interpretation apparatus has problems with most automatic translation apparatuses. That is, the problem of unregistered words, proper nouns, speech recognition errors, and domain application errors. These issues prevent the speaker's presentation from being delivered accurately to the audience.

이를 위해 강연과 가장 직접, 간접적으로 관련이 있는 강연자의 발표자료를 분석함으로써 강연을 구성하고 있는 주요 어휘와 이들 어휘 간의 의미관계를 파악한다. 이러한 분석 과정을 통해서 사용자 사전 구축, 미등록어 등록, 음성인식 지식 확장 및 대역어 가중치 조절을 수행한다. 이렇게 번역지식이 강연을 위해 조정된 후, 강연 동시통역에 적용될 경우, 상기의 오류들을 상당히 줄일 수 있는 효과가 있다. To this end, by analyzing the presentation materials of the lecturers that are most directly or indirectly related to the lecture, the main vocabulary constituting the lecture and the semantic relationship between these vocabularies is grasped. Through this analysis process, user dictionary construction, registration of unregistered words, expansion of speech recognition knowledge, and weight control of band words are performed. When the translation knowledge is adjusted for a lecture and then applied to the simultaneous interpretation of the lecture, the above errors can be significantly reduced.

도 1은 본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치에 대한 블록 구성을 나타낸 도면,
도 2는 본 발명의 일 실시예에 채용된 특수 대역어 추출부를 설명하기 위한 기능블럭도.
도 3은 본 발명의 일 실시예에 채용된 문자 인식부를 설명하기 위한 기능 블록도.
도 4는 본 발명의 본 발명의 다른 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치를 설명하기 위한 기능블럭도.
도 5는 본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 방법을 설명하기 위한 순서도.
도 6은 본 발명의 다른 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 방법 방법을 설명하기 위한 순서도이다. 1 is a diagram showing a block configuration of a simultaneous interpretation apparatus for a real-time lecture based on automatic analysis of presentation data according to an embodiment of the present invention;
Figure 2 is a functional block diagram for explaining a special band word extraction unit employed in an embodiment of the present invention.
3 is a functional block diagram illustrating a character recognition unit employed in an embodiment of the present invention.
4 is a functional block diagram for explaining a simultaneous interpretation apparatus for a real-time lecture based on automatic analysis of presentation data according to another embodiment of the present invention.
5 is a flowchart illustrating a method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to an embodiment of the present invention.
6 is a flowchart illustrating a method of simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to another embodiment of the present invention.

이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 설명한다. 본 발명에 따른 동작 및 작용을 이해하는 데 필요한 부분을 중심으로 상세히 설명한다. 본 발명의 실시 예를 설명하면서, 본 발명이 속하는 기술 분야에 익히 알려졌고 본 발명과 직접적으로 관련이 없는 기술 내용에 대해서는 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 요지를 흐리지 않고 더욱 명확히 전달하기 위함이다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It will be described in detail focusing on the parts necessary to understand the operation and operation according to the present invention. While describing the embodiments of the present invention, descriptions of technical contents that are well known in the technical field to which the present invention pertains and are not directly related to the present invention will be omitted. This is to more clearly convey the gist of the present invention by omitting unnecessary description.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 동일한 명칭의 구성 요소에 대하여 도면에 따라 다른 참조부호를 부여할 수도 있으며, 서로 다른 도면임에도 동일한 참조부호를 부여할 수도 있다. 그러나 이와 같은 경우라 하더라도 해당 구성 요소가 실시 예에 따라 서로 다른 기능을 갖는다는 것을 의미하거나, 서로 다른 실시 예에서 동일한 기능을 갖는다는 것을 의미하는 것은 아니며, 각각의 구성 요소의 기능은 해당 실시 예에서의 각각의 구성 요소에 대한 설명에 기초하여 판단하여야 할 것이다. In addition, in describing the constituent elements of the present invention, different reference numerals may be assigned to constituent elements of the same name according to the drawings, and the same reference numerals may be denoted even in different drawings. However, even in such a case, it does not mean that the corresponding component has different functions according to the embodiment, or that it has the same function in different embodiments, and the function of each component is the corresponding embodiment. It should be judged based on the description of each component in.

도 1은 본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치에 대한 블록 구성을 나타낸 도면이다. 1 is a diagram showing a block configuration of a simultaneous interpretation apparatus for a real-time lecture based on automatic analysis of presentation data according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치는, 문자 인식부(100), 형태소 분석부(200), 번역사전 데이터베이스(300), 특수 대역어 추출부(400), 특수 대역어 표시부(500), 사용자 대역어 처리부(600) 및 번역 지식 반영부(700)를 포함하여 이루어진다. As shown in Figure 1, the simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data according to the present invention includes a character recognition unit 100, a morpheme analysis unit 200, a translation dictionary database 300, and a special band word extraction unit. It comprises 400, a special band word display unit 500, a user band word processing unit 600, and a translation knowledge reflection unit 700.

문자 인식부(100)는 문서 또는 전자 문서에 포함된 문자열 정보를 인식하고 이를 텍스트화하는 역할을 한다. The character recognition unit 100 serves to recognize character string information included in a document or an electronic document and convert it into text.

그리고 형태소 분석부(200)는 상기 문자 인식부(100)를 통해 인식된 텍스트에 대하여 형태소 분석을 수행하여 토큰을 추출하는 역할을 한다. In addition, the morpheme analysis unit 200 serves to extract a token by performing morpheme analysis on the text recognized through the character recognition unit 100.

특수 대역어 추출부(400)는 상기 형태소 분석부(200)의 형태소 분석된 발표자료에 등장하는 어휘들을 분석하여 특수 대역어를 추출하는 역할을 한다. The special band word extraction unit 400 serves to extract a special band word by analyzing the vocabulary appearing in the morpheme-analyzed presentation data of the morpheme analysis unit 200.

그리고 특수 대역어 표시부(500)는 상기 추출된 특수 대역어를 표시하는 역할을 한다. In addition, the special band word display unit 500 serves to display the extracted special band word.

또한 사용자 대역어 처리부(600)는 상기 특수 대역어 표시부(500)를 통해 표시된 상기 특수 대역어에 대하여 사용자가 직접 대역어를 결정할 수 있도록, 상기 특수 대역어에 대한 사용자 대역어를 입력받는 역할을 한다. In addition, the user band word processing unit 600 serves to receive the user band word for the special band word so that the user can directly determine the band word for the special band word displayed through the special band word display unit 500.

그리고, 상기 번역 지식 반영부(700)는, 상기 사용자 대역어 처리부(600)를 통해 입력된 사용자 대역어를 상기 번역 사전 데이터베이스(130)에 등록하여 갱신할 수 있다. In addition, the translation knowledge reflecting unit 700 may register and update the user spoken language input through the user spoken word processing unit 600 in the translation dictionary database 130.

이러한 본 발명의 일 실시예에 따르면, 강연과 직, 간접적으로 관련이 있는 강연자의 발표자료를 사전에 분석하여 강연을 구성하고 있는 주요 어휘와 이들 어휘 간의 이미 관계를 파악하여, 실제 강연 시 이용될 번역 사전의 등록어를 조정함으로써, 강연에 대한 동시 통역 시 통역 오류를 줄여줄 수 있는 효과가 있다. According to one embodiment of the present invention, the presentation data of the lecturer directly or indirectly related to the lecture is analyzed in advance to identify the main vocabulary constituting the lecture and the relationship between these vocabularies. By adjusting the registered words in the translation dictionary, there is an effect of reducing interpretation errors during simultaneous interpretation of lectures.

한편, 번역 지식 반영부(700)는 사용자 대역어를 번역 사전 데이터베이스(130)에만 갱신하지 않고, 음성인식용 사전에도 등록할 수 있다. 이와 같이, 번역 지식 반영부(700)가 음성인식용 사전에 사용자 대역어를 등록함으로써, 특정 어휘에 대한 발음 사전을 비롯한 음성인식 지식에 발표자료의 어휘에 대한 정보를 미리 반영할 수 있게 된다. Meanwhile, the translation knowledge reflecting unit 700 may register the user's spoken word not only in the translation dictionary database 130, but also in a dictionary for speech recognition. In this way, the translation knowledge reflecting unit 700 registers the user's spoken word in the speech recognition dictionary, so that the information on the vocabulary of the presentation material can be previously reflected in the speech recognition knowledge including the pronunciation dictionary for the specific vocabulary.

따라서, 실제 강연에서 음성인식을 할 때, 해당 사용자 대역어에 대한 가중치를 높이 할당함으로써 음성인식 오류를 줄이는데 도움을 줄 수 있다.Therefore, when performing speech recognition in an actual lecture, it is possible to help reduce speech recognition errors by assigning a high weight to a corresponding user's bandwidth word.

여기서, 본 발명의 일 실시예에 채용된 특수 대역어 추출부(400)는 형태소 분석을 통해 획득한 토큰에서 의미적 모호성을 지니는 대역어를 추출하는 것이 바람직하다. 즉, 특수 대역어 추출부(400)는 원시어휘와 대역어휘를 동일한 의미벡터 스페이스 상에 투사하여 강연에 등장하는 어휘의 전반적인 의미관계와 도메인 정보를 의미적으로 파악함으로써, 문맥과 맞지 않는 대역어를 추출할 수 있다. Here, it is preferable that the special band word extraction unit 400 employed in an embodiment of the present invention extracts a band word having semantic ambiguity from the token obtained through morpheme analysis. That is, the special band word extraction unit 400 semantically grasps the overall semantic relationship and domain information of the vocabulary appearing in the lecture by projecting the original vocabulary and the band vocabulary on the same semantic vector space, thereby extracting a band word that does not fit the context. can do.

도 2는 본 발명의 일 실시예에 채용된 특수 대역어 추출부를 설명하기 위한 기능블럭도이다. 2 is a functional block diagram illustrating a special band word extraction unit employed in an embodiment of the present invention.

도 2에 도시된 바와 같이, 그리고, 본 발명의 일 실시예에 채용된 특수 대역어 추출부(400)는 언어적 특성 등을 이용하여 추출된 어휘의 고유명사 여부를 파악하는 고유명사 인식부(410)를 포함할 수 있다. As shown in Fig. 2, and, the special band word extraction unit 400 employed in an embodiment of the present invention is a proper noun recognition unit 410 that determines whether or not the extracted vocabulary is a proper noun using linguistic characteristics, etc. ) Can be included.

여기서 고유명사 인식부(410)는 영어의 경우, 음성인식된 강연 발화문장은 모두 소문자로 구성되어 있어서 고유명사를 인식하기 어렵지만, 발표자료 상의 고유명사는 대문자로 시작하기 때문에 고유명사임을 쉽게 인식할 수 있다. Here, in the case of English, the proper noun recognition unit 410 is difficult to recognize proper nouns because all speech-recognized speech utterances are composed of lowercase letters, but since proper nouns in the presentation materials start with capital letters, it is easy to recognize that they are proper nouns. have.

또한, 본 발명의 일 실시예에 채용된 특수 대역어 추출부(400)는 전체 발표자료 상의 어휘들 간의 의미관계를 분석하는 어휘 의미관계 분석부(420)를 포함할 수 있다. 본 발명의 일 실시예에 채용된 어휘 의미관계 분석부(420)는 word2vec 기술을 이용하는 것이 바람직하다. In addition, the special band word extraction unit 400 employed in an embodiment of the present invention may include a vocabulary semantic relationship analysis unit 420 that analyzes the semantic relationship between vocabularies in the entire presentation material. It is preferable that the vocabulary semantic analysis unit 420 employed in an embodiment of the present invention uses word2vec technology.

그리고, 특수 대역어 추출부(400)는 보통 의미적으로 연관성을 지니는 어휘들은 유사한 의미 클러스터 주변에 투사되며, 이들 유사 의미 클러스터 주변의 어휘들은 서로 의미적 관계가 크다고 볼 수 있기 때문에, 의미관계를 이용하여 대역어 선택 모호성이 있는 어휘들의 대역어를 결정하기 위한 가중치를 조정하는 가중치 조정부(430)를 더 포함할 수 있다. In addition, the special band word extraction unit 400 usually uses the semantic relationship because vocabularies having a semantically related relationship are projected around similar semantic clusters, and the vocabularies around these similar semantic clusters can be considered to have a large semantic relationship with each other. Accordingly, the weight adjustment unit 430 may further include a weight adjustment unit 430 that adjusts a weight for determining a band word of vocabularies having ambiguity in selecting a band word.

이러한, 본 발명의 다른 실시예에 채용된 가중치 조정부(430)에 따르면, 문맥과 맞지 않는 대역어에 대한 가중치를 낮추고, 문맥에 맞는 대역어에 대한 가중치를 높여서 문맥에 맞는 자연스러운 번역이 가능하도록 하는 장점이 있다. According to the weight adjustment unit 430 employed in another embodiment of the present invention, there is an advantage of lowering the weight for a band word that does not fit the context and increasing the weight for the band word that fits the context so that natural translation in accordance with the context is possible. have.

도 3은 본 발명의 일 실시예에 채용된 문자 인식부를 설명하기 위한 기능 블록도이다. 3 is a functional block diagram illustrating a character recognition unit employed in an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 일 실시예에 채용된 문자 인식부(100)는, 발표자료가 하드카피인 경우, 하드카피내 문자열 정보를 인식한 후, 텍스트화하여 상기 형태소 분석부(200)로 제공하는 OCR 문자 인식부(110)인 것이 바람직하다. As shown in FIG. 3, when the presentation material is a hard copy, the character recognition unit 100 employed in an embodiment of the present invention recognizes character string information in the hard copy, and then converts the text into a text to analyze the morpheme. It is preferable that it is an OCR character recognition unit 110 provided to 200.

그리고, 발표자료가 전자파일인 경우, 상기 문자 인식부(100)는 전자파일내 문자열 정보를 인식한 후, 텍스트화하여 상기 형태소 분석부(200)로 제공하는 전자파일 인식부(120)를 포함할 수 있다. And, when the presentation material is an electronic file, the character recognition unit 100 includes an electronic file recognition unit 120 that recognizes character string information in the electronic file, converts it into text, and provides it to the morpheme analysis unit 200. can do.

도 4는 본 발명의 본 발명의 다른 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치를 설명하기 위한 기능블럭도이다. 4 is a functional block diagram illustrating an apparatus for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to another embodiment of the present invention.

도 4에 도시된 바와 같이, 본 발명의 다른 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 장치는 미등록어 추출부(800)를 더 포함하여 이루어질 수 있다. As shown in FIG. 4, the apparatus for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to another embodiment of the present invention may further include an unregistered word extracting unit 800.

미등록어 추출부(800)는 상기 형태소 분석부(200)에서 제공되는 토큰을 번역 사전 데이터베이스(130)에 저장된 등록어와 비교하여 미등록어를 추출하는 역할을 한다. The unregistered word extracting unit 800 serves to extract the unregistered word by comparing the token provided from the morpheme analysis unit 200 with the registered word stored in the translation dictionary database 130.

이에, 번역 지식 반영부(700)는 미등록어 추출부(800)를 통해 추출된 미등록어를 상기 번역 사전 데이터베이스(130)에 등록하여 갱신하는 역할을 한다. Accordingly, the translation knowledge reflecting unit 700 serves to register and update the unregistered words extracted through the unregistered word extracting unit 800 in the translation dictionary database 130.

이러한 본 발명의 일 실시예에 따르면, 강연과 직, 간접적으로 관련이 있는 강연자의 발표자료를 사전에 분석하여 번역사전에 미등록어를 등록함으로써, 실제 강연에서 동시 통역 시 통역 오류를 줄여줄 수 있는 효과가 있다. According to one embodiment of the present invention, by analyzing the presentation data of the lecturer directly or indirectly related to the lecture in advance and registering the unregistered word in the translation dictionary, interpretation errors can be reduced during simultaneous interpretation in the actual lecture. It works.

본 발명의 일 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 방법에 대하여 도 5를 참조하여 설명하기로 한다. A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to an embodiment of the present invention will be described with reference to FIG. 5.

먼저, 문자 인식부(100)가 문서 또는 전자 문서에 포함된 문자열 정보를 인식하고 이를 텍스트화한다(S100). First, the character recognition unit 100 recognizes character string information included in a document or an electronic document and converts it into text (S100).

이어서, 형태소 분석부(200)가 문자 인식부(100)를 통해 인식된 텍스트에 대하여 형태소 분석을 수행하여 토큰을 추출한다(S200). Subsequently, the morpheme analysis unit 200 extracts a token by performing morpheme analysis on the text recognized through the character recognition unit 100 (S200).

특수 대역어 추출부(400)가 상기 형태소 분석부(200)의 형태소 분석된 발표자료에 등장하는 어휘들을 분석하여 특수 대역어를 추출한다(S300). The special band word extraction unit 400 analyzes the vocabulary appearing in the morpheme-analyzed presentation data of the morpheme analysis unit 200 to extract the special band word (S300).

이어서, 특수 대역어 표시부(500)가 상기 추출된 특수 대역어를 표시한다(S400). Subsequently, the special band word display unit 500 displays the extracted special band word (S400).

이후, 사용자 대역어 처리부(600)가 상기 특수 대역어 표시부(500)를 통해 표시된 상기 특수 대역어에 대하여 사용자가 직접 대역어를 결정할 수 있도록 상기 특수 대역어에 대한 사용자 대역어를 입력받는다(S500). Thereafter, the user band word processing unit 600 receives the user band word for the special band word so that the user can directly determine the band word for the special band word displayed through the special band word display unit 500 (S500).

이후, 상기 번역 지식 반영부(700)가 상기 사용자 대역어 처리부(600)를 통해 입력된 사용자 대역어를 상기 번역 사전 데이터베이스(130)에 등록하여 갱신한다(S600). Thereafter, the translation knowledge reflecting unit 700 registers and updates the user spoken language input through the user spoken word processing unit 600 in the translation dictionary database 130 (S600).

여기서, 상기 특수 대역어를 추출하는 단계(S300)는, 고유명사 인식부(410)가 언어적 특성 등을 이용하여 추출된 어휘의 고유명사 여부를 파악하는 단계를 더 포함할 수 있다(S310). Here, the step of extracting the special band word (S300) may further include the step of determining, by the proper noun recognition unit 410, whether the extracted vocabulary is a proper noun using linguistic characteristics (S310).

상기 특수 대역어를 추출하는 단계(S300)는, 어휘 의미관계 분석부(420)가 전체 발표자료 상의 어휘들 간의 의미관계를 분석하는 단계(S320)를 더 포함할 수 있다. 여기서, 상기 의미관계를 분석하는 단계(S320)는 word2vec 기술을 이용하는 것이 바람직하다. The step of extracting the special band word (S300) may further include a step (S320) of analyzing, by the vocabulary semantic relationship analysis unit 420, a semantic relationship between vocabularies in the entire presentation material. Here, it is preferable to use word2vec technology in the step of analyzing the semantic relationship (S320).

상기 의미관계를 분석하는 단계(S320)는 가중치 조정부(430)가 의미관계를 이용하여 대역어 선택 모호성이 있는 어휘들의 대역어를 결정하기 위한 가중치를 조정하는 단계(S330)를 더 포함할 수 있다. The step of analyzing the semantic relationship (S320) may further include a step (S330) of adjusting, by the weight adjustment unit 430, a weight for determining a band word of vocabularies having a band word selection ambiguity using the semantic relationship.

한편, 상기 특수 대역어를 추출하는 단계(S300)는, 원시어휘와 대역어휘를 동일한 의미벡터 스페이스 상에 투사하여 강연에 등장하는 어휘의 전반적인 의미관계와 도메인 정보를 의미적으로 파악함으로써, 문맥과 맞지 않는 대역어를 추출하는 것이 바람직하다.On the other hand, in the step of extracting the special band words (S300), the original vocabulary and the band vocabulary are projected onto the same semantic vector space to semantically grasp the overall semantic relationship and domain information of the vocabulary appearing in the lecture, so that they match the context. It is desirable to extract the non-referred word.

본 발명의 일 실시예에 채용된 상기 텍스트화하는 단계(S100)는 OCR 문자 인식부(110)가 발표자료가 하드카피인 경우, 하드카피내 문자열 정보를 인식한 후, 텍스트화하여 상기 형태소 분석부(200)로 제공하는 것이 바람직하다. In the text conversion step (S100) employed in an embodiment of the present invention, when the presentation material is a hard copy, the OCR character recognition unit 110 recognizes the character string information in the hard copy, and then converts the text into text to analyze the morpheme. It is preferable to provide the part 200.

본 발명의 일 실시예에 채용된 상기 텍스트화하는 단계(S100)는 전자파일 인식부(120)가 발표자료가 전자파일인 경우, 전자파일내 문자열 정보를 인식한 후, 텍스트화하여 상기 형태소 분석부(200)로 제공할 수 있다. In the step of converting text (S100) employed in an embodiment of the present invention, when the presentation material is an electronic file, the electronic file recognition unit 120 recognizes character string information in the electronic file and converts it into text to analyze the morpheme. It can be provided as part 200.

한편, 본 발명의 다른 실시예에 따른 발표자료 자동 분석에 기반한 실시간 강연 동시 통역 방법 방법에 대하여 도 6을 참조하여 설명하기로 한다. Meanwhile, a method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data according to another embodiment of the present invention will be described with reference to FIG. 6.

형태소 분석부(200)가 문자 인식부(100)를 통해 인식된 텍스트에 대하여 형태소 분석을 수행하여 토큰을 추출한다(S200). The morpheme analysis unit 200 extracts a token by performing a morpheme analysis on the text recognized through the character recognition unit 100 (S200).

이후, 미등록어 추출부가 형태소 분석부(200)에서 제공되는 토큰을 번역 사전 데이터베이스(130)에 저장된 등록어와 비교하여 미등록어를 추출한다(S700). Thereafter, the non-registered word extracting unit compares the token provided from the morpheme analysis unit 200 with the registered word stored in the translation dictionary database 130 to extract the non-registered word (S700).

그러면, 번역 지식 반영부(700)가 미등록어 추출부를 통해 추출된 미등록어를 상기 번역 사전 데이터베이스(130)에 등록하여 갱신한다(S800). Then, the translation knowledge reflecting unit 700 registers and updates the unregistered words extracted through the unregistered word extracting unit in the translation dictionary database 130 (S800).

이상에서 설명한 실시 예들은 그 일 예로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다. The embodiments described above are examples, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

A character recognition unit for recognizing character string information included in a document or an electronic document and converting it into text;
A morpheme analysis unit for extracting a token by performing a morpheme analysis on the text recognized through the character recognition unit;
A special band word extracting unit for extracting a special band word by analyzing vocabularies appearing in the morpheme-analyzed presentation data of the morpheme analysis unit;
A special band word display unit for displaying the extracted special band word; And
A user band word processing unit receiving a user band word for the special band word so that a user can directly determine the band word for the special band word displayed through the special band word display unit; And
Including a translation knowledge reflecting unit for registering and updating the user spoken word input through the user spoken word processing unit in a translation dictionary database,
The special band word extraction unit
Real-time based on automatic analysis of presentation data by projecting the original vocabulary and the band vocabulary on the same semantic vector space and semantically grasping the overall semantic relationship and domain information of the vocabulary appearing in the lecture to extract the band word that does not match the context. Simultaneous interpretation system for lectures.

The method of claim 1,
The special band word extraction unit,
A device for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data further comprising a proper noun recognition unit that determines whether or not the extracted vocabulary is a proper noun using linguistic characteristics.

The method of claim 1,
The special band word extraction unit,
A vocabulary semantic relationship analysis unit that analyzes the semantic relationship between vocabulary in the entire presentation material; a simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data.

The method of claim 3,
The vocabulary semantic relationship analysis unit,
Simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data using word2vec technology.

The method of claim 4,
The special band word extraction unit
A simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data, further comprising a weight adjustment unit that adjusts weights for determining a band word of ambiguity words using semantic relations.

delete

The method of claim 1,
An unregistered word extracting unit for extracting an unregistered word by comparing the token provided by the morpheme analysis unit with the registered word stored in a translation dictionary database; And
A translation knowledge reflecting unit for registering and updating the non-registered words extracted through the non-registered word extracting unit in the translation dictionary database; and a simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data.

The method of claim 1,
The character recognition unit,
A simultaneous interpretation device for real-time lectures based on automatic analysis of presentation data, characterized in that it is an OCR character recognition unit that recognizes character string information in a hard copy, converts it into text, and provides it to the morpheme analysis unit.

The method of claim 1,
The character recognition unit,
A real-time lecture simultaneous interpretation device based on automatic analysis of presentation data, characterized in that it is an electronic file recognition unit that recognizes character string information in an electronic file, converts it into text, and provides it to the morpheme analysis unit.

Recognizing, by the character recognition unit, character string information included in the document or electronic document and converting the text into text;
Extracting a token by performing a morpheme analysis on the text recognized through the character recognition unit by a morpheme analysis unit;
Extracting a special band word by analyzing the words appearing in the morpheme-analyzed presentation data by the morpheme analysis unit by a special band word extraction unit;
Displaying the extracted special band words by a special band word display unit;
Receiving, by a user band word processing unit, a user band word for the special band word so that a user can directly determine a band word for the special band word displayed through the special band word display unit; And
Including, by the translation knowledge reflecting unit, registering and updating the user spoken word input through the user spoken word processing unit in a translation dictionary database;
Extracting a non-registered word by comparing the token provided by the morpheme analysis unit with a registered word stored in a translation dictionary database by a non-registered word extracting unit; And
A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, further comprising: registering and updating the unregistered words extracted by the translation knowledge reflecting unit through the unregistered word extracting unit in the translation dictionary database.

The method of claim 10,
The step of extracting the special band word,
Simultaneous interpretation method for a real-time lecture based on automatic analysis of presentation data further comprising; determining whether the extracted vocabulary is a proper noun using linguistic characteristics, etc., by a proper noun recognition unit.

The method of claim 10,
The step of extracting the special band word,
A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, further comprising: analyzing, by a vocabulary semantic relationship analysis unit, the semantic relationship between vocabularies in the entire presentation material.

The method of claim 12,
Analyzing the semantic relationship,
Simultaneous interpretation method for real-time lectures based on automatic analysis of presentation materials using word2vec technology.

The method of claim 13,
Analyzing the semantic relationship,
A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, further comprising: adjusting, by a weight adjusting unit, a weight for determining a band word of vocabularies having a band word selection ambiguity using a semantic relationship.

The method of claim 10,
The step of extracting the special band word,
Real-time based on automatic analysis of presentation data by projecting the original vocabulary and the band vocabulary on the same semantic vector space and semantically grasping the overall semantic relationship and domain information of the vocabulary appearing in the lecture to extract the band word that does not match the context. Simultaneous interpretation method of lecture.

delete

The method of claim 10,
The step of textualizing,
A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, characterized in that the OCR character recognition unit recognizes the character string information in the hard copy, converts it into text, and provides it to the morpheme analysis unit.

The method of claim 10,
The step of textualizing,
A method for simultaneous interpretation of a real-time lecture based on automatic analysis of presentation data, characterized in that the electronic file recognition unit recognizes character string information in the electronic file, converts it into text, and provides it to the morpheme analysis unit.