KR20210097366A

KR20210097366A - Apparatus and method for identifying sentence and phrase of chinese character text based on conditional random field

Info

Publication number: KR20210097366A
Application number: KR1020200010977A
Authority: KR
Inventors: 김소정
Original assignee: (주)나라지식정보
Priority date: 2020-01-30
Filing date: 2020-01-30
Publication date: 2021-08-09
Also published as: KR102529987B1

Abstract

In accordance with an embodiment of the present disclosure, a device for identifying a sentence and a phrase of a Chinese character literature comprises: a communication circuit in communication with the outside; a memory; and a processor electrically connected to the communication circuit and the memory. The processor receives input text including a pre-input tag for distinguishing a sentence and a phrase using the communication circuit, determines a characteristic function for each character included in the input text, receives a text to be translated by using the communication circuit, calculates the appearance probability of a plurality of label sequences for distinguishing a sentence and a phrase with respect to a character sequence included in the text to be translated by using the characteristic function, obtains the label sequence corresponding to a letter sequence among the label sequences on the basis of the appearance probability, and can insert a punctuation mark for distinguishing a sentence and a phrase into the text to be translated based on the label sequence. Accordingly, sentences and phrases in classical Chinese character literature can be efficiently identified.

Description

Apparatus and method for identifying sentences and phrases in CRF-based Chinese character literature

본 문서에서 개시되는 실시 예들은 고전 한자 문헌에서 문장 및 어구를 식별하는 장치 및 방법과 관련된다.Embodiments disclosed in this document relate to an apparatus and method for identifying sentences and phrases in classical Chinese characters.

중국의 문자인 한자는 5세기 이후 동아시아 지역에서 공식어로서 수백년간 사용되어 왔다. 고전 한자 문헌에서 사용되는 패턴, 문법 및 단어는 현대 중국어에서 사용되는 그것과 매우 상이하다. 또한, 동일한 시대에 작성된 고전 한자 문헌이라도, 그 문헌이 작성된 지역(예: 중국, 한국, 베트남 및 일본 등)에 따라 그 패턴, 문법 및 단어가 매우 상이하다. 한편, 고전 한자 문헌은 문단 단위로만 구분되어 작성되고, 문단 내에서는 띄어쓰기 및 문장 부호가 사용되지 않는다.Chinese characters, Chinese characters, have been used for hundreds of years as an official language in East Asia since the 5th century. The patterns, grammars, and words used in classical Chinese literature are very different from those used in modern Chinese. Also, even in classical Chinese characters written in the same period, the patterns, grammars, and words are very different depending on the region (eg, China, Korea, Vietnam, Japan, etc.) where the documents were written. On the other hand, classical Chinese character literature is divided into paragraphs only, and spaces and punctuation marks are not used within paragraphs.

따라서, 고전 한자 문헌의 번역을 위해서는, 해당 지역 문헌에 대한 전문가가 문헌에 포함된 텍스트를 문장 단위로 분할하고, 해당 문장을 어구 단위로 분할하여 문장 및 어구를 구별하는 표점(標點)을 입력하는 작업이 요구된다.Therefore, for the translation of classical Chinese character literature, an expert in the local literature divides the text included in the document into sentence units, divides the sentence into phrase units, and inputs a mark for distinguishing sentences and phrases. work is required

상술한 표점 입력 작업에는 고전 한자 문헌의 번역에 있어서 높은 비중의 시간과 비용이 소요된다. 한국고전번역원의 자료에 따르면, 국내의 고전 한자 문헌 중 하나인 승정원 일기의 경우 현재 번역이 약 21% 완료된 상태이며, 현재와 동일한 작업 속도로는 완역까지 약 41년이 소요될 것으로 예상된다. 따라서, 번역의 속도를 높여 고전 한자 문헌을 활용하기 위해서는, 표점 입력 작업을 자동화함으로써 표점 입력에 소요되는 인력, 시간 및 비용을 감소시킬 필요성이 있다.The above-mentioned gimbal input operation requires a high proportion of time and cost in the translation of classical Chinese characters. According to the data of the Korean Classical Translation Institute, in the case of Seung Jeong-won's diary, one of the classic Chinese characters in Korea, the translation is currently about 21% completed, and it is expected that it will take about 41 years to complete the translation at the same speed as the current work. Therefore, in order to increase the speed of translation and utilize classical Chinese characters, it is necessary to reduce the manpower, time, and cost required for inputting the marks by automating the work of entering the marks.

본 발명의 실시 예들은, 기계 학습 기법을 활용하여 고전 한자 문헌의 문장 및 어구를 식별하도록 하는 표점을 입력할 수 있는 장치 및 방법을 제공하기 위한 것이다.SUMMARY Embodiments of the present invention provide an apparatus and method for inputting a mark for identifying sentences and phrases of a classical Chinese character document by using a machine learning technique.

본 문서에 개시되는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치는 외부와 통신하도록 구성된 통신 회로, 메모리, 및 통신 회로 및 메모리와 전기적으로 연결된 프로세서를 포함하고, 프로세서는 통신 회로를 이용하여 문장 및 어구를 구별하는 미리 입력된 태그를 포함하는 입력 텍스트를 수신하고, 입력 텍스트에 포함된 글자 각각에 대한 특성 함수를 결정하고, 통신 회로를 이용하여 번역 대상 텍스트를 수신하고, 특성 함수를 이용하여 번역 대상 텍스트에 포함된 글자 시퀀스에 대한 문장 및 어구를 구별하는 복수의 라벨 시퀀스 각각의 출현 확률을 산출하고, 출현 확률에 기초하여 복수의 라벨 시퀀스 중 글자 시퀀스에 대응하는 라벨 시퀀스를 획득하고, 라벨 시퀀스에 기초하여 번역 대상 텍스트에 문장 및 어구를 구별하는 문장 부호를 삽입할 수 있다.The apparatus for identifying sentences and phrases in Chinese literature according to an embodiment disclosed in this document includes a communication circuit configured to communicate with the outside, a memory, and a communication circuit and a processor electrically connected to the communication circuit and the memory, the processor using the communication circuit Receive input text including pre-entered tags for discriminating sentences and phrases, determine a characteristic function for each character included in the input text, receive text to be translated using a communication circuit, and use the characteristic function to calculate the appearance probability of each of a plurality of label sequences for discriminating sentences and phrases with respect to the character sequence included in the text to be translated, and obtain a label sequence corresponding to the letter sequence among the plurality of label sequences based on the appearance probability; Punctuation marks for distinguishing sentences and phrases may be inserted into the translation target text based on the label sequence.

일 실시 예에 따르면, 미리 입력된 태그는 입력 텍스트에 포함된 어구의 첫 글자 및 문장의 마지막 글자와 이웃하게 배치되고, 프로세서는 번역 대상 텍스트에서 어구를 구별하는 라벨에 대응하는 글자 앞에 쉼표를 삽입하고, 문장을 구별하는 라벨에 대응하는 글자 뒤에 마침표를 삽입할 수 있다.According to an embodiment, the pre-input tag is disposed adjacent to the first letter of the phrase and the last letter of the sentence included in the input text, and the processor inserts a comma in front of the letter corresponding to the label for distinguishing the phrase in the text to be translated and a period can be inserted after the letter corresponding to the label that distinguishes the sentence.

일 실시 예에 따르면, 미리 입력된 태그는 입력 텍스트에 포함된 어구의 마지막 글자 및 문장의 마지막 글자와 이웃하게 배치되고, 프로세서는 번역 대상 텍스트에서 어구를 구별하는 라벨에 대응하는 글자 뒤에 쉼표를 삽입하고, 문장을 구별하는 라벨에 대응하는 글자 뒤에 마침표를 삽입할 수 있다.According to an embodiment, the pre-input tag is disposed adjacent to the last letter of the phrase and the last letter of the sentence included in the input text, and the processor inserts a comma after the letter corresponding to the label for distinguishing the phrase in the text to be translated and a period can be inserted after the letter corresponding to the label that distinguishes the sentence.

일 실시 예에 따르면, 프로세서는 미리 입력된 태그 및 입력 텍스트에 포함된 글자에 기초하여 입력 텍스트에 포함된 글자 각각에 대한 라벨을 포함하는 학습 데이터를 생성하고, 학습 데이터에 기초하여 입력 텍스트에 포함된 글자 각각에 대한 특성 함수를 결정할 수 있다.According to an embodiment, the processor generates training data including a label for each character included in the input text based on the pre-input tag and the characters included in the input text, and includes in the input text based on the training data It is possible to determine the characteristic function for each character.

일 실시 예에 따르면, 특성 함수는 입력 텍스트에 포함된 특정 글자, 특정 글자의 위치, 특정 글자에 대응하는 라벨, 및 특정 글자에 이웃하는 다른 글자에 대응하는 라벨에 기초하여 결정될 수 있다.According to an embodiment, the characteristic function may be determined based on a specific character included in the input text, a position of the specific character, a label corresponding to the specific character, and a label corresponding to another character adjacent to the specific character.

일 실시 예에 따르면, 프로세서는 입력 텍스트에 포함된 글자 각각에 대해 복수의 특성 함수를 결정하고, 복수의 특성 함수 각각에 상이한 가중치를 부여하고, 복수의 특성 함수의 리턴 값 및 가중치에 기초하여 출현 확률을 산출할 수 있다.According to an embodiment, the processor determines a plurality of characteristic functions for each character included in the input text, assigns different weights to each of the plurality of characteristic functions, and appears based on return values and weights of the plurality of characteristic functions probabilities can be calculated.

일 실시 예에 따르면, 프로세서는 CRF(conditional random field)에 따라 출현 확률을 산출하고, 라벨 시퀀스를 획득할 수 있다.According to an embodiment, the processor may calculate an appearance probability according to a conditional random field (CRF) and obtain a label sequence.

일 실시 예에 따르면, 라벨 시퀀스는 복수의 라벨 시퀀스 중 글자 시퀀스에 대한 출현 확률이 가장 높은 시퀀스일 수 있다.According to an embodiment, the label sequence may be a sequence having the highest appearance probability of a letter sequence among a plurality of label sequences.

일 실시 예에 따르면, 프로세서는 번역 대상 텍스트 및 문장 부호를 포함하는 결과 텍스트를 제공할 수 있다.According to an embodiment, the processor may provide the result text including the translation target text and punctuation marks.

본 문서에 개시되는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 방법은, 문장 및 어구를 구별하는 미리 입력된 태그를 포함하는 입력 텍스트를 수신하는 단계, 입력 텍스트에 포함된 글자 각각에 대한 특성 함수를 결정하는 단계, 번역 대상 텍스트를 수신하는 단계, 특성 함수를 이용하여 번역 대상 텍스트에 포함된 글자 시퀀스에 대한 문장 및 어구를 구별하는 복수의 라벨 시퀀스 각각의 출현 확률을 산출하는 단계, 출현 확률에 기초하여 복수의 라벨 시퀀스 중 글자 시퀀스에 대응하는 라벨 시퀀스를 획득하는 단계, 및 라벨 시퀀스에 기초하여 번역 대상 텍스트에 문장 및 어구를 구별하는 문장 부호를 삽입하는 단계를 포함할 수 있다.A method for identifying a sentence and a phrase of a Chinese character document according to an embodiment disclosed in this document includes the steps of receiving input text including a pre-input tag for distinguishing a sentence and a phrase, and characteristics of each character included in the input text Determining a function, receiving the text to be translated, calculating an appearance probability of each of a plurality of label sequences for discriminating sentences and phrases with respect to a character sequence included in the text to be translated by using a characteristic function; The method may include obtaining a label sequence corresponding to a letter sequence among a plurality of label sequences based on

본 문서에 개시되는 실시 예들에 따르면, 문장 및 어구의 구별이 없는 고전 한자 문헌에 대해 특성 함수를 이용하여 문장 및 어구를 구별하는 라벨링을 수행함으로써, 비규칙적이고 무작위적인 고전 한자 문헌의 문장 및 어구의 식별을 동시에 효율적으로 처리할 수 있다. 특히, 일정한 종결 어미 및 조사를 갖는 우리말에 비해 문장의 종결 및 어구의 분리를 판단하기 어려운 한자로 이루어진 텍스트에서 그 효과는 증대될 수 있다.According to the embodiments disclosed in this document, by performing labeling that distinguishes sentences and phrases using a characteristic function on classical Chinese character documents without distinction of sentences and phrases, irregular and random sentences and phrases of classical Chinese character documents identification can be efficiently processed at the same time. In particular, the effect can be increased in texts made of Chinese characters, in which it is difficult to determine the end of a sentence and separation of phrases compared to Korean having a certain ending ending and proposition.

이 외에, 본 문서를 통해 직접적 또는 간접적으로 파악되는 다양한 효과들이 제공될 수 있다.In addition, various effects directly or indirectly identified through this document may be provided.

도 1은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치의 동작 환경을 나타낸다.
도 2는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치의 구성을 나타내는 블록도이다.
도 3은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 수신되는 예시적인 입력 텍스트를 도시한다.
도 4는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 생성되는 예시적인 학습 데이터를 도시한다.
도 5는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 생성되는 예시적인 학습 데이터를 도시한다.
도 6은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 수신되는 예시적인 번역 대상 텍스트의 처리 과정을 도시한다.
도 7은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 방법을 설명하기 위한 흐름도이다.
도 8은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 방법을 설명하기 위한 흐름도이다.
도면의 설명과 관련하여, 동일 또는 유사한 구성요소에 대해서는 동일 또는 유사한 참조 부호가 사용될 수 있다.1 illustrates an operating environment of an apparatus for recognizing sentences and phrases in a Chinese character document according to an exemplary embodiment.
2 is a block diagram illustrating a configuration of an apparatus for identifying sentences and phrases of a Chinese character document according to an exemplary embodiment.
3 illustrates an exemplary input text received by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.
4 illustrates exemplary learning data generated by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.
5 illustrates exemplary learning data generated by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.
6 illustrates a process of processing an exemplary translation target text received by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.
7 is a flowchart illustrating a method for identifying sentences and phrases in a Chinese character document according to an exemplary embodiment.
8 is a flowchart illustrating a method of identifying sentences and phrases in a Chinese character document according to an exemplary embodiment.
In connection with the description of the drawings, the same or similar reference numerals may be used for the same or similar components.

이하, 본 발명의 실시 예가 첨부된 도면을 참조하여 기재된다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 실시 예의 다양한 변경(modification), 균등물(equivalent), 및/또는 대체물(alternative)을 포함하는 것으로 이해되어야 한다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, this is not intended to limit the present invention to specific embodiments, and it should be understood that various modifications, equivalents, and/or alternatives of the embodiments of the present invention are included.

도 1은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치의 동작 환경을 나타낸다.1 illustrates an operating environment of an apparatus for recognizing sentences and phrases in a Chinese character document according to an exemplary embodiment.

도 1을 참조하면, 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치는 서버(100) 형태로 구현될 수 있다. 그러나, 이에 제한되지 않고, 한자 문헌의 문장 및 어구 식별 장치는 다양한 형태의 컴퓨팅 디바이스 중 하나로 구현될 수도 있다. 또한, 한자 문헌의 문장 및 어구 식별 장치는 도 1에 도시된 것과 같이 하나의 장치로 구현될 수도 있으나, 이에 제한되지 않고, 2 이상의 장치의 집합으로 구현될 수도 있다.Referring to FIG. 1 , an apparatus for identifying sentences and phrases of a Chinese character document according to an embodiment may be implemented in the form of a server 100 . However, the present invention is not limited thereto, and the apparatus for recognizing sentences and phrases in a Chinese character document may be implemented as one of various types of computing devices. In addition, the apparatus for identifying sentences and phrases in the Chinese character literature may be implemented as a single device as shown in FIG. 1 , but is not limited thereto, and may be implemented as a set of two or more devices.

일 실시 예에 따른 서버(100)는 외부로부터 번역 대상 텍스트(12)를 수신할 수 있다. 예를 들어, 서버(100)는 사용자 단말 또는 다른 서버로부터 번역 대상 텍스트(12)를 수신할 수 있다. 번역 대상 텍스트(12)는 문장 부호 및 띄어쓰기를 포함하지 않을 수 있다. 서버(100)는, 예를 들어, CRF(conditional random field) 기반 알고리즘을 이용하여 번역 대상 텍스트(12)에 포함된 문장 및 어구를 식별할 수 있다.The server 100 according to an embodiment may receive the translation target text 12 from the outside. For example, the server 100 may receive the translation target text 12 from the user terminal or another server. The translation target text 12 may not include punctuation marks and spaces. The server 100 may identify sentences and phrases included in the translation target text 12 using, for example, a conditional random field (CRF) based algorithm.

서버(100)는 문장을 구별하기 위한 마침표 및 어구를 구별하기 위한 쉼표 등과 같은 문장 부호를 번역 대상 텍스트(12)에 삽입할 수 있다. 서버(100)는 문장 부호가 삽입된 결과 텍스트(12)를 사용자 단말 또는 다른 서버로 제공할 수 있다. 사용자는 결과 텍스트(12)를 이용하여 고전 한자 문헌의 번역을 수행할 수 있다. 번역 대상 텍스트(12)에 표점을 입력하는 과정을 자동화함으로써, 번역에 소요되는 인력, 시간 및 비용을 감소시킬 수 있다.The server 100 may insert punctuation marks, such as a period for distinguishing sentences and a comma for distinguishing a phrase, into the translation target text 12 . The server 100 may provide the result text 12 in which punctuation marks are inserted to the user terminal or another server. The user can use the resulting text 12 to perform translation of the classical Chinese character literature. By automating the process of inputting marks into the text to be translated 12 , it is possible to reduce manpower, time, and cost required for translation.

도 2는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치의 구성을 나타내는 블록도이다.2 is a block diagram illustrating a configuration of an apparatus for identifying sentences and phrases of a Chinese character document according to an exemplary embodiment.

도 2를 참조하면, 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치는 서버(200)로 구현될 수 있다. 일 실시 예에 따른 서버(200)는 통신 회로(210), 메모리(220) 및 프로세서(230)를 포함할 수 있다.Referring to FIG. 2 , an apparatus for identifying sentences and phrases of a Chinese character document according to an embodiment may be implemented as a server 200 . The server 200 according to an embodiment may include a communication circuit 210 , a memory 220 , and a processor 230 .

통신 회로(210)는 외부와 통신하도록 구성될 수 있다. 통신 회로(210)는 무선 통신 인터페이스 및/또는 유선 통신 인터페이스를 포함할 수 있다. 예를 들어, 통신 회로(210)는 사용자 단말 및/또는 다른 외부 서버 등과 같은 외부 장치와 데이터를 송수신할 수 있다.The communication circuit 210 may be configured to communicate with the outside. The communication circuit 210 may include a wireless communication interface and/or a wired communication interface. For example, the communication circuit 210 may transmit/receive data to and from an external device such as a user terminal and/or another external server.

메모리(220)는 휘발성 메모리 및/또는 비휘발성 메모리를 포함할 수 있다. 메모리(220)는 서버(200)에서 취급되는 다양한 데이터를 저장할 수 있다. 예를 들어, 메모리(220)는 사용자 단말 및/또는 다른 외부 서버로부터 수신된 데이터를 저장할 수 있고, 서버(200) 내부에서 처리된 데이터를 저장할 수도 있다.Memory 220 may include volatile memory and/or non-volatile memory. The memory 220 may store various data handled by the server 200 . For example, the memory 220 may store data received from the user terminal and/or another external server, and may store data processed inside the server 200 .

프로세서(230)는 통신 회로(210) 및 메모리(220)와 전기적으로 연결될 수 있다. 프로세서(240)는 통신 회로(210) 및 메모리(220)를 제어할 수 있고, 다양한 데이터 처리 및 연산을 수행할 수 있다.The processor 230 may be electrically connected to the communication circuit 210 and the memory 220 . The processor 240 may control the communication circuit 210 and the memory 220 , and may perform various data processing and operations.

일 실시 예에 따르면, 프로세서(230)는 통신 회로(210)를 이용하여 문장 및 어구를 구별하는 미리 입력된 태그를 포함하는 입력 텍스트를 수신할 수 있다. 수신된 입력 텍스트는 메모리(220)에 저장될 수 있다. 미리 입력된 태그는 문장의 시작, 문장의 종결, 어구의 분리를 나타낼 수 있다. 예를 들어, 미리 입력된 태그는 입력 텍스트에 포함된 어구의 첫 글자 및 문장의 마지막 글자와 이웃하게 배치될 수 있다. 다른 예를 들면, 미리 입력된 태그는 입력 텍스트에 포함된 어구의 마지막 글자 및 문장의 마지막 글자와 이웃하게 배치될 수 있다. 입력 텍스트는 학습을 위한 데이터로서 이용될 수 있다. 입력 텍스트의 구체적인 예시에 대해서는 도 3을 참조하여 상세히 설명한다.According to an embodiment, the processor 230 may receive an input text including a pre-input tag for distinguishing a sentence and a phrase by using the communication circuit 210 . The received input text may be stored in the memory 220 . The pre-input tag may indicate the beginning of a sentence, the end of a sentence, and separation of phrases. For example, the pre-input tag may be disposed adjacent to the first letter of a phrase and the last letter of a sentence included in the input text. As another example, the pre-input tag may be disposed adjacent to the last letter of a phrase and the last letter of a sentence included in the input text. The input text can be used as data for learning. A specific example of the input text will be described in detail with reference to FIG. 3 .

일 실시 예에 따르면, 프로세서(230)는 입력 텍스트에 포함된 글자 각각에 대한 특성 함수(feature function)를 결정할 수 있다. 글자의 특성은 글자의 앞 또는 뒤에 위치하는 글자, 글자들의 집합(N-gram) 또는 그 글자가 속한 집합을 의미할 수 있다. 특성 함수는 그 특성에 해당하는지 여부를 나타내는 함수일 수 있고, 참 또는 거짓을 나타내는 논리 값(예: 0 또는 1)을 반환할 수 있다. 프로세서(230)는 특성 함수를 메모리(220)에 저장할 수 있다. 특성 함수는, 예를 들어, 입력 텍스트에 포함된 특정 글자, 특정 글자의 위치, 특정 글자에 대응하는 라벨, 및 특정 글자에 이웃하는 다른 글자에 대응하는 라벨에 기초하여 결정될 수 있다. According to an embodiment, the processor 230 may determine a feature function for each character included in the input text. Characteristics of a letter may mean a letter positioned before or after a letter, a set of letters (N-gram), or a set to which the letter belongs. A feature function can be a function that indicates whether the feature corresponds to the feature, and can return a logical value (such as 0 or 1) that represents true or false. The processor 230 may store the characteristic function in the memory 220 . The characteristic function may be determined based on, for example, a specific character included in the input text, a position of the specific character, a label corresponding to the specific character, and a label corresponding to another character adjacent to the specific character.

프로세서(230)는 복수의 특성 함수를 결정할 수도 있다. 프로세서(230)는 복수의 특성 함수 각각에 상이한 가중치를 부여할 수 있다. 예를 들어, 프로세서(230)는 MLE(maximum likelihood estimation)을 이용한 반복 계산을 통해 학습 데이터에 대해 정확한 결과를 도출하도록 하는 가중치를 산출할 수 있다. 특성 함수의 구체적인 예시에 대해서는 도 4를 참조하여 상세히 설명한다.The processor 230 may determine a plurality of characteristic functions. The processor 230 may assign different weights to each of the plurality of characteristic functions. For example, the processor 230 may calculate a weight for deriving an accurate result for the training data through iterative calculation using maximum likelihood estimation (MLE). A specific example of the characteristic function will be described in detail with reference to FIG. 4 .

일 실시 예에 따르면, 프로세서(230)는 미리 입력된 태그 및 입력 텍스트에 포함된 글자에 기초하여 입력 텍스트에 포함된 글자 각각에 대한 라벨을 포함하는 학습 데이터를 생성하고, 학습 데이터에 기초하여 입력 텍스트에 포함된 글자 각각에 대한 특성 함수를 결정할 수 있다. 학습 데이터는 입력 텍스트에 포함된 글자, 글자 각각에 대응하는 라벨, 및 글자의 위치를 나타내는 인덱스를 포함하는 테이블로 구성될 수 있다. 생성된 학습 데이터는 메모리(220)에 저장될 수 있다. 프로세서(230)는 학습 데이터를 특성 함수의 입력 값으로 활용할 수 있다. 학습 데이터의 구체적인 예시에 대해서는 도 4 및 도 5를 참조하여 상세히 설명한다.According to an embodiment, the processor 230 generates training data including a label for each of the characters included in the input text based on the previously input tag and the characters included in the input text, and input based on the training data. It is possible to determine a characteristic function for each character included in the text. The training data may be composed of a table including letters included in the input text, a label corresponding to each letter, and an index indicating the position of the letter. The generated learning data may be stored in the memory 220 . The processor 230 may utilize the training data as an input value of the feature function. A specific example of the learning data will be described in detail with reference to FIGS. 4 and 5 .

일 실시 예에 따르면, 프로세서(230)는 통신 회로(210)를 이용하여 번역 대상 텍스트를 수신할 수 있다. 프로세서(230)는 사용자 단말에 의해 웹 페이지의 입력창을 통해 입력되는 텍스트를 수신할 수도 있고, 사용자 단말에 의해 업로드된 파일(파일은 텍스트를 포함함)을 수신할 수도 있다. 프로세서(230)는 사용자 단말로부터 직접 번역 대상 텍스트를 수신할 수도 있고, 다른 외부 장치를 통해 번역 대상 텍스트를 수신할 수도 있다. 프로세서(230)는 번역 대상 텍스트를 메모리(220)에 저장할 수 있다.According to an embodiment, the processor 230 may receive the text to be translated using the communication circuit 210 . The processor 230 may receive text input by the user terminal through the input window of the web page, or may receive a file (the file includes text) uploaded by the user terminal. The processor 230 may receive the translation target text directly from the user terminal or may receive the translation target text through another external device. The processor 230 may store the translation target text in the memory 220 .

일 실시 예에 따르면, 프로세서(230)는 특성 함수를 이용하여 번역 대상 텍스트에 포함된 글자 시퀀스에 대한 문장 및 어구를 구별하는 복수의 라벨 시퀀스 각각의 출현 확률을 산출할 수 있다. 프로세서(230)는, 예를 들어, CRF에 따라 출현 확률을 산출할 수 있다. 프로세서(230)는 번역 대상 텍스트에 포함된 글자 시퀀스에 대응할 수 있는 모든 라벨 시퀀스를 파약할 수 있다. 프로세서(230)는 모든 라벨 시퀀스 중 특정 라벨 시퀀스가 출현할 확률을 산출할 수 있다. 프로세서(230)는 모든 라벨 시퀀스 각각에 대해 출현 확률을 산출할 수 있다. 확률을 산출하는 예시적인 수학식은 아래와 같다.According to an embodiment, the processor 230 may calculate an appearance probability of each of a plurality of label sequences for discriminating a sentence and a phrase of a character sequence included in the translation target text by using the characteristic function. The processor 230 may calculate an appearance probability according to, for example, CRF. The processor 230 may break all label sequences that may correspond to the character sequence included in the translation target text. The processor 230 may calculate a probability that a specific label sequence appears among all the label sequences. The processor 230 may calculate an appearance probability for each of all label sequences. An exemplary equation for calculating the probability is as follows.

[수학식 1][Equation 1]

여기서, x는 번역 대상 텍스트에 포함된 글자의 확률 변수이고, y는 번역 대상 텍스트에 포함된 글자에 대응하는 라벨의 확률 변수이고, n은 글자 시퀀스의 길이이고, m은 특성 함수의 종류의 수이고, y'은 글자 시퀀스에 대응할 수 있는 모든 라벨 시퀀스이고, f는 특성 함수이고, λ는 특성 함수에 대한 가중치일 수 있다. 모든 라벨 시퀀스 각각에 대한 확률의 합은 1일 수 있다. 수학식 1을 이용하여 글자 시퀀스에 대한 모든 라벨 시퀀스 각각의 출현 확률을 산출할 수 있다. 수학식 1은 λ를 산출하기 위해 사용될 수도 있고, 이 경우 입력 값은 학습 데이터에 해당할 수 있다.Here, x is a random variable of a character included in the translation target text, y is a random variable of a label corresponding to a character included in the translation target text, n is the length of a character sequence, and m is the number of types of characteristic functions. , y' may be any label sequence that may correspond to a character sequence, f may be a feature function, and λ may be a weight for the feature function. The sum of the probabilities for each of all label sequences may be 1. Using Equation 1, it is possible to calculate the appearance probability of each of all the label sequences with respect to the letter sequence. Equation 1 may be used to calculate λ, and in this case, the input value may correspond to learning data.

일 실시 예에 따르면, 프로세서(230)는 출현 확률에 기초하여 복수의 라벨 시퀀스 중 글자 시퀀스에 대응하는 라벨 시퀀스를 획득할 수 있다. 프로세서(230)는, 예를 들어, CRF에 따라 라벨 시퀀스를 획득할 수 있다. 라벨 시퀀스는 복수의 라벨 시퀀스 중 글자 시퀀스에 대한 출현 확률이 가장 높은 시퀀스일 수 있다. 글자 시퀀스에 대응하는 라벨 시퀀스를 획득하기 위한 예시적인 수학식은 아래와 같다.According to an embodiment, the processor 230 may obtain a label sequence corresponding to a letter sequence among a plurality of label sequences based on the appearance probability. The processor 230 may obtain the label sequence according to the CRF, for example. The label sequence may be a sequence having the highest appearance probability with respect to the letter sequence among the plurality of label sequences. An exemplary equation for obtaining a label sequence corresponding to a character sequence is as follows.

[수학식 2][Equation 2]

여기서, x는 번역 대상 텍스트에 포함된 글자 시퀀스이고, y는 글자 시퀀스에 대응할 수 있는 모든 라벨 시퀀스이고, y*는 복수의 라벨 시퀀스 중 해당 글자 시퀀스 x에 대해 출현 확률이 가장 높은 라벨 시퀀스이다.Here, x is a character sequence included in the text to be translated, y is all label sequences that can correspond to the character sequence, and y* is a label sequence with the highest probability of appearance with respect to the corresponding character sequence x among the plurality of label sequences.

일 실시 예에 따르면, 프로세서(230)는 라벨 시퀀스에 기초하여 번역 대상 텍스트에 문장 및 어구를 구별하는 문장 부호를 삽입할 수 있다. 프로세서(230)는 라벨 시퀀스에서 어구를 구별하는 라벨 및 문장을 구별하는 라벨을 인식하고, 어구를 구별하는 라벨을 참조하여 어구를 구별하기 위한 문장 부호를 삽입하고, 문장을 구별하는 라벨을 참조하여 문장을 구별하기 위한 문장 부호를 삽입할 수 있다. 예를 들어, 프로세서(230)는 번역 대상 텍스트에서 어구를 구별하는 라벨에 대응하는 글자 앞에 쉼표를 삽입하고, 문장을 구별하는 라벨에 대응하는 글자 뒤에 마침표를 삽입할 수 있다. 다른 예를 들면, 프로세서(230)는 번역 대상 텍스트에서 어구를 구별하는 라벨에 대응하는 글자 뒤에 쉼표를 삽입하고, 문장을 구별하는 라벨에 대응하는 글자 뒤에 마침표를 삽입할 수 있다.According to an embodiment, the processor 230 may insert a punctuation mark for distinguishing a sentence and a phrase into the translation target text based on the label sequence. The processor 230 recognizes a label for discriminating a phrase and a label for discriminating a sentence in the label sequence, inserting a punctuation mark for discriminating a phrase by referring to the label discriminating the phrase, and referencing the label for discriminating the sentence. Punctuation marks can be inserted to separate sentences. For example, the processor 230 may insert a comma in front of a character corresponding to a label for discriminating a phrase in the translation target text, and insert a period after a character corresponding to a label for discriminating a sentence. As another example, the processor 230 may insert a comma after a letter corresponding to a label for distinguishing a phrase in the translation target text, and insert a period after a letter corresponding to a label for distinguishing a sentence.

일 실시 예에 따르면, 프로세서(230)는 번역 대상 텍스트 및 문장 부호를 포함하는 결과 텍스트를 제공할 수 있다. 예를 들어, 프로세서(230)는 통신 회로(210)를 이용하여 결과 텍스트를 웹 페이지를 통해 제공할 수도 있고, 결과 텍스트를 포함하는 전자 파일을 제공할 수도 있다. 프로세서(230)는 통신 회로(210)를 이용하여 결과 텍스트를 사용자 단말 또는 다른 외부 장치로 제공할 수 있다.According to an embodiment, the processor 230 may provide the result text including the translation target text and punctuation marks. For example, the processor 230 may provide the result text through a web page using the communication circuit 210 , or may provide an electronic file including the result text. The processor 230 may provide the result text to the user terminal or other external device using the communication circuit 210 .

도 3은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 수신되는 예시적인 입력 텍스트를 도시한다.3 illustrates an exemplary input text received by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.

도 3을 참조하면, 일 실시 예에 따른 장치에 의해 수신되는 입력 텍스트는 문장의 시작 또는 종결을 나타내는 태그 및 어구의 시작 또는 종결을 나타내는 태그를 포함할 수 있다. 태그는 사용자에 의해 미리 입력될 수 있다.Referring to FIG. 3 , the input text received by the device according to an embodiment may include a tag indicating the start or end of a sentence and a tag indicating the start or end of a phrase. The tag may be pre-entered by the user.

예를 들어, 고전 한자 문헌의 일부에 해당하는 원본 텍스트는 “光海由後苑門出走”일 수 있다. 이 경우, “光海”가 하나의 어구에 해당하고, “由後苑門出走”가 다른 하나의 어구에 해당하며, “光海由後苑門出走”가 하나의 문장에 해당할 수 있다.For example, the original text corresponding to a part of a classical Chinese character text might be "Light Sea 由下苑門出走". In this case, “light sea” may correspond to one phrase, “由下苑門出走” may correspond to another phrase, and “光海由下苑門出走” may correspond to one sentence.

일 실시 예에 따르면, 하나의 문장의 마지막 글자 뒤에 문장의 종결을 나타내는 <END>가 태그되고, 어구의 첫 글자 뒤에 어구의 시작을 나타내는 /S가 태그될 수 있다. 이 경우, 제1 어구 “光海”의 첫 글자인 “光” 뒤에 /S가 태그되고, 제2 어구 “由後苑門出走”의 첫 글자인 “由” 뒤에 /S가 태그되고, 문장 “光海由後苑門出走”의 마지막 글자인 “走” 뒤에 <END>가 태그될 수 있다. 이 경우, 입력 텍스트는 “光/S海由/S後苑門出走<END>”일 수 있다(입력 텍스트 1).According to an embodiment, <END> indicating the end of a sentence may be tagged after the last character of one sentence, and /S indicating the start of a sentence may be tagged after the first character of a sentence. In this case, /S is tagged after “light”, which is the first letter of the first phrase “光海,” and /S is tagged after “由”, which is the first letter of the second phrase “由下苑門出走”, and the sentence “ <END> can be tagged after “走”, which is the last character of “光海由下苑門出走”. In this case, the input text may be “Light/Shae由/Shu苑門出走<END>” (input text 1).

일 실시 예에 따르면, 하나의 문장의 마지막 글자 뒤에 문장의 시작을 나타내는 <START>가 태그되고, 어구의 마지막 글자 뒤에 어구의 종결을 나타내는 /S가 태그될 수 있다. 이 경우, 제1 어구 “光海”의 마지막 글자인 “海” 뒤에 /S가 태그되고, 제2 어구 “由後苑門出走”의 마지막 글자인 “走” 뒤에 /S가 태그되고, 문장 “光海由後苑門出走”의 마지막 글자인 “走” 뒤에 <START>가 태그될 수 있다. 이 경우, 입력 텍스트는 “光海/S由後苑門出走<START>/S”일 수 있다(입력 텍스트 2).According to an embodiment, <START> indicating the start of a sentence may be tagged after the last character of one sentence, and /S indicating the end of a phrase may be tagged after the last character of a phrase. In this case, /S is tagged after “海”, which is the last letter of the first phrase “light sea,” and /S is tagged after “走”, which is the last letter of the second phrase “由下苑門出走”, and the sentence “ <START> can be tagged after “走”, which is the last character of “光海由下苑門出走”. In this case, the input text may be “光海/S由具苑門出走<START>/S” (input text 2).

입력 텍스트 1 또는 입력 텍스트 2는 구현 방식에 따라 임의로 선택될 수 있다. 또한, 상술한 입력 텍스트는 예시적인 것일 뿐이고, 입력 텍스트는 어구 및 문장을 구별하도록 하는 다양한 형태의 태그를 포함할 수 있다. 입력 텍스트는 학습을 위한 데이터로 활용될 수 있다. 입력 텍스트에 어구를 식별하는 태그 및 문장을 식별하는 태그가 모두 포함되어 있으므로, 이를 학습을 위한 데이터로 활용함으로써, 어구 및 문장의 구별이 없는 고전 한자 문헌에 대한 표점 입력 작업이 효율적으로 수행될 수 있다.Input text 1 or input text 2 may be arbitrarily selected according to an implementation manner. In addition, the above-described input text is merely exemplary, and the input text may include various types of tags for distinguishing phrases and sentences. The input text may be used as data for learning. Since the input text contains both a tag for identifying a phrase and a tag for identifying a sentence, by using this as data for learning, a mark input task for classical Chinese character literature that does not distinguish between phrases and sentences can be performed efficiently. there is.

도 4는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 생성되는 예시적인 학습 데이터를 도시한다.4 illustrates exemplary learning data generated by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.

도 4에 도시된 표에서, x는 입력 텍스트에 포함된 글자이고, y는 글자 각각에 대응하는 라벨이고, i는 한 문장 내 글자의 위치를 의미한다. 도 4에 도시된 학습 데이터는 도 3에서 설명된 입력 데이터 1에 기반하여 생성된다.In the table shown in FIG. 4 , x is a letter included in the input text, y is a label corresponding to each letter, and i means the position of the letter in one sentence. The training data shown in FIG. 4 is generated based on the input data 1 described in FIG. 3 .

도 4를 참조하면, 학습 데이터의 항목 x에는 입력 텍스트에 포함된 글자가 입력될 수 있다. 예를 들어, 항목 x에는 글자 “光海由後苑門出走”가 순서대로 입력될 수 있다. 학습 데이터의 항목 i에는 각각의 글자의 순번을 나타내는 숫자가 배열될 수 있다. 예를 들어, 항목 i에는 첫 번째 글자 “光”에 대응하여 0이 입력되고, 두 번째 글자 “海”에 대응하여 숫자 1이 입력되고, 일곱 번째 글자 “走”에 대응하여 숫자 7이 입력될 수 있다. 학습 데이터의 항목 y에는 각각의 글자에 대응하는 라벨이 입력될 수 있다. 예를 들어, /S가 태그된 글자 “光” 및 “由”에는 라벨 S가 입력되고, <END>가 태그된 글자 “走”에는 라벨 E가 입력되고, 태그되지 않은 나머지 글자에는 라벨 N이 입력될 수 있다. 상술한 학습 데이터를 이용하여 이하와 같이 학습을 진행하고, 번역 대상 텍스트에 표점을 입력할 수 있다. 학습 데이터에 어구를 식별하는 라벨 및 문장을 식별하는 라벨이 모두 포함되어 있고, 학습 데이터를 이용하여 특성 함수 및 가중치를 설정함으로써, 띄어쓰기 및 문장 부호를 포함하지 않는 고전 한자 문헌의 어구 및 문장을 효율적으로 식별할 수 있다.Referring to FIG. 4 , characters included in input text may be input to item x of the training data. For example, in the item x, the letters “光海由下苑門出走” may be entered in order. In the item i of the learning data, a number indicating the sequence number of each letter may be arranged. For example, in item i, 0 is inputted corresponding to the first letter “light”, the number 1 is inputted corresponding to the second letter “海”, and the number 7 is inputted corresponding to the seventh letter “走”. can A label corresponding to each letter may be input to the item y of the learning data. For example, the letters “light” and “由” tagged with /S would have the label S, the letters “走” tagged with <END> would have the label E, and the remaining untagged letters would have the label N. can be entered. Learning may be performed as follows using the above-described learning data, and a mark may be inputted into the text to be translated. The training data includes both a label for identifying a phrase and a label for identifying a sentence, and by setting a characteristic function and weight using the training data, phrases and sentences of classical Chinese characters that do not contain spaces and punctuation marks are efficiently used. can be identified as

장치는 학습 데이터에 포함된 글자 각각에 대한 특성 함수를 결정할 수 있다. 특성 함수의 종류는, 예를 들어, 3개 일 수 있다. 예를 들어, 3 종류의 특성 함수는 f₁(x, i, y_i, y_i-1), f₂(x, i, y_i) 및 f₃(x, i, y_i, y_i-1, y_i-2)일 수 있다. 여기서, i=6인 경우, 도 4에 도시된 학습 데이터에 기초하여, f₁(x, 6, y₆, y₅)은 x=出, y₆=N, y₅=N일 때 1을 반환하고, 아니면 0을 반환하도록 설정될 수 있다. f₂(x, 6, y₆)은 x=出, y₆=N일 때 1을 반환하고, 아니면 0을 반환하도록 설정될 수 있다. f₃(x, 6, y₆, y₅, y₄)은 x=出, y₆=N, y₅=N, y₄=N일 때 1을 반환하고, 아니면 0을 반환하도록 설정될 수 있다. 상술한 것과 동일한 방식으로, 학습 데이터에 포함된 각각의 글자에 대해 f₁, f₂ 및 f₃ 등의 특성 함수가 모두 설정될 수 있다. 중복되는 특성 함수는 제거될 수 있다. 특성 함수가 결정되면, 장치는 수학식 1을 이용하여 특성함수 f₁, f₂ 및 f₃ 각각에 대한 가중치 λ₁, λ₂ 및 λ₃을 설정할 수 있다. 가중치는 MLE에 따라 수학식 1이 학습 데이터 내에서 가장 적합한 확률을 산출할 수 있도록 설정될 수 있다.The device may determine a characteristic function for each letter included in the training data. The type of the characteristic function may be, for example, three. For example, the three kinds of feature functions are f ₁ (x, i, y _i , y _i-1 ), f ₂ (x, i, y _i ) and f ₃ (x, i, y _i , y _{i- 1} , y _i-2 ). Here, when i = 6, based on the training data shown in Fig. 4, f ₁ (x, 6, y ₆ , y ₅ ) is 1 when x = 出, y ₆ =N, y ₅ =N returns, otherwise it can be set to return 0. f ₂ (x, 6, y ₆ ) can be set to return 1 when x=出, y _{6 =N, and 0 otherwise.} f ₃ (x, 6, y ₆ , y ₅ , y ₄ ) can be set to return 1 when x=出, y ₆ =N, y ₅ =N, y ₄ =N, and 0 otherwise. there is. In the same manner as described above, all characteristic functions such as _{f 1} , f _{2 ,} and f ₃ may be set for each character included in the training data. Duplicate feature functions can be removed. When the characteristic function is determined, the device may set weights λ ₁ , λ _{2 ,} and λ ₃ _{for each of the characteristic functions f 1} , f _{2 ,} and f _{3 using Equation 1 .} The weight may be set so that Equation 1 can calculate the most appropriate probability in the training data according to the MLE.

특성 함수 및 가중치가 결정되면, 장치는 수학식 1 및 수학식 2를 이용하여, 번역 대상 텍스트에 포함된 글자 시퀀스에 대응하는 라벨 시퀀스를 결정할 수 있다. 장치는 글자 시퀀스에 대응할 수 있는 모든 라벨 시퀀스 각각에 대해 수학식 1을 이용하여 확률 값을 산출할 수 있고, 수학식 2를 이용하여 가장 높은 확률 값을 갖는 라벨 시퀀스를 획득할 수 있다. 장치는 라벨 시퀀스에서 라벨 E에 대응하는 글자를 문장의 마지막 글자로 인식하고, 해당 글자의 뒤에 마침표를 삽입할 수 있다. 장치는 라벨 시퀀스에서 라벨 S에 대응하는 글자를 어구의 첫 글자로 인식하고, 해당 글자의 앞에 쉼표를 삽입할 수 있다. 쉼표와 마침표가 중복되는 경우, 쉼표는 제거될 수 있다.When the characteristic function and the weight are determined, the apparatus may determine a label sequence corresponding to a character sequence included in the translation target text by using Equations 1 and 2 . The device may calculate a probability value using Equation 1 for each of all label sequences that may correspond to a character sequence, and may obtain a label sequence having the highest probability value using Equation 2 . The device may recognize the letter corresponding to the label E in the label sequence as the last letter of the sentence, and insert a period after the corresponding letter. The device may recognize the letter corresponding to the label S in the label sequence as the first letter of the phrase, and insert a comma in front of the letter. If a comma and a period overlap, the comma may be removed.

도 5는 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 생성되는 예시적인 학습 데이터를 도시한다.5 illustrates exemplary learning data generated by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.

도 5 참조하면, 학습 데이터의 항목 x에는 입력 텍스트에 포함된 글자가 입력될 수 있다. 예를 들어, 항목 x에는 글자 “光海由後苑門出走”가 순서대로 입력될 수 있다. 학습 데이터의 항목 i에는 각각의 글자의 순번을 나타내는 숫자가 배열될 수 있다. 예를 들어, 항목 i에는 첫 번째 글자 “光”에 대응하여 0이 입력되고, 두 번째 글자 “海”에 대응하여 숫자 1이 입력되고, 일곱 번째 글자 “走”에 대응하여 숫자 7이 입력될 수 있다. 학습 데이터의 항목 y에는 각각의 글자에 대응하는 라벨이 입력될 수 있다. 예를 들어, /S가 태그된 글자 “海”에는 라벨 S가 입력되고, <START>가 태그된 글자 “走”에는 라벨 ST가 입력되고, 태그되지 않은 나머지 글자에는 라벨 N이 입력될 수 있다. 상술한 학습 데이터를 이용하여 학습을 진행하고, 번역 대상 텍스트에 표점을 입력할 수 있다. 특성 함수 및 가중치의 설정과 라벨 시퀀스의 획득은 도 4에 대한 설명과 유사한 방식으로 수행될 수 있다.Referring to FIG. 5 , characters included in input text may be input to item x of the learning data. For example, in the item x, the letters “光海由下苑門出走” may be entered in order. In the item i of the learning data, a number indicating the sequence number of each letter may be arranged. For example, in item i, 0 is inputted corresponding to the first letter “light”, the number 1 is inputted corresponding to the second letter “海”, and the number 7 is inputted corresponding to the seventh letter “走”. can A label corresponding to each letter may be input to the item y of the learning data. For example, the label S may be input to the character “海” tagged with /S, the label ST may be input to the character “走” tagged with <START>, and the label N may be input to the remaining untagged characters. . Learning may be performed using the above-described learning data, and a mark may be input to the text to be translated. The setting of the characteristic function and weight and the acquisition of the label sequence may be performed in a manner similar to that of FIG. 4 .

장치는 라벨 시퀀스에서 라벨 ST에 대응하는 글자를 문장의 마지막 글자로 인식하고, 해당 글자의 뒤에 마침표를 삽입할 수 있다. 장치는 라벨 시퀀스에서 라벨 S에 대응하는 글자를 어구의 마지막 글자로 인식하고, 해당 글자의 뒤에 쉼표를 삽입할 수 있다. 쉼표와 마침표가 중복되는 경우, 쉼표는 제거될 수 있다.The device may recognize the letter corresponding to the label ST in the label sequence as the last letter of the sentence, and insert a period after the corresponding letter. The device may recognize the letter corresponding to the label S in the label sequence as the last letter of the phrase, and insert a comma after the letter. If a comma and a period overlap, the comma may be removed.

도 6은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 장치에 의해 수신되는 예시적인 번역 대상 텍스트의 처리 과정을 도시한다.6 illustrates a process of processing an exemplary translation target text received by the apparatus for recognizing sentences and phrases of a Chinese character document according to an embodiment.

도 6을 참조하면, 일 실시 예에 따른 장치는 번역 대상 텍스트를 수신할 수 있다. 수신된 번역 대상 텍스트는 어구 및 문장이 구별되지 않은 상태인 텍스트이고, 예를 들어, “兵人爭入寢殿燃炬搜覓火延簾因燒諸殿”일 수 있다.Referring to FIG. 6 , the device according to an embodiment may receive a text to be translated. The received text to be translated is a text in which phrases and sentences are not distinguished, for example, "兵人爭入寢殿燃炬搜覓火延簾因燒諸殿".

장치는 특성 함수 및 가중치를 이용하여 번역 대상 텍스트에 대응하는 라벨 시퀀스를 획득할 수 있다. 장치는 수학식 1을 이용하여 번역 대상 텍스트에 대응할 수 있는 모든 라벨 시퀀스 각각에 대한 확률을 산출할 수 있다. 예를 들어, 라벨 시퀀스 y₁, y₂, y₃ 및 y₄ 등에 대하여 확률 P(y₁|x), P(y₂|x), P(y₃|x) 및 P(y₄|x) 등을 산출할 수 있다. 장치는 수학식 2에 따라 라벨 시퀀스 중 가장 높은 확률 값 0.80에 대응하는 라벨 시퀀스 y₁(S N N N N N S N N N S N N N S N N E)을 획득할 수 있다. 라벨 시퀀스는 어구를 식별하기 위한 라벨 S 및 문장을 식별하기 위한 라벨 E를 포함할 수 있다.The apparatus may obtain a label sequence corresponding to the text to be translated by using the characteristic function and the weight. The apparatus may calculate a probability for each of all the label sequences that may correspond to the translation target text by using Equation (1). For example, for the label sequences y ₁ , y ₂ , y ₃ and y _{4 ,} etc., the probabilities P(y ₁ |x), P(y ₂ |x), P(y ₃ |x) and P(y ₄ |x) ) can be calculated. _{The device may obtain the label sequence y 1} (SNNNNNSNNNSNNNSNNE) corresponding to the highest probability value of 0.80 among the label sequences according to Equation (2). The label sequence may include a label S for identifying a phrase and a label E for identifying a sentence.

장치는 라벨 시퀀스 y₁에 대응하도록 번역 대상 텍스트에 문장 부호를 삽입할 수 있다. 예를 들어, 라벨 시퀀스 y₁에 포함된 라벨 S에 대응하는 글자 “兵”, “燃”, “火” 및 “因” 앞에 쉼표를 삽입하고, 라벨 E에 대응하는 글자 “殿” 뒤에 마침표를 삽입할 수 있다. 장치는 문장 부호를 삽입한 출력 텍스트 “兵人爭入寢殿, 燃炬搜覓, 火延簾, 因燒諸殿。”을 제공할 수 있다.The device may insert punctuation marks into the translation target text to correspond to the _{label sequence y 1 .} For example, insert a comma before the letters “兵”, “燃”, “Flame” and “因” corresponding to the label S in the _{label sequence y 1 , and a period after the letter “殿” corresponding to the label E} can be inserted. The device may provide the output text “兵人爭入寢殿, 燃炬搜覓, 火延簾, 因燒諸殿。” inserted with punctuation marks.

특성 함수를 이용하여 어구를 식별하는 라벨 및 문장을 식별하는 라벨을 포함하는 라벨 시퀀스를 획득함으로써, 띄어쓰기 및 문장 부호를 포함하지 않는 고전 한자 문헌에서 어구를 식별하기 위한 문장 부호 및 문장을 구별하기 위한 문장 부호를 동시에 처리할 수 있다. 특히, 일정한 종결 어미 및 조사를 갖는 우리말에 비해 문장의 종결 및 어구의 분리를 판단하기 어려운 한자로 이루어진 텍스트에서 그 효과는 증대될 수 있다.By obtaining a label sequence including a label for identifying a phrase and a label for identifying a sentence by using a characteristic function, for distinguishing between punctuation marks and sentences for identifying phrases in classical Chinese literature that does not include spaces and punctuation marks Punctuation marks can be processed simultaneously. In particular, the effect can be increased in texts made of Chinese characters, in which it is difficult to determine the end of a sentence and separation of phrases compared to Korean having a certain ending ending and proposition.

도 7은 일 실시 예에 따른 한자 문헌의 문장 및 어구 식별 방법을 설명하기 위한 흐름도이다.7 is a flowchart illustrating a method for identifying sentences and phrases in a Chinese character document according to an exemplary embodiment.

이하에서는 도 2의 서버(200)가 도 7의 프로세스를 수행하는 것을 가정한다. 또한, 도 7의 설명에서, 서버(200)에 의해 수행되는 것으로 기술된 동작은 프로세서(230)에 의해 제어되는 것으로 이해될 수 있다.Hereinafter, it is assumed that the server 200 of FIG. 2 performs the process of FIG. 7 . In addition, in the description of FIG. 7 , an operation described as being performed by the server 200 may be understood as being controlled by the processor 230 .

도 7을 참조하면, 단계 710에서, 서버는 문장 및 어구를 구별하는 미리 입력된 태그를 포함하는 입력 텍스트를 수신할 수 있다. 예를 들어, 서버는 사용자에 의해 문장을 구별하는 태그 및 어구를 구별하는 태그가 삽입된 입력 텍스트를 외부로부터 다양한 인터페이스를 통해 수신할 수 있다.Referring to FIG. 7 , in step 710 , the server may receive input text including a pre-input tag for distinguishing a sentence and a phrase. For example, the server may receive input text into which a tag for discriminating a sentence and a tag for discriminating a phrase are inserted by the user from the outside through various interfaces.

단계 720에서, 서버는 입력 텍스트에 포함된 글자 각각에 대한 특성 함수를 결정할 수 있다. 예를 들어, 서버는 복수의 종류의 특성 함수를 설정할 수 있고, 입력 텍스트에 포함된 글자 각각에 대해 복수의 종류의 특성 함수 각각의 리턴 값을 설정할 수 있다.In operation 720, the server may determine a characteristic function for each character included in the input text. For example, the server may set a plurality of types of characteristic functions, and may set a return value of each of the plurality of types of characteristic functions for each character included in the input text.

단계 730에서, 서버는 번역 대상 텍스트를 수신할 수 있다. 예를 들어, 서버는 띄어쓰기 및 문장 부호 등을 포함하지 않는 한자로 이루어진 번역 대상 텍스트를 외부로부터 다양한 인터페이스를 통해 수신할 수 있다.In operation 730, the server may receive the text to be translated. For example, the server may receive the translation target text composed of Chinese characters not including spaces and punctuation marks from the outside through various interfaces.

단계 740에서, 서버는 특성 함수를 이용하여 번역 대상 텍스트에 포함된 글자 시퀀스에 대한 문장 및 어구를 구별하는 복수의 라벨 시퀀스 각각의 출현 확률을 산출할 수 있다. 예를 들어, 서버는 번역 대상 텍스트에 포함된 글자 시퀀스를 단계 720에서 설정된 특성 함수에 입력하고, 단계 720에서 설정된 리턴 값을 이용하여 글자 시퀀스에 대응할 수 있는 모든 라벨 시퀀스 각각의 출현 확률을 산출할 수 있다.In operation 740 , the server may calculate an appearance probability of each of a plurality of label sequences for discriminating a sentence and a phrase of a character sequence included in the translation target text by using the characteristic function. For example, the server inputs the character sequence included in the text to be translated into the characteristic function set in step 720, and calculates the appearance probability of each of all label sequences that can correspond to the character sequence by using the return value set in step 720. can

단계 750에서, 서버는 출현 확률에 기초하여 복수의 라벨 시퀀스 중 글자 시퀀스에 대응하는 라벨 시퀀스를 획득할 수 있다. 예를 들어, 서버는 출현 확률이 가장 높은 라벨 시퀀스를 획득할 수 있다.In operation 750, the server may obtain a label sequence corresponding to the letter sequence from among the plurality of label sequences based on the appearance probability. For example, the server may acquire the label sequence with the highest probability of appearance.

단계 760에서, 서버는 라벨 시퀀스에 기초하여 번역 대상 텍스트에 문장 및 어구를 구별하는 문장 부호를 삽입할 수 있다. 예를 들어, 서버는 라벨 시퀀스에서 문장의 분리를 나타내는 라벨 및 어구의 분리를 나타내는 라벨에 기초하여 해당 라벨에 대응하는 글자의 앞 또는 뒤에 문장 부호를 삽입할 수 있다. 번역 대상 텍스트에 문장 부호가 삽입된 결과 텍스트는 다양한 인터페이스를 통해 외부로 제공될 수 있다.In operation 760, the server may insert punctuation marks for distinguishing sentences and phrases into the translation target text based on the label sequence. For example, the server may insert a punctuation mark before or after a letter corresponding to the label based on a label indicating separation of sentences and a label indicating separation of phrases in the label sequence. The resulting text in which punctuation marks are inserted into the translation target text may be externally provided through various interfaces.

본 문서의 실시 예들 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및/또는 B 중 적어도 하나", "A, B 또는 C" 또는 "A, B 및/또는 C 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. "제1," "제2," "첫째," 또는 "둘째," 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 어떤 구성요소가 다른 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소를 통하여 연결될 수 있다.The embodiments of this document and the terms used therein are not intended to limit the technology described in this document to a specific embodiment, and should be understood to include various modifications, equivalents, and/or substitutions of the embodiments. In connection with the description of the drawings, like reference numerals may be used for like components. The singular expression may include the plural expression unless the context clearly dictates otherwise. In this document, expressions such as “A or B”, “at least one of A and/or B”, “A, B or C” or “at least one of A, B and/or C” refer to all of the items listed together. Possible combinations may be included. Expressions such as "first," "second," "first," or "second," can modify the corresponding elements regardless of order or importance, and to distinguish one element from another element. It is used only and does not limit the corresponding components. When an element is referred to as being "connected (functionally or communicatively)" or "connected" to another element, the element is directly connected to the other element or is connected to the other element. can be connected through

본 문서에서, "~하도록 설정된(adapted to or configured to)"은 상황에 따라, 예를 들면, 하드웨어적 또는 소프트웨어적으로 "~에 적합한," "~하는 능력을 가지는," "~하도록 변경된," "~하도록 만들어진," "~를 할 수 있는," 또는 "~하도록 설계된"과 상호 호환적으로(interchangeably) 사용될 수 있다. 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 설정된 (또는 구성된) 프로세서"는 해당 동작들을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(예: CPU)를 의미할 수 있다.In this document, "adapted to or configured to", depending on the context, for example, hardware or software "suitable for," "having the ability to," "modified to, Can be used interchangeably with ""made to," "capable of," or "designed to." In some circumstances, the expression “a device configured to” may mean that the device is “capable of” with other devices or parts. For example, the phrase "a processor configured (or configured to perform) A, B, and C" refers to a dedicated processor (eg, an embedded processor) for performing the corresponding operations, or by executing one or more programs stored in a memory device; It may refer to a general-purpose processor (eg, CPU) capable of performing corresponding operations.

본 문서에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어(firmware)로 구성된 유닛(unit)을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. "모듈"은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. "모듈"은 기계적으로 또는 전자적으로 구현될 수 있으며, 예를 들면, 어떤 동작들을 수행하는, 알려졌거나 앞으로 개발될, ASIC(application-specific integrated circuit) 칩, FPGAs(field-programmable gate arrays), 또는 프로그램 가능 논리 장치를 포함할 수 있다.As used herein, the term “module” includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic block, component, or circuit. can A “module” may be an integrally formed part or a minimum unit or a part of performing one or more functions. A “module” may be implemented mechanically or electronically, for example, known or to be developed, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), or It may include a programmable logic device.

일 실시 예에 따른 장치(예: 모듈들 또는 그 기능들) 또는 방법(예: 동작들)의 적어도 일부는 프로그램 모듈의 형태로 컴퓨터로 판독 가능한 저장 매체에 저장된 명령어로 구현될 수 있다. 상기 명령어가 프로세서에 의해 실행될 경우, 프로세서가 상기 명령어에 해당하는 기능을 수행할 수 있다.At least a portion of an apparatus (eg, modules or functions thereof) or a method (eg, operations) according to an embodiment may be implemented as instructions stored in a computer-readable storage medium in the form of a program module. When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction.

일 실시 예에 따른 구성 요소(예: 모듈 또는 프로그램 모듈) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소를 더 포함할 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램 모듈)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 일 실시 예에 따른 모듈, 프로그램 모듈 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱(heuristic)하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.Each of the components (eg, a module or a program module) according to an embodiment may be composed of a singular or a plurality of entities, and some sub-components of the aforementioned sub-components may be omitted, or other sub-components may be included. may include more. Alternatively or additionally, some components (eg, a module or a program module) may be integrated into one entity to perform the same or similar functions performed by each corresponding component before being integrated. Operations performed by a module, program module, or other component according to an embodiment are sequentially, parallelly, repetitively or heuristically executed, or at least some operations are executed in a different order, omitted, or other operations This can be added.

Claims

A device for identifying sentences and phrases in Chinese characters, the method comprising:
a communication circuit configured to communicate with the outside;
Memory; and
a processor electrically connected to the communication circuitry and the memory;
The processor is
receiving an input text including a pre-entered tag for discriminating a sentence and a phrase using the communication circuit;
determining a characteristic function for each character included in the input text;
receiving text to be translated using the communication circuit;
calculating an appearance probability of each of a plurality of label sequences for discriminating the sentence and the phrase with respect to the character sequence included in the translation target text by using the characteristic function,
obtaining a label sequence corresponding to the letter sequence among the plurality of label sequences based on the appearance probability;
and inserting a punctuation mark for distinguishing the sentence and the phrase into the translation target text based on the label sequence.

The method of claim 1,
The pre-entered tag is disposed adjacent to the first letter of the phrase and the last letter of the sentence included in the input text,
The processor is
An apparatus, characterized in that inserting a comma in front of a letter corresponding to a label for distinguishing the phrase in the translation target text, and inserting a period after a letter corresponding to a label for distinguishing the sentence.

The method of claim 1,
The pre-entered tag is disposed adjacent to the last character of the phrase and the last character of the sentence included in the input text,
The processor is
and inserting a comma after a letter corresponding to a label for discriminating the phrase in the translation target text, and inserting a period after a letter corresponding to a label for discriminating the sentence.

The method of claim 1,
The processor is
generating learning data including a label for each character included in the input text for distinguishing the sentence and the phrase based on the pre-input tag and the characters included in the input text,
and determining a characteristic function for each character included in the input text based on the learning data.

The method of claim 1,
wherein the characteristic function is determined based on a specific character included in the input text, a position of the specific character, a label corresponding to the specific character, and a label corresponding to another character adjacent to the specific character, Device.

The method of claim 1,
The processor is
determining a plurality of characteristic functions for each character included in the input text;
Giving different weights to each of the plurality of characteristic functions,
The apparatus, characterized in that calculating the appearance probability based on the return value of the plurality of characteristic functions and the weight.

The method of claim 1,
The processor is
Calculating the appearance probability according to a conditional random field (CRF), and obtaining the label sequence.

The method of claim 1,
The label sequence is characterized in that the sequence with the highest probability of the occurrence of the letter sequence among the plurality of label sequences.

The method of claim 1,
The processor is
and providing a result text including the translation target text and the punctuation marks.

A method for identifying sentences and phrases in a Chinese character document, the method comprising:
receiving input text including pre-entered tags for discriminating sentences and phrases;
determining a characteristic function for each character included in the input text;
receiving text to be translated;
calculating an appearance probability of each of a plurality of label sequences for discriminating the sentence and the phrase with respect to the character sequence included in the translation target text by using the characteristic function;
obtaining a label sequence corresponding to the letter sequence from among the plurality of label sequences based on the appearance probability;
and inserting a punctuation mark for distinguishing the sentence and the phrase into the translation target text based on the label sequence.