KR100897718B1

KR100897718B1 - Device and method for correcting errors of colloquial type sentence

Info

Publication number: KR100897718B1
Application number: KR1020070093657A
Authority: KR
Inventors: 한경수; 장정선; 임해창; 서준모
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2007-09-14
Filing date: 2007-09-14
Publication date: 2009-05-15
Also published as: KR20090028219A

Abstract

통신 환경에서 나타나는 구어체 문장의 다양한 오류를 효율적으로 교정할 수 있는 구어체 문장의 오류 교정 장치 및 방법이 제공된다. 오류 교정 장치는 사용자 조작에 따라 입력 데이터를 생성하는 입력부, 입력 데이터로부터 구어체 문자열을 추출하고, 추출된 구어체 문자열을 대상으로 반복어구 제거, 해체음절 조합, 음절을 기본 교정 단위로 하는 문자열 교정 등의 과정을 선택적으로 적용하여 구어체 교정을 수행하는 구어체 교정부 및 교정된 구어체 문자열을 입력 데이터에 반영하여 교정된 입력 데이터를 반환하는 교정 데이터 반환부를 포함한다. 오류 교정 방법은 사용자의 조작에 따라 입력 데이터를 생성하는 단계, 입력 데이터에 포함된 구어체 문자열을 추출하는 단계, 추출된 구어체 문자열을 대상으로 반복어구 제거, 해체음절 조합, 음절을 기본 교정 단위로 하는 문자열 교정 등의 과정을 선택적으로 적용하여 구어체 교정을 수행하는 단계, 교정된 구어체 문자열을 입력 데이터에 포함시켜 교정된 입력 데이터를 생성하는 단계를 포함한다.An error correction apparatus and method for colloquial sentences which can efficiently correct various errors in spoken sentences appearing in a communication environment are provided. The error correcting apparatus extracts colloquial character strings from the input data to generate input data according to a user's operation, removes repeated phrases from the colloquial character strings, combines disassembled syllables, and corrects strings based on syllables. A colloquial proofing unit for performing colloquial correction by selectively applying the process, and a correction data return unit for returning corrected input data by reflecting the corrected colloquial string on the input data. The error correction method includes generating input data according to a user's operation, extracting colloquial strings included in the input data, removing repeated phrases from the extracted colloquial strings, combining disassembled syllables, and syllables as basic correction units. Selectively applying a process such as string correction to perform colloquial correction, and including the corrected colloquial string in the input data to generate the corrected input data.

통신망, 맞춤법, 구어체 Network, spelling, colloquialism

Description

DEVICE AND METHOD FOR CORRECTING ERRORS OF COLLOQUIAL TYPE SENTENCE}

본 발명은 통신망에서의 언어 처리에 관한 것으로, 더욱 상세하게는 통신망에서 구어체 문장의 오류를 교정하는 장치 및 방법에 관한 것이다.The present invention relates to language processing in a communication network, and more particularly, to an apparatus and method for correcting errors in spoken sentences in a communication network.

형태소 분석이나 구문 분석 등 통상적인 자연어 처리 기술은 문법적으로 올바르고, 이모티콘(Imoticon) 등의 비언어적 요소가 없는 문장을 처리 대상으로 삼는다. 참고로, 이모티콘은 키패드나 마우스 등의 입력 수단을 통해 입력할 수 있는 범위에서 기호를 조합하여 만든 다양한 표정의 얼굴 모습이다.Conventional natural language processing techniques, such as morphological analysis and syntax analysis, use sentences that are grammatically correct and have no non-verbal elements such as emoticons. For reference, emoticons are various facial expressions made by combining symbols within a range that can be input through an input means such as a keypad or a mouse.

하지만, 실제 사용자가 사용하는 언어에는 맞춤법이나 문법에 어긋나는 오류들이 존재하며, 많은 문장들이 철자 오류, 띄어쓰기 오류 등을 포함한다. 이러한 오류들은 자연어 처리 기술이나, 이를 이용한 어플리케이션(Application)의 성능을 저하시키는 주요 원인이 된다.However, there are some spelling or grammatical errors in the language used by the actual user, and many sentences include misspelling and spacing errors. These errors are a major cause of degrading the performance of natural language processing techniques and applications using them.

이러한 문제를 해결하기 위해 다양한 철자 교정, 띄어쓰기 교정 등의 오류 교정 기술이 제안되고 있다.In order to solve this problem, various error correction techniques such as spell correction and spacing correction have been proposed.

그런데, 종래의 오류 교정 기술은 신문 기사, 소설, 보고서, 에세이 등과 같은 문어체 문장을 대상으로 삼고 있다. 이러한 문장에서의 오류는 보통 사용자의 실수 또는 부정확한 정보, 문법 지식의 부족 등으로 인해 발생한, 의도하지 않은 오류들이다. 예를 들면 "할 수 있다" 를 "할수 있다" 로 표기하거나 "샌프란시스코" 를 "센프란시스코" 로 표기하는 것은 단순한 실수나 정확한 표기법에 대한 지식 부족에서 발생한 것으로 볼 수 있다.However, the conventional error correction technique targets written sentences such as newspaper articles, novels, reports, and essays. Errors in these sentences are usually unintentional errors caused by user error or incorrect information, lack of grammatical knowledge, and the like. For example, marking "I can" as "I can" or "San Francisco" as "San Francisco" can be attributed to a simple mistake or lack of knowledge of the correct notation.

그런데, 인터넷이나 이동통신망 등 통신 환경에서 사용되는 구어체 문장에는 일반적인 문어체 문장과 다른 유형의 다양한 오류 요소들이 존재한다. 즉, 통신 환경에서 사용되는 문장은 종래의 오류 교정 기술들이 대상으로 삼는 문장과 매우 다른 오류 유형을 포함한다. However, in the colloquial sentence used in a communication environment such as the Internet or a mobile communication network, there are various types of error elements different from the general written sentence. That is, sentences used in a communication environment contain error types that are very different from those targeted by conventional error correction techniques.

따라서, 통신 환경에서 쓰이는 구어체 문장은 종래의 오류 교정 기술을 적용하여 교정하기 어렵다. 그 이유를 보다 상세하게 설명하면 다음과 같다.Therefore, colloquial sentences used in a communication environment are difficult to correct by applying a conventional error correction technique. The reason for this is described in more detail as follows.

첫째, 인터넷이나 이동통신망 등 통신 환경에서 사용되는 구어체 문장에서는, 강조, 시각적 효과, 재미 등을 위해 사용자가 의도적으로 오류를 유발하는 경우가 매우 빈번하다.First, in colloquial sentences used in a communication environment such as the Internet or a mobile communication network, a user intentionally causes an error for emphasis, visual effects, and fun.

예를 들어, 사용자가 "나 ㅈㅈㅏ증났삼 -.- " 이라는 문장을 사용한 경우를 가정하자. 이 문장은 "나 짜증났어" 라는 뜻을 전달하기 위해 사용된 것으로, 사용자는 "짜증"이란 단어를 강조하기 위해 자음과 모음을 각각 분리하고, 글자를 크게 보이는 시각적 효과를 주었으며, 이모티콘(-.-)도 함께 사용하였다.For example, suppose a user uses the sentence "I'm sick of -.-". This sentence is used to convey the meaning of "I'm annoyed," and the user separates consonants and vowels, gives a large visual effect, and emoticons (-. -) Was also used together.

이와 같이, 통신 환경의 구어체 문장에는, 사용자가 의도하지 않은 오류뿐만 아니라 의도적으로 입력한 오류도 함께 존재한다. As described above, the colloquial sentence of the communication environment includes not only an error not intended by the user but also an error input intentionally.

이는 통신 환경의 구어체 문장을 사용하는 주요 사용자가 유행에 민감하고, 개성적인 것, 독특한 것을 선호하는 10 ~ 20대라는 특징이 반영된 결과라 할 수 있다.This is the result of reflecting the characteristic that the main users who use colloquial sentences in the communication environment are teenagers in their 20s and 20s who prefer fashion-sensitive, personal, and unique things.

둘째, 사용자가 의도적으로 입력한 오류들은 띄어쓰기 오류나 철자 오류 등의 비의도적인 오류들과는 다른 특징을 보인다.Second, errors intentionally entered by the user are different from unintentional errors such as spacing and spelling errors.

예를 들면, 사용자가 의도한 오류는 통상 여러 글자에 걸쳐서 오류가 나타나며, 오류 유형에 따라 사용되는 글자가 다르다. 이모티콘의 경우에는 주로 기호가 사용되며, 전술한 "ㅈㅈㅏ" 와 같은 경우에는 자음, 자음, 모음이 사용된다.For example, an error intended by a user usually appears in several letters, and the letters used differ depending on the type of error. In the case of an emoticon, a symbol is mainly used. In the case of the aforementioned "ㅈㅈ", a consonant, a consonant, and a vowel are used.

또한, 오류가 발생하는 상황, 즉, 어떤 단어나 문장을 입력할 때 오류가 발생하는가 여부 역시 종래의 의도하지 않은 철자/띄어쓰기 오류에서는 어떠한 경향성(예를 들면, 자판의 배열)을 찾을 수 있지만, 사용자가 의도한 오류의 경우에는 모든 경우가 가능하다는 점이 다르다.In addition, in situations where an error occurs, that is, when an error occurs when entering a word or sentence, some tendency (e.g., an arrangement of keyboards) can be found in a conventional unintended spelling / offset error. The only difference is that the user intended error can be any case.

셋째, 통신 환경의 구어체 문장에서는, 오류 유형이 다양하고, 하나의 문장 내에 여러 종류의 오류가 함께 등장하는 경우가 많으므로, 하나의 오류만을 교정해서는 교정 효과를 보기 힘들다. 따라서, 하나의 프로세스, 모듈 내에서 여러 오류를 함께 교정해야 한다.Third, since colloquial sentences in a communication environment have various types of errors and many types of errors appear together in one sentence, it is difficult to correct only one error. Therefore, several errors must be corrected together in one process, module.

따라서, 본 발명이 이루고자 하는 기술적 과제는 통신 환경에서 나타나는 구어체 문장의 다양한 오류를 효율적으로 인식 및 교정하고, 이를 통해 구어체 분석 내지 자연어 처리 기술의 성능을 향상시켜 사용자 만족도를 높일 수 있는 오류 교정 장치 및 방법을 제공하는 것이다.Accordingly, an aspect of the present invention provides an error correction apparatus for efficiently recognizing and correcting various errors in spoken sentences appearing in a communication environment, thereby improving performance of colloquial analysis or natural language processing technology and improving user satisfaction. To provide a way.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Technical problems to be achieved by the present invention are not limited to the above-mentioned technical problems, and other technical problems not mentioned above may be clearly understood by those skilled in the art from the following description. There will be.

전술한 목적을 달성하기 위한 본 발명에 따른 구어체 문장의 오류 교정 장치는, 사용자 조작에 따라 입력 데이터를 생성하는 입력부; 상기 입력 데이터로부터 구어체 문자열을 추출하고, 상기 구어체 문자열을 대상으로 반복어구 제거, 해체음절 조합, 음절을 기본 교정 단위로 하는 문자열 교정 중 하나 이상의 과정을 선택적으로 수행하여 교정된 구어체 문자열을 생성하는 구어체 교정부; 및 상기 교정된 구어체 문자열을 상기 입력 데이터에 반영하여 교정된 입력 데이터를 반환하는 교정 데이터 반환부를 포함한다.According to an aspect of the present invention, there is provided an apparatus for correcting a spoken sentence, comprising: an input unit generating input data according to a user operation; A colloquial text extracting colloquial character strings from the input data and selectively performing one or more processes of removing a phrase, combining disassembled syllables, and correcting a syllable based on syllables as a basic correction unit. Correction unit; And a correction data return unit which returns the corrected input data by applying the corrected colloquial character string to the input data.

한편, 본 발명에 따른 구어체 문장의 오류 교정 방법은, 사용자 조작에 따른 입력 데이터를 생성하는 단계; 상기 입력 데이터에 포함된 구어체 문자열을 추출하는 단계; 상기 구어체 문자열을 대상으로 반복어구 제거, 해체음절 조합, 음절을 기본 교정 단위로 하는 문자열 교정 중 하나 이상의 과정을 선택적으로 수행하여 교정된 구어체 문자열을 생성하는 단계; 및 상기 교정된 구어체 문자열을 상기 입력 데이터에 포함시켜 교정된 입력 데이터를 생성하는 단계를 포함한다.On the other hand, the error correction method of spoken sentences according to the present invention, generating the input data according to the user operation; Extracting a colloquial character string included in the input data; Generating a corrected colloquial string by selectively performing one or more processes of removing a phrase, combining disassembled syllables, and correcting a syllable based on syllables based on the colloquial string; And generating the corrected input data by including the corrected colloquial character string in the input data.

상기한 바와 같이 이루어진 본 발명에 따른 오류 교정 장치 및 방법은 통신 환경에서 나타나는 구어체 문장의 다양한 오류를 효율적으로 인식 및 교정할 수 있다.The error correction apparatus and method according to the present invention made as described above can efficiently recognize and correct various errors of spoken sentences appearing in a communication environment.

또한, 본 발명에 따른 오류 교정 장치 및 방법은 구어체 분석 기술의 성능을 향상시키고, 사용자 만족도를 높일 수 있다.In addition, the error correction apparatus and method according to the present invention can improve the performance of colloquial analysis technology and increase user satisfaction.

또한, 본 발명에 따른 오류 교정 장치 및 방법이 구어체 문장의 전처리 과정에 적용되어 문어체 기반의 자연어 처리 기술과 접목되면, 문어체와 오류 유형이 상이하여 교정이 어려운 구어체가 포함된 경우에도, 자연어 처리의 전체적인 성능을 유지할 수 있다.In addition, if the error correction apparatus and method according to the present invention is applied to the pre-processing process of spoken sentences, and combined with the written language-based natural language processing technology, even if the written language and error type is different and the colloquial language is difficult to correct, Maintain overall performance.

이하, 본 발명의 바람직한 실시예에 따른 오류 교정 장치에 대하여 첨부된 도 1 및 도 2를 참조하여 상세히 설명한다.Hereinafter, an error correction apparatus according to a preferred embodiment of the present invention will be described in detail with reference to FIGS. 1 and 2.

도 1은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 장치를 나타낸 구성도로서, 설명의 편의상 핸드폰 형태의 오류 교정 장치를 예시한 것이다. 오류 교정 장치로는 핸드폰 이외에도, 컴퓨터, 개인 휴대 단말기, 노트북, 서버 등 통신 환경에서 사용 가능한 다양한 종류의 정보처리기기가 적용될 수 있다.1 is a block diagram showing an error correction device of spoken sentences according to an embodiment of the present invention, for convenience of description illustrating an error correction device in the form of a mobile phone. In addition to a mobile phone, as an error correction device, various types of information processing devices that can be used in a communication environment, such as a computer, a personal digital assistant, a notebook, and a server, may be applied.

도 1을 참조하면, 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 장치는 제어부(100), 입력부(110), 표시부(120), 데이터 저장부(131), 모드/상태 저장부(132), 신호 처리부(140), 구어체 교정부(150), 교정 데이터 반환부(160), 후처리 교정부(170)를 포함한다.Referring to FIG. 1, an apparatus for correcting a spoken sentence according to an embodiment of the present invention includes a control unit 100, an input unit 110, a display unit 120, a data storage unit 131, and a mode / state storage unit 132. ), A signal processor 140, a colloquial correction unit 150, a correction data return unit 160, and a post-process correction unit 170.

제어부(100)는 모드/상태 저장부(132)에 저장되어 있는 상태 플래그를 근거로 핸드폰의 전반적인 동작을 제어한다. 그리고, 구어체 교정부(150), 교정 데이터 반환부(160), 후처리 교정부(170)를 구동하여 자연어 처리 기능을 수행함으로써, 사용자에 의해 입력되는 입력 데이터의 오류를 교정하며, 교정된 데이터를 표시부(120)에 디스플레이한다.The controller 100 controls the overall operation of the mobile phone based on the state flag stored in the mode / state storage 132. The natural language processing function is performed by driving the colloquial correction unit 150, the correction data return unit 160, and the post-processing correction unit 170 to correct an error of the input data input by the user. Is displayed on the display unit 120.

입력부(110)는 사용자가 입력하는 원본 메시지에 상응하는 입력 데이터를 생성하는 것으로, 키패드(Keypad), 터치펜(Touch pen), 마우스(Mouse) 등이 입력 수단으로 사용될 수 있다. 키패드 형태의 입력부(110)의 경우, 전화번호 등의 숫자를 입력하기 위한 숫자키, 문자를 입력하기 위한 문자키, 특정 기능을 수행하는 기능키 및 방향키를 구비한다. 사용자가 특정한 키를 누를 때 입력부(110)는 눌려진 키에 해당되는 입력 데이터를 발생하여 제어부(100)로 전송한다.The input unit 110 generates input data corresponding to the original message input by the user, and a keypad, a touch pen, a mouse, and the like may be used as the input means. The keypad 110 has a numeric key for inputting a number such as a telephone number, a character key for inputting a character, a function key for performing a specific function, and a direction key. When the user presses a specific key, the input unit 110 generates input data corresponding to the pressed key and transmits it to the control unit 100.

표시부(120)는 핸드폰의 대기 상태에서 배터리(Battery)의 사용 상태, 전파의 수신 강도, 날짜와 시각, 핸드폰의 동작 상태 등을 표시한다. 그리고, 자연어 처리 기능의 수행 시 메시지 편집창에서 원본 메시지와 원본 메시지의 오류를 교정한 교정 메시지가 표시부(120) 상에 시각적으로 디스플레이된다.The display unit 120 displays a state of use of the battery, a reception strength of radio waves, a date and time, an operation state of the mobile phone, etc. in the standby state of the mobile phone. Then, when the natural language processing function is performed, the original message and a correction message correcting the error of the original message are visually displayed on the display unit 120 in the message editing window.

데이터 저장부(131)는 핸드폰의 동작 시 데이터 버퍼(Buffer)로서의 역할을 수행하고, 입력부(110)에 의해 입력된 데이터를 임시로 저장하거나 외부에서 핸드폰으로 수신되는 문자나 이미지(Image) 등의 데이터를 저장한다.The data storage unit 131 acts as a data buffer when the mobile phone operates, and temporarily stores data input by the input unit 110 or receives a text or an image received from the external mobile phone. Save the data.

모드/상태 저장부(132)는 입력부(110)에 의해 선택된 핸드폰의 현재 동작 모드를 상태 플래그(Flag)(0, 1, 2, ...)로 저장한다. 제어부(100)는 핸드폰이 수신 모드인지, 발신 모드인지, 저장 모드인지, 통화 모드인지 등을 구분하기 위해 각각의 모드마다 고유한 상태 플래그를 할당하여 모드/상태 저장부(132)를 갱신한다.The mode / state storage unit 132 stores the current operation mode of the mobile phone selected by the input unit 110 as a status flag (0, 1, 2, ...). The controller 100 updates the mode / state storage unit 132 by assigning a unique state flag for each mode to distinguish whether the cellular phone is in a reception mode, an outgoing mode, a storage mode, a call mode, or the like.

신호 처리부(140)는 음성 신호를 코딩(Coding) 또는 디코딩(Decoding)하고, 다중 경로 잡음 제거를 위한 이퀄라이저(Equalizer)의 기능, 음향 데이터 처리 기능 등을 수행하는 프로세서이다.The signal processor 140 is a processor that codes or decodes a voice signal and performs an equalizer function and a sound data processing function for removing multipath noise.

구어체 교정부(150)는 사용자 조작에 따라 생성되는 입력 데이터에 구어체 문자열이 포함되어 있는지 여부를 판단하고, 입력 데이터로부터 교정 대상이 되는 구어체 문자열을 추출한다. 그리고, 추출된 구어체 문자열에 대한 구어체 교정을 수행하여 교정된 구어체 문자열을 생성한다.The colloquial correction unit 150 determines whether or not a colloquial character string is included in the input data generated according to a user operation, and extracts a colloquial character string to be corrected from the input data. The colloquial correction is performed on the extracted colloquial string to generate a corrected colloquial string.

교정 데이터 반환부(160)는 구어체 교정부(150)로부터 교정된 구어체 문자열을 수신하고, 이를 입력 데이터에 반영하여 교정된 입력 데이터를 반환한다.The calibration data return unit 160 receives the corrected colloquial character string from the colloquial collation unit 150, and reflects the collated spoken character string to the input data to return the corrected input data.

이러한 오류 교정 장치는 구어체 교정부(150)를 거친 자연어 처리의 후속 과정을 수행하기 위한 후처리 교정부(170)를 추가로 포함할 수 있다. 후처리 교정 부(170)는 반환된 입력 데이터를 대상으로 후속적인 문어체 교정을 수행한다.The error correction apparatus may further include a post-processing corrector 170 for performing a subsequent process of natural language processing through the colloquial body correcting unit 150. The post-processing correction unit 170 performs subsequent written word correction on the returned input data.

여기서, 후처리 교정부(170)는 문어체, 즉, 사용자의 의도적 오류를 포함하지 않은 문어체 기반의 문장(예를 들면, 보고서, 신문기사 등에 사용되는 문장)을 대상으로 언어 처리 기능을 수행하는 문어체 기반의 자연어 처리 엔진이다.Here, the post-processing corrector 170 performs a language processing function on a written word, that is, a written word based sentence (eg, a sentence used in a report, a newspaper article, etc.) that does not include an intentional error of a user. Is a natural language processing engine.

구어체 교정부(150)는 통신 환경의 구어체 문장에 나타난 다양한 오류를 분석하여 그 유형을 분류하며, 각 유형에 적합한 교정 방법을 고안해 해당 오류들을 교정한다. 그럼으로써, 형태소 분석 및 구문 분석 등 문어체 기반의 자연어 처리 엔진에서 곧바로 사용할 수 있는 문장 표현으로 변환한다. 즉, 사용자가 입력한 구어체 문장에 대해 띄어쓰기 오류 및 철자 오류 등 기본적인 교정 뿐만 아니라, 반복어구 제거, 해체된 자소 조합, 이모티콘 인식, 비속어 인식 및 교정 등을 수행한다.The colloquial correction unit 150 analyzes various errors shown in spoken sentences in a communication environment, classifies the types, and devises a correction method suitable for each type to correct the corresponding errors. This converts it to sentence expressions that can be used directly in written language-based natural language processing engines such as morphological analysis and syntax analysis. That is, not only basic correction such as spacing and spelling errors, but also repeated phrase removal, decomposed phoneme combination, emoticon recognition, slang recognition and correction are performed on the colloquial sentence input by the user.

그러므로, 구어체 오류를 교정하는 기술을 구어체 문장의 전처리 과정에 적용하고, 구어체 오류 수정의 결과로 출력된 교정된 문자열을 후처리 교정부(170)의 입력으로 사용하면, 문어체 기반의 자연어 처리 엔진을 성능 저하 없이 구어체에 적용하여 활용할 수 있다.Therefore, if a technique for correcting a spoken error is applied to a preprocessing process of a spoken sentence, and using the corrected string output as a result of the corrected spoken error as an input to the post-processing corrector 170, a written language-based natural language processing engine is used. It can be applied to colloquial language without degrading performance.

후처리 교정부(170)로는 형태소 분석기, 구문 분석기, 대화 엔진, 기계 번역 엔진 등의 다양한 자연어 처리 어플리케이션(Application)이 사용될 수 있다.As the post-processing corrector 170, various natural language processing applications such as a morpheme analyzer, a syntax analyzer, a dialogue engine, and a machine translation engine may be used.

도 2는 도 1에 나타난 일부 구성요소의 세부 구성도로서, 구어체 교정부(150)의 세부적인 구성을 도시하고 있다.FIG. 2 is a detailed configuration diagram of some components shown in FIG. 1, and illustrates a detailed configuration of the colloquial correction unit 150.

구어체 오류를 다시 설명하면 다음과 같다.The colloquial error is explained again as follows.

통신 환경에서 사용되는 구어체 문장은 일반적인 문서나 실생활에서 사용되는 문장과는 달리 인위적인 오류를 포함하는 경우가 많으며, 그 유형도 기존의 오탈자, 또는 띄어쓰기 오류뿐만 아니라 매우 다양하다.The colloquial sentences used in the communication environment often contain artificial errors, unlike general documents or sentences used in real life, and the types thereof are various as well as conventional typos or spacing errors.

예를 들어, " 아아아아아니야.", " ㅈㅈㅏ증나", " 이쁘다."와 같은 어구는 구어체 오류를 포함하는 것이며, 이러한 오류는 사용자의 감정을 강조하거나, 단순한 재미를 위해 사용하는 것으로, 문어체 문장에서는 드물게 발생하는 오류이다.For example, phrases such as "AAA ANOYA", "I'm sick," "pretty," include colloquial errors, and these errors are used to emphasize the user's feelings or use them for fun. This is a rare error in written sentences.

또한, 문장의 내용 자체도 단순하고 가벼운 경우가 많은 만큼 사용자 스스로도 맞춤법이나 문법 오류에 대해 상대적으로 둔감하다. 그로 인해, 설사 문장을 잘못 입력한 경우라도 사용자가 굳이 수정하지 않거나, 올바른 단어 대신 소리 나는 대로 입력하는 경우가 많다. '이쁘다.'라는 표현의 경우, '예쁘다'가 올바른 표현이나, 소리 나는 대로 입력되어 사용되는 경우가 많다.In addition, as the content of the sentence itself is often simple and light, the user is relatively insensitive to spelling or grammatical errors. Therefore, even if a sentence is incorrectly input, the user does not necessarily correct it or inputs a phonetic sound instead of the correct word. In the case of the expression 'pretty', 'pretty' is a correct expression, but it is often used as a phonetic input.

구어체 교정부(150)는 통신 환경 구어체에서 자주 발생하는 다양한 오류 현상을 분석하여 반복어구 제거, 해체음절 조합, 문자열 교정과 같이 크게 3 종류로 분류하고, 이를 기반으로 각각에 적합한 교정을 수행한다.The colloquial correction unit 150 analyzes various error phenomena frequently occurring in a spoken language, and classifies them into three types such as repetitive phrase removal, disassembled syllable combination, and string correction, and performs appropriate correction for each.

사용자가 입력한 문자열이 입력 데이터로 수신되면, 다양한 오류 교정이 수행된 후 그 결과로 교정된 문자열이 반환된다.When the string entered by the user is received as input data, various error corrections are performed and the corrected string is returned as a result.

도 2를 참조하면, 구어체 교정부(150)는 반복어구 제거부(151), 해체음절 조합부(152), 문자열 교정부(153)를 포함한다.Referring to FIG. 2, the colloquial body correcting unit 150 includes a repeating phrase removing unit 151, a disassembled syllable combining unit 152, and a string correcting unit 153.

반복어구 제거부(151)는 구어체 문자열 내에서 불필요하게 반복되는 부분인 반복어구를 추출하고, 추출된 반복어구를 제거하여 문장을 명확하게 함으로써 언어 처리에 적합한 문장으로 변환한다.The repetitive phrase removing unit 151 extracts a repetitive phrase that is an unnecessary part of the spoken text string, and removes the extracted repetitive phrase to make the sentence clear and converts the sentence into a sentence suitable for language processing.

해체음절 조합부(152)는 구어체 문자열 내에서 해체음절을 추출한 후 해체음절의 자음과 모음을 조합하여 올바른 음절을 복원한다.The disassembled syllable combination unit 152 extracts disassembled syllables from colloquial strings, and combines consonants and vowels of disassembled syllables to restore correct syllables.

문자열 교정부(153)는 구어체 오류를 포함하는 시범 문자열들과 시범 문자열들의 오류를 교정한 교정 문자열들을 대상으로 음절을 기본 교정 단위로 하는 교정 규칙들을 학습하여 저장하고, 저장된 교정 규칙들을 기반으로 주어진 구어체 문자열의 오류를 교정한다.The string corrector 153 learns and stores correction rules including syllables based on syllables based on the test strings including colloquial errors and the correction strings correcting the errors of the test strings and based on the stored correction rules. Correct errors in colloquial strings.

이러한 문자열 교정부(153)는 도 2에 도시된 것처럼, 검증 데이터 사전(153_1)과 학습 수행부(153_2), 교정 규칙 사전(153_3), 교정 수행부(153_4)를 포함한다.As shown in FIG. 2, the character string corrector 153 includes a verification data dictionary 153_1, a learning performer 153_2, a correction rule dictionary 153_3, and a correction performer 153_4.

검증 데이터 사전(153_1)는 시범 문자열들과 시범 문자열들의 오류를 수정한 교정 문자열들을 저장한다. 여기서, 교정 규칙의 신뢰도를 높일 수 있도록 충분히 많은 개수의 시범 문자열들과 교정 문자열들을 등록하는 것이 좋다. 또한, 구어체 문장의 오류는 일반화된 자연어 처리 기술과 다른 유형이 많으므로, 시범 문자열들의 오류 교정 작업 시 설계자가 수동으로 작업하여 효율을 높이는 것이 좋다.The verification data dictionary 153_1 stores demonstration strings and correction strings correcting errors of the demonstration strings. Here, it is good to register a large number of demonstration strings and calibration strings to increase the reliability of the calibration rule. In addition, since errors in spoken sentences have many different types from generalized natural language processing techniques, it is recommended that designers work manually to improve the efficiency of error proofing of test strings.

학습 수행부(153_2)는 시범 문자열들과 이들의 오류를 교정한 교정 문자열들을 대상으로 교정 규칙을 학습한다. 교정 규칙은 음절열 단위의 교정 규칙이다. 학습 수행부(153_2)는 시범 문자열과 시범 문자열의 오류를 교정한 교정 문자열을 비교하여 서로 다른 부분을 추출하고, 시범 문자열의 오류 부분을 대상음절열, 교정 문자열의 오류 교정 부분을 수정후음절열로 간주한 후, 수정후음절열 좌우에 1~n 개의 좌/우 음절, 즉 좌음절문맥 및 우음절문맥을 포함시켜 복수의 예비 교정 규칙들을 생성한다. 이후, 예비 교정 규칙들의 성능을 측정하여 적은 문맥을 사용하면서도 성능이 좋은 예비 교정 규칙을 선별한 후 선별된 교정 규칙을 대상음절열과 좌/우음절문맥을 모두 포함시킨 상태로 교정 규칙 사전(153_3)에 등록한다. 교정 규칙을 학습하는 과정은 도 6 부분에서 보다 상세히 설명한다.The learning performer 153_2 learns calibration rules for the test strings and the correction strings for correcting the errors. The correction rule is a correction rule in syllable string units. The learning performing unit 153_2 compares the test string with the calibration string corrected for the error of the test string, extracts different parts, corrects the error part of the test string for the syllable string, and corrects the error correction part for the calibration string. After considering it as, a plurality of preliminary correction rules are generated by including 1 to n left / right syllables, that is, left syllable context and right syllable context, to the left and right of the syllable string. Subsequently, the preliminary correction rule is selected by measuring the performance of the preliminary correction rules and using a small context, and then the correction rule dictionary with the selected syllable sequence and the left / right syllable context is included in the selected correction rule (153_3). Register at The process of learning the calibration rule is described in more detail in FIG. 6.

교정 규칙 사전(153_3)은 학습 수행부(153_2)을 통해 학습된 교정 규칙들을 저장한다.The calibration rule dictionary 153_3 stores the calibration rules learned through the learning performer 153_2.

교정 수행부(153_4)는 교정 규칙 사전(153_3)에 등록된 교정 규칙들 중 구어체 문자열과 상응하는 유형의 교정 규칙을 탐색하고(예컨대, 구어체 문자열의 일부 내지 전체와 일치하는 대상음절열이 있는지 여부를 검사), 탐색된 교정 규칙을 주어진 구어체 문자열에 적용(예컨대, 구어체 문자열의 일부 혹은 전체를 교정 규칙으로 등록된 대상음절열, 좌/우음절문맥으로 교체)함으로써 오류 교정을 수행한다.The calibration performing unit 153_4 searches for a calibration rule of a type corresponding to the colloquial string among the calibration rules registered in the calibration rule dictionary 153_3 (for example, whether there is a target syllable string that matches some or all of the colloquial string). Error correction is performed by applying the found correction rule to a given colloquial string (for example, replacing part or all of the colloquial string with a target syllable string and a left / right syllable context registered as a correction rule).

이와 같이, 구어체 교정부(150)는 통신 환경에서 빈번히 발생하는 오류 유형들을 크게 3가지로 구분하여 정의하고, 각 오류 유형에 적합한 교정 알고리즘을 통해 통신 환경 구어체에 나타나는 오류들을 효과적으로 교정하고, 언어 처리 기술의 성능을 향상시켜 사용자 만족도를 제고할 수 있다.In this way, the colloquial correction unit 150 defines three types of errors frequently occurring in a communication environment, and effectively corrects errors appearing in the spoken language of the communication environment through a correction algorithm suitable for each error type, and processes the language. The performance of the technology can be improved to improve user satisfaction.

반복어구 제거, 해체음절 조합을 위해서, 각각의 오류에 적합한 휴리스틱(Heuristic) 알고리즘이 적용될 수 있다. 휴리스틱 알고리즘은 설계자(의사 결정자)의 주관적 판단에 기초하여 결론을 도출하는 경험적 방법론이다.For repetitive phrase elimination and decomposing syllable combinations, a heuristic algorithm suitable for each error may be applied. Heuristic algorithms are empirical methodologies that draw conclusions based on the subjective judgments of the designer (decision maker).

그리고, 철자 오류 교정, 띄어쓰기 오류 교정, 이모티콘 인식 및 교정, 비속어 인식 및 교정 등을 포괄하는 문자열 교정을 위해 규칙 기반의 학습 알고리즘이 사용될 수 있으며, 해당 알고리즘을 통해 적합한 교정 규칙들이 구축되고, 이를 이용해 오류 교정이 이루어진다.In addition, rule-based learning algorithms can be used for string corrections, including spelling correction, spacing error correction, emoticon recognition and correction, slang recognition and correction. Error correction is made.

구어체 교정부(150)는 임의의 구어체 문자열에 대하여 반복어구 제거, 해체음절 조합, 문자열 교정의 과정을 선택적으로 적용하거나, 동시에 중복하여 적용할 수 있다. 따라서, 구어체 교정부(150) 내에는, 반복어구 제거부(151), 해체음절 조합부(152), 문자열 교정부(153)가 선택적으로 구축될 수도 있고, 모두 함께 구축될 수도 있다.The colloquial correction unit 150 may selectively apply repetitive phrase removal, disassembly syllable combinations, and string correction processes to arbitrary colloquial strings, or may be applied at the same time. Therefore, in the colloquial body correcting unit 150, the repeated phrase removing unit 151, the disassembled syllable combination unit 152, and the string correcting unit 153 may be selectively constructed or all together.

한편, 구어체 교정부(150)는 추출된 구어체 문자열에 적용할 교정 방식(예를 들면, 반복어구 제거, 해체음절 조합, 문자열 교정)을 미리 판별한 후, 판별된 교정 방식에 따라 구어체 교정을 수행할 수도 있다. 교정 방식을 미리 판별하는 경 우, 구어체 교정부(150)는 추출된 구어체 문자열을 이용해 반복어구가 존재하는지 여부를 검사하거나, 해체된 자소가 존재하는지 여부를 검사하거나, 검증 데이터 사전에 저장되어 있는 문자열과 대응되는 부분이 있는지 여부를 검사하고, 검사 결과에 따라 교정 방식의 적용 여부를 판별한다.On the other hand, the colloquial correction unit 150 determines in advance the correction method (for example, repeated phrase removal, disassembly syllable combination, string correction) to be applied to the extracted colloquial string, and performs colloquial correction according to the determined correction scheme. You may. In the case of determining the correction method in advance, the colloquial correction unit 150 checks whether or not a repeated phrase exists using the extracted colloquial string, checks whether or not the dismantled phoneme exists, or is stored in the verification data dictionary. It checks whether there is a corresponding part of the character string, and determines whether to apply the calibration method according to the inspection result.

구어체 문자열 내에 반복어구가 있는 경우 반복어구 제거 과정이 수행되고, 해체된 자소가 있는 경우 해체음절 조합 과정이 수행된다. 마찬가지로, 구어체 문자열 내에 검증 데이터 사전(153_1)에 저장된 문자열과 대응되는 부분이 있는 경우, 문자열 교정 과정이 수행된다.When there is a repetitive phrase in the colloquial string, the repetitive phrase removing process is performed, and when there is a dismantled phoneme, a disassembly syllable combining process is performed. Similarly, when there is a part corresponding to the string stored in the verification data dictionary 153_1 in the colloquial string, the string correction process is performed.

이하, 본 발명의 바람직한 실시예에 따른 오류 교정 방법에 대하여 첨부된 도 3 내지 도 7을 참조하여 상세히 설명한다. 편의상, 도 1 및 도 2의 오류 교정 장치를 가정하여 도 3 내지 도 7의 오류 교정 방법을 설명하기로 한다.Hereinafter, an error correction method according to a preferred embodiment of the present invention will be described in detail with reference to FIGS. 3 to 7. For convenience, the error correction method of FIGS. 3 to 7 will be described assuming the error correction apparatus of FIGS. 1 and 2.

도 3은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 방법을 나타낸 순서도이다.3 is a flowchart illustrating an error correction method of a spoken sentence according to an embodiment of the present invention.

먼저, S100 단계에서, 사용자가 입력부(110)를 이용해 원본 메시지를 입력하면, 그에 상응하여 입력 데이터가 생성되며, 제어부(100)가 생성된 입력 데이터를 구어체 교정부(150)로 전송한다.First, in step S100, when a user inputs an original message using the input unit 110, input data is generated correspondingly, and the control unit 100 transmits the generated input data to the colloquial correction unit 150.

S110 단계에서, 구어체 교정부(150)는 수신된 입력 데이터에 교정 대상이 되는 구어체 문자열이 포함되어 있는지 여부를 검사하여 구어체 교정을 수행할 것인지 여부를 판단하며, 구어체 교정의 필요성이 인정되면 S120 단계로 진행한다.In operation S110, the colloquial correction unit 150 determines whether oral collation is to be performed by checking whether or not the colloquial string to be corrected is included in the received input data. Proceed to

S120 단계에서, 구어체 교정부(150)는 오류 교정을 위해 입력 데이터로부터 구어체 문자열을 추출한다.In operation S120, the colloquial correction unit 150 extracts a colloquial character string from the input data for error correction.

S130 단계에서, 구어체 교정부(150)는 추출된 구어체 문자열의 특징에 따라 교정 방식을 판별하며, 판별 결과에 따라 S140 단계 내지 S160 단계 중 적어도 하나 이상의 단계를 수행하게 된다.In operation S130, the colloquial correction unit 150 determines a calibration method according to the extracted colloquial character string, and performs at least one or more of steps S140 to S160 according to the determination result.

이때, 구어체 교정부(150)는 추출된 구어체 문자열을 검사하여 해당 문자열에 적용할 교정 방식(예를 들면, 반복어구 제거, 해체음절 조합, 문자열 교정)을 미리 판별한 후, 판별된 교정 방식에 따라 구어체 교정을 수행할 수 있도록 한다. 교정 방식을 판별하기 위해서, 구어체 교정부(150)는 추출된 구어체 문자열을 이용해 반복어구가 존재하는지 여부를 검사하거나, 해체된 자소가 존재하는지 여부를 검사하거나, 검증 데이터 사전에 저장되어 있는 문자열과 대응되는 부분이 있는지 여부를 검사하고, 검사 결과에 따라 S140 단계 내지 S160 단계에 나타난 각 교정 방식의 적용 여부를 판별한다.At this time, the colloquial correction unit 150 examines the extracted colloquial string and determines in advance the correction method (for example, repeated phrase removal, disassembly syllable combination, string correction) to be applied to the string, and then determines the corrected correction method. So that colloquial corrections can be performed. In order to determine the correction method, the colloquial correction unit 150 checks whether or not there is a repeated phrase using the extracted colloquial string, checks whether or not the dismantled phoneme exists, or compares the string with the string stored in the verification data dictionary. It is checked whether there is a corresponding part, and it is determined whether to apply each calibration method shown in steps S140 to S160 according to the inspection result.

S140 단계 내지 S160 단계에서, S130 단계를 통해 판별된 교정 방식에 의해 추출된 구어체 문자열에 대한 오류 교정이 이루어진다. 임의의 구어체 문자열에 대한 구어체 교정을 수행하는 S140 단계 내지 S160 단계는 경우에 따라 선택적으로 적용될 수도 있고, 동시에 중복적으로 적용될 수도 있다.In steps S140 to S160, error correction is performed on the colloquial strings extracted by the calibration method determined through step S130. Steps S140 to S160 that perform colloquial correction on any colloquial character string may be selectively applied in some cases and may be overlapped at the same time.

S140 단계는 반복어구 제거 과정을 수행하는 단계로서, 구어체 교정부(150) 내의 반복어구 제거부(151)가 구어체 문자열 내에서 불필요하게 반복되는 반복어구를 추출하여 제거한다.Step S140 is a step of removing the repeated phrases, the repeated phrase removal unit 151 in the colloquial correction unit 150 extracts and removes the repeated phrases that are unnecessarily repeated in the colloquial string.

S150 단계는 해체음절 조합 과정을 수행하는 단계로서, 구어체 교정부(150) 내의 해체음절 조합부(152)가 구어체 문자열 내에서 해체음절을 추출한 후 추출된 해체음절의 자음과 모음을 조합하여 올바른 음절을 복원한다.Step S150 is a step of performing the disassembly syllable combination process, after the disassembly syllable combination unit 152 in the colloquial body correcting unit 150 extracts the disassembled syllables in the colloquial string, combines the consonants and vowels of the disassembled syllables to correct the correct syllables. Restore

S160 단계는 문자열 교정 과정을 수행하는 단계로서, 검증 데이터 사전(153_1)과 교정 규칙 사전(153_3)을 이용하여 철자 교정, 띄어쓰기 교정, 이모티콘 교정, 비속어 교정 등의 문자열 교정을 동시에 수행할 수 있다. 여기서, 구어체 교정부(150) 내의 문자열 교정부(153)는 검증 데이터 사전(153_1)을 대상으로 교정 규칙들을 학습하여 교정 규칙 사전(153_3)에 저장함으로써, 교정 규칙 사전(153_3)을 구축한다. 검증 데이터 사전(153_1) 내에는 구어체 오류를 포함하는 시범 문자열들과 시범 문자열들의 오류를 교정한 교정 문자열들이 저장되어 있다.The step S160 is a step of performing a string correction process, and may simultaneously perform string correction such as spell correction, spacing correction, emoticon correction, and slang correction using the verification data dictionary 153_1 and the correction rule dictionary 153_3. Here, the string correction unit 153 in the colloquial correction unit 150 learns the correction rules for the verification data dictionary 153_1 and stores the correction rules in the correction rule dictionary 153_3, thereby constructing the correction rule dictionary 153_3. In the verification data dictionary 153_1, demonstration strings including colloquial errors and correction strings correcting errors of the demonstration strings are stored.

추출된 구어체 문자열의 특징에 따라 교정 방식을 미리 판별하는 S130 단계는 생략될 수 있다.The step S130 of determining the correction method in advance according to the extracted colloquial character string may be omitted.

S170 단계에서, 구어체 교정부(150)를 거쳐 오류가 교정된 구어체 문자열이 출력되면, 교정 데이터 반환부(160)가 교정된 구어체 문자열을 입력 데이터에 포함시켜 교정된 입력 데이터를 생성한다.In operation S170, when the error-corrected colloquial string is output through the colloquial correction unit 150, the calibration data return unit 160 includes the corrected colloquial string in the input data to generate the corrected input data.

도 4a 내지 도 6은 도 3에 나타난 일부 단계의 세부 흐름도로서, 반복어구 제거 과정, 해체음절 조합 과정, 문자열 교정 과정을 각각 도시하고 있다.4A to 6 are detailed flowcharts of some of the steps shown in FIG. 3 and illustrate a repetitive phrase removing process, a disassembled syllable combining process, and a string correction process, respectively.

반복어구 제거부(151)에 의해 수행되는 반복어구 제거 과정은 도 4a와 같은 일련의 알고리즘을 통해 이루어진다.The repeated phrase removing process performed by the repeated phrase removing unit 151 is performed through a series of algorithms as shown in FIG. 4A.

먼저, 구어체 문자열이 수신되면, 반복어구 제거부(151)는 S141 단계 및 S142 단계의 반복어구 탐색 과정을 통해 수신된 구어체 문자열을 최좌측으로부터 최우측까지 탐색하면서 반복되는 부분이 있는지 여부를 검사한다. 구어체 문자열 내에 반복되는 부분이 있는 경우 S143 단계가 수행되어 구어체 문자열 내에서 불필요하게 반복되는 반복어구가 추출되고, 없는 경우 반복어구 제거 과정이 종료된다.First, when the colloquial string is received, the repeating phrase removing unit 151 searches for the colloquial string received from the leftmost side to the rightmost side through the search for the repeated phrase in steps S141 and S142 and checks whether there is a repeated portion. . If there is a repeated portion in the colloquial string, step S143 is performed to extract an unnecessary phrase that is repeatedly repeated in the colloquial string.

구어체 문자열을 이루는 n개의 문자열을 w₁,…, w_n 로 표현할 때, 주어진 문자열에 대해 좌측에서 우측으로 반복어구 탐색이 수행된다. 반복어구 탐색 알고리즘을 보다 상세히 설명하면 다음과 같다.N strings that form colloquial strings w ₁ ,… When expressed as w _n , iterative search is performed from left to right for a given string. An iterative phrase search algorithm is described in more detail as follows.

제1 단계는, w₁ ~ w_n 사이에 있는 임의의 문자 w_i (1 ≤ i < n)에 대해 이와 동일한 문자가 w_i+1, … , w_n 중에 존재하는지 검사한다.The first step is that for any letter w _i (1 ≦ i <n) between w ₁ to w _n , the same letter is w _{i + 1} ,... , check whether w _n exists.

제2 단계는, w_i와 동일한 글자 w_j( i < j ≤ n)가 존재하는 경우, 부분 문자열 w_i,…, w_i+k와 w_j,…, w_j+k에 대해 서로 일치하는지 검사한다. 여기서 k는 w_i부터 w_j까지의 거리이며, k=j-i-1로 정의된다.The second step, the same letters _{w j (i <j ≤ n} ) and w _i is present, the substring w _i, if ... , w _{i + k} and w _j ,… , w Check for _{j + k} matches with each other. Where k is the distance from w _i to w _j and is defined as k = ji-1.

제3 단계는, 부분 문자열 w_i,…, w_i+k와 w_j,…, w_j+k가 일치하는 경우, w_i,…, w_i+k를 반복 문자열 원소로 지정하고, w_j,…, w_j+k를 반복어구로 지정한다.The third step is substring w _i ,... , w _{i + k} and w _j ,… If w _{j + k} matches, then w _i ,… , w _{i + k} as repeating string elements, w _j ,… , w Specify _{j + k} as a repeating phrase.

네번째로, w_j 이후에 나오는 문자열 w_j+k+1,…, w_n에 대해서 상술한 두번째 과정부터 반복하면서 반복어구가 더 있는지 검사한다.Fourth, the string w _{j + k + 1,} that follow w _j ... , w _n is repeated from the second process described above to check whether there are more repetitive phrases.

이러한 과정을 도시하면 도 4b와 같다.This process is shown in Figure 4b.

예를 들어, '이제 바이바이바이야 ' 라는 구어체 문자열이 주어진 경우, 반복어구 제거부(151)는 '바이바이바이' 라는 부분을 반복어구로 찾아낸다. 이때, '바이'는 반복 문자열 원소이며, k는 2이다. '바이바이바이'는 '바이'라는 어구가 3번 반복된 경우이다.For example, given a colloquial string of 'Now bye bye', the repeating phrase removing unit 151 finds a part of the wording bye bye. At this time, 'by' is a repeating string element, and k is 2. Bye-bye is a case where the phrase "bye" is repeated three times.

반복어구 제거부(151) 내에는 교정 예외 대상이 되는 문자열들이 저장되어 있는 반복어구 사전(151_1)이 구축된다. 반복어구 사전(151_1)은 교정 예외 대상에 속하지 않는 문자열들에 대해서만 반복어구 제거를 수행하기 위한 것이다.In the repetitive phrase removing unit 151, a repetitive phrase dictionary 151_1 in which character strings subject to correction exceptions are stored is constructed. The repetitive phrase dictionary 151_1 is for performing repetitive phrase removal only on strings that do not belong to the correction exception object.

반복어구 제거부(151)는 S144 단계 및 S145 단계의 예외여부 확인 과정에서 반복어구 사전(151_1)을 탐색하여 구어체 문자열로부터 추출된 반복어구가 반복어구 사전(151_1)에 있는지 여부를 판단한다. The repetitive phrase removing unit 151 searches for the repetitive phrase dictionary 151_1 in the process of checking whether an exception is performed in steps S144 and S145, and determines whether or not the repetitive phrase extracted from the colloquial string is in the repetitive phrase dictionary 151_1.

추출된 반복어구가 반복어구 사전(151_1)에 있는 경우, 당해 반복어구는 교정 대상이 아니라고 판단되어 반복어구 제거 과정이 종료된다. 예를 들면, '깡충깡충'과 같이 반복어구가 단어인 경우 등을 위해 반복어구 사전(151_1)이 구축되고, 반복어구의 탐색 과정을 통해 탐색된 문자열이 반복어구 사전(151_1)에 수록되어 있는 경우, 반복어구 제거 과정이 시도되지 않는다. 교정 예외 대상들을 저장하는 반복어구 사전(151_1)은 통신 환경의 구어체 문장들을 수집하는 경우와 마찬가지로 수작업으로 구축할 수 있다.When the extracted repetitive phrase is in the repetitive phrase dictionary 151_1, it is determined that the repetitive phrase is not a correction target and the process of removing the repetitive phrase ends. For example, a repetitive phrase dictionary 151_1 is constructed for a case in which a repetitive phrase is a word such as 'snapping', and a string searched through a repetitive phrase search process is stored in the repetitive phrase dictionary 151_1. In that case, no repetitive phrase removal process is attempted. The repetitive phrase dictionary 151_1 storing the correction exception objects may be manually constructed as in the case of collecting colloquial sentences in a communication environment.

추출된 반복어구가 반복어구 사전(151_1)에 없는 경우에는, 반복어구 제거 과정에 해당되는 S146 단계가 수행되어 구어체 문자열 내에서 불필요하게 중복되는 반복어구가 제거된다. 탐색된 반복어구가 교정 예외 대상이 아니면, 불필요하게 반 복된 부분을 제거하여 추출된 반복어구 내에서 해당 단어(반복 문자열 원소)가 한 번만 출연하도록 문장이 변환하는 것이다.If the extracted repetitive phrase does not exist in the repetitive phrase dictionary 151_1, step S146 corresponding to the repetitive phrase removing process is performed to remove unnecessary repetitive phrases in the colloquial string. If the searched repetitive phrase is not subject to correction exception, the sentence is converted so that the word (repeated string element) appears only once in the extracted repetitive phrase by removing unnecessary parts.

이때, 구어체 문자열 내에서 소정의 어구가 N(N은 2 이상의 자연수)번 반복되는 경우, 반복어구 제거부(151)는 반복 횟수를 하나씩 순차적으로 줄여 나가 당해 어구가 하나만 남을 때까지 반복어구 제거를 수행한다. 즉, 반복어구 제거부(151)는 오류의 소지를 줄이기 위해 한번에 반복 문자열 원소 하나만을 제거한 후, 반복어구 사전(151_1)을 다시 검사한다.At this time, if a predetermined phrase is repeated N times (N is a natural number of 2 or more) in the colloquial string, the repeated phrase removing unit 151 sequentially reduces the number of repetitions one by one and removes the repeated phrase until only one phrase is left. To perform. That is, the repetitive phrase removing unit 151 removes only one repeated string element at a time in order to reduce the possibility of error, and then checks the repetitive phrase dictionary 151_1 again.

예를 들면, 사용자가 '바이바이바이'를 입력한 경우, '바이바이바이'가 반복어구이고, '바이'가 반복 문자열 원소이므로, 먼저 '바이바이바이'에서 '바이바이'로 문자열을 수정한 후, 해당 문자열이 반복어구 사전(151_1)에 있는지 검사한다.For example, if the user enters 'by-bye', first modify the string from 'by-bye' to 'by-by' because 'by-bye' is a repetitive phrase and 'by' is a repetitive string element. After that, it is checked whether the corresponding string is in the repetitive phrase dictionary 151_1.

만약, '바이바이'가 반복어구 사전(151_1)에 없는 경우에는 문자열 내에서 다시 반복 문자열 원소를 제거하여 '바이'로 수정한다. 반복어구에 해당하는 문자열 내에서 반복 문자열 원소가 하나만 남을 때까지 이러한 과정이 반복된다.If the 'bye' does not exist in the repetitive phrase dictionary 151_1, the repeated string element is removed from the string and modified to 'by'. This process is repeated until only one repeating string element remains in the string corresponding to the repeating phrase.

'바이바이'가 반복어구 사전(151_1)에 수록되어 있는 경우에는 해당 단어에 대한 반복어구 제거가 다시 수행되지 않고, 반복어구 제거 과정이 그대로 종료된다.When the "bye" is included in the repetitive phrase dictionary 151_1, the repetitive phrase removal process for the word is not performed again, and the repetitive phrase removal process is terminated as it is.

도 5는 해체음절을 복원하기 위한 해체음절의 조합 알고리즘을 도시한 것이다. 해체음절은 컴퓨터나 휴대폰 등 통신 환경을 구현하는 정보처리기기 상에서 한글을 입력할 때, 자음과 모음이 올바르게 입력되어 하나의 음절로 완성되지 않고, 각 자소가 해체된 형태로 입력된 경우를 뜻한다(예를 들면, 'ㅎㅏ얀 ㄱㅕ울'). 5 illustrates a combination algorithm of disassembled syllables for restoring disassembled syllables. Disorganized syllables are when consonants and vowels are input correctly when the Korean characters are input on an information processing device that implements a communication environment such as a computer or a mobile phone. (E.g., 'hehyan').

도 5를 참조로 하여, 해체음절 조합부(152)를 통해 구현되는 해체음절의 조합 알고리즘을 설명하면 다음과 같다.Referring to FIG. 5, a combination algorithm of disassembled syllables implemented through the disassembled syllable combination unit 152 will be described.

해체음절 조합부(152)는 S151 단계를 통해 검사 포인터를 설정하여 구어체 문자열 내 임의의 위치에 있는 N(N은 자연수)번째 글자를 기준 글자로 지정한다. 여기서, 해체음절 조합부(152)는 해체음절 검사의 대상이 되는 글자의 위치를 지정하는 검사 포인터(Pointer)를 하나씩 증가시켜 가는 방식으로 기준 글자를 지정할 수 있다. 해체음절의 검사는 구어체 문자열의 최좌측에서 최우측으로 한 글자씩 순차적으로 이루어진다.The disassembly syllable combination unit 152 sets a check pointer through step S151 to designate the N (N is a natural number) character at an arbitrary position in a colloquial string as a reference character. Here, the disassembly syllable combination unit 152 may designate a reference character in such a manner as to increase the check pointers that designate the position of the character to be examined for disassembly syllables one by one. The disassembly syllable test is performed one by one from the leftmost side to the rightmost side of the spoken character string.

S152 단계에서, 구어체 문자열을 이루는 문장 전체에 대한 검사가 끝난 경우 해체음절 조합 과정이 종료되고, 그렇지 않은 경우 S153 단계 내지 S158 단계의 후속 공정이 진행된다.In step S152, the disassembly syllable combining process is terminated when the entire sentence constituting the spoken character string is finished, otherwise the subsequent process of steps S153 to S158 is performed.

S153 단계에서, 해체음절 조합부(152)는 현재 위치의 기준 글자가 한글 자음에 해당하는지 여부를 검사한다. 예를 들면, w₁,…, w_n까지 n개의 글자로 이루어진 문장에서 임의의 위치에 있는 글자 w_i에 대해, 해당되는 기준 글자가 한글 자음에 해당하는지 검사한다.In step S153, the disassembled syllable combination unit 152 checks whether the reference character of the current position corresponds to the Hangul consonant. For example, w ₁ ,. For a letter w _i at any position in a sentence of n letters up to, w _n , check whether the corresponding reference letter corresponds to a Korean consonant.

기준 글자가 자음이 아닌 경우, 해체음절 조합부(152)는 S154 단계로 진행하여 N+1번째 위치에 있는 다음 글자의 자음 여부부터 다시 검사하게 된다. 기준 글자가 자음인 경우에는, S155 단계가 수행되어 N+1번째의 다음 글자가 모음인지 여부가 검사된다. 예시에서, 만약 w_i가 자음이면, 다음 글자 w_i+1이 모음인지 여부가 검사된다.If the reference letter is not a consonant, the disassembled syllable combination unit 152 proceeds to step S154 and checks whether the next letter in the N + 1th position is consonant again. If the reference letter is a consonant, step S155 is performed to check whether the N + 1 th next letter is a vowel. In an example, if w _i is a consonant, it is checked whether the next letter w _{i + 1} is a vowel.

N번째의 기준 글자가 자음이고, N+1번째의 다음 글자가 모음인 경우, 해체음절 조합부(152)는 S156 단계를 통해 자·모음인 기준 글자 및 다음 글자를 서로 결합시켜 하나의 음절을 만든다. 그리고, S154 단계로 다시 진행하여 N+2번째 글자부터 다시 자음 여부를 검사한다. 예시에서, w_i+1이 모음이면 w_i와 w_i+1의 결합을 통해 하나의 글자가 만들어지고, w_i+2부터 다시 자음 여부 검사가 시작된다.When the N th reference letter is a consonant and the N + 1 th next letter is a vowel, the disassembled syllable combination unit 152 combines the reference letter and the next letter which are the vowels and vowels in step S156 to form one syllable. Make. In addition, the process returns to step S154 to check whether the consonants are repeated again from the N + 2 th letter. In the example, if w _{i + 1} is a vowel, a combination of w _i and w _{i + 1} creates a letter, and w _{i + 2} starts consonant checking again.

N번째의 기준 글자가 자음인데, N+1번째의 다음 글자가 모음이 아닌 경우, 해체음절 조합부(152)는 S157 단계를 통해 N+1번째의 다음 글자가 N번째의 기준 글자와 동일한 자음인지 여부를 검사한다. 예시에서, w_i+1이 모음이 아니라면, w_i와 동일한 자음인지 여부가 검사된다.If the Nth reference letter is a consonant, but the N + 1th next letter is not a vowel, the disassembly syllable combination unit 152 performs the S157 step, where the next N + 1th letter is the same as the Nth reference letter. Check if it is. In an example, if w _{i + 1} is not a vowel, it is checked whether w _i is the same consonant.

N번째의 기준 글자와 N+1번째의 다음 글자가 동일한 자음이면, 해체음절 조합부(152)가 S158 단계를 통해 기준 글자 및 다음 글자를 결합하여 하나의 현재 자음으로 인식한 후, S155 단계로 되돌아가 N+2번째의 다음 글자가 모음인지 여부를 다시 검사하게 된다. 예시에서, w_i, w_i+1이 모두 같은 자음이라면, 두 개의 자음이 결합하여 하나의 이중 자음이 만들어진다. 예를 들어 w_i='ㅈ', w_i+1='ㅈ'이면, 이 둘을 결합하여 'ㅉ'가 만들어진다. 그런 다음 w_i+2가 모음인지 여부가 검사된다. w_i+2가 모음이면, 이전에 결합한 이중 자음과 w_i+2가 결합하여 하나의 글자가 만들어진다.If the N th reference letter and the N + 1 th next letter are the same consonant, the disassembled syllable combination unit 152 recognizes the current consonant as one current consonant by combining the reference letter and the next letter through step S158, and then proceeds to step S155. We go back and check again whether the next N + 2th character is a vowel. In an example, if w _i and w _{i + 1} are both the same consonant, the two consonants are combined to form one double consonant. For example, if w _i = 'ㅈ', w _{i + 1} = 'ㅈ', the combination of the two creates a 'ㅉ'. Then it is checked whether w _{i + 2} is a vowel. If w _{i + 2} is a vowel, the previously combined double consonant and w _{i + 2} are combined to form a letter.

N번째의 기준 글자와 N+1번째의 다음 글자가 동일한 자음이 아닌 경우, 해체음절 조합부(152)는 S154 단계로 되돌아가 N+2번째의 다음 글자를 새로운 기준 글자로 하여 자음인지 여부부터 다시 검사하게 된다.If the N th reference letter and the N + 1 th next letter are not the same consonants, the disassembled syllable combination unit 152 returns to step S154 to determine whether the N + 2 th next letter is the new consonant or not. Will be checked again.

이러한 일련의 과정을 거쳐, 구어체 문자열의 최좌측부터 최우측까지 한 글자씩을 대상으로 순차적으로 해체음절 조합 과정이 수행된다.Through such a series of processes, the disassembly syllable combination process is sequentially performed for each letter from the leftmost side to the rightmost side of the spoken character string.

도 6은 문자열 교정 과정을 도시한 흐름도이다.6 is a flowchart illustrating a string calibration process.

문자열 교정 과정은 동일한 구조의 사전과 규칙 기반의 학습 알고리즘에 기반하여 철자 오류 교정, 띄어쓰기 오류 교정, 이모티콘 인식을 동시에 수행한다. 이를 수행하기 위한 문자열 교정부(153)는 검증 데이터 사전(153_1)과 연계되며, 크게 교정 규칙을 학습하는 학습 수행부(153_2), 학습된 교정 규칙들을 저장하는 교정 규칙 사전(153_3), 저장된 교정 규칙을 적용하여 실제적인 교정을 수행하는 교정 수행부(153_4)로 구성된다(도 2 참조).The string correction process simultaneously corrects spelling errors, spacing errors, and emoticon recognition based on the same structured dictionary and rule-based learning algorithms. The string correction unit 153 for performing this is associated with the verification data dictionary 153_1, the learning execution unit 153_2 for learning the correction rule largely, the correction rule dictionary 153_3 for storing the learned correction rules, and stored corrections. It consists of a calibration execution unit 153_4 that performs actual calibration by applying the rule (see FIG. 2).

S161 단계는 검증 데이터 사전(153_1)의 구축 단계이다. 설계자는 구어체 오류를 포함하는 복수의 시범 문자열들과 각각의 시범 문자열의 오류를 수정한 교정 문자열들을 검증 데이터 사전(153_1)에 미리 저장한다.In step S161, the verification data dictionary 153_1 is constructed. The designer pre-stores a plurality of test strings including colloquial errors and correction strings correcting errors of each test string in the verification data dictionary 153_1.

S162 단계는 교정 규칙의 학습 과정을 통해 교정 규칙 사전(153_3)을 구축하는 단계로서, S162_1 단계 내지 S162_4 단계로 세분화될 수 있다.The step S162 is a step of constructing the correction rule dictionary 153_3 through the learning process of the correction rule, and may be subdivided into steps S162_1 to S162_4.

문자열 교정부(153)는 검증 데이터 사전(153_1)에 저장되어 있는 시범 문자열들과 그에 대한 교정 문자열들을 대상으로 교정 규칙들을 학습하고, 학습된 교정 규칙들을 교정 규칙 사전(153_3)으로 저장한다. 교정 규칙 사전(153_3)은 문자열 교정을 위한 교정 규칙들을 저장하며, 각 교정 규칙의 형식은 다음과 같이 표현될 수 있다.The string calibration unit 153 learns calibration rules for the demonstration strings stored in the verification data dictionary 153_1 and calibration strings thereof, and stores the learned calibration rules as the calibration rule dictionary 153_3. The calibration rule dictionary 153_3 stores calibration rules for string calibration, and the format of each calibration rule may be expressed as follows.

" 대상음절열 → 좌음절문맥 수정후음절열 우음절문맥 ""Target syllable sequence → left syllable context correction syllable sequence right syllable context"

여기서, 대상음절열은 수정하고자 하는 음절열과 그 좌/우음절문맥을 포함한 문자열로 원본 메시지에서 교정하고자 하는 부분을 찾기 위해 사용된다. 좌음절문맥 및 우음절문맥은 수정후음절열의 좌우 문맥을 결정한다. 예를 들어, 원본 메시지가 '너 정말 이쁘다'인 경우, '이쁘다'가 '예쁘다'로 교정된다고 가정하면 교정 규칙은 다음과 같이 구성될 수 있다.Here, the target syllable string is a string including the syllable string to be corrected and its left / right syllable context and is used to find a part to be corrected in the original message. The left syllable and right syllable contexts determine the left and right contexts of post-fertilized syllable strings. For example, if the original message is 'you are very pretty', assuming that 'pretty' is corrected to 'pretty', the correction rule can be constructed as follows.

" _이쁘 → _ 예 쁘 ""_ Pretty → _ Pretty"

여기서, 밑줄(_)은 빈칸을 의미한다. 단, 이모티콘의 경우에는 문장 내에서 추출하여 따로 표시를 해야 하기 때문에 'M: '이라는 임의의 표시를 사용한다. 'M: '은 해당 음절열을 문장에서 분리하라는 뜻이다. 아래 예제는 이모티콘 '-.-'을 인식하기 위한 교정 규칙과 교정 규칙을 적용한 결과이다.Here, underscores (_) means blanks. However, in the case of emoticons, an arbitrary mark of 'M:' is used because it must be extracted and displayed separately in a sentence. 'M:' means to separate the syllable strings from the sentence. The example below shows the result of applying the correction rule and correction rule to recognize the emoticon '-.-'.

" 교정 규칙: -.- → 흠 M:[-.- 화/난감] * " (여기서, *는 모든 문맥에 적용 가능함을 뜻함)"Calibration Rule: -.- → Hmm M: [-.- Anger / Toy] *" (where * means applicable to all contexts)

" 원본 메시지: 흠-.- 어쩌자는 거야? ""Original message: Hmm ..-What are you going to do?"

" 교정 메시지: 흠 어쩌자는 거야? ""Calibration Message: Hmm what are you doing?"

" 추출 음절: [-.- 화/난감] ""Extract Syllables: [-.- Angry / Toy]"

이러한 교정 규칙은 비단 이모티콘 인식뿐만 아니라 문장에서 불필요한 단어, 기호나 욕설, 비속어 등을 제거하는 데도 적용할 수 있다.This correction rule can be applied not only to the recognition of emoticons, but also to removing unnecessary words, symbols, swear words, and slang words from sentences.

교정 규칙 사전(153_3)에 수록될 교정 규칙의 학습을 위해서, 설계자는 통신 환경 구어체로 쓰여진 대량의 말뭉치를 수집한 후, 각 문장에 대해 수작업으로 직접 오류를 교정한 검증 데이터를 구축하고, 이를 기반으로 규칙 기반의 학습 알고리즘을 적용하여 유용한 규칙을 추출할 수 있다.In order to learn the calibration rules to be contained in the calibration rule dictionary 153_3, the designer collects a large number of corpus written in communication environment colloquial language, and then builds the verification data manually corrected for each sentence. By using rule-based learning algorithm, useful rules can be extracted.

여기에 사용 가능한 알고리즘으로는 대안제거 학습 알고리즘(Candidate-elimination Learning Algorithm)이 있다. 대안제거 학습 알고리즘에서, 초기 집합은 모든 표현 가능한 개념으로서 구성되지만, 훈련사례가 주어지게 되면 그것에 저촉되는 대안개념(Candidate Concept)들이 버전 공간에서 제거된다. 이러한 제거 과정을 통해 마지막까지 남는 대안개념이 찾고자 하는 개념이 되는 것이다. An algorithm that can be used here is the Candidate-elimination Learning Algorithm. In the alternative elimination learning algorithm, the initial set consists of all expressible concepts, but given training examples, the Candidate Concepts that conflict with it are removed from the version space. Through this elimination process, the last remaining concept of concept becomes the concept to find.

교정 규칙의 학습 과정을 단계별로 세분화하여 설명하면 다음과 같다.The learning process of the calibration rule is described in detail by stages as follows.

먼저, S162_1 단계에서, 시범 문자열과 그에 상응하여 저장된 교정 문자열이 서로 비교된다. 예를 들면, 오류를 포함한 문장 A와 해당 문장에서 오류가 교정된 문장 A'이 서로 비교된다.First, in step S162_1, the demonstration string and the calibration string stored corresponding thereto are compared with each other. For example, a sentence A containing an error and a sentence A 'corrected in the sentence are compared with each other.

S162_2 단계에서, 문자열 교정부(153)는 비교 결과에 따라 시범 문자열 및 교정 문자열의 서로 다른 부분을 추출한다. 즉, 예시에서, 문자열 교정부(153)는 두 문장 A, A'에서 서로 다른 부분 a, a'를 찾는다. 여기서, a는 A에 포함되어 있는 부분 문자열을 뜻하고, a'는 A'에 포함되어 있으며, a에 대응되는 부분 문자열을 뜻한다. 오류 문장 A의 '네 여친 예뻐?'와 이를 올바르게 교정한 교정 문장 A'의 '네 여자친구 예뻐?'에서 a와 a'를 표시하면 다음과 같다.In step S162_2, the string correction unit 153 extracts different parts of the demonstration string and the correction string according to the comparison result. That is, in the example, the string corrector 153 finds the different parts a and a 'in the two sentences A and A'. Here, a means a substring included in A, and a 'means a substring included in A' and corresponds to a. If 'a and a' are displayed in the error sentence 'A' your girlfriend is pretty 'and' correct your girlfriend 'in the correction sentence' A 'corrected as follows:

S162_3 단계에서, 시범 문자열과 교정 문자열로부터 추출된 서로 다른 부분은 대상음절열 및 수정후음절열로 각각 간주된다. 여기서, 시범 문자열 및 교정 문자열의 상이한 부분이 한 군데 이상인 경우 가장 긴 부분이 서로 다른 부분으로 간주된다.In step S162_3, different portions extracted from the demonstration string and the correction string are regarded as the target syllable sequence and the post-corrected syllable sequence, respectively. Here, when there are more than one different part of the demonstration string and the calibration string, the longest part is regarded as the different part.

S162_4 단계에서, 문자열 교정부(153)는 시범 문자열을 기준으로 수정후음절열의 좌우에 위치하게 되는 좌음절문맥 및 우음절문맥을 최적화하여 예비 교정 규칙을 생성하게 된다.In step S162_4, the string correction unit 153 generates a preliminary correction rule by optimizing the left syllable context and the right syllable context that are located on the left and right sides of the corrected syllable string based on the demonstration string.

예시에서, 문자열 교정부(153)는 a와 a'를 각각 대상음절열과 수정후음절열로 간주하고 좌음절문맥, 우음절문맥을 문장 A에서 추출하여 교정 규칙을 생성한다. 이 때, 좌음절문맥, 우음절문맥이 다양화됨으로써, 여러 문맥에 대한 교정 규칙들이 생성될 수 있다. 상술한 예제에서 생성 가능한 교정 규칙의 예는 다음과 같다.In the example, the string correction unit 153 considers a and a 'as the target syllable string and the corrected syllable string, respectively, and extracts the left syllable context and the right syllable context from sentence A to generate a correction rule. At this time, the left syllable context and the right syllable context may be diversified, so that correction rules for various contexts may be generated. An example of a calibration rule that can be generated in the above example is as follows.

여친 → 여자친구 (좌문맥:0, 우문맥:0)Girlfriend → Girlfriend (Left context: 0, Right context: 0)

여친 → _ 여자친구 (좌문맥:1, 우문맥:0)Girlfriend → _ Girlfriend (Left context: 1, Right context: 0)

여친 → 여자친구 _ (좌문맥:0, 우문맥:1)Girlfriend → Girlfriend _ (Left Context: 0, Right Context: 1)

여친 → _ 여자친구 _ (좌문맥:1, 우문맥:1)Girlfriend → _ Girlfriend _ (Left context: 1, Right context: 1)

여친 → 네_ 여자친구 _ (좌문맥:2, 우문맥:1)Girlfriend → Yes_ Girlfriend _ (Left context: 2, Right context: 1)

여친 → _ 여자친구 _예 (좌문맥:1, 우문맥:2)Girlfriend → _ Girlfriend _ Yes (Left context: 1, Right context: 2)

다음으로, S162_5 단계에서, 문자열 교정부(153)가 생성된 예비 교정 규칙의 성능을 측정하여 그 중 가장 고성능의 교정 규칙을 교정 규칙 사전(153_3)에 추가하게 된다. 즉, 문자열 교정부(153)는 생성된 예비 교정 규칙들을 검증 데이터(held-out data)에 적용하여 각각의 성능을 측정한다. 표 1은 좌우문맥에 따른 교정 규칙들의 성능 예제를 나타낸 것이다.Next, in step S162_5, the string correction unit 153 measures the performance of the generated preliminary calibration rule and adds the most high-performance calibration rule to the calibration rule dictionary 153_3. That is, the string correction unit 153 applies the generated preliminary correction rules to the hold-out data and measures each performance. Table 1 shows performance examples of the correction rules according to the left and right contexts.

이러한 성능 측정을 통해 어떠한 교정 규칙이 적은 문맥을 사용하면서도 동일한 성능을 보이는지를 알 수 있다. 표 1에서 좌우 문맥이 각각 (1,1)인 규칙과 (1,2)인 규칙은 성능이 86%로 동일하다. 이러한 경우, 좀 더 적은 문맥을 사용하는 규칙, 상술한 예제의 경우 좌우 문맥이 (1,1)인 규칙이 더욱 일반적인 규칙인 것으로 간주하고 이를 교정 규칙 사전(153_3)에 추가한다.These performance measures show which calibration rules perform the same, using less context. In Table 1, the rules with left and right contexts of (1,1) and (1,2) respectively have the same performance with 86%. In this case, the rule using less context, in the above example, the rule with left and right contexts (1,1) is considered to be a more general rule and added to the correction rule dictionary 153_3.

이러한 학습 알고리즘을 이용하면, 동일한 성능을 보이면서도 좀 더 적용 범위가 넓은, 즉, 더욱 일반화된 교정 규칙만이 추출되어 교정 규칙 사전(153_3)에 추가될 수 있다.Using this learning algorithm, only the more general, but more general, correction rules that exhibit the same performance can be extracted and added to the correction rule dictionary 153_3.

S163 단계는 교정 규칙의 적용 단계로서, 교정 규칙 사전(153_3)을 이용해 문자열 교정을 수행하는 단계이다.Step S163 is a step of applying a correction rule, and performs a string correction using the correction rule dictionary 153_3.

교정 규칙의 적용 과정은 크게 대상음절열을 탐색하는 S163_1 단계, 좌/우음절문맥을 비교하는 S163_2 단계, 구어체 문자열을 수정후음절열로 치환하는 S163_3 단계의 3단계로 이루어진다.The application process of the correction rule is largely composed of three steps: S163_1 to search for the target syllable sequence, S163_2 to compare the left / right syllable context, and S163_3 to replace the spoken character string with the modified syllable sequence.

구어체 문자열이 수신되면, 문자열 교정부(153)는 S163_1 단계 내지 S163_3 단계를 통해 교정 규칙 사전(153_3)에 저장되어 있는 교정 규칙들 중 수신된 구어체 문자열과 상응하는 유형의 교정 규칙을 탐색한다. 그리고, 탐색된 교정 규칙을 적용해 상기 구어체 문자열을 교정하여 반환한다.When the colloquial string is received, the string corrector 153 searches for a correction rule of a type corresponding to the colloquial string received from the calibration rules stored in the calibration rule dictionary 153_3 through steps S163_1 to S163_3. The colloquial string is corrected and returned by applying the found correction rule.

S163_1 단계에서, 문자열 교정부(153)가 검증 데이터 사전(153_1)에서 구어체 문자열과 일치하는 대상음절열이 있는지 여부를 탐색하며, 일치하는 대상음절열이 있는 경우 검증 데이터 사전(153_1)으로부터 당해 대상음절열에 대한 정보가 반환된다.In operation S163_1, the string correction unit 153 searches whether the target syllable string matches the colloquial string in the verification data dictionary 153_1, and if there is a target syllable string, the target object from the verification data dictionary 153_1. Information about the syllable string is returned.

당해 단계에서, 문자열 교정부(153)는 문장의 최좌측에서부터 최우측까지 음절열을 검사하면서 사전에 등재된 대상음절열과 동일한 문자열이 나오는지 검사한다.In this step, the character string corrector 153 checks whether the same character string as the target syllable string listed in advance is output while checking the syllable string from the leftmost to the rightmost side of the sentence.

예를 들어, 사용자 입력에 의해 주어진 구어체 문자열이 " ABCDEFG (알파벳은 임의의 음절을 나타냄) "이고, 대상음절열이 "A", "AB", "ABC", "ABCD"인 경우, 가장 긴 "ABCD"가 대상음절열로 선택된다.For example, if the colloquial string given by user input is "ABCDEFG (alphabet represents any syllable)" and the target syllable string is "A", "AB", "ABC", "ABCD", the longest "ABCD" is selected as the target syllable sequence.

일치하는 대상음절열이 검색되면, 구어체 문자열로 주어진 문장 상에서 해당 음절열의 위치와, 규칙에 수록된 좌음절문맥, 우음절문맥, 수정후음절열 정보를 반환한다.When the matching syllable string is found, the position of the syllable string on the sentence given as the spoken string, the left syllable context, the right syllable context, and the modified syllable sequence information are returned.

S163_2 단계에서, 문자열 교정부(153)는 반환된 대상음절열에 대한 정보를 기준으로 구어체 문자열의 좌우 문맥을 검사함으로써, 구어체 문자열에 교정 규칙의 적용이 가능한지 여부를 판단한다.In operation S163_2, the character string corrector 153 determines whether or not the correction rule is applicable to the spoken character string by checking the left and right context of the spoken character string based on the returned information about the target syllable string.

당해 단계를 세분화하면 다음과 같다.The steps are broken down as follows.

먼저, 문자열 교정부(153)는 교정 대상이 되는 구어체 문자열을 대상음절열로 인식하고, 인식된 대상음절열의 좌측 부분이 반환된 좌음절문맥과 일치하는지 여부를 확인한다.First, the string correction unit 153 recognizes a colloquial string to be corrected as a target syllable string, and checks whether the left portion of the recognized target syllable string matches the left syllable context.

그리고, 인식된 대상음절열의 우측 부분이 반환된 우음절문맥과 일치하는지 여부를 확인한다.Then, it is checked whether the right part of the recognized syllable sequence matches the right syllable context.

구어체 문자열의 좌우 부분이 반환된 좌음절문맥 및 우음절문맥과 모두 일치할 경우, 해당 교정 규칙이 적용 가능하다고 판단하고, 주어진 구어체 문자열을 수정후음절열로 치환하기 위한 후속 단계를 수행하게 된다.If the left and right parts of the colloquial string match the returned left syllable context and the right syllable context, it is determined that the correction rule is applicable, and the subsequent steps for replacing the given colloquial string with the modified syllable sequence are performed.

따라서, 교정 규칙의 적용이 가능하다고 판단되는 경우 구어체 문자열을 수정후음절열로 치환하기 위한 S163_3 단계가 수행되며, 교정 규칙 사전(153_3)으로부터 대상음절열의 교정 규칙이 추출되고, 구어체 문자열에 추출된 교정 규칙이 적용되어 결과적으로 교정된 구어체 문자열이 생성된다.Therefore, when it is determined that the correction rule is applicable, step S163_3 is performed to replace the spoken word string with the modified syllable string, and the correction rule of the target syllable string is extracted from the correction rule dictionary 153_3, and extracted to the spoken string. Correction rules are applied resulting in a corrected colloquial string.

당해 단계에서, 문자열 교정부(153)는 좌/우음절문맥의 비교 단계에서 인식한 대상음절열을 교정 규칙 사전(153_3)에 수록된 수정후음절열로 대체한다.In this step, the string correction unit 153 replaces the target syllable sequence recognized in the comparison step of the left / right syllable context with the corrected syllable sequence recorded in the correction rule dictionary 153_3.

이해를 돕기 위하여, 문자열 교정 중 이모티콘 인식 및 삭제가 수행되는 과정을 예시하면 다음과 같다. 예를 들어, 사용자가 아래와 같은 문장을 입력한 경우를 가정하자.To help understand, an example of emoticon recognition and deletion during string correction is as follows. For example, suppose a user enters the following sentence.

문자열 교정부(153)는 먼저, 원본 메시지를 최좌측부터 최우측까지 탐색하면서 교정 규칙 사전(153_3)에 수록된 대상음절열과 비교한다.The text string correcting unit 153 first searches the original message from the leftmost side to the rightmost side, and compares it with the target syllable sequence recorded in the correction rule dictionary 153_3.

이후, 대상음절열에 수록된 예비 교정 규칙의 대상음절열이 원본 메시지의 일부와 일치함을 확인한다.Then, it is confirmed that the target syllable sequence of the preliminary correction rule included in the target syllable sequence matches a part of the original message.

" 예비 교정 규칙: -.- → 흠 M:[-.- 화/난감] * ""Pre-calibration rule: -.- → Hmm M: [-.- Anger / Toy] *"

위의 예비 교정 규칙은 좌음절문맥의 길이가 1인 '흠'이며, 우음절문맥은 모든 경우를 포함(*)한다. 또한, 교정 대상이 되는 대상음절열이 “M:”으로 시작하므로, 단순히 교정만 하는 것이 하니라, 해당 문자열을 추출하여 따로 표시한다.The preliminary correction rule above is 'blem' with the length of the left syllable context 1, and the right syllable context includes all cases (*). In addition, since the target syllable string to be corrected starts with "M:", it is not simply to correct, but to extract the corresponding character string and display it separately.

다음으로, 예비 교정 규칙의 좌음절문맥이 원본 메시지에 나타났는지 확인한다. 예제에서는, '-.-' 의 좌측에 '흠'이라는 글자가 있으므로, 좌음절문맥이 서로 일치한다는 것을 알 수 있다.Next, check whether the left syllable context of the preliminary correction rule appears in the original message. In the example, the word 'hmm' is on the left side of '-.-', so we can see that the left syllable contexts coincide with each other.

다음으로, 예비 교정 규칙의 우음절문맥이 원본 메시지에 나타났는지 확인한다. 예제에서는, 우음절문맥이 '*'이므로 모든 경우에 적용 가능하며, 따라서 일치한다고 할 수 있다.Next, check that the dominant verse context of the preliminary correction rule appears in the original message. In the example, the circumferential context is '*', so it is applicable in all cases, so it can be said to match.

다음으로, 좌/우음절문맥이 일치하므로, 교정이 수행된다. 예제에서는, 단순히 문자열을 치환하지 않고, 원본 메시지에서 '-.-' 부분을 삭제한 후, 결과로 다음과 같은 <교정 메시지, 추출 문자열> 쌍을 반환한다.Next, since the left / right symptomatic context coincides, correction is performed. In our example, we simply replace the string, delete the '-.-' part of the original message, and return the following <correction message, extracted string> pairs.

" 추출 문자열: [-.- 화/난감] ""Extract String: [-.- Angry / Toy]"

이러한 방식으로, 철자 교정, 띄어쓰기 교정, 이모티콘 인식 및 교정을 포괄하는 문자열 교정이 실행될 수 있다.In this way, string corrections can be performed that include spelling correction, spacing correction, emoticon recognition, and correction.

이러한 문자열 교정 과정은 다음과 같은 장점을 지닌다.This string correction process has the following advantages.

첫째, 철자 오류 교정, 띄어쓰기 오류 교정, 이모티콘 인식 및 교정, 비속어 인식 및 교정 등 4가지를 모두 처리할 수 있다. 따라서, 동일한 교정 알고리즘에 의해 한번에 4 종류의 오류 인식 및 교정 작업이 수행될 수 있으며, 교정 규칙의 생성 및 관리가 간편해질 수 있다.First, it can handle all four types of spelling error correction, spacing error correction, emoticon recognition and correction, slang recognition and correction. Therefore, four kinds of error recognition and correction operations can be performed at the same time by the same correction algorithm, and creation and management of correction rules can be simplified.

둘째, 교정 규칙의 각 항목 즉, 대상음절열, 좌음절문맥, 수정후음절열, 우음절문맥이 모두 음절 단위로 구성되어 있어 정확도와 재현율이 높다. 예를 들면, 다음과 같다.Second, each item of the correction rule, that is, the target syllable sequence, the left syllable context, the corrected syllable sequence, and the right syllable context are all composed of syllable units, so the accuracy and reproducibility are high. For example:

" 이뿌 → 예쁘 " : 이뿌다(예쁘다), 이뿌다고(예쁘다고), 이뿌거든(예쁘거든), 이뿌쟎아(예쁘쟎아)"Ipu → Pretty": Ipuda (pretty), Ipu (pretty), Ipu (pretty), Ipupona (pretty)

" 밥먹 → 밥 _먹 " : 밥먹어(밥 먹어), 밥먹었어(밥 먹었어), 밥먹고 있니(밥 먹고 있니)"Eat rice → rice _ eat": eat (eat), eat (eat), eat (eat)

셋째, 문맥을 고려하지 않고 수정할 문자열만 고려하는 경우에는 잘못된 교정이 빈번히 발생할 수 있는 데 반해, 본 발명에서와 같이 문맥을 고려하는 경우 교정의 정확도가 높아진다. 예를 들면 다음과 같다.Third, incorrect correction may frequently occur when only the string to be modified is considered without considering the context, whereas the accuracy of the correction is increased when the context is considered as in the present invention. For example:

" 수정 규칙: 빠~~ → 바이~~ ", " 하구 → 하고 ""Editing rules: Fa ~~ → By ~~", "Estuary → and"

" 올바른 교정: 공부잘하구 빠~~ → 공부 잘 하고. 바이~~ ""Correct correction: study well pa ~~ → study well. Bye ~~"

" 잘못된 교정: 오빠~~ 참 똑똑하구나 → 오바이~~ 참 똑똑하고나 ""Bad correction: brother ~~ You're so smart → Oh ~~ You're so smart"

도 7은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 과정에서 나타나는 화면 예시도로서, 오류 교정 장치가 핸드폰 형태로 구현된 경우의 화면을 예시한 도면이다.FIG. 7 is a diagram illustrating a screen that appears in an error correcting process of a spoken sentence according to an embodiment of the present invention. FIG.

도 7을 참조하면, 제1 화면(D100)에서 오류 교정 장치는 사용자가 원본 메시지(예를 들면, '공부잘하구 빠~~')를 입력한 후 오류 교정 기능을 선택할 수 있도록 하는 화면이 디스플레이 된다. 사용자는 제1 화면(D100)에서 입력 수단에 해당되는 키패드를 조작하여 원본 메시지를 입력하고, 교정 버튼(D101)을 눌러 오류 교정 기능을 선택함으로써 제2 화면(D110)으로 진행할 수 있다.Referring to FIG. 7, in the first screen D100, an error correction apparatus displays a screen that allows a user to select an error correction function after inputting an original message (for example, 'study well ~~'). do. The user may proceed to the second screen D110 by inputting an original message by operating a keypad corresponding to the input means on the first screen D100 and selecting an error correction function by pressing the correction button D101.

오류 교정 기능이 실행되면, 원본 메시지(예를 들면, '공부잘하구 빠~~')에 대한 오류 교정이 수행되어 제2 화면(D110)에서와 같이 교정된 메시지(예를 들면, '공부 잘 하고, 바이~~')가 시각적으로 디스플레이 된다.When the error correction function is executed, error correction is performed on the original message (for example, 'not good at studying ~~'), and the message corrected as shown in the second screen (D110) (for example, 'study well'). And, bye ~~ ') is displayed visually.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that.

따라서, 이상에서 기술한 실시예들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이므로, 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 하며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Therefore, since the embodiments described above are provided to completely inform the scope of the invention to those skilled in the art, it should be understood that they are exemplary in all respects and not limited. The invention is only defined by the scope of the claims.

본 발명에 따른 오류 교정 장치 및 방법은 통신 환경에서 나타나는 구어체 문장의 다양한 오류를 효율적으로 인식 및 교정할 수 있으며, 결과적으로 자연어 처리 기술의 성능을 향상시켜 사용자 만족도를 높일 수 있다.The error correction apparatus and method according to the present invention can efficiently recognize and correct various errors in spoken sentences appearing in a communication environment, and as a result, improve the performance of the natural language processing technology, thereby improving user satisfaction.

도 1은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 장치를 나타낸 구성도.1 is a block diagram showing an error correction apparatus of spoken sentences according to an embodiment of the present invention.

도 2는 도 1에 나타난 일부 구성요소의 세부 구성도.FIG. 2 is a detailed configuration diagram of some components shown in FIG. 1. FIG.

도 3은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 방법을 나타낸 순서도.3 is a flowchart illustrating a method for correcting errors in spoken sentences according to an embodiment of the present invention.

도 4a 내지 도 6은 도 3에 나타난 일부 단계의 세부 흐름도.4A-6 are detailed flowcharts of some of the steps shown in FIG.

도 7은 본 발명의 일 실시예에 따른 구어체 문장의 오류 교정 과정에서 나타나는 화면 예시도.7 is an exemplary view showing a screen during an error correction process of a spoken sentence according to an embodiment of the present invention.

*** 도면의 주요 부분에 대한 부호의 설명 ****** Explanation of symbols for the main parts of the drawing ***

110: 입력부 150: 구어체 교정부110: input unit 150: colloquial correction unit

151: 반복어구 제거부 152: 해체음절 조합부151: repeat phrase removal unit 152: disassembled syllable combination unit

153: 문자열 교정부 160: 교정 데이터 반환부153: string correction unit 160: correction data return unit

Claims

An input unit for generating input data according to a user operation;

A colloquial text extracting colloquial character strings from the input data and selectively performing one or more processes of removing a phrase, combining disassembled syllables, and correcting a syllable based on syllables as a basic correction unit. Correction unit; And

Correction data return unit for returning the corrected input data by reflecting the corrected colloquial character string to the input data

Error correction device of spoken sentences comprising a.

The method of claim 1, wherein the colloquial correction unit,

A repetitive phrase removal unit for extracting and removing an unnecessary repetitive phrase in the colloquial string;

A disassembled syllable combination unit for extracting disassembled syllables from the colloquial string and restoring correct syllables by combining consonants and vowels of the disassembled syllables; And

Learning and storing correction rules using syllables as basic correction units for a plurality of demonstration strings including errors and a plurality of correction strings for correcting errors of the plurality of demonstration strings, and based on the stored correction rules String correction unit to correct errors in strings

Error correction device of spoken sentences comprising at least one of.

The method of claim 2, wherein the string correction unit,

A verification data dictionary for storing the plurality of demonstration strings and the plurality of calibration strings;

A learning execution unit learning the calibration rules for the plurality of demonstration strings and the plurality of calibration strings;

A calibration rule dictionary for storing the learned calibration rules; And

A correction performing unit for searching for a type of correction rule corresponding to the colloquial character string among the learned correction rules and applying the searched correction rule to correct the spoken character string

Error correction device of spoken sentences comprising a.

The apparatus of claim 1, further comprising a post-processing corrector configured to perform written correction on the returned input data.

Generating input data according to a user operation;

Extracting a colloquial character string included in the input data;

Generating a corrected colloquial string by selectively performing one or more processes of removing a phrase, combining disassembled syllables, and correcting a syllable based on syllables based on the colloquial string; And

Generating the corrected input data by including the corrected colloquial character string in the input data

Error correction method of spoken sentences comprising a.

The method of claim 5, wherein the generating of the corrected colloquial string comprises:

Repeated phrase removal process for extracting and removing unnecessary phrases repeated in the colloquial string,

Decomposition syllable combining process of extracting deconstructed syllables from the colloquial string and restoring correct syllables by combining consonants and vowels

Learning and storing correction rules using syllables as basic correction units for a plurality of demonstration strings including colloquial errors and a plurality of correction strings for correcting errors of the plurality of demonstration strings, and based on the stored correction rules. String correction process to correct errors in spoken strings,

Error correction method of a spoken sentence, characterized in that the step of performing at least one of the process.

The method of claim 6, wherein the repeating phrase removing process comprises:

When a predetermined phrase is repeated N times (N is a natural number of 2 or more) in the colloquial string, a colloquial phrase is repeated until the number of repetitions is reduced one by one until the predetermined phrase remains. How to correct errors in sentences.

The method of claim 7, wherein the repeating phrase removal process,

And constructing a repetitive phrase dictionary for the subject of correction exception, and performing repetitive phrase removal only on strings that are not in the repetitive phrase dictionary.

The method of claim 6, wherein the disassembly syllable combination process,

Checking whether an N-th reference character at an arbitrary position in the colloquial string corresponds to a Korean consonant;

If the reference letter is a consonant, checking whether the N + 1 th next letter is a vowel;

If the next letter is a vowel, combining the reference letter and the vowel letter to create one syllable, and checking again from the N + 2 th letter;

Checking whether the next letter is the same consonant as the reference letter; And

If the reference letter and the next letter is the same consonant, comprising the step of combining the reference letter and the next letter to a single consonant,

The method for correcting errors in spoken sentences, characterized in that to sequentially perform a test for each letter from the leftmost to the rightmost side of the colloquial string.

The method of claim 6, wherein the string correction process,

Storing the plurality of demonstration strings and the plurality of correction strings in advance to construct a verification data dictionary that is a target of calibration rule learning;

Learning calibration rules for the plurality of demonstration strings and the plurality of calibration strings, and storing the learned calibration rules to build a calibration rule dictionary;

Searching for a calibration rule of a type corresponding to the colloquial string among the learned calibration rules; And

Correcting the colloquial string by applying the found correction rule

Error correction method of spoken sentences comprising a.

The method of claim 10, wherein the step of constructing the correction rule dictionary,

Comparing a demonstration string including an error with a calibration string in which an error corresponding to the demonstration string is corrected;

Extracting different portions of the demonstration string and the calibration string according to a comparison result;

The extracted different parts are regarded as target syllable strings and modified syllable strings respectively, and the left syllable context and right syllable context that are located at the left and right sides of the corrected syllable string based on the test string are optimized for preliminary correction rules. Generating; And

Measuring a performance of the generated preliminary correction rule and adding a high-performance correction rule to a correction rule dictionary.

The method of claim 11, wherein the extraction of the different portions,

If the different parts of the demonstration string and the correction string is one or more, the longest part is considered to be a different part.

The method of claim 10, wherein the correcting of the colloquial strings,

Searching whether there is a target syllable sequence matching the colloquial string from the verification data dictionary, and returning information on the target syllable sequence if there is a target syllable sequence;

Determining whether a correction rule is applicable by examining left and right contexts of the spoken character string based on the information on the target syllable string; And

If the correction rule is applicable, extracting a correction rule of the target syllable string from the correction rule dictionary and generating the corrected spoken character string by applying the extracted correction rule to the spoken character string.

Error correction method of spoken sentences comprising a.

The method of claim 6, wherein the string correction process,

A method of correcting a spoken sentence, comprising performing at least one of spelling correction, spacing correction, emoticon correction, and slang correction.