KR20200057824A

KR20200057824A - Word spelling correction system

Info

Publication number: KR20200057824A
Application number: KR1020180139409A
Authority: KR
Inventors: 이반 베를로셰; 김성현
Original assignee: 주식회사 솔트룩스
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2020-05-27
Also published as: KR102129575B1

Abstract

Provided is a word correction system which performs word semantic embedding on typos. According to the present invention, the word correction system comprises: a user interface that receives a user input word from a user through a network; a processing data generation unit which separates words for learning and user input words by morphemes, generates processing data for analysis by separating the separated morphemes for each element, and generates an n-gram list by dividing the analysis processing data into n-grams; a word embedding learning unit that performs word embedding learning on a list of n-grams through a skip-gram model; and a semantic vector generator for generating a semantic vector from the n-gram list.

Description

Word spelling correction system

본 발명은 단어 교정 시스템에 관한 것으로, 자세하게는 오탈자를 교정하는 시스템에 관한 것이다. The present invention relates to a word correction system, and more particularly, to a system for correcting misspellings.

본 발명은 산업통상자원부 SW컴퓨팅산업원천기술개발사업(SW)의 일환으로 (주)솔트룩스에서 주관하고 연구하여 수행된 연구로부터 도출된 것이다. [연구기간: 2017.10.01.~2018.09.30., 연구관리 전문기관: 한국산업기술진흥원, 연구과제명: 모바일에 최적화된 멀티모달 질의응답 프레임워크 개발, 과제 고유번호: N0001701]The present invention was derived from a study conducted by Saltlux Co., Ltd. as part of the SW Computing Industry Source Technology Development Project (SW) by the Ministry of Trade, Industry and Energy. [Research period: 2017.10.01. ~ 2018.09.30., Research institute: Korea Advanced Institute of Industrial Technology, Research project name: Development of a mobile-optimized multimodal question and answer framework, task identification number: N0001701]

자연어는 사람들이 일상적으로 사용하는 언어를 말하며, 컴퓨터 과학에서 자연어 처리는 인간의 언어 현상을 컴퓨터와 같은 기계를 이용해 모사할 수 있도록 하는 과정을 말한다. 따라서, 자연어 처리에 가장 핵심적인 요소는 자연어를 컴퓨터가 이해할 수 있는 데이터로 변환시키는 기술이다. Natural language refers to the language that people use everyday, and in computer science, natural language processing is a process that enables human language phenomena to be simulated using machines such as computers. Therefore, the most essential element in natural language processing is a technology that converts natural language into data that a computer can understand.

정보기술이 발달하면서 단어 의미 임베딩 기술을 이용해 단어의 의미를 벡터 공간을 통해 표현함으로써 자연어 처리나 자연어 이해의 다양한 분야에 이것을 적용하려는 시도가 이뤄지고 있다. With the development of information technology, attempts have been made to apply this to various fields of natural language processing or natural language understanding by expressing the meaning of words through vector space using the word meaning embedding technology.

하지만, 기존의 단어 의미 임베딩 기술은 문서 단어장에 존재하는 단어에 대해서만 임베딩이 가능하며, 신조어나 오탈자처럼 단어장에 존재하지 않은 단어는 벡터 정보를 획득하는 것이 불가능하다는 한계점이 있었다. 특히 한국어는 외래어에 대해서 외래어 표기법의 혼동에 따른 오탈자, 또는 겹받침에 대하여 발음으로 구분이 어려움에 따른 오탈자 등이 매우 많이 발생하며, 기존의 단어 의미 임베딩으로는 해결할 수 없다는 한계점이 있다.However, the existing word semantic embedding technique has a limitation in that it is possible to embed only words that exist in the document wordbook, and it is impossible to acquire vector information for words that do not exist in the wordbook, such as new words and typos. In particular, in Korean, there are a lot of typographical errors due to confusion of foreign language notation, or typographical errors due to difficulty in distinguishing overlays, and there is a limitation that existing word meaning embedding cannot be solved.

본 발명의 기술적 과제는, 오탈자에 대하여 단어 의미 임베딩을 수행하는 단어 교정 시스템에 관한 것이다. The technical problem of the present invention relates to a word correction system that performs word semantic embedding on typos.

상기 기술적 과제를 달성하기 위한 본 발명의 기술적 사상의 일측면에 따른 단어 교정 시스템은, 네트워크를 통하여 사용자로부터 사용자 입력 단어를 입력받는 사용자 인터페이스; 학습용 어절과 상기 사용자 입력 단어를 형태소별로 분리하고 분리된 각 형태소를 자소별로 분리하여 분석용 가공 데이터를 생성하고, 상기 분석용 가공 데이터를 엔그램(n-Gram)들로 분리하여 엔그램 리스트를 생성하는 가공 데이터 생성부; 상기 엔그램 리스트에 대하여 스킵-그램(Skip-Gram) 모델을 통해 단어 임베딩 학습을 수행하는 단어 임베딩 학습부; 및 상기 엔그램 리스트로부터 의미 벡터를 생성하는 의미 벡터 생성부;를 포함한다. A word correction system according to an aspect of the technical idea of the present invention for achieving the above technical problem includes: a user interface that receives a user input word from a user through a network; By separating the learning word and the user input word by morphemes and separating each separated morpheme by element, generating processing data for analysis, and separating the processing data for analysis into engrams (n-grams) to separate the engram list. A processing data generator for generating; A word embedding learning unit that performs word embedding learning on the engram list through a skip-gram model; And a semantic vector generator for generating a semantic vector from the engram list.

상기 분석용 가공 데이터는, 형태소의 시각과 끝을 구분하는 기호, 및 음절 구분 기호를 포함할 수 있다. The processing data for analysis may include a symbol separating the start and end of a morpheme, and a syllable delimiter.

상기 가공 데이터 생성부는, 상기 학습용 어절과 상기 사용자 입력 단어에 대하여 형태소 분석을 하여 형태소 별로 분리하는 형태소 분리부, 상기 형태소 분리부에서 분리된 각 형태소를 자소별로 분리하는 자소 분리부, 및 상기 자소 분리부에서 분리된 자음과 모음을 이용하여 엔그램들로 분리하는 엔그램 분리부를 포함할 수 있다. The processing data generation unit performs a morpheme analysis on the learning word and the user input word to separate morphemes by morphemes, a morpheme separation unit to separate each morpheme separated from the morpheme separation unit, and the grapheme separation An engram separation unit that separates into consonants and vowels separated from the unit may be included.

상기 엔그램 분리부는, 상기 분석용 가공 데이터에서 각각 음절의 모음을 제거한 모음 제거 가공 데이터들을 생성한 후, 상기 모음 제거 가공 데이터들을 분리하여 엔그램들을 생성할 수 있다. The engram separating unit may generate vowel removal processing data after removing vowels of syllables from the processing data for analysis, and then generate engrams by separating the vowel removal processing data.

상기 엔그램 분리부는, 상기 모음 제거 가공 데이터들 각각을 이루는 x개의 문자에 대하여, 2개의 문자 내지 2보다 크고 x보다 작은 개수의 문자로 이루어지는 분리된 엔그램들을 생성할 수 있다. The engram separator may generate separated engrams of 2 characters to a number of characters larger than 2 and smaller than x for x characters constituting each of the vowel removal processing data.

상기 모음 제거 가공 데이터들의 개수는 상기 분석용 가공 데이터을 생성하는 데 사용된 형태소가 가지는 음절수와 동일할 수 있다. The number of the vowel removal processing data may be the same as the number of syllables in the morpheme used to generate the processing data for analysis.

상기 엔그램 분리부는, 상기 분석용 가공 데이터에서 각각 음절의 모음을 제거한 모음 제거 가공 데이터들, 및 각각의 음절의 종성을 제거한 종성 제거 가공 데이터들을 생성한 후, 상기 모음 제거 가공 데이터들, 및 상기 종성 제거 가공 데이터들을 분리하여 엔그램들을 생성할 수 있다. The engram separating unit generates vowel removal processing data after removing vowels of syllables from the processing data for analysis, and final removal processing data after removing the finality of each syllable, and then removes the vowel removal processing data, and the Separation processing data can be separated to generate engrams.

상기 학습용 어절로부터 얻어진 의미 벡터와 상기 사용자 입력 단어로부터 얻어진 의미 벡터를 비교하여, 상기 사용자 입력 단어에 대한 인접 단어들을 선별하여 인접 단어 리스트를 제공하는 인접 단어 리스트 출력부; 상기 사용자 입력 단어로부터 얻어진 의미 벡터와 상기 선별된 인접 단어들의 의미 벡터의 유사도를 계산하여, 상기 사용자 입력 단어에 대하여 교정된 단어를 선정하는 유사도 계산부; 상기 선정된 교정된 단어를 자연어로 생성하여 상기 사용자 인터페이스를 통하여 사용자에게 제공하는 교정 단어 출력부;를 더 포함할 수 있다. An adjacent word list output unit comparing the semantic vector obtained from the learning word with the semantic vector obtained from the user input word, and selecting adjacent words for the user input word to provide a neighbor word list; A similarity calculation unit calculating a similarity between the semantic vector obtained from the user input word and the semantic vector of the selected adjacent words, and selecting a corrected word for the user input word; It may further include a correction word output unit that generates the selected corrected word in a natural language and provides it to the user through the user interface.

상기 의미 벡터 생성부는, 상기 엔그램 리스트의 모든 엔그램의 벡터를 합산한 후 그 평균을 구하여, 상기 의미 벡터를 생성할 수 있다. The semantic vector generator may generate the semantic vector by summing the vectors of all the engrams in the engram list and calculating the average.

본 발명에 따른 단어 교정 시스템은 다양한 자원으로부터의 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 이용한 단어 임베딩 학습을 통해 학습하고 인접 어절간의 유사도를 비교하여 오탈자와 같은 미학습 패턴에 대해서 본래의 단어와 유사한 벡터 정보를 획득할 수 있다.The word correction system according to the present invention learns through word embedding learning using a list of words to be learned from various resources, a converted corpus, and word semantic data, and compares similarities between adjacent words, so that the original non-learning patterns such as typographical errors are It is possible to obtain vector information similar to a word.

또한, 종래의 단어 임베딩이 아닌 말뭉치에서 실질형태소의 의미적 정보를 이용한 단어 임베딩 학습을 수행하기 때문에 적은 학습 데이터로도 단어 임베딩 학습이 가능하며 학습 시간도 적게 걸릴 수 있다.In addition, since word embedding learning using semantic information of real morphemes is performed in a corpus, not conventional word embedding, word embedding learning is possible with less learning data and learning time may be reduced.

또한, 단순하게 등장하는 단어 위치적 정보가 아닌 학습할 단어의 의미 기반으로 학습함으로써 저 빈도의 단어 학습에도 효율적이며, 사전에 등장한 단어의 앤그램을 대상으로 하기 때문에 종래의 단어 임베딩 보다 많은 단어를 벡터로 표현할 수 있다. In addition, by learning based on the meaning of the word to be learned, rather than simply appearing word positional information, it is efficient for low-frequency word learning, and because it targets the angram of the word that appeared in the dictionary, it uses more words than conventional word embedding. It can be expressed as a vector.

또한, 종래의 위치 기반의 단어 임베딩이 아닌 실질형태소와 인접한 어절과 그 인접한 어절의 단어 의미 데이터(예컨대, 상위어나 동의어)를 인접 어절로 처리하여 단어 벡터를 생성함으로써, 단어들의 관계를 코사인 유사도를 통해 볼 수 있다.In addition, a word vector is generated by processing words adjacent to a real morpheme and word semantic data (for example, a high-level word or a synonym) of a word adjacent to a real morpheme, rather than a conventional location-based word embedding, to generate a word vector, thereby making the relationship between words cosine similarity. Can be seen through.

특히, 본 발명에 따른 단어 교정 시스템은 형태소로 분리된 단어에 대하여 스킵-그램 모델을 통하여 단어 임베딩 학습을 수행하지 않고, 분리된 엔그램들의 목록인 엔그램 리스트를 이용해 스킵-그램 모델을 통해 단어 임베딩 학습을 수행하므로, 미학습 단어인 오탈자의 경우에도, 의미 벡터가 가장 인접한 단어를 통하여 본래의 단어를 획득할 수 있으므로, 오탈자를 교정할 수 있다. In particular, the word correction system according to the present invention does not perform word embedding learning through a skip-gram model for words separated into morphemes, and uses a word list through a skip-gram model using an engram list, which is a list of separated engrams. Since embedding learning is performed, even in the case of the unlearned word misspeller, since the semantic vector can acquire the original word through the closest word, the misspeller can be corrected.

도 1은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다.
도 2는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 가공 데이터 생성부의 동작을 설명하기 위한 블록도이다.
도 3은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 형태소 분리부 및 자소 분리부의 동작을 설명하기 위한 순서도이다.
도 4는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 엔그램 분리부의 동작을 설명하기 위한 순서도이다.
도 5는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 가공 데이터 생성부의 동작을 설명하기 위한 블록도이다.
도 6은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다.
도 7은 도 1은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다. 1 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention.
Fig. 2 is a block diagram illustrating an operation of a processing data generation unit of a word correction system according to an exemplary embodiment of the present invention.
Figure 3 is a flow chart for explaining the operation of the morpheme separation unit and the character separation unit of the word correction system according to an exemplary embodiment of the present invention.
Figure 4 is a flow chart for explaining the operation of the engram separation unit of the word correction system according to an exemplary embodiment of the present invention.
Fig. 5 is a block diagram illustrating an operation of a processing data generating unit of a word correction system according to an exemplary embodiment of the present invention.
Fig. 6 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention.
FIG. 7 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components. In the accompanying drawings, the dimensions of the structures are enlarged or reduced than actual ones for clarity of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, terms such as “include” or “have” are intended to indicate the presence of features, numbers, steps, actions, elements, parts or combinations thereof described in the specification, but one or more other features. It should be understood that the existence or addition possibilities of fields or numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in the commonly used dictionary, should be interpreted as having meanings consistent with meanings in the context of related technologies, and are not to be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. .

이하 도면 및 설명에서, 하나의 블록, 예를 들면, '~부' 또는 '~모듈'로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다.In the following drawings and descriptions, one block, for example, a component represented or described as '~ unit' or '~ module' may be a hardware block or a software block. For example, each of the components may be an independent hardware block that communicates with each other, or may be a software block executed in one processor.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예들을 설명한다. In order to fully understand the configuration and effects of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다. 1 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention.

도 1을 참조하면, 단어 교정 시스템(1)은 가공 데이터 저장부(1000)로부터 학습용 어절을 제공받아 엔그램(n-Gram) 리스트(140)를 생성하는 가공 데이터 생성부(100), 가공 데이터 생성부(100)에서 생성한 엔그램 리스트(140)를 기초로 단어 임베딩 학습을 수행하는 단어 임베딩 학습부(200), 및 의미 벡터 생성부(300)를 포함한다. 단어 교정 시스템(1)은 의미 벡터 생성부(300)에서 생성한 학습 단어 의미 벡터를 저장하는 학습 단어 의미 벡터 저장소(350)를 더 포함할 수 있다. Referring to FIG. 1, the word correction system 1 is a processing data generation unit 100 that receives a learning word from the processing data storage unit 1000 and generates an n-gram list 140, processing data It includes a word embedding learning unit 200 for performing word embedding learning based on the engram list 140 generated by the generator 100, and a semantic vector generator 300. The word correction system 1 may further include a learning word semantic vector storage 350 that stores the learning word semantic vector generated by the semantic vector generator 300.

가공 데이터 저장부(1000)는 가공 데이터 생성부(100)에서 엔그램 리스트(140)를 생성하기 위한 가공용 데이터인 학습용 어절을 가질 수 있다. 가공 데이터 생성부(100)는 예를 들면, NoSQL, 관계형 데이터베이스, 파일시스템 등 어떠한 형태로든 데이터를 저장할 수 있는 공간일 수 있다. 가공 데이터 생성부(100)는 논리적으로 구분되는 하나의 저장 장치이거나, 하나 또는 복수의 저장 장치를 논리적으로 구분하는 구분 단위이거나 물리적으로 구분되는 하나의 저장 장치 또는 논리적으로 구분되는 하나의 구분 단위 중 일부일 수 있다. 일부 실시 예에서, 가공 데이터 저장부(1000)는 단어 교정 시스템(1) 내에 포함되는 학습용 어절을 저장할 수 있는 공간일 수 있다. The processing data storage unit 1000 may have a learning word that is processing data for generating the engram list 140 in the processing data generation unit 100. The processing data generation unit 100 may be, for example, a space capable of storing data in any form, such as NoSQL, relational database, and file system. The processing data generating unit 100 may be a storage unit logically divided, a division unit logically dividing one or a plurality of storage units, a storage unit physically divided, or a division unit logically divided. It can be a part. In some embodiments, the processing data storage unit 1000 may be a space that can store learning words included in the word correction system 1.

다른 일부 실시 예에서, 가공 데이터 저장부(1000)는 단어 교정 시스템(1) 외부에서 학습용 어절을 저장하고 있는 공간 또는 시스템일 수 있으며, 이 경우, 단어 교정 시스템(1)은 네트워크를 통하여 가공 데이터 저장부(1000)와 연결될 수 있다. 가공 데이터 저장부(1000)는 예를 들면, 언어 연구를 위해 텍스트를 컴퓨터가 읽을 수 있는 형태로 모아 놓은 언어 자료인 말뭉치, 또는 대용량의 어휘를 어휘 간의 연상 관계나 의미 관계로 보여주는 지식 베이스인 어휘 의미망일 수 있다. 예를 들면, 가공 데이터 저장부(1000)는 하나의 말뭉치, 또는 어휘 의미망일 수도 있으나, 이에 한정되지 않으며, 각각 별개로 구축된 복수의 말뭉치, 또는 어휘 의미망일 수도 있다. In some other embodiments, the processing data storage unit 1000 may be a space or a system that stores a learning word outside the word correction system 1, and in this case, the word correction system 1 may process data through a network. It may be connected to the storage unit 1000. The processing data storage unit 1000 is, for example, a vocabulary, which is a language material that collects text in a computer-readable form for language research, or a vocabulary that is a knowledge base that shows a large vocabulary as an association or semantic relationship between words. It can be a semantic network. For example, the processing data storage unit 1000 may be a single corpus, or a vocabulary semantic network, but is not limited thereto, and may be a plurality of corpuses separately constructed or a vocabulary semantic network.

가공 데이터 생성부(100)는 형태소 분리부(110), 자소 분리부(120), 및 엔그램 분리부(130)를 포함할 수 있다. The processing data generation unit 100 may include a morpheme separation unit 110, a grapheme separation unit 120, and an engram separation unit 130.

형태소 분리부(110)는 가공 데이터 저장부(1000)로부터 제공받은 학습용 어절에 대하여 형태소 분석을 하여, 형태소 별로 분리할 수 있다. The morpheme separating unit 110 may perform morpheme analysis on a learning word received from the processing data storage unit 1000 to separate morphemes.

자소 분리부(120)는 형태소 분리부(110)에서 분리된 각 형태소를 자소별로 분리할 수 있다. 일부 실시 예에서, 자소 분리부(120)는 각 형태소를 자음과 모음으로 구분하여 분리할 수 있다. 다른 일부 실시 예에서, 자소 분리부(120)는 각 형태소를 초성, 중성, 종성(받침)으로 구분하도록 자음과 모음으로 분리할 수 있다. The grapheme separation unit 120 may separate each morpheme separated from the morpheme separation unit 110 for each grapheme. In some embodiments, the grapheme separation unit 120 may separate each morpheme into consonants and vowels. In some other embodiments, the grapheme separating part 120 may be divided into consonants and vowels to divide each morpheme into super, neutral, and vertical (base).

일부 실시 예에서, 자소 분리부(120)는 분리된 자소 사이에 음절 구분 기호를 삽입할 수 있다. In some embodiments, the phoneme separator 120 may insert syllable separators between the separated phonemes.

엔그램 분리부(130)는 각 형태소를 자소 분리부(120)에서 분리된 자음과 모음을 이용하여 엔그램(n-Gram)으로 분리한다. 엔그램 분리부(130)는 엔그램으로 분리하기 이전에, 각 형태소가 포함하는 모음 각각, 및/또는 종성 각각에 대한 제거 과정을 수행할 수 있으며, 이 경우, 제거된 모음 및/또는 종성 대신에 제거 기호가 부여될 수 있다. The engram separation unit 130 separates each morpheme into an n-gram using consonants and vowels separated from the grapheme separation unit 120. The engram separation unit 130 may perform a removal process for each vowel included in each morpheme and / or each species before separation into engrams. In this case, instead of the removed vowel and / or species, Can be given a removal symbol.

예를 들면, 하나의 형태소가 x개의 문자로 이루어지는 경우, 해당 형태소의 엔그램은 2개 내지 x-1개의 문자로 이루어질 수 있다. 이때, 형태소를 이루는 x개의 문자는, 자음과 모음 이외에, 형태소 분리부(110) 및/또는 자소 분리부(120)에서 부여된 구분 기호, 및/또는 엔그램 분리부(130)에서 부여된 제거 기호 등을 더 포함할 수 있다. For example, when one morpheme consists of x characters, the engram of the morpheme may consist of 2 to x-1 characters. At this time, the x characters constituting the morpheme, in addition to the consonants and vowels, the delimiter given by the morpheme separating part 110 and / or the semetic part separating part 120, and / or the removal given by the engram separating part 130 Symbols and the like may be further included.

엔그램 분리부(130)에서 분리된 엔그램들의 목록은 엔그램 리스트(140)로 생성될 수 있다. The list of engrams separated from the engram separator 130 may be generated as the engram list 140.

단어 임베딩 학습부(200)는 가공 데이터 생성부(100)로부터 전달받은 가공 데이터, 예를 들면 엔그램 리스트를 이용해 스킵-그램(Skip-Gram) 모델을 통해 단어 임베딩 학습을 수행할 수 있다. The word embedding learning unit 200 may perform word embedding learning through a skip-gram model using the processing data received from the processing data generation unit 100, for example, an engram list.

예를 들면, 단어 임베딩 학습부(200)는 형태소에 대한 엔그램들의 목록인 엔그램 리스트 중에서 학습할 단어를 단어 임베딩에서의 스킵-그램의 입력층에 위치시키고, 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성할 수 있다. 단어 임베딩 학습부(200)는 단어 임베딩 학습의 피드포워드(feed forward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행하고, 백프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경할 수 있다. For example, the word embedding learning unit 200 places words to be learned in the engram list, which is a list of engrams for morphemes, in the input layer of the skip-gram in word embedding, and words to be learned through word embedding learning. You can learn to generate word vectors. The word embedding learning unit 200 performs word embedding learning through a feed forward process and a back propagation process of word embedding learning, and processing data of words to be learned in the back propagation process. The weight value associated with the word to be learned can be changed without changing the connected weight value.

의미 벡터 생성부(300)는 단어 임베딩 학습부(200)를 통해 학습된 단어의 의미 벡터를 생성할 수 있다. 일부 실시 예에서, 의미 벡터 생성부(300)는 학습된 단어 벡터 정보를 이용해 가장 벡터 유사도가 높은 단어들을 내림차순으로 n개 만큼 생성할 수 있으며, 이를 통하여 오탈자에 대한 교정 단어를 제공할 수 있다. The semantic vector generation unit 300 may generate a semantic vector of words learned through the word embedding learning unit 200. In some embodiments, the semantic vector generator 300 may generate the n words having the highest vector similarity in descending order by using the learned word vector information, thereby providing a correction word for the misspeller.

학습 단어 의미 벡터 저장소(350)는 학습된 단어의 의미 벡터를 저장할 수 있다. 학습 단어 의미 벡터 저장소(350)에 저장된 학습된 단어의 의미 벡터는, 오탈자에 대한 교정 단어를 제공하는 데에 사용될 수 있다. The learning word semantic vector storage 350 may store the semantic vector of the learned word. The semantic vector of the learned word stored in the learning word semantic vector storage 350 may be used to provide a correction word for a misspelled word.

학습 단어 의미 벡터 저장소(350)는 예를 들면, NoSQL, 관계형 데이터베이스, 파일시스템 등 어떠한 형태로든 데이터를 저장할 수 있는 공간일 수 있다. 학습 단어 의미 벡터 저장소(350)는 논리적으로 구분되는 하나의 저장 장치이거나, 하나 또는 복수의 저장 장치를 논리적으로 구분하는 구분 단위이거나 물리적으로 구분되는 하나의 저장 장치 또는 논리적으로 구분되는 하나의 구분 단위 중 일부일 수 있다. The learning word meaning vector storage 350 may be, for example, a space for storing data in any form, such as NoSQL, a relational database, and a file system. The learning word meaning vector storage 350 is a storage unit logically divided, a division unit that logically divides one or a plurality of storage units, a storage unit that is physically divided, or a division unit that is logically divided. It can be part of.

본 발명에 따른 단어 교정 시스템(1)은 다양한 자원(예컨대, 말뭉치, 어휘 의미망)으로부터의 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 이용한 단어 임베딩 학습을 통해 학습하고 인접 어절간의 유사도를 비교하여 오탈자와 같은 미학습 패턴에 대해서 본래의 단어와 유사한 벡터 정보를 획득할 수 있다.The word correction system 1 according to the present invention learns through word embedding learning using a list of words to be learned from various resources (eg, corpus, vocabulary semantic network), transformed corpus, and word semantic data and similarity between adjacent words By comparison, vector information similar to the original word can be obtained for an unlearned pattern such as a typo.

또한, 단순하게 등장하는 단어 위치적 정보가 아닌 학습할 단어의 의미 기반으로 학습함으로써 저 빈도의 단어 학습에도 효율적이며, 사전에 등장한 단어의 앤그램를 대상으로 하기 때문에 종래의 단어 임베딩 보다 많은 단어를 벡터로 표현할 수 있다. In addition, by learning based on the meaning of the word to be learned rather than simply appearing word positional information, it is also effective for low-frequency word learning, and because it targets angram of words that appear in the dictionary, vector more words than conventional word embedding Can be expressed as

특히, 본 발명에 따른 단어 교정 시스템(1)은 형태소로 분리된 단어에 대하여 스킵-그램 모델을 통하여 단어 임베딩 학습을 수행하지 않고, 분리된 엔그램들의 목록을 이용해 스킵-그램 모델을 통해 단어 임베딩 학습을 수행하므로, 미학습 단어인 오탈자의 경우에도, 의미 벡터가 가장 인접한 단어를 통하여 본래의 단어를 획득할 수 있으므로, 오탈자를 교정할 수 있다. In particular, the word correction system 1 according to the present invention does not perform word embedding learning through a skip-gram model on words separated into morphemes, and embeds words through a skip-gram model using a list of separated engrams. Since learning is performed, even in the case of a misspelled word, the original word can be acquired through the word having the closest semantic vector, so that the misspelled word can be corrected.

도 2는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 가공 데이터 생성부의 동작을 설명하기 위한 블록도이고, 도 3은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 형태소 분리부 및 자소 분리부의 동작을 설명하기 위한 순서도이고, 도 4는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 엔그램 분리부의 동작을 설명하기 위한 순서도이다. 2 is a block diagram for explaining the operation of the processing data generation unit of the word correction system according to the exemplary embodiment of the present invention, and FIG. 3 is a morpheme separation unit and grapheme of the word correction system according to the exemplary embodiment of the present invention. 4 is a flowchart for explaining the operation of the separation unit, and FIG. 4 is a flowchart for explaining the operation of the engram separation unit of the word correction system according to the exemplary embodiment of the present invention.

도 2 내지 도 4를 함께 참조하면, 가공 데이터 생성부(100)는 형태소 분리부(110), 자소 분리부(120), 및 엔그램 분리부(130)를 포함한다. 엔그램 분리부(130)에서 분리된 엔그램들의 목록은 엔그램 리스트(140)로 생성될 수 있다. Referring to FIGS. 2 to 4 together, the processing data generation unit 100 includes a morpheme separation unit 110, a grapheme separation unit 120, and an engram separation unit 130. The list of engrams separated from the engram separator 130 may be generated as the engram list 140.

형태소 분리부(110)는 가공 데이터 저장부(도 1의 1000)로부터 제공받은 학습용 어절에 대하여 형태소 분석을 하여, 형태소 별로 분리할 수 있다. 예를 들면, 가공 데이터 저장부(1000)가 '오렌지는 맛있다'라는 언어 자료를 가지고 있는 경우(S100), 가공 데이터 저장부(1000)는 '오렌지는'과 '맛있다'라는 학습용 어절을 형태소 분리부(110)에 제공할 수 있다(S200). The morpheme separating unit 110 may perform morpheme analysis on a learning word provided from the processing data storage unit (1000 in FIG. 1), and may separate the morphemes. For example, when the processing data storage unit 1000 has language data such as 'orange is delicious' (S100), the processing data storage unit 1000 morphologically separates the learning phrases 'orange' and 'delicious'. It may be provided to the unit 110 (S200).

형태소 분리부(110)는 형태소 분리 모듈(112) 및 형태소 구분 기호 삽입 모듈(114)을 포함할 수 있다. 형태소 분리 모듈(112)은 '오렌지는'과 '맛있다'라는 어절에 대하여 형태소 분리를 하여, '오렌지', '는', '맛있', '다'와 같은 분리된 형태소를 구할 수 있다(S300). 형태소 구분 기호 삽입 모듈(114)은 형태소의 시작과 끝을 구분하는 기호를 삽입할 수 있다. The morpheme separation unit 110 may include a morpheme separation module 112 and a morpheme separator insertion module 114. The morpheme separating module 112 may separate morphemes for the phrases 'orange' and 'delicious' to obtain separated morphemes such as 'orange', 'silver', 'delicious', and 'da' (S300) ). The morpheme separator insertion module 114 may insert a symbol that separates the start and end of a morpheme.

자소 분리부(120)는 자소 분리 모듈(122) 및 음절 구분 기호 삽입 모듈(124)을 포함할 수 있다. 자소 분리 모듈(122)은 각 형태소를 자음과 모음으로 구분하여 분리할 수 있다. 예를 들면, 자소 분리 모듈(122)은 각 형태소를 한국어 자소인 자음 19개(ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ, ㄸ, ㅃ, ㅆ, ㅉ) 및 모음 21개(ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, ㅡ, ㅣ, ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ)의 총 40자로 분해하는 과정을 수행할 수 있다. The grapheme separation unit 120 may include a grapheme separation module 122 and a syllable separator insertion module 124. The phoneme separation module 122 may separate each morpheme into consonants and vowels. For example, the grapheme separation module 122 has 19 morphemes of each Korean morpheme (ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ , ㄸ, ㅃ, ㅆ, ㅉ) and 21 vowels (ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, ㅡ, ㅣ, ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ) can be decomposed to a total of 40 characters.

자소 분리 모듈(122)은 우선 형태소인 '오렌지'에 대하여 '오', '렌', '지'로 음절 분리를 한 후에(S400), 'ㅇ', 'ㅗ', 'ㄹ', 'ㅔ', 'ㄴ', 'ㅈ', 'ㅣ'로 자소 분리를 할 수 있다(S500). 음절 구분 기호 삽입 모듈(124)은 각 음절 사이에 음절 구분 기호를 삽입할 수 있다. The grapheme separation module 122 first separates syllables into 'o', 'ren', and 'ji' with respect to the morpheme 'orange' (S400), then 'ㅇ', 'ㅗ', 'd', 'ㅔ Separation can be performed with ',' ㄴ ',' ㅈ ',' ㅣ '(S500). The syllable separator insertion module 124 may insert a syllable separator between each syllable.

예를 들면, 형태소 구분 기호 삽입 모듈(114)에서는 형태소의 시작과 끝 각각에 형태소의 시작과 끝을 구분하는 기호인 '<'과 '>'를 삽입하고, 음절 구분 기호 삽입 모듈(124)은 각 음절 사이에 음절 구분 기호인 '

'를 삽입할 수 있다. 따라서, 형태소 분리부(110)와 자소 분리부(120)를 통하여, '오렌지'라는 형태소는 '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ>'와 같은 분석용 가공 데이터로 가공될 수 있다(S600). 자소가 분리되고, 형태소 구분 기호와 음절 구분 기호가 삽입되도록 가공된 형태소를 분석용 가공 데이터라 호칭할 수 있다. For example, in the morpheme separator insertion module 114, the symbols '<' and '>' that separate the start and end of the morpheme are inserted at the start and end of the morpheme, and the syllable separator insertion module 124 Between each syllable is the syllable separator '

'Can be inserted. Therefore, through the morpheme separating part 110 and the morpheme separating part 120, the morpheme called 'orange'is'<ㅇ

ㄹ ㅔ ㄴ

It can be processed into processing data for analysis such as ㅈ ㅣ>'(S600). The morpheme processed so that the grapheme is separated and the morpheme separator and syllable separator are inserted may be referred to as processing data for analysis.

도 3에서는, 형태소 구분 기호와 음절 구분 기호를 최종 단계(S600)에서 삽입한 것으로 도시되었으나, 이는 형태소, 음절, 및 자소 분리 각각의 과정을 위주로 도시하기 위한 것으로, 이에 한정되지 않는다. 예를 들면, 형태소 구분 기호는 형태소가 분리되는 S300 단계에서 삽입될 수 있고, 음절 구분 기호는 음절이 분리되는 S400 단계에서 삽입될 수도 있다. In FIG. 3, the morpheme separator and the syllable separator are illustrated as being inserted in the final step (S600), but this is for illustrating mainly the processes of morpheme, syllable, and phoneme separation, but is not limited thereto. For example, the morpheme separator may be inserted in step S300 in which the morpheme is separated, and the syllable separator may be inserted in step S400 in which the syllable is separated.

엔그램 분리부(130)는 각 형태소의 분석용 가공 데이터를 이용하여 엔그램으로 분리한다. 엔그램 분리부(130)는 엔그램으로 분리하기 이전에, 각 형태소가 포함하는 모음 각각에 대한 제거 과정을 수행하는 모음 제거 모듈(132) 및, 엔그램으로 분리하는 엔그램 리스트 생성 모듈(136)을 포함할 수 있다. The engram separation unit 130 separates into engrams using processing data for analysis of each morpheme. The engram separating unit 130, before separating into engrams, the vowel removing module 132 that performs the removal process for each vowel included in each morpheme, and the engram list generating module 136 that separates into engrams ).

'오렌지'라는 형태소에 대하여 '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ>'라는 분석용 가공 데이터가 제공되면(S600), 모음 제거 모듈(132)은 음절 각각의 모음을 제거하여, '<ㅇ

ㄹㅔㄴ

ㅈㅣ>', '<ㅇㅗ

ㄹㄴ

ㅈㅣ>', '<ㅇㅗ

ㄹㅔㄴ

ㅈ>'와 같은 모음 제거 가공 데이터들을 생성할 수 있다(S700). 모음 제거 가공 데이터들의 개수는 분석용 가공 데이터를 생성하는 데에 사용된 형태소가 가지는 음절수와 동일할 수 있다. About the morpheme 'orange', '<ㅇㅗ

ㄹ ㅔ ㄴ

When the processing data for analysis called ㅈ ㅣ>'is provided (S600), the vowel removal module 132 removes each vowel of syllables, and the'<ㅇ

ㄹ ㅔ ㄴ

ㅈ ㅣ>','<ㅇㅗ

ㄹㄴ

ㅈ ㅣ>','<ㅇㅗ

ㄹ ㅔ ㄴ

Collection data such as ㅈ>'may be generated (S700). The number of vowel removal processing data may be the same as the number of syllables in the morpheme used to generate the processing data for analysis.

엔그램 리스트 생성 모듈(136)은 모음 제거 가공 데이터들 각각을 엔그램으로 분리할 수 있다. 예를 들면, 하나의 형태소에 대한 모음 제거 가공 데이터가 x개의 문자로 이루어지는 경우, 엔그램 리스트 생성 모듈(136)은 해당 형태소에 대하여 2개 내지 x-1개의 문자로 이루어지는 분리된 엔그램을 생성할 수 있다. 즉, 엔그램 범위는 2 내지 x-1일 수 있다. '<ㅇ

ㄹㅔㄴ

ㅈㅣ>', '<ㅇㅗ

ㄹㄴ

ㅈㅣ>', '<ㅇㅗ

ㄹㅔㄴ

ㅈ>'와 같이 모음 제거 가공 데이터들이 10개의 문자로 이루어지는 경우, 엔그램 리스트 생성 모듈(136)은 2개 내지 9개의 문자로 이루어지는 분리된 엔그램을 생성할 수 있다. The engram list generation module 136 may separate each of the vowel removal processing data into an engram. For example, when the vowel removal processing data for one morpheme is composed of x characters, the engram list generation module 136 generates a separate engram composed of 2 to x-1 characters for the morpheme. can do. That is, the engram range may be 2 to x-1. '<ㅇ

ㄹ ㅔ ㄴ

ㅈ ㅣ>','<ㅇㅗ

ㄹㄴ

ㅈ ㅣ>','<ㅇㅗ

ㄹ ㅔ ㄴ

When the vowel removal processing data such as ㅈ>'is composed of 10 characters, the engram list generation module 136 may generate a separated engram composed of 2 to 9 characters.

구체적으로 살펴보면, '<ㅇ

ㄹㅔㄴ

ㅈㅣ>'에 대해서 3개로 이루어지는 엔그램을 분리하여, '<ㅇ

', 'ㅇ

ㄹ', '

ㄹㅔ', 'ㄹㅔㄴ', 'ㅔㄴ

', 'ㄴ

ㅈ', '

ㅈㅣ', 'ㅈㅣ>'와 같은 엔그램을 분리할 수 있고, '<ㅇㅗ

ㄹㄴ

ㅈㅣ>'에 대해서 3개로 이루어지는 엔그램을 분리하여, '<ㅇㅗ', 'ㅇㅗ

', 'ㅗ

ㄹ', '

ㄹㄴ', 'ㄹㄴ

', 'ㄴ

ㅈ', '

ㅈㅣ', 'ㅈㅣ>'와 같은 엔그램을 분리할 수 있고, , '<ㅇㅗ

ㄹㅔㄴ

ㅈ>'에 대하여 3개로 이루어지는 엔그램을 분리하여, '<ㅇㅗ', 'ㅇㅗ

', 'ㅗ

ㄹ', '

ㄹㅔ', 'ㄹㅔㄴ', 'ㅔㄴ

', 'ㄴ

ㅈ', '

ㅈ>'와 같은 엔그램을 분리할 수 있다(S800). Specifically, '<ㅇ

ㄹ ㅔ ㄴ

For ㅈ ㅣ>', separate the three-gram engram,'<ㅇ

',' ㅇ

L', '

ㄹ ㅔ ',' ㄹ ㅔ ㄴ ',' ㅔ ㄴ

', 'N

ㅈ ','

Engrams like ㅈ ㅣ ',' ㅈ ㅣ>'can be separated, and'<ㅇㅗ

ㄹㄴ

For ㅈ ㅣ>', separate three engrams,'<ㅇ<',' ㅇㅗ

',' ㅗ

L', '

ㄹㄴ ',' ㄹㄴ

', 'N

ㅈ ','

Engrams such as ㅈ ㅣ ',' ㅈ ㅣ>'can be separated, and'<ㅇㅗ

ㄹ ㅔ ㄴ

For ㅈ>', separate the three-gram engrams,'<ㅇㅗ',' ㅇㅗ

',' ㅗ

L', '

ㄹ ㅔ ',' ㄹ ㅔ ㄴ ',' ㅔ ㄴ

', 'N

ㅈ ','

Engrams such as ㅈ>'can be separated (S800).

일부 실시 예에서, 엔그램 리스트 생성 모듈(136)은 2개 내지 2보다 크고 x-1보다 작은 개수의 문자, 예를 들면, 2개 내지 6개로 이루어지는 분리된 엔그램을 생성할 수도 있다. 즉, 엔그램 범위는 2 내지 x-1보다 작은 수, 예를 들면, 2 내지 6일 수 있다. In some embodiments, the engram list generation module 136 may generate a number of characters larger than 2 to 2 and smaller than x-1, for example, 2 to 6 separated engrams. That is, the engram range may be a number less than 2 to x-1, for example, 2 to 6.

이와 같은, 엔그램 리스트 생성 모듈(136)에서 생성된 분리된 엔그램들의 목록은 엔그램 리스트(140)로 생성될 수 있다. 예를 들면, 엔그램 리스트(140)는 분리된 엔그램들 중 중복되는 엔그램이 제거된 '<ㅇ

', 'ㅇ

ㄹ', '

ㄹㅔ', 'ㄹㅔㄴ', 'ㅔㄴ

', 'ㄴ

ㅈ', '

ㅈㅣ', 'ㅈㅣ>', '<ㅇㅗ', 'ㅇㅗ

', 'ㅗ

ㄹ', '

ㄹㄴ', 'ㄹㄴ

', '

ㅈ>'로 생성될 수 있다(S900). 도 4에는 예시적으로, 3개로 이루어지는 분리된 엔그램을 도시하였으며, 이에 대해여 설명하였으나, 생성된 엔그램 리스트(140)는 2개, 및 4개 이상의 분리된 엔그램을 더 포함할 수 있다. As such, a list of separated engrams generated in the engram list generation module 136 may be generated as an engram list 140. For example, the engram list 140 removes overlapping engrams among the separated engrams.

',' ㅇ

L', '

ㄹ ㅔ ',' ㄹ ㅔ ㄴ ',' ㅔ ㄴ

', 'N

ㅈ ','

ㅈ ㅣ ',' ㅈ ㅣ>','<ㅇㅗ',' ㅇㅗ

',' ㅗ

L', '

ㄹㄴ ',' ㄹㄴ

','

It can be created as ㅈ>'(S900). For example, in FIG. 4, three separate engrams are illustrated, which have been described, but the generated engram list 140 may further include two and four or more separated engrams. .

도 5는 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 가공 데이터 생성부의 동작을 설명하기 위한 블록도이다.Fig. 5 is a block diagram illustrating an operation of a processing data generating unit of a word correction system according to an exemplary embodiment of the present invention.

도 5를 참조하면, 가공 데이터 생성부(100)는 형태소 분리부(110), 자소 분리부(120), 및 엔그램 분리부(130)를 포함한다. 엔그램 분리부(130)에서 분리된 엔그램들의 목록은 엔그램 리스트(140)로 생성될 수 있다. Referring to FIG. 5, the processing data generation unit 100 includes a morpheme separation unit 110, a grapheme separation unit 120, and an engram separation unit 130. The list of engrams separated from the engram separator 130 may be generated as the engram list 140.

자소 분리부(120)는 자소 분리 모듈(122) 및 음절 구분 기호 삽입 모듈(124)을 포함할 수 있다. 자소 분리 모듈(122)은 일부 실시 예에서, 자소 분리 모듈(122)은 각 형태소를 초성, 중성, 종성(받침)으로 구분하도록 자음과 모음으로 분리할 수 있다. 예를 들면, 자소 분리 모듈(120)은 각 형태소를 초성인 자음 19개(ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ, ㄸ, ㅃ, ㅆ, ㅉ), 중성인 모음 21개(ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, ㅡ, ㅣ, ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ), 및 받침 27개(ㄱ, ㄲ, ㄳ, ㄴ, ㄵ, ㄶ, ㄷ, ㄹ, ㄺ, ㄻ, ㄼ, ㄽ, ㄾ, ㄿ, ㅀ, ㅁ, ㅂ, ㅄ, ㅅ, ㅆ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ)의 총 67자로 분해하는 과정을 수행할 수 있다. 일부 실시 예에서, 자소 분리 모듈(120)은 종성(받침)을 가지지 않는 음절에는 종성 미포함 기호(예를 들면, '

')를 삽입할 수 있다. 따라서, 자소 분리부(120)는 '오렌지'라는 형태소에 대하여 '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ

>'라는 분석용 가공 데이터를 제공할 수 있다. 다른 일부 실시 예에서, 자소 분리 모듈(120)은 종성 미포함 기호를 삽입하지 않을 수 있고, 이 경우, 음절 구분 기호인 '

' 앞에 모음이 오는 경우, 종성이 포함되지 않는 것으로 판단할 수 있다. The grapheme separation unit 120 may include a grapheme separation module 122 and a syllable separator insertion module 124. Elemental separation module 122, in some embodiments, the elemental separation module 122 may be separated into consonants and vowels to divide each morpheme into super, neutral, and longitudinal (support). For example, the grapheme separation module 120 has 19 consonants for each morpheme (ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ, ㄸ, ㅃ, ㅆ, ㅉ), 21 neutral vowels (ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, ㅡ, ㅣ, ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ , ㅞ, ㅟ, ㅢ), and 27 bases (ㄱ, ㄲ, ㄳ, ㄴ, ㄵ, ㄶ, ㄷ, ㄹ, ㄺ, ㄻ, ㄼ, ㄽ, ㄾ, ㄿ, ㅀ, ㅁ, ㅂ, ㅄ, ㅅ , ㅆ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ) You can perform the process of decomposing a total of 67 characters. In some embodiments, the grapheme separation module 120 does not have a seed (base), and does not have a seed (eg, '

') Can be inserted. Therefore, the grapheme separation unit 120 is referred to as'<ㅇㅗ

ㄹ ㅔ ㄴ

ㅈ ㅣ

>'Can provide processing data for analysis. In some other embodiments, the grapheme separation module 120 may not insert a symbol without a species, in this case, a syllable separator '

'If a vowel comes before, it can be judged that the species is not included.

엔그램 분리부(130)는 각 형태소의 분석용 가공 데이터를 이용하여 엔그램으로 분리한다. 엔그램 분리부(130)는 모음 제거 모듈(132), 종성 제거 모듈(134) 및, 엔그램으로 분리하는 엔그램 리스트 생성 모듈(136)을 포함할 수 있다. The engram separation unit 130 separates into engrams using processing data for analysis of each morpheme. The engram separation unit 130 may include a vowel removal module 132, a finality removal module 134, and an engram list generation module 136 that separates into engrams.

'오렌지'라는 형태소에 대하여 '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ

>''라는 분석용 가공 데이터가 제공되면, 모음 제거 모듈(132)은 음절 각각의 모음을 제거하여, '<ㅇ

ㄹㅔㄴ

ㅈㅣ

>', '<ㅇㅗ

ㄹㄴ

ㅈㅣ

>', '<ㅇㅗ

ㄹㅔㄴ

ㅈ

>'와 같은 모음 제거 가공 데이터들을 생성할 수 있고, 종성 제거 모듈(134)은 음절 각각의 종성을 제거하여, '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ

>', '<ㅇㅗ

ㄹㅔ

ㅈㅣ

>', '<ㅇㅗ

ㄹㅔㄴ

ㅈㅣ>'와 같은 종성 제거 가공 데이터들을 생성할 수 있다. 종성 제거 가공 데이터들의 개수는 분석용 가공 데이터를 생성하는 데에 사용된 형태소가 가지는 음절수와 동일할 수 있다. About the morpheme 'orange', '<ㅇㅗ

ㄹ ㅔ ㄴ

ㅈ ㅣ

>'' When the processing data for analysis is provided, the vowel removing module 132 removes each vowel of syllables, so that '<ㅇ

ㄹ ㅔ ㄴ

ㅈ ㅣ

>','<ㅇㅗ

ㄹㄴ

ㅈ ㅣ

>','<ㅇㅗ

ㄹ ㅔ ㄴ

ㅈ

>'Can be generated vowel removal processing data, and the deconciliation removing module 134 removes the disposition of each syllable, so that'<ㅇㅗ

ㄹ ㅔ ㄴ

ㅈ ㅣ

>','<ㅇㅗ

ㄹ ㅔ

ㅈ ㅣ

>','<ㅇㅗ

ㄹ ㅔ ㄴ

It is possible to generate de-scaling data such as ㅈ ㅣ>'. The number of species removal processing data may be the same as the number of syllables in the morpheme used to generate the processing data for analysis.

엔그램 리스트 생성 모듈(136)은 모음 제거 가공 데이터 및 종성 제거 가공 데이터 각각을 엔그램으로 분리할 수 있고, 이와 같은, 엔그램 리스트 생성 모듈(136)에서 생성된 분리된 엔그램들의 목록은 엔그램 리스트(140)로 생성될 수 있다. The engram list generation module 136 may separate each of the vowel removal processing data and the seed removal processing data into engrams, and the list of the separated engrams generated in the engram list generation module 136 may be yen. It can be generated as a gram list 140.

도 6은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다. Fig. 6 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention.

도 6을 참조하면, 단어 교정 시스템(2)은 네트워크(50)를 통하여 사용자(10)로부터 단어를 입력받고, 교정된 단어를 제공하는 사용자 인터페이스(UI, 20)를 포함한다. Referring to FIG. 6, the word correction system 2 includes a user interface (UI, 20) that receives words from the user 10 through the network 50 and provides corrected words.

네트워크(50)는 유선 인터넷 서비스, 근거리 통신망(LAN), 광대역 통신망(WAN), 인트라넷, 무선 인터넷 서비스, 이동 컴퓨팅 서비스, 무선 데이터 통신 서비스, 무선 인터넷 접속 서비스, 위성 통신 서비스, 무선 랜, 블루투스 등 유/무선을 통하여 데이터를 주고 받을 수 있는 것을 모두 포함할 수 있다. 네트워크(50)가 스마트폰 또는 태블릿 등과 연결되는 경우, 네트워크(50)는 3G, 4G, 5G 등의 무선 데이터 통신 서비스, 와이파이(Wi-Fi) 등의 무선 랜, 블루투스 등일 수 있다. The network 50 includes wired Internet service, local area network (LAN), broadband communication network (WAN), intranet, wireless Internet service, mobile computing service, wireless data communication service, wireless Internet access service, satellite communication service, wireless LAN, Bluetooth, etc. It can include anything that can send and receive data over wired or wireless. When the network 50 is connected to a smartphone or tablet, the network 50 may be wireless data communication services such as 3G, 4G, 5G, wireless LAN such as Wi-Fi, Bluetooth, or the like.

사용자 인터페이스(50)는 사용자(10)가 사용하는 단말기 등을 통하여 단어 교정 시스템(2)에 엑세스하기 위한 인터페이스를 제공할 수 있다. 사용자(10)는 사용자 인터페이스(20)를 통하여 단어 교정 시스템(2)에 단어를 입력하여 전송할 수 있고, 사용자 인터페이스(20)를 통하여 단어 교정 시스템(2)가 제공하는 교정된 단어를 수신할 수 있다. The user interface 50 may provide an interface for accessing the word correction system 2 through a terminal or the like used by the user 10. The user 10 may input and transmit words to the word correction system 2 through the user interface 20, and receive corrected words provided by the word correction system 2 through the user interface 20. have.

단어 교정 시스템(2)은, 가공 데이터 생성부(100), 단어 임베딩 학습부(200), 의미 벡터 생성부(300), 및 학습 단어 의미 벡터 저장소(350)를 포함한다. 단어 교정 시스템(2)이 가지는 가공 데이터 생성부(100), 단어 임베딩 학습부(200), 및 의미 벡터 생성부(300)는 도 1 내지 도 5를 통하여 설명한 가공 데이터 생성부(100), 단어 임베딩 학습부(200), 및 의미 벡터 생성부(300)와 실질적으로 동일하며, 사용자(10)가 사용자 인터페이스(50)를 통하여 입력한 단어가 교정이 필요한 오탈자인 경우에, 오탈자인 단어(사용자 입력 단어)에 대해서 도 1 내지 도 5를 통하여 설명한 과정을 수행한다. 학습 단어 의미 벡터 저장소(350)는 도 1 내지 도 5를 통하여 설명한 오타자가 아닌 올바른 형태소(학습 단어, 또는 학습용 어절)에 대한 학습 단어의 의미 벡터가 저장될 수 있다. The word correction system 2 includes a processing data generation unit 100, a word embedding learning unit 200, a semantic vector generation unit 300, and a learning word semantic vector storage 350. The processing data generation unit 100, the word embedding learning unit 200, and the semantic vector generation unit 300 of the word correction system 2 include the processing data generation unit 100 and words described with reference to FIGS. 1 to 5. It is substantially the same as the embedding learning unit 200 and the semantic vector generator 300, and when the word entered by the user 10 through the user interface 50 is a misspelling that requires correction, the misspelled word (user Input words) are performed through the processes described with reference to FIGS. 1 to 5. The learning word semantic vector storage 350 may store a semantic vector of learning words for a correct morpheme (a learning word, or a learning word) rather than a typo described through FIGS. 1 to 5.

단어 임베딩 학습부(200) 및 의미 벡터 생성부(300)는 엔그램 리스트(140)를 대상으로 엔그램 임베딩을 수행함으로써 엔그램 벡터들을 획득할 수 있다. 단어의 의미 벡터는 엔그램 리스트의 모든 엔그램 벡터를 모두 합산하여 그 평균을 구함으로써 획득할 수 있으며, 이를 수학식으로 나타내면, 다음과 같다.The word embedding learning unit 200 and the semantic vector generator 300 may obtain engram vectors by performing engram embedding on the engram list 140. The semantic vector of a word can be obtained by summing all the engram vectors in the engram list and calculating the average, which is expressed as follows.

여기서, Vw는 입력 단어 w의 의미 벡터값을 나타내고, i는 엔그램 리스트의 인덱스를 나타내고, k는 엔그램 리스트의 전체 개수를 나타내고, Vi는 i번째 엔그램의 벡터를 나타낸다. Here, Vw represents the semantic vector value of the input word w, i represents the index of the engram list, k represents the total number of engram lists, and Vi represents the vector of the i th engram.

단어 교정 시스템(2)은 인접 단어 리스트 출력부(400), 유사도 계산부(500), 및 교정 단어 출력부(600)를 더 포함한다. The word correction system 2 further includes an adjacent word list output unit 400, a similarity calculation unit 500, and a correction word output unit 600.

의미 벡터 생성부(300)가 사용자 입력 단어에 대한 의미 벡터를 생성하면, 인접 단어 리스트 출력부(400)는 학습 단어 의미 벡터 저장소(350)에 저장된 학습된 단어의 의미 벡터를 참조하여, 사용자 입력 단어에 대한 인접 단어 리스트를 선별할 수 있다. 예를 들면, 인접 단어 리스트 출력부(400)는 학습된 단어가 가지는 엔그램 리스트와 사용자 입력 단어의 엔그램 리스트를 비교하여, 사용자 입력 단어의 엔그램 리스트와 중복되는 엔그램의 비율이 일정값 이상인 엔그램 리스트를 가지는 학습된 단어들을 인접 단어들로 선별하여, 그 리스트를 제공할 수 있다. 예를 들면, 인접 단어 리스트 출력부(400)는 학습된 단어의 엔그램 벡터와 사용자 입력 단어의 엔그램 벡터의 유사도를 비교하여, 사용자 입력 단어의 엔그램 벡터와 유사도가 높은 엔그램 벡터를 가지는 학습된 단어들을 인접 단어들로 선별하여, 그 리스트를 제공할 수 있다. When the semantic vector generator 300 generates a semantic vector for the user input word, the adjacent word list output unit 400 refers to the semantic vector of the learned word stored in the learning word semantic vector store 350, and inputs the user. A list of adjacent words for a word can be selected. For example, the adjacent word list output unit 400 compares the engram list of the learned word with the engram list of the user input word, and the ratio of the engram overlapping the engram list of the user input word with a constant value. The learned words having the above engram list may be selected as adjacent words, and the list may be provided. For example, the adjacent word list output unit 400 compares the similarity between the engram vector of the learned word and the engram vector of the user input word, and has an engram vector having a high similarity with the engram vector of the user input word. The learned words may be selected as adjacent words, and a list may be provided.

유사도 계산부(500)는 사용자 입력 단어의 의미 벡터와 선별된 인접 단어의 의미 벡터의 유사도를 계산하여, 오탈자에 대하여 교정된 단어를 선정할 수 있다. 예를 들면, 유사도 계산부(500)는, 사용자 입력 단어의 의미 벡터와 선별된 인접 단어들의 의미 벡터들간의 코사인 유사도를 비교함으로써 벡터값이 인접한 단어들을 교정된 단어로 선정할 수 있다. 일부 실시 예에서, 유사도 계산부(500)는 의미 벡터의 유사도가 높은 단어들을 내림차순으로 복수개를 선정할 수 있다. The similarity calculator 500 may calculate a similarity between the semantic vector of the user input word and the semantic vector of the selected adjacent word, and select the corrected word for the misspeller. For example, the similarity calculator 500 may select words with adjacent vector values as corrected words by comparing the cosine similarity between the semantic vector of the user input word and the semantic vectors of the selected adjacent words. In some embodiments, the similarity calculator 500 may select a plurality of words having high similarity of the semantic vector in descending order.

교정 단어 출력부(600)는 사용자 인터페이스(20)를 통하여 선정된 교정된 단어를 자연어로 생성하여 사용자(10)에게 제공할 수 있다. 일부 실시 예에서, 교정 단어 출력부(600)는 의미 벡터의 유사도가 높은 단어들을 자연어로 생성하여 내림차순으로 복수개를 사용자(10)에게 제공할 수 있다. The correction word output unit 600 may generate the corrected word selected through the user interface 20 as a natural language and provide it to the user 10. In some embodiments, the correction word output unit 600 may generate words having high similarity of the semantic vector in natural language and provide the user 10 with a plurality in descending order.

도 7은 도 1은 본 발명의 예시적 실시 예에 따른 단어 교정 시스템의 개략적인 블록도이다. 예를 들면, 도 1에 보인 단어 교정 시스템(1)은 단어 교정을 하기 위하여, 오탈자가 아닌 올바른 형태소인 단어에 대한 의미 벡터를 얻어서 학습 단어 의미 벡터 저장소(350)에 저장하는 시스템이고, 도 6에 보인 단어 교정 시스템(2)은 올바른 형태소인 단어에 대한 이미 벡터 정보가 저장된 학습 단어 의미 벡터 저장소(350)를 이용하여, 단어 교정이 필요한 오탈자에 대하여 교정된 단어를 구하는 시스템이라면, 도 7에 보이는 단어 교정 시스템(1)은 올바른 형태소인 단어에 대한 의미 벡터를 얻어서 학습 단어 의미 벡터 저장소(350)에 저장하는 것과, 학습 단어 의미 벡터 저장소(350)를 이용하여 단어 교정이 필요한 오탈자에 대하여 교정된 단어를 구하는 것이 함께 이루어지는 시스템일 수 있다. 따라서, 도 1 내지 도 6에서 설명된 내용가 중복되는 내용은 생략될 수 있다.FIG. 7 is a schematic block diagram of a word correction system according to an exemplary embodiment of the present invention. For example, the word correction system 1 shown in FIG. 1 is a system that obtains a semantic vector for a word that is a correct morpheme, not a typographical error, and stores it in the learning word semantic vector storage 350 to correct the word. The word correction system 2 shown in FIG. 7 is a system for obtaining a corrected word for a misspeller requiring word correction by using the learning word semantic vector storage 350 in which vector information about the correct morphological word is already stored. The visible word correction system 1 obtains the semantic vector for the correct morphological word and stores it in the learning word semantic vector storage 350, and corrects for the misspeller requiring word correction using the learning word semantic vector storage 350 It may be a system in which finding a word that has been made is done together. Therefore, contents overlapping with those described in FIGS. 1 to 6 may be omitted.

도 7을 참조하면, 교정 단어 제공 시스템(3)은 사용자 인터페이스(20), 가공 데이터 생성부(100), 단어 임베딩 학습부(200), 의미 벡터 생성부(300), 학습 단어 의미 벡터 저장소(350), 인접 단어 리스트 출력부(400), 유사도 계산부(500), 및 교정 단어 출력부(600)를 포함할 수 있다.Referring to FIG. 7, the correction word providing system 3 includes a user interface 20, a processing data generating unit 100, a word embedding learning unit 200, a semantic vector generating unit 300, and a learning word semantic vector storage ( 350), an adjacent word list output unit 400, a similarity calculation unit 500, and a correction word output unit 600.

단어 교정 시스템(3)은 가공 데이터 저장부(1000)로부터 학습용 어절을 제공받아, 가공 데이터 생성부(100)에서 학습용 어절에 대한 엔그램 리스트(140)를 생성하고, 단어 임베딩 학습부(200)에서 단어 임베딩 학습을 수행한 후, 의미 벡터 생성부(300)에서 학습 단어 의미 벡터를 생성하여 학습 단어 의미 벡터 저장소(350)에 저장한다. The word correction system 3 receives a learning word from the processing data storage unit 1000, generates an engram list 140 for the learning word in the processing data generation unit 100, and a word embedding learning unit 200 After performing the word embedding learning in, the semantic vector generator 300 generates a learning word semantic vector and stores it in the learning word semantic vector storage 350.

또한 단어 교정 시스템(3)은 네트워크(50)를 통하여 사용자(10)로부터 교정이 필요한 오탈자인 사용자 입력 단어를 입력받아, 가공 데이터 생성부(100)에서 사용자 입력 단어에 대한 엔그램 리스트(140)를 생성하고, 단어 임베딩 학습부(200)에서 단어 임베딩 학습을 수행한 후, 의미 벡터 생성부(300)에서 사용자 입력 단어의 의미 벡터를 생성한다. In addition, the word correction system 3 receives a user input word that is a misspelling that needs to be corrected from the user 10 through the network 50, and the engram list 140 for the user input word in the processing data generation unit 100 After generating the word embedding learning in the word embedding learning unit 200, the semantic vector generating unit 300 generates a semantic vector of the user input word.

인접 단어 리스트 출력부(400)는 학습 단어 의미 벡터 저장소(350)에 저장된 학습 단어와 사용자 입력 단어의 엔그램 리스트를 비교하거나, 엔그램 벡터의 유사도를 비교하여, 사용자 입력 단어에 대한 인접 단어들을 선별하여 제공할 수 있다. The adjacent word list output unit 400 compares the engram list of the learning word and the user input word stored in the learning word semantic vector storage 350, or compares the similarity of the engram vector to compare adjacent words to the user input word. It can be selected and provided.

이후, 유사도 계산부(500)는 사용자 입력 단어의 의미 벡터와 선별된 인접 단어의 의미 벡터의 유사도를 계산하여, 오탈자인 사용자 입력 단어에 대하여 교정된 단어를 선정할 수 있고, 교정 단어 출력부(600)는 사용자 인터페이스(20)를 통하여 선정된 교정된 단어를 자연어로 생성하여 사용자(10)에게 제공할 수 있다. Subsequently, the similarity calculating unit 500 may calculate the similarity between the semantic vector of the user input word and the semantic vector of the selected adjacent word, to select a corrected word for the user input word that is a misspelling, and the correction word output unit ( The 600 may generate the corrected word selected through the user interface 20 as a natural language and provide it to the user 10.

이상, 본 발명을 바람직한 실시예를 들어 상세하게 설명하였으나, 본 발명은 상기 실시예에 한정되지 않고, 본 발명의 기술적 사상 및 범위 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 여러가지 변형 및 변경이 가능하다. As described above, the present invention has been described in detail with reference to preferred embodiments, but the present invention is not limited to the above embodiments, and various modifications and alterations by those skilled in the art within the technical spirit and scope of the present invention This is possible.

1, 2, 3 : 단어 교정 시스템, 10 : 사용자, 20 : 사용자 인터페이스, 50 : 네트워크, 100 : 가공 데이터 생성부, 200 : 단어 임베딩 학습부, 300 : 의미 벡터 생성부, 350 : 학습 단어 의미 벡터 저장소, 400 : 인접 단어 리스트 출력부, 500 : 유사도 계산부, 600 : 교정 단어 출력부, 1000 : 가공 데이터 저장부1, 2, 3: word correction system, 10: user, 20: user interface, 50: network, 100: processing data generation unit, 200: word embedding learning unit, 300: semantic vector generator, 350: learning word semantic vector Repository, 400: adjacent word list output unit, 500: similarity calculation unit, 600: correction word output unit, 1000: processing data storage unit

Claims

A user interface that receives a user input word from a user through a network;
By separating the learning word and the user input word by morphemes and separating each separated morpheme by element, generating processing data for analysis and separating the processing data for analysis into engrams (n-grams) to separate the engram list. A processing data generator for generating;
A word embedding learning unit that performs word embedding learning on the engram list through a skip-gram model; And
And a semantic vector generator for generating a semantic vector from the engram list.

According to claim 1,
The processing data for analysis, the word correction system, characterized in that it includes a symbol for distinguishing the morpheme time and end, and a syllable separator.

According to claim 2,
The processing data generation unit,
A morpheme separation unit that separates each morpheme separated from the morpheme separation unit by morphemes by performing a morpheme analysis on the learning word and the user input word, and a consonant separated from the morpheme separation unit. A word correction system comprising an engram separator that separates into engrams using vowels.

According to claim 3,
The engram separating unit, after generating vowel removal processing data with each syllable removed from the analysis processing data, separating the vowel removal processing data to generate engrams.

According to claim 4,
The engram separating unit, for the x characters constituting each of the vowel removal processing data, a word correction system characterized in that for generating the separated engrams consisting of a number of characters greater than 2 and less than x 2 characters. .

According to claim 4,
The number correction processing number of the word correction system, characterized in that the same as the number of syllables in the morpheme used to generate the processing data for analysis.

According to claim 3,
The engram separating unit generates vowel removal processing data after removing vowels of syllables from the processing data for analysis, and vocal removal processing data after removing the finality of each syllable, and then removes the vowel removal processing data, and the A word correction system characterized by separating de-scaling processing data to generate engrams.

According to claim 1,
An adjacent word list output unit comparing the semantic vector obtained from the learning word with the semantic vector obtained from the user input word, and selecting adjacent words for the user input word to provide a neighbor word list;
A similarity calculation unit calculating a similarity between the semantic vector obtained from the user input word and the semantic vector of the selected adjacent words, and selecting a corrected word for the user input word;
And a correction word output unit that generates the selected corrected word in natural language and provides it to the user through the user interface.

According to claim 1,
The semantic vector generation unit, a word correction system, characterized in that by summing the vectors of all the engrams in the engram list, and then obtaining the average, generating the semantic vector.