KR20100138194A

KR20100138194A - System and method for recommendding japanese language automatically using tranformatiom of romaji

Info

Publication number: KR20100138194A
Application number: KR1020090056609A
Authority: KR
Inventors: 고병일; 기윤서; 김태일; 서희철
Original assignee: 엔에이치엔(주)
Priority date: 2009-06-24
Filing date: 2009-06-24
Publication date: 2010-12-31
Also published as: JP2011008784A; KR101086550B1; JP5097802B2

Abstract

PURPOSE: A system and a method for automatically recommending Japanese using conversion of roman letters are provided to convert a pronunciation of an inputted Japanese word into Roman letters and search similarity of the word based on the converted Roman letters. CONSTITUTION: A Roman letter converting unit(103) converts a pronunciation of a word expressed by hiragana or katakana into Roman letters. A similar word search unit(104) searches a similar word about the word based on the converted Roman letters. The similar word search unit searches a similar word for the word in consideration of a similarity score of the word which is converted into Roman letters. A similar word recommending unit(105) converts the searched similar word into one form of hiragana, katakana, and Chinese characters.

Description

Japanese automatic recommendation system and method using roman conversion {SYSTEM AND METHOD FOR RECOMMENDDING JAPANESE LANGUAGE AUTOMATICALLY USING TRANFORMATIOM OF ROMAJI}

본 발명은 입력된 일본어에 대한 유사어를 추천하는 시스템 및 방법에 관한 것으로, 보다 자세하게는, 입력된 일본어의 발음을 로마자로 변환하여 유사어를 추천하는 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for recommending a similar word for the input Japanese, and more particularly, to a system and a method for recommending the similar word by converting the pronunciation of the input Japanese to Roman characters.

사용자는 원하는 정보를 얻기 위해 검색 엔진의 검색 창에 단어를 입력하여 검색을 수행한다. 이 때, 사용자가 단어를 잘못 입력하여 오타가 발생하는 경우, 오타로 인해 검색되는 문서의 품질이 떨어지거나 검색되는 문서의 수가 거의 없는 문제가 발생하였다. 이러한 문제를 해결하기 위해 검색 엔진은 이러한 단어를 오타로 판단하여 사용자가 실제 입력하고자 하는 단어를 추천하였다. The user enters a word into the search window of the search engine to perform the search to obtain the desired information. At this time, when a user incorrectly inputs a word and a typo occurs, the quality of the searched document is reduced due to the typo, or the number of searched documents is hardly generated. In order to solve this problem, the search engine considers these words as typos and recommends the words that the user actually inputs.

또한, 사용자가 단어를 입력하여 검색을 수행하더라도, 사용자가 원하는 결과를 얻기 위한 최적의 단어를 입력하는 경우가 소수에 불과하다. 이 경우, 검색 엔진은 사용자에게 검색 결과를 제공하더라도, 사용자는 검색 결과에 불만을 가질 수 밖에 없다. 이러한 문제를 해결하기 위해 검색 엔진은 사용자가 입력한 단어에 대한 연관어 또는 유사어를 제공함으로써 검색의 정확도를 향상시킬 수 있다.In addition, even if a user searches by entering a word, only a few cases may input an optimal word for obtaining a desired result. In this case, even if the search engine provides the search results to the user, the user has to complain about the search results. In order to solve this problem, the search engine may improve the accuracy of the search by providing an associated word or a similar word for a word input by the user.

특히, 위에서 언급한 상황들은 일본어 검색의 경우 보다 문제가 될 수 있다. 사용자가 입력한 일본어를 오타로 판단하여 정답을 제시하거나 또는 사용자가 입력한 일본어에 대해 유사어를 제공하는 경우, 종래에는 정확도를 보장하기 어려웠다. 무엇보다, 일본어는 한자, 히라가나 및 가타카나의 형태로 표현될 수 있기 때문에, 사용자가 입력한 단어에 대해 적절한 단어를 추천하는 것이 어려운 문제가 존재하였다. 따라서, 어떠한 형태의 일본어가 입력되더라도 적절한 단어를 추천하는 방법이 요구되고 있다.In particular, the above mentioned situations may be more problematic than in the case of Japanese search. When the Japanese inputted by the user is judged as a typo, the correct answer is provided, or when the user inputs a similar word for the Japanese input, it is difficult to guarantee the accuracy in the past. Above all, since Japanese can be expressed in the form of kanji, hiragana and katakana, it has been difficult to recommend an appropriate word for a word input by a user. Therefore, there is a need for a method for recommending an appropriate word even when Japanese is input in any form.

본 발명은 입력된 일본어 단어의 발음을 로마자로 변환하고, 변환된 로마자에 기초하여 단어에 대한 유사어를 검색함으로써 일본어에 대한 유사어 검색의 정확도를 향상시키는 시스템 및 방법을 제공할 수 있다.The present invention can provide a system and method for improving the accuracy of a similar word search for Japanese by converting a pronunciation of an input Japanese word into a roman letter and searching for a similar word for a word based on the converted roman letter.

본 발명은 입력된 일본어 단어가 오타인지 판별하고, 오타인 경우 유사어를 검색하여 정답 단어를 제공함으로써, 사용자가 검색 질의를 잘못 입력하더라도 적절한 정답 단어를 추천하여 검색의 정확도를 향상시키는 시스템 및 방법을 제공할 수 있다.The present invention provides a system and method for determining whether an input Japanese word is a typo, and searching for similar words in case of a typo, thereby providing a correct answer word, thereby recommending an appropriate correct word even when a user incorrectly inputs a search query. Can provide.

본 발명은 입력된 일본어 단어가 한자인 경우, 기계학습을 통해 생성한 학습 데이터를 통해 토큰으로 분할하고 분할된 토큰에 대해 히라가나로 변환함으로써 신속하고 정확한 한자-히라가나 변환을 수행할 수 있는 시스템 및 방법을 제공할 수 있다.The present invention is a system and method that can perform fast and accurate Kanji-Hiragana conversion by dividing the token into tokens through the learning data generated by machine learning and converting the split tokens into hiragana when the entered Japanese words are Chinese characters. Can be provided.

본 발명은 사용자가 입력한 일본어 단어의 형태와 다른 형태의 유사어를 검색하여 추천함으로써, 사용자에게 보다 정확한 검색을 수행할 수 있도록 하는 시스템 및 방법을 제공할 수 있다.The present invention can provide a system and method for searching and recommending similar words in a form different from a Japanese word input by a user, thereby enabling a user to perform a more accurate search.

본 발명의 일실시예에 따른 일본어 자동 추천 시스템은 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어의 발음을 로마자(romaji)로 변환하는 로마자 변환부 및 상기 변환된 로마자에 기초하여 상기 단어에 대한 유사어를 검색하는 유사어 검색부를 포함할 수 있다.In the Japanese automatic recommendation system according to an embodiment of the present invention, a roman conversion unit for converting a pronunciation of a word expressed in Japanese hiragana form or katakana form into a romaji and a similar word for the word based on the converted roman letter It may include a similar word search unit for searching for.

본 발명의 일실시예에 따른 일본어 자동 추천 시스템은 상기 검색된 유사어를 상기 히라가나, 가타카나 또는 한자 중 어느 하나의 일본어 형태로 변환하여 추천하는 유사어 추천부를 더 포함할 수 있다.The automatic Japanese recommendation system according to an embodiment of the present invention may further include a similar word recommendation unit which converts the searched analogous words into any one of the Japanese form of hiragana, katakana or kanji.

본 발명의 일실시예에 따른 일본어 자동 추천 시스템은 입력된 단어를 분석하여 상기 단어가 오타인 지 여부를 판단하는 오타 판단부를 더 포함할 수 있다.The automatic Japanese recommendation system according to an embodiment of the present invention may further include a typo determination unit that analyzes the input word and determines whether the word is a typo.

본 발명의 일실시예에 따른 일본어 자동 추천 시스템은 입력된 단어가 오타인 경우, 유사도 점수 또는 단어 출현 빈도에 따른 편집 거리를 고려하여 상기 검색된 유사어 중 상기 단어에 대한 정답 단어를 선택하는 정답 단어 선택부를 더 포함할 수 있다.When the input word is a typo, the Japanese automatic recommendation system according to an embodiment of the present invention selects a correct answer word for selecting the correct answer word for the word among the searched similar words in consideration of the editing distance according to the similarity score or the frequency of word appearance. It may further include wealth.

본 발명의 일실시예에 따른 일본어 자동 추천 시스템은 입력된 단어가 한자인 경우, 토큰 분할 학습 데이터를 이용하여 상기 단어를 토큰 별로 분할하고, 한자-히라가나 변환 학습 데이터를 이용하여 상기 분할된 토큰에 대응하는 히라가나로 변환하는 한자-히라가나 변환부를 더 포함할 수 있다.In the Japanese automatic recommendation system according to an embodiment of the present invention, when the input word is a Chinese character, the word is divided by token using token division learning data, and the divided token is used by using the Kanji-Hiragana conversion learning data. The apparatus may further include a kanji-hiragana conversion unit for converting to a corresponding hiragana.

본 발명의 일실시예에 따른 일본어 자동 추천 방법은 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어의 발음을 로마자(romaji)로 변환하는 단계 및 상기 변환된 로마자에 기초하여 상기 단어에 대한 유사어를 검색하는 단계를 포함할 수 있다.The automatic Japanese recommendation method according to an embodiment of the present invention converts a pronunciation of a word expressed in Japanese hiragana form or katakana form into a romaji and searches for a similar word for the word based on the converted roman letter. It may include the step.

본 발명의 일실시예에 따르면, 입력된 일본어 단어의 발음을 로마자로 변환 하고, 변환된 로마자에 기초하여 단어에 대한 유사어를 검색함으로써 일본어에 대한 유사어 검색의 정확도를 향상시킬 수 있다.According to an embodiment of the present invention, the accuracy of the similarity search for Japanese may be improved by converting the pronunciation of the input Japanese word into a Roman letter and searching for a similar word for the word based on the converted Roman letter.

본 발명의 일실시예에 따르면, 입력된 일본어 단어가 오타인지 판별하고, 오타인 경우 유사어를 검색하여 정답 단어를 제공함으로써, 사용자가 검색 질의를 잘못 입력하더라도 적절한 정답 단어를 추천하여 검색의 정확도를 향상시킬 수 있다.According to an embodiment of the present invention, by determining whether an input Japanese word is a typo, and searching for a similar word when a typo is provided, the correct word is recommended, even when a user incorrectly inputs a search query, by recommending an appropriate correct word to improve the accuracy of the search. Can be improved.

본 발명의 일실시예에 따르면, 입력된 일본어 단어가 한자인 경우, 기계학습을 통해 생성한 학습 데이터를 통해 토큰으로 분할하고 분할된 토큰에 대해 히라가나로 변환함으로써 신속하고 정확한 한자-히라가나 변환을 수행할 수 있다.According to an embodiment of the present invention, when the input Japanese word is a Chinese character, the character data is divided into tokens through learning data generated through machine learning, and the divided tokens are converted into hiragana to perform fast and accurate Kanji-Hiragana conversion. can do.

본 발명의 일실시예에 따르면, 사용자가 입력한 일본어 단어의 형태와 다른 형태의 유사어를 검색하여 추천함으로써, 사용자에게 보다 정확한 검색을 수행할 수 있도록 한다.According to an embodiment of the present invention, by searching for and recommending similar words in a form different from the Japanese word input by the user, the user can perform a more accurate search.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다. 다만, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다.Hereinafter, with reference to the contents described in the accompanying drawings will be described in detail an embodiment according to the present invention. However, the present invention is not limited to or limited by the embodiments. Like reference numerals in the drawings denote like elements.

도 1은 본 발명의 일실시예에 따른 일본어 자동 추천 시스템의 전체 구성을 도시한 블록 다이어그램이다.1 is a block diagram showing the overall configuration of a Japanese automatic recommendation system according to an embodiment of the present invention.

도 1을 참고하면, 일본어 자동 추천 시스템(100)은 오타 판단부(101), 한자-히라가나 변환부(102), 로마자 변환부(103), 유사어 검색부(104), 유사어 추천 부(105) 및 정답 단어 선택부(106)를 포함할 수 있다.Referring to FIG. 1, the Japanese automatic recommendation system 100 includes a typo determining unit 101, a kanji-hiragana conversion unit 102, a roman character conversion unit 103, a similar word search unit 104, and a similar word recommendation unit 105. And a correct word selector 106.

일본어 검색에 있어, 사용자는 원하는 정보 검색을 위해 일본어를 입력할 수 있다. 이 때, 사용자는 한자, 히라가나 또는 가타카나 형태의 일본어인 단어 A(107)를 입력할 수 있다. 일본어 자동 추천 시스템(100)은 사용자가 입력한 단어(107)의 발음을 로마자 변환함으로써 보다 정확한 일본어인 단어 B(108)를 추천할 수 있다.In Japanese search, the user can enter Japanese to search for desired information. At this time, the user may input the word A 107 which is Japanese in the form of a kanji, hiragana or katakana. The Japanese automatic recommendation system 100 may recommend the more accurate Japanese word B 108 by converting the pronunciation of the word 107 input by the user into a Roman letter.

본 발명의 일실시예에 따르면, 사용자가 오타를 입력하는 경우, 일본어 자동 추천 시스템(100)은 오타 판단부(101) 내지 정답 단어 선택부(106)를 통해 오타에 대한 정답을 선택하여 제공할 수 있다. 그리고, 본 발명의 다른 일실시예에 따르면, 사용자가 오타가 아닌 정자를 입력하는 경우, 일본어 자동 추천 시스템(100)은 한자-히라가나 변환부(102) 내지 유사어 추천부(105)를 통해 유사어를 제공할 수 있다. 이하에서는, 사용자가 오타를 입력하는 경우를 중심으로 설명된다. According to an embodiment of the present invention, when the user inputs a typo, the Japanese automatic recommendation system 100 may select and provide a correct answer for the typo through the typo determination unit 101 or the correct word selection unit 106. Can be. In addition, according to another embodiment of the present invention, when the user inputs a sperm other than a typo, the Japanese automatic recommendation system 100 uses the kanji-hiragana conversion unit 102 to the similar word recommendation unit 105. Can provide. Hereinafter, a description will be given focusing on a case where a user inputs a typo.

오타 판단부(101)는 사용자로부터 입력된 단어(107)를 분석하여 단어(107)가 오타인지 여부를 판단할 수 있다. 이 경우, 로마자 변환부(103)는 사용자가 입력한 단어(107)가 오타인 경우, 단어(107)를 로마자로 변환할 수 있다.The typo determination unit 101 may analyze the word 107 input by the user to determine whether the word 107 is a typo. In this case, when the word 107 input by the user is a typo, the roman character converting unit 103 may convert the word 107 into a roman character.

일례로, 오타 판단부(101)는 사용자가 입력한 단어(107)가 미리 설정한 오타 데이터에 포함되는 지 여부를 고려하여 단어(107)가 오타인지 여부를 판단할 수 있다. 구체적으로, 오타 판단부(101)는 사전에 등재된 단어나 검색 엔진에서 구축한 컨텐츠 DB 목록, 수동 검수 등을 통해 결정된 오타 데이터를 이용하여 사용자가 입력한 단어(107)가 오타 데이터에 포함되는 경우 오타로 판단할 수 있다.For example, the typo determining unit 101 may determine whether the word 107 is a typo in consideration of whether or not the word 107 input by the user is included in the preset typo data. In detail, the typo determination unit 101 includes a word 107 input by the user in the typo data by using a pre-registered word, a typo data determined by a content DB list constructed by a search engine, and manual inspection. In this case, it can be judged as a typo.

다른 일례로, 오타 판단부(101)는 사용자가 입력한 단어(107)의 입력 빈도 또는 문서 출현 빈도가 미리 설정된 기준 빈도보다 낮은지 여부를 고려하여 단어(107)가 오타인지 여부를 판단할 수 있다. As another example, the typo determining unit 101 may determine whether the word 107 is a typo in consideration of whether an input frequency or a document appearance frequency of the word 107 input by the user is lower than a preset reference frequency. have.

이 때, 단어(107)의 입력 빈도는 사용자가 입력한 단어(107)의 입력 횟수를 의미한다. 즉, 오타 판단부(101)는 입력 빈도가 낮은 단어(107)를 오타로 판단할 수 있다. 그리고, 문서 출현 빈도는 입력된 단어(107)를 통해 검색하였을 때 검색 결과로 도출되는 문서의 개수를 의미할 수 있다. 즉, 오타 판단부(101)는 문서 출현 빈도가 낮은 단어(107)를 오타로 판단할 수 있다. At this time, the frequency of input of the word 107 means the number of times of input of the word 107 input by the user. That is, the typo determination unit 101 may determine the word 107 having a low input frequency as a typo. In addition, the document appearance frequency may mean the number of documents derived as a search result when searching through the input word 107. That is, the typo determination unit 101 may determine the word 107 having a low document appearance frequency as a typo.

또는 오타 판단부(101)는 단어(107)에 대해 문서 출현 빈도가 질의 빈도보다 낮은 경우, 해당 단어(107)를 오타로 판단할 수 있다. 그리고, 오타 판단부(101)는 문서 출현 빈도가 낮으면서 연속된 단어(107)를 오타로 판단할 수 있다.Alternatively, when the document appearance frequency of the word 107 is lower than the query frequency, the typo determination unit 101 may determine the word 107 as a typo. In addition, the typo determination unit 101 may determine the continuous word 107 as a typo while the document appearance frequency is low.

또 다른 일례로, 오타 판단부(101)는 사용자가 입력한 단어(107)가 형태소로 분리되는 지 여부를 고려하여 단어(107)가 오타인지 여부를 판단할 수 있다. 이 때, 오타 판단부(101)는 입력된 단어가 형태소 분석기나 품사 태거에 의해 각 형태소로 분리되는 경우 해당 단어(107)가 오타가 아니라고 판단할 수 있다. 다시 말해서, 단어가 오타가 아닌 경우, 형태소로 쉽게 분리될 수 없어서 오타 판단부(101)는 단어가 형태소로 쉽게 분리되는 경우 정자로 판단할 수 있다.As another example, the typo determining unit 101 may determine whether the word 107 is a typo in consideration of whether the word 107 input by the user is divided into morphemes. In this case, when the input word is separated into each morpheme by a morpheme analyzer or a part-of-speech tag, the typo determination unit 101 may determine that the corresponding word 107 is not a typo. In other words, if the word is not a typo, the typos may not be easily separated into morphemes, and thus the typo determination unit 101 may determine sperm when the words are easily separated into morphemes.

한자-히라가나 변환부(102)는 입력된 단어(107)가 한자인 경우, 토큰 분할 학습 데이터를 이용하여 단어를 토큰 별로 분할할 수 있다. 그리고, 한자-히라가나 변환부(102)는 한자-히라가나 변환 학습 데이터를 이용하여 상기 분할된 토큰에 대응하는 히라가나로 변환할 수 있다. 일본에서 같은 한자라도 쓰임에 따라 읽는 방법이 상이하기 때문에, 한자에 대응하는 정확한 히라가나로 변환하는 것이 중요하다. 한자-히라가나 변환부(102)에 대해서는 도 3에서 구체적으로 설명한다.If the input word 107 is a Chinese character, the Chinese-Hiragana conversion unit 102 may divide the word for each token using token split learning data. The kanji-hiragana conversion unit 102 may convert the kanji-hiragana conversion into hiragana corresponding to the divided token using the kanji-hiragana conversion training data. Since the same kanji is read differently depending on the same kanji in Japan, it is important to convert it to the correct hiragana. The kanji-hiragana conversion unit 102 will be described in detail with reference to FIG. 3.

로마자 변환부(103)는 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어(107)의 발음을 로마자(romaji)로 변환할 수 있다. 만약, 단어가 한자인 경우 한자-히라가나 변환부(102)를 통해 히라가나로 변환된 후, 로마자로 변환될 수 있다. 예를 들어, 입력된 단어가 한자인 映?(영화)인 경우, 한자-히라가나 변환부(102)를 통해 えいが로 변환되고, 로마자 변환부(103)는 단어의 발음을 로마자(eiga)로 변환할 수 있다. 로마자 변환부(103)가 로마자로 변환하는 예는 도 4에서 구체적으로 설명된다.The roman conversion unit 103 may convert a pronunciation of the word 107 expressed in a Japanese hiragana form or a katakana form into a roman character. If the word is a Chinese character, the word may be converted into hiragana through the kanji-hiragana conversion unit 102 and then converted into roman characters. For example, if the input word is 映? (Movie), which is a Chinese character, it is converted into えい 한 through the Chinese-Hiragana conversion unit 102, and the Roman conversion unit 103 converts the pronunciation of the word to Roman letters (eiga). can do. An example in which the Roman conversion unit 103 converts to Roman characters is described in detail with reference to FIG. 4.

유사어 검색부(104)는 변환된 로마자에 기초하여 단어(107)에 대한 유사어를 검색할 수 있다. 일례로, 유사어 검색부(104)는 로마자로 변환된 단어의 유사도 점수를 고려하여 단어에 대한 유사도를 검색할 수 있다. 히라가나/가타카나 또는 한자 상태에서 유사도를 측정하는 것은 편집 거리의 해상도가 매우 낮아 정확도가 떨어지기 때문에, 본 발명에 따르면 단어의 발음을 로마자로 변환하여 유사도를 측정할 수 있다. 예를 들어, オリゴン와 オリコン을 직접 비교하는 것보다 이를 로마자로 변환하여 origon과 orikon을 비교함으로써 보다 정확하게 유사도를 비교할 수 있다.The similar word search unit 104 may search for similar words for the word 107 based on the converted roman characters. For example, the similarity search unit 104 may search for similarity with respect to a word in consideration of a similarity score of a word converted to Roman characters. Since measuring the similarity in the hiragana / katakana or kanji state is very low in accuracy of the editing distance, the accuracy can be measured by converting the pronunciation of a word to roman. For example, it is possible to compare origon and orikon more accurately by converting it to Roman characters and comparing origon and orikon rather than directly comparing Orikon and Orikon.

이 때, 유사도 점수는 단어의 길이에 따른 입력 빈도, 단어가 장음, 가운데점, 촉음 또는 탁음의 포함 여부에 따른 편집 거리 또는 단어의 원형 상태의 비교 정도 중 적어도 하나에 기초하여 결정될 수 있다. 일례로, 단어가 한자인 경우, 유사어 검색부(104)는 로마자로 변환된 형태의 비교 결과, 히라가나로 변환된 형태의 비교 결과 및 한자 원래 형태의 비교 결과를 고려하여 유사도 점수를 결정할 수 있다. 유사어 검색에 대해서는 도 2에서 구체적으로 설명한다.In this case, the similarity score may be determined based on at least one of an input frequency according to the length of the word, an edit distance according to whether the word includes a long sound, a middle point, a tactile sound or a sound sound or a degree of comparison of the word's original state. For example, when a word is a Chinese character, the similar word search unit 104 may determine a similarity score in consideration of a comparison result of a form converted to Roman characters, a comparison result of a form converted to Hiragana, and a comparison result of the original form of a Chinese character. The similar word search will be described in detail with reference to FIG. 2.

유사어 추천부(105)는 검색된 유사어를 상기 히라가나, 가타카나 또는 한자 중 어느 하나의 일본어 형태의 단어(108)로 변환하여 추천할 수 있다. 사용자는 추천된 단어(108)를 입력하여 검색을 수행할 수 있다.The similar word recommendation unit 105 may convert the searched similar word into a Japanese word 108 in any one of the hiragana letter, katakana character, and kanji. The user may enter the suggested word 108 to perform a search.

일례로, 유사어 추천부(105)는 검색된 유사어를 사용자가 입력한 단어(107)의 일본어 형태와 다른 형태의 단어(108)로 변환하여 추천할 수 있다. 예를 들면, 사용자가 히라가나 형태의 단어(107)를 입력하더라도, 유사어 추천부(105)는 입력된 단어(107)에 대한 유사어를 한자 형태의 단어(108)로 변환하여 추천할 수 있다.For example, the similar word recommender 105 may convert the searched similar word into a word 108 having a form different from the Japanese form of the word 107 input by the user and recommend the similar word. For example, even if a user inputs a hiragana type word 107, the similar word recommender 105 may convert a similar word for the input word 107 into a Chinese character type word 108 and recommend the same.

정답 단어 선택부(106)는 사용자로부터 입력된 단어(107)가 오타인 경우, 유사도 점수 또는 단어의 입력 빈도에 따른 편집 거리를 고려하여 검색된 유사어 중 단어(107)에 대한 정답 단어(108)를 선택할 수 있다. 즉, 입력된 단어(107)의 오타에 대해 복수의 유사어가 추천되는 경우, 정답 단어 선택부(106)는 유사도 점수가 높거나 입력 빈도가 높은 유사어를 정답 단어(108)로 선택하여 제공할 수 있다.When the word 107 input from the user is a typo, the correct word selecting unit 106 selects the correct word 108 for the word 107 among the similar words searched in consideration of the similarity score or the editing distance according to the frequency of the word input. You can choose. That is, when a plurality of similar words are recommended for a typo in the input word 107, the correct word selecting unit 106 may select and provide a similar word having a high similarity score or a high input frequency as the correct word 108. have.

도 2는 본 발명의 일실시예에 따라 입력된 단어에 대해 로마자 변환을 통해 일본어를 자동으로 추천하는 과정을 도시한 도면이다.FIG. 2 is a diagram illustrating a process of automatically recommending Japanese through roman conversion for input words according to an embodiment of the present invention.

사용자로부터 일본어로 이루어진 단어가 입력되면, 오타 판단부(101)는 입 력된 단어가 오타인지 판단할 수 있다. 앞서 설명하였듯이, 오타 판단부(101)는 단어가 미리 설정된 오타 데이터에 포함되는 지 여부, 단어의 입력 빈도 또는 문서 출현 빈도가 미리 설정된 기준 빈도보다 낮은지 여부 또는 단어가 형태소로 분리되는 지 여부를 고려하여 단어가 오타인 지 여부를 판단할 수 있다.When a word made in Japanese is input from the user, the typo determining unit 101 may determine whether the input word is a typo. As described above, the typo determination unit 101 determines whether the word is included in the preset typo data, whether the word input frequency or document appearance frequency is lower than the preset reference frequency, or whether the word is morphologically separated. Consideration can be made to determine whether a word is a typo.

만약, 입력된 단어가 오타인 경우, 정답 단어 선택부(106)는 입력된 단어의 유사어 중 정답 단어를 선택하여 제공할 수 있다. 반대로, 만약, 입력된 단어가 정자인 경우, 정답 단어 선택부(106)는 동작하지 않는다.If the input word is a typo, the correct word selecting unit 106 may select and provide a correct answer word among similar words of the input word. On the contrary, if the input word is a sperm, the correct word selection unit 106 does not operate.

도 2에서 볼 수 있듯이, 입력된 단어가 히라가나 형태, 가타카나 형태 또는 한자 형태 중 어느 하나일 수 있다. 이 때, 입력된 단어가 히라가나 형태 또는 가타카나 형태인 경우, 로마자 변환부(103)는 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어의 발음을 로마자(romaji)로 변환할 수 있다.As shown in FIG. 2, the input word may be one of hiragana form, katakana form, or kanji form. In this case, when the input word is in hiragana form or katakana form, the roman conversion unit 103 may convert a pronunciation of a word expressed in Japanese hiragana form or katakana form into romaji.

만약, 입력된 단어가 한자 형태인 경우, 한자를 직접 로마자로 변환하기 어렵기 때문에 한자-히라가나 변환부(102)를 통해 히라가나 형태로 정규화하는 과정을 거칠 수 있다. 구체적으로, 한자-히라가나 변환부(102)는 토큰 분할 학습 데이터를 이용하여 한자를 토큰 별로 분할하고, 한자-히라가나 변환 학습 데이터를 이용하여 분할된 토큰에 대응하는 히라가나로 변환할 수 있다. 그러면, 로마자 변환부(103)는 변환된 히라가나의 발음을 로마자로 변환할 수 있다.If the input word is in the form of a Chinese character, since it is difficult to directly convert a Chinese character to a Roman character, the Chinese character-hiragana conversion unit 102 may undergo a process of normalizing the hiragana form. Specifically, the kanji-hiragana conversion unit 102 may divide the kanji into tokens using the token split learning data and convert the kanji into hiragana corresponding to the split tokens using the kanji-hiraga conversion data. Then, the roman conversion unit 103 may convert the converted hiragana pronunciation into a roman alphabet.

그러면, 유사어 검색부(104)는 변환된 로마자에 기초하여 단어에 대한 유사어를 검색할 수 있다. 구체적으로, 유사어 검색부(104)는 로마자로 변환된 단어의 유사도 점수를 고려하여 단어에 대한 유사어를 검색할 수 있다.Then, the similar word search unit 104 may search for similar words for the word based on the converted roman characters. In detail, the similar word search unit 104 may search for similar words with respect to a word in consideration of a similarity score of a word converted to Roman characters.

일례로, 유사도 점수는 단어의 길이에 따른 입력 빈도, 단어가 장음, 가운데점, 촉음 또는 탁음의 포함 여부에 따른 편집 거리 또는 단어의 원형 상태의 비교 정도 중 적어도 하나에 기초하여 결정될 수 있다For example, the similarity score may be determined based on at least one of an input frequency according to a word length, an edit distance according to whether a word includes a long sound, a middle point, a tactile sound, or a tactile sound or a degree of comparison of a word's original state.

단어의 길이, information -information [편집거리, 유사도]Word length, information -information

장음 : ハロワ-ク(오타), ハロ-ワ-ク(오타), ハロ-ワ-ク(정답)Long sound: ハロワ-ク (Ota), ハロ-ワ-ク (Ota), ハロ-ワ-ク (correct)

중점 : ピ-トロ-ズ(오타), ピ-ト·ロ-ズ(정답)Key points: ピ-トロ-ズ (correct), ピ-トロ-ズ (correct)

반탁음 : オリゴン(오타), オリコン(정답)Halftone: オリゴン (Ota), オリコン (correct)

촉음 : ビクカメラ(오타) ビックカメラ(정답)Tactile: BicCamera (Ota) BicCamera (correct)

원형 : 花よりだんごファイナル(오타) 花より男子ファイナル(정답)Prototype: Flower Yori Yoshino Foinin (Ota) Flower Yori Yoko Foinin (correct)

단어의 길이가 짧을수록 단어의 입력 빈도가 증가되기 때문에, 유사도 검색부(104)는 단어의 길이가 짧을수록 유사도 점수를 증가시킬 수 있다.Since the shorter the word length, the higher the frequency of input of the word, the similarity search unit 104 may increase the similarity score as the shorter word length.

일본어의 장음(-)은 다른 문자에 비해 쉽게 삽입되거나 삭제되기 때문에, 유사어 검색부(104)는 단어에 장음이 포함된 경우 편집 거리를 작게 가중하여 유사도 점수를 증가시킬 수 있다. 그리고, 일본어의 중점(中點, ·)은 다른 문자에 비해 쉽게 삽입되거나 또는 삭제되기 때문에, 유사어 검색부(104)는 단어에 중점이 포함된 경우 편집 거리를 작게 가중하여 유사도 점수를 높일 수 있다. 일본어의 촉음(っ)은 쉽게 생략되거나 유사 발음으로 잘못 쓰이는 경우가 많기 때문에, 유사어 검색부(104)는 단어에 촉음이 포함된 경우 편집 거리를 작게 가중하여 유사도 점수를 높일 수 있다.Since the Japanese long sound (-) is easily inserted or deleted in comparison with other characters, the similar word search unit 104 may increase the similarity score by weighting the editing distance small when the long sound is included in the word. In addition, since the Japanese midpoint is easily inserted or deleted in comparison with other characters, the similarity search unit 104 may increase the similarity score by weighting the edit distance small when the midpoint is included in the word. . Since the Japanese tactile っ is easily omitted or misused by similar pronunciation, the similarity search unit 104 may increase the similarity score by weighting the editing distance small when the word includes the tactile sound.

또한, 로마자로 변환된 형태뿐만 아니라, 유사어 검색부(104)는 단어의 원 형 상태도 비교하여 유사도 점수에 반영할 수 있다. 원형 상태를 비교함으로써 로마자로 정규화한 상태에서 유사어를 검색하는 결과의 오류를 보완할 수 있다. 예를 들어, 입력된 단어가 うとん(우통)인 경우, 유사어 검색부(104)는 うろん(우롱)보다는 원형 상태가 유사한 うどん(우동)의 유사도 점수를 높게 부여함으로써, 로마자 변환을 통해 유사도를 판단할 때의 오류를 보완할 수 있다. In addition, the similar word search unit 104 may be reflected in the similarity score by comparing the circular state of the word, as well as the form converted to Roman characters. Comparing the primitive states can compensate for errors in the search for similar words in the Roman normalized state. For example, if the input word is うとん (utong), the similarity search unit 104 assigns a similarity score to うどん (udon) having a similar circular state rather than うろん (oolong), thereby providing similarity through the roman conversion. Error in judgment can be compensated for.

일례로, 단어가 한자인 경우, 유사어 검색부(104)는 로마자로 변환된 형태의 비교 결과, 히라가나로 변환된 형태의 비교 결과 및 한자 원래 형태의 비교 결과를 고려하여 유사도 점수를 결정할 수 있다. 구체적으로, 단어가 한자인 경우, 유사어 검색부(104)는 하기 수학식 1에 따라 유사도 점수를 결정할 수 있다.For example, when a word is a Chinese character, the similar word search unit 104 may determine a similarity score in consideration of a comparison result of a form converted to Roman characters, a comparison result of a form converted to Hiragana, and a comparison result of the original form of a Chinese character. Specifically, when the word is a Chinese character, the similar word search unit 104 may determine the similarity score according to Equation 1 below.

SCORE(q,t)=a*score_romaji(q_romaji, t_romaji)+b*score_hiraganaSCORE (q, t) = a * score_romaji (q_romaji, t_romaji) + b * score_hiragana

(q_hiragana, t_hiragana)+c*score_original(q_original, t_original)(q_hiragana, t_hiragana) + c * score_original (q_original, t_original)

여기서, q는 사용자가 입력한 일본어(질의어), t는 유사어를 의미한다. 그리고, a, b, c는 상수를 의미한다. 이 때, a, b, c는 기계 학습 등을 통해 도출될 수 있다.Here, q means Japanese (query word) input by the user, and t means analogous word. And a, b, and c represent constants. In this case, a, b, and c may be derived through machine learning.

이러한 과정을 통해 유사어가 검색되면, 도 2에서 볼 수 있듯이, 유사어 추천부(105)는 검색된 유사어를 히라가나, 가타카나 또는 한자 중 어느 하나의 일본어 형태로 변환하여 추천할 수 있다. 예를 들어, 입력된 단어가 히라가나 형태인 경우, 유사어 추천부(105)는 검색된 유사어를 히라가나 형태, 가타카나 형태 또는 한자 형태 중 어느 하나의 일본어 형태로 변환하여 추천할 수 있다. 즉, 유사어 추천부(105)는 검색된 유사어를 입력된 단어의 일본어 형태와 다른 형태로 변환하여 추천할 수 있다.When a similar word is searched through this process, as shown in FIG. 2, the similar word recommender 105 may convert the searched similar word into a Japanese form of hiragana, katakana or kanji. For example, when the input word is in hiragana form, the similar word recommendation unit 105 may convert the searched similar word into any one of hiragana form, katakana form, or kanji form and recommend it. That is, the similar word recommendation unit 105 may convert the searched similar word into a form different from the Japanese form of the input word and recommend the same.

일례로, 유사어 추천부(105)는 로마자로 변환된 상태의 유사도와 로마자로 변환되지 않은 상태의 유사도의 차이가 미리 설정한 기준을 초과하는 경우, 유사어를 추천하지 않을 수 있다. 다른 일례로, 유사어 추천부(105)는 입력된 단어가 추천된 유사어보다 더 많이 사용되는 경우 유사어를 추천하지 않을 수 있다.For example, the similar word recommendation unit 105 may not recommend the similar word when the difference between the similarity between the state converted to the roman alphabet and the similarity between the state not converted to the Roman alphabet exceeds a preset criterion. As another example, the similar word recommendation unit 105 may not recommend the similar word when the input word is used more than the recommended similar word.

그리고, 입력된 단어가 오타인 경우, 정답 단어 선택부(106)는 유사도 점수 또는 단어의 입력 빈도에 따른 편집 거리를 고려하여 검색된 유사어 중 단어에 대한 정답 단어를 선택할 수 있다. 구체적으로, 정답 단어 선택부(106)는 유사도 점수가 가장 높거나 단어의 입력 빈도가 높아 편집 거리가 낮은 유사어를 단어에 대한 정답 단어를 선택할 수 있다.In addition, when the input word is a typo, the correct word selecting unit 106 may select the correct answer word for the word from among similar words searched in consideration of the similarity score or the editing distance according to the frequency of input of the word. In detail, the correct word selecting unit 106 may select the correct word for the word from the similar word having the highest similarity score or the high frequency of input of the word and the low editing distance.

도 3은 본 발명의 일실시예에 따라 한자로부터 히라가나로 변환하는 과정을 도시한 도면이다.3 is a diagram illustrating a process of converting a Chinese character into hiragana according to an embodiment of the present invention.

본 발명의 일실시예에 따른 한자-히라가나 변환부는 입력된 한자에 대해 히라가나로 변환할 수 있다. 그러면, 로마자 변환부는 히라가나를 로마자로 변환할 수 있다.Kanji-Hiragana conversion unit according to an embodiment of the present invention may convert the input Chinese characters to hiragana. Then, the roman conversion unit may convert hiragana to roman characters.

일례로, 한자-히라가나 변환부는 토큰 분할 학습 데이터(302)를 이용하여 토큰 분할(305)에 따라 한자(304)를 토큰 별로 분할하고, 한자-히라가나 변환 학습 데이터(303)를 이용하여 한자-히라가나 변환(306)을 통해 분할된 토큰(305)에 대응하는 히라가나(307)로 변환할 수 있다.For example, the kanji-hiragana conversion unit divides the kanji 304 by token according to the token division 305 using the token division learning data 302 and the kanji-hiragana using the kanji-hiragana conversion learning data 303. The transform 306 may convert the split token 305 into a hiragana 307 corresponding to the split token 305.

僕と彼女の生きる道의 경우 토큰 문할 학습 데이터(302)를 이용하여 僕, と, 彼女, の, 生き, る, 道로 토큰 분할을 하고, 각 토근 바이그램들에서 최대 확률 값을 갖는 히라가나 상태열을 선택한다. 그 결과 다음과 같이 수행 될 수 있다. 僕-ぼく　と　彼女-かのじょ　の　生きる-いきる　道-みち 최종적으로 ぼくとかのじょのいきるみち로 변환할 수 있다..In the case of 僕と彼女の生きる道, the tokens are divided into 僕, と, 彼女, の, 生き, る, and 道 using the token literacy learning data (302), and the hiragana state sequence has the maximum probability value in each toeogram. Select. As a result, it can be performed as follows.僕-ぼくと彼女-かのじょの生きる-いきる道-みち Finally it can be converted to ぼくとかのじょのいきるみち.

이 때, 학습 데이터는 일본어 뉴스 또는 일본어 블로그에 게시된 문서와 같은 일본어 문서(301)에서 한자(304)에 대응하는 히라가나 학습 문서를 만들고, 상기 학습 문서를 바탕으로 기계학습 방법을 통해 입력 형태에 따른 히라가나를 선택 조합함으로써 수행될 수 있다.At this time, the learning data is made from a Japanese document 301, such as a document published in Japanese news or Japanese blog, Hiragana learning document corresponding to the kanji 304, and based on the learning document to the input form through the machine learning method By selecting and combining the hiragana according.

일례로, 토큰 분할 학습 데이터(302)는 한자의 형태소 토큰 별로 나누어진 코퍼스(corpus)를 이용하여 은닉 마르코프 모델(Hidden Markov Model: HMM) 기반의 띄어쓰기 학습을 통해 결정될 수 있다. 이 때, 음절 trigam HMM 기반의 띄어쓰기 학습을 통해 토큰 분할 학습 데이터(302)가 결정될 수 있다.For example, the token split learning data 302 may be determined through a hidden markov model-based spacing learning using a corpus divided by morpheme tokens of Chinese characters. In this case, the token split learning data 302 may be determined through the spaced learning based on the syllable trigam HMM.

일례로, 한자-히라가나 변환 학습 데이터(303)는 한자(304)의 형태소 토큰 별로 분리된 코퍼스(corpus)에 기초한 학습을 통해 결정된 유니그램(unigram) 사전(303-1) 및 바이그램(bigram) 사전(303-2)을 포함할 수 있다. 이 때, 유니그램 사전(303-1)은 토큰과 히라가나 간의 빈도수 (토큰 - 히라가나)로 구축될 수 있다. 바이그램 사전(303-2)은 토큰 간의 빈도수(토큰 1 - 토큰 2)로 구축될 수 있다. 즉, 한자-히라가나 변환부는 문서(301)로부터 학습 과정을 통해 결정된 토큰 분할 학습 데이터(302) 및 한자-히라가나 변환 학습 데이터(303)를 이용하여 한자(304) 를 히라가나(307)로 변환할 수 있다.In one example, the Hanja-Hiragana conversion training data 303 is a unigram dictionary 303-1 and a bigram dictionary determined through learning based on a corpus separated by morpheme tokens of the Chinese character 304. 303-2. At this time, the unigram dictionary 303-1 may be constructed as a frequency (token-hiragana) between the token and the hiragana. The viagram dictionary 303-2 may be constructed with a frequency between tokens (token 1-token 2). That is, the kanji-hiraga conversion unit may convert the kanji 304 into the hiragana 307 using the token split learning data 302 and the kanji-hiragana conversion learning data 303 determined from the document 301 through the learning process. have.

일례로, 한자-히라가나 변환부는 토큰 분할 학습 데이터(301)를 통해 한자(304)로부터 분할된 토큰에 대해 2개의 토큰씩 바이그램 사전(303-2)을 검색하여 최대 확률을 갖는 토큰을 선택할 수 있다. 그리고, 한자-히라가나 변환부(102)는 최종적으로 선택된 토큰들에 대해 유니그램 사전(303-1)에 대응하는 히라가나(307)로 변환할 수 있다. 만약, 바이그램 사전(303-2)의 정보량이 부족한 경우, 한자-히라가나 변환부는 유니그램 사전(303-1)을 이용하여 최대 확률을 갖는 토큰을 선택할 수 있다.For example, the Chinese-Hiragana conversion unit may select a token having a maximum probability by searching the bigram dictionary 303-2 for each token divided from the Chinese character 304 through the token split learning data 301. . The kanji-hiragana conversion unit 102 may convert the finally selected tokens into the hiragana 307 corresponding to the unigram dictionary 303-1. If the amount of information of the bigram dictionary 303-2 is insufficient, the Hanja-Hiragana converter may select the token having the maximum probability using the unigram dictionary 303-1.

도 4는 본 발명의 일실시예에 따라 로마자로 변환하는 일례를 도시한 도면이다.4 is a diagram illustrating an example of converting to Roman characters according to an embodiment of the present invention.

도 4에 의하면, "아"행과 "가"행에 대해 로마자로 변환하는 일례를 나타내고 있다. 로마자 변환부는 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어의 발음을 로마자(romaji)로 변환할 수 있다. 이 때, 입력된 단어가 한자인 경우, 한자-히라가나 변환부는 한자를 히라가나로 변환할 수 있다.According to FIG. 4, the example which converts to "a" line and "a" line to Roman characters is shown. The roman conversion unit may convert a pronunciation of a word expressed in Japanese hiragana form or katakana form into romaji. In this case, when the input word is a Chinese character, the Chinese character-Hiragana conversion unit may convert the Chinese character to Hiragana.

도 4에서 볼 수 있듯이, あ行에 대해, 로마자 변환부는 あ를 로마자 "a"로 변환할 수 있다. 그리고, 로마자 변환부는 い를 로마자 "i"로 변환할 수 있다. 마찬가지로, 로마자 변환부는 각각 う를 "u"로, え를 "e"로, お를 "o"로 변환할 수 있다. 이러한 과정을 통해, 일본어 자동 추천 시스템은 히라가나 또는 가타카나를 로마자로 변환함으로써 보다 정밀하게 입력된 단어의 유사어를 검색할 수 있다.As can be seen in FIG. 4, for the A action, the Roman conversion part may convert A to the Roman letter “a”. The roman conversion unit may convert i to roman "i". Similarly, the Roman conversion unit can convert う to "u", え to "e", and お to "o", respectively. Through this process, the Japanese automatic recommendation system can search for similar words of the words inputted more precisely by converting Hiragana or Katakana to Roman characters.

앞서 설명했듯이, 히라가나와 가타카나를 그대로 사용하여 유사어를 검색하 는 경우, 편집 거리의 해상도가 낮아 인간이 아닌 서버와 같은 기계의 경우 オリゴン와 オリコン를 구별하기 힘들다. 이 경우, オリゴン와 オリコン를 로마자인 origon과 orikon을 비교함으로써 보다 정밀한 유사도 점수를 산정하여 유사어 추천의 정확도를 향상시킬 수 있다.As described above, when searching for similar words using hiragana and katakana as it is, the resolution of the editing distance is low, so it is difficult to distinguish between the original and the original in a machine such as a non-human server. In this case, the accuracy of the similarity recommendation can be improved by calculating a more precise similarity score by comparing the orignon and the oricon to the roman letters origon and orikon.

도 5는 본 발명의 일실시예에 따른 일본어 자동 추천 방법의 전체 과정을 도시한 플로우차트이다.5 is a flowchart illustrating the entire process of the Japanese automatic recommendation method according to an embodiment of the present invention.

도 5를 참고하면, 일본어 자동 추천 시스템은 사용자로부터 입력된 단어가 오타인지 판단할 수 있다(S501). 이 때, 입력된 단어가 오타인 경우, 일본어 자동 추천 시스템은 단어에 대한 유사어 중 정답 단어를 선택하여 제공할 수 있다(S507).Referring to FIG. 5, the Japanese automatic recommendation system may determine whether a word input from the user is a typo (S501). In this case, when the input word is a typo, the Japanese automatic recommendation system may select and provide a correct answer word among similar words for the word (S507).

일본어 자동 추천 시스템은 입력된 단어가 오타뿐만 아니라 정자인 경우에도 입력된 단어에 대한 유사어를 자동으로 추천할 수 있다. 일본어 자동 추천 시스템은 입력된 단어가 한자인지 판단할 수 있다(S502). 만약, 단어가 한자인 경우, 일본어 자동 추천 시스템은 한자를 히라가나로 변환할 수 있다(S503). 그런 후, 단계(S504)가 수행된다. 입력된 단어가 한자가 아닌 경우, 별도의 변환 과정을 거치지 않는다.The Japanese automatic recommendation system may automatically recommend similar words for the input words even if the input words are sperm as well as typos. The Japanese automatic recommendation system may determine whether the input word is a kanji (S502). If the word is a kanji, the Japanese automatic recommendation system may convert the kanji to hiragana (S503). Then, step S504 is performed. If the entered word is not a Chinese character, no separate conversion process is performed.

구체적으로, 일본어 자동 추천 시스템은 토큰 분할 학습 데이터를 이용하여 상기 단어를 토큰 별로 분할하고, 한자-히라가나 변환 학습 데이터를 이용하여 상기 분할된 토큰에 대응하는 히라가나로 변환할 수 있다. In detail, the Japanese automatic recommendation system may divide the word into tokens by using token division learning data, and convert the word into hiragana corresponding to the divided tokens using the kanji-hiragana conversion learning data.

이 때, 토큰 분할 학습 데이터는 한자의 형태소 토큰 별로 나누어진 코퍼 스(corpus)를 이용하여 은닉 마르코프 모델(Hidden Markov Model: HMM) 기반의 띄어쓰기 학습을 통해 결정될 수 있다. 또한, 한자-히라가나 변환 학습 데이터는 한자의 형태소 토큰 별로 분리된 코퍼스(corpus)에 기초한 학습을 통해 결정된 바이그램(bigram) 사전 및 유니그램(unigram) 사전을 포함할 수 있다. 여기서, 바이그램 사전은 토큰 간의 빈도수로 구축되고, 유니그램 사전은 토큰과 히라가나 간의 빈도수로 구축될 수 있다.In this case, the token split learning data may be determined through a hidden markov model based on Hidden Markov Model (HMM) using a corpus divided by morpheme tokens of Chinese characters. In addition, the Chinese-Hiragana conversion training data may include a bigram dictionary and a unigram dictionary determined through learning based on a corpus separated by morpheme tokens of the Chinese character. Here, the bigram dictionary may be constructed with a frequency between tokens, and the unigram dictionary may be constructed with a frequency between tokens and hiragana.

그러면, 일본어 자동 추천 시스템은 분할된 토큰에 대해 바이그램 사전을 검색하여 최대 확률을 나타내는 토큰을 선택하고, 선택된 토큰에 대해 유니그램 사전에 대응하는 히라가나로 변환할 수 있다.Then, the Japanese automatic recommendation system may search the bigram dictionary for the divided tokens, select a token representing the maximum probability, and convert the hiragana corresponding to the unigram dictionary for the selected token.

일본어 자동 추천 시스템은 일본어의 히라가나 형태 또는 가타카나 형태로 표현된 단어의 발음을 로마자로 변환할 수 있다(S504). 그러면, 일본어 자동 추천 시스템은 변환된 로마자에 기초하여 단어에 대한 유사어를 검색할 수 있다(S505). The Japanese automatic recommendation system may convert the pronunciation of a word expressed in Japanese hiragana form or katakana form to Roman characters (S504). Then, the Japanese automatic recommendation system may search for similar words for the word based on the converted roman characters (S505).

일례로, 일본어 자동 추천 시스템은 로마자로 변환된 단어의 유사도 점수를 고려하여 단어에 대한 유사어를 검색할 수 있다. 이 때, 유사도 점수는 단어의 길이에 따른 입력 빈도, 단어가 장음, 중점, 촉음 또는 탁음의 포함 여부에 따른 편집 거리 또는 단어의 원형 상태의 비교 정도 중 적어도 하나에 기초하여 결정될 수 있다.For example, the Japanese automatic recommendation system may search for a similar word for a word in consideration of a similarity score of a word converted to Roman characters. In this case, the similarity score may be determined based on at least one of an input frequency according to the length of the word, an edit distance according to whether the word includes a long sound, a midpoint, a tactile sound, or a tactile sound or a degree of comparison of the word's original state.

그리고, 일본어 자동 추천 시스템은 검색된 유사어를 히라가나, 가타카나 또는 한자 중 어느 하나의 일본어 형태로 변환하여 추천할 수 있다(S506). 이 때, 유사어 추천부는 검색된 유사어를 입력된 단어의 일본어 형태와 다른 형태로 변환하여 추천할 수 있다.In addition, the Japanese automatic recommendation system may convert the searched analogous words into any Japanese form of hiragana, katakana, or kanji (S506). In this case, the similar word recommendation unit may recommend the similar word searched by converting the searched similar word into a form different from the Japanese form of the input word.

일례로, 로마자로 변환된 상태의 유사도와 로마자로 변환되지 않은 상태의 유사도의 차이가 미리 설정한 기준을 초과하는 경우, 일본어 자동 추천 시스템은 유사어를 추천하지 않지 않을 수 있다. 다른 일례로, 입력된 단어가 추천된 유사어보다 더 많이 사용되는 경우, 일본어 자동 추천 시스템은 유사어를 추천하지 않을 수 있다.For example, when the difference between the similarity of the state converted to Roman characters and the similarity of the state not converted to Roman characters exceeds a preset criterion, the Japanese automatic recommendation system may not recommend similar words. As another example, when the input word is used more than the recommended similar word, the Japanese automatic recommendation system may not recommend the similar word.

일본어 자동 추천 시스템은 단계(S501)에서 입력된 단어가 오타로 판단된 경우, 유사도 점수 또는 단어 출현 빈도에 따른 편집 거리를 고려하여 검색된 유사어 중 상기 단어에 대한 정답 단어를 선택할 수 있다(S507).When the word input in step S501 is determined as a typo, the Japanese automatic recommendation system may select the correct answer word for the word from among the similar words searched in consideration of the similarity score or the edit distance according to the frequency of word appearance (S507).

도 5에서 구체적으로 설명되지 않은 부분은 도 1 내지 도 4의 설명을 참고할 수 있다.Portions not specifically described in FIG. 5 may refer to descriptions of FIGS. 1 to 4.

또한 본 발명의 일실시예에 따른 일본어 자동 추천 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.In addition, the Japanese automatic recommendation method according to an embodiment of the present invention includes a computer readable medium including program instructions for performing operations implemented by various computers. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be program instructions that are specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명 사상은 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. Modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100: 일본어 자동 추천 시스템100: Japanese automatic recommendation system

101: 오타 판단부101: typo judgment

102: 한자-히라가나 변환부102: Kanji-Hiragana conversion unit

103: 로마자 변환부103: Roman translator

104: 유사어 검색부104: synonym search unit

105: 유사어 추천부105: synonym recommendation

106: 정답 단어 선택부106: Answer word selector

Claims

A roman conversion part for converting a pronunciation of a word expressed in Japanese hiragana form or katakana form into a roman character; And

A similar word search unit for searching similar words for the word based on the converted roman characters

Japanese automatic recommendation system that includes.

The method of claim 1,

The similar word search unit,

Searching for a similar word for the word in consideration of a similarity score of the word converted to the roman character,

The similarity score is,

Japanese automatic recommendation, characterized in that it is determined based on at least one of the input frequency according to the length of the word, the edit distance according to whether the word is included in the long sound, midpoint, tactile sound or sound, or the degree of comparison of the circular state of the word system.

The method of claim 2,

The similar word search unit,

And if the word is a Chinese character, the similarity score is determined in consideration of the comparison result of the form converted to Roman characters, the comparison result of the form converted to Hiragana, and the comparison result of the original form of the Chinese character.

The method of claim 1,

A similar word recommending unit for converting the searched similar words into a Japanese form of any one of the hiragana, katakana, and kanji.

Japanese automatic recommendation system that includes more.

The method of claim 4, wherein

The analogous word recommendation unit,

(1) if the difference between the similarity of the state converted to Roman characters and the similarity of the state not converted to Roman characters exceeds a preset criterion, the similar word is not recommended, or

(2) Japanese automatic recommendation system, characterized in that the similar words are not recommended when the words are used more than the recommended similar words.

The method of claim 4, wherein

The analogous word recommendation unit,

The automatic Japanese recommendation system, characterized in that for converting the searched analogous words in a form different from the Japanese form of the word.

The method of claim 1,

An error determination unit that analyzes the input word to determine whether the word is a typo

More,

The roman conversion unit,

If the input word is a typo, Japanese automatic recommendation system, characterized in that for converting the word to Roman.

The method of claim 7, wherein

The typo determination unit,

Whether the word is a typo in consideration of whether the word is included in preset typo data, whether the word input frequency or document appearance frequency is lower than a preset reference frequency, or whether the word is morphologically separated. Japanese automatic recommendation system, characterized by judging whether or not.

The method of claim 7, wherein

When the word is a typo, a correct word selection unit for selecting the correct word for the word among the searched similar words in consideration of the similarity score or the editing distance according to the frequency of input of the word

Japanese automatic recommendation system that includes more.

The method of claim 1,

When the input word is a Chinese character, the Chinese character-hiragana conversion unit divides the word by token using token division learning data, and converts the word into hiragana corresponding to the divided token using Chinese character-Hiragana conversion learning data.

Japanese automatic recommendation system that includes more.

The method of claim 10,

The token split learning data is,

The Japanese automatic recommendation system, characterized in that it is determined through a hidden Markov Model (HMM) -based spacing learning using a corpus divided by the morpheme token of the Chinese character.

The method of claim 10,

The kanji-hiragana conversion learning data,

A bigram dictionary and a unigram dictionary determined through learning based on a corpus separated by morpheme tokens of Chinese characters,

The bigram dictionary is constructed with a frequency between tokens,

The unigram dictionary is a Japanese automatic recommendation system, characterized in that the frequency is built between the token and Hiragana.

The method of claim 12,

The Kanji-Hiragana conversion unit,

The Japanese automatic recommendation system, characterized in that for searching the bigram dictionary for the divided token to select a token representing the maximum probability, and converts the selected token to the hiragana corresponding to the unigram dictionary.

Converting a pronunciation of a word expressed in Japanese hiragana form or katakana form into romaji; And

Searching for similar words for the word based on the converted roman characters

Japanese automatic recommendation method that includes.

The method of claim 14,

Searching for a similar word for the word,

Characterized in that to search for a similar word for the word in consideration of the similarity score of the word converted to the roman alphabet,

The similarity score is,

Japanese automatic, characterized in that determined based on at least one of the frequency of input according to the length of the word, the edit distance according to whether the word contains a long sound, a midpoint, a tactile sound or the sound of the word or the degree of comparison of the circular state of the word Recommended way.

The method of claim 15,

Searching for a similar word for the word,

The method of claim 14,

Recommending the searched analogous word by converting the searched analogous word into a Japanese form of the hiragana letter, katakana word or kanji character.

Japanese automatic recommendation method that includes more.

The method of claim 17,

Recommend by converting the searched analogous words into any one of the Japanese form of the hiragana, katakana or kanji,

(2) Japanese automatic recommendation method, characterized in that the similar words are not recommended if the words are used more than the recommended similar words.

The method of claim 17,

The automatic Japanese recommendation method, characterized in that for converting the searched analogous words in a form different from the Japanese form of the word.

The method of claim 14,

Analyzing the input words to determine whether the words are typos

More,

Converting the pronunciation of the word to Roman,

If the input word is a typo, the Japanese automatic recommendation method, characterized in that for converting the word to Roman.

21. The method of claim 20,

Determining whether the word is a typo,

Whether the word is a typo in consideration of whether the word is included in preset typo data, whether the word input frequency or document appearance frequency is lower than a preset reference frequency, or whether the word is morphologically separated. Japanese automatic recommendation method, characterized by judging whether or not.

21. The method of claim 20,

If the word is a typo, selecting a correct answer word for the word among the searched similar words in consideration of a similarity score or an edit distance according to a frequency of input of the word;

Japanese automatic recommendation method that includes more.

The method of claim 14,

When the input word is a Chinese character, dividing the word by token using token split learning data and converting the word into hiragana corresponding to the split token using the Chinese character hiragana conversion learning data.

Japanese automatic recommendation method that includes more.

24. The method of claim 23,

The token split learning data is,

The Japanese automatic recommendation method characterized in that it is determined through the spacing learning based on hidden Markov model using the corpus divided by the morpheme token of the kanji.

24. The method of claim 23,

The kanji-hiragana conversion learning data,

Includes a bigram dictionary and a unigram dictionary determined through learning based on a corpus separated by morpheme tokens of Chinese characters,

The bigram dictionary is constructed with a frequency between tokens,

The unigram dictionary is a Japanese automatic recommendation method, characterized in that the frequency is built between the token and Hiragana.

24. The method of claim 23,

Converting to the hiragana corresponding to the divided token,

Searching a bigram dictionary for the divided tokens and selecting a token representing a maximum probability; And

Converting to hiragana corresponding to a unigram dictionary for the selected token

Japanese automatic recommendation method that includes.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 14 to 26.