JP2011008784A

JP2011008784A - System and method for automatically recommending japanese word by using roman alphabet conversion

Info

Publication number: JP2011008784A
Application number: JP2010141508A
Authority: JP
Inventors: Byeong Il Ko; ビョンイルコ; Yoon Suh Ki; ユンソキ; Tae Il Kim; テイルキム; Hee Cheol Seo; ヒ‐チョルソ
Original assignee: NHN Corp
Current assignee: NHN Corp
Priority date: 2009-06-24
Filing date: 2010-06-22
Publication date: 2011-01-13
Anticipated expiration: 2030-06-22
Also published as: KR101086550B1; JP5097802B2; KR20100138194A

Abstract

PROBLEM TO BE SOLVED: To provide a system and method for automatically recommending Japanese words by using Roman alphabet conversion.SOLUTION: The system for automatically recommending Japanese words includes: a Roman alphabet conversion part for converting pronunciation of a word expressed in a Japanese Hiragana or Katakana mode to Roman alphabets; and a similar word search part for searching for words similar to the word on the basis of the converted Roman alphabets.

Description

本発明は、入力された日本語に対する類似語を推薦するシステムおよび方法に関し、より詳しくは、入力された日本語の発音をローマ字に変換して類似語を推薦するシステムおよび方法に関する。 The present invention relates to a system and method for recommending similar words for input Japanese, and more particularly, to a system and method for recommending similar words by converting input Japanese pronunciation into Roman characters.

ユーザは、所望する情報を得るために検索エンジンの検索ウィンドウに単語を入力して検索を行う。このとき、ユーザが単語を誤って入力して誤字が発生する場合、誤字によって検索される文書（検索結果）の品質が落ちたり、検索される文書の数がほとんどないという問題が発生していた。かかる問題を解決するために従来の検索エンジンでは、このような単語を誤字として判断し、ユーザが実際に入力しようとした誤字に対応する単語（検索語）を推薦する機能を備えている。 The user performs a search by entering a word in a search window of a search engine in order to obtain desired information. At this time, when a user mistakenly inputs a word and a typographical error occurs, there is a problem that the quality of a document (search result) searched by the typographical character is deteriorated or the number of documents to be searched is scarce . In order to solve such a problem, a conventional search engine has a function of determining such a word as a typo and recommending a word (search word) corresponding to the typo actually entered by the user.

一方、ユーザが単語を入力して検索を行ったとしても、ユーザの所望する検索結果を得るための最適な単語をユーザが入力しているケースは、少数に過ぎない。このため、検索エンジンが、ユーザが入力したその単語に対する検索結果を提供したとしても、ユーザはその検索結果に対して不満を持つことになる。そこで、このような問題を解決するために従来の検索エンジンは、ユーザが入力した単語に対する関連語または類似語を提供することによって検索の正確度の向上を図っている。 On the other hand, even if the user performs a search by inputting a word, there are only a few cases in which the user has input an optimal word for obtaining a search result desired by the user. For this reason, even if the search engine provides a search result for the word input by the user, the user is dissatisfied with the search result. Therefore, in order to solve such a problem, the conventional search engine aims to improve the accuracy of the search by providing related words or similar words for the word input by the user.

しかしながら、上述した各状況は、特に日本語を用いて検索する場合に以下のような問題があった。すなわち、ユーザが入力した検索語としての日本語を誤字と判断して正しい単語を提示したり、あるいはユーザが入力した日本語に対する類似語を提供する場合、従来はその正確度を保証することができなかった。何よりも、日本語は、漢字、ひらがな、及びカタカナの形態で表現され、また、１つの単語がこれら３つの形態を含んでいるため、ユーザが入力した単語に対して適切な単語を推薦することが難しいという問題があった。したがって、漢字、ひらがな、及びカタカナの各形態の日本語が入力されても適切な単語を推薦する方法が切に求められる。 However, each of the above-described situations has the following problems especially when searching using Japanese. In other words, if the Japanese search term input by the user is judged as a typo and the correct word is presented, or if a similar word for the Japanese input by the user is provided, the accuracy is conventionally guaranteed. could not. Above all, Japanese is expressed in the form of kanji, hiragana, and katakana, and since one word contains these three forms, it is recommended to recommend an appropriate word for the word entered by the user There was a problem that was difficult. Accordingly, there is a strong demand for a method for recommending an appropriate word even when Japanese characters in kanji, hiragana, and katakana are input.

本発明は、入力された日本語単語の発音をローマ字に変換し、変換されたローマ字に基づいて単語に対する類似語を検索することによって、日本語に対する類似語検索の正確度を向上させるシステムおよび方法を提供することを目的とする。 The present invention relates to a system and method for improving the accuracy of similar word search for Japanese by converting the pronunciation of an input Japanese word into Roman letters and searching for similar words for the word based on the converted Roman letters. The purpose is to provide.

また、本発明の他の目的は、入力された日本語単語が誤字であるかを判別し、誤字である場合に類似語を検索して正解単語を提供することによって、ユーザが検索語（質疑語）を誤って入力しても適切な正解単語を推薦して検索の正確度を向上させるシステムおよび方法を提供することにある。 Another object of the present invention is to determine whether an input Japanese word is a typographical error, and to search for a similar word and provide a correct word when the typographical error is a typographical character, so that the user can search for It is an object of the present invention to provide a system and method for improving the accuracy of a search by recommending an appropriate correct word even if a word) is erroneously input.

また、本発明の他の目的は、入力された日本語単語が漢字である場合、機械学習によって生成した学習データを介してトークンに分割し、分割されたトークンに対してひらがなに変換することで、迅速で正確な漢字−ひらがな変換を行うことができるシステムおよび方法を提供することにある。 Another object of the present invention is to divide the token into tokens via learning data generated by machine learning and convert the divided tokens into hiragana when the input Japanese word is kanji. An object of the present invention is to provide a system and method capable of performing quick and accurate kanji-hiragana conversion.

また、本発明の他の目的は、ユーザが入力した日本語単語の形態と異なる形態の類似語を検索して推薦することによって、ユーザがより正確な検索を行うことができるようにしたシステムおよび方法を提供することにある。 Another object of the present invention is to provide a system that allows a user to perform a more accurate search by searching for and recommending similar words in a form different from the form of Japanese words input by the user. It is to provide a method.

本発明の一実施形態に係る日本語自動推薦システムは、日本語のひらがな形態またはカタカナ形態で表現される単語の発音をローマ字に変換するローマ字変換部と、前記変換されたローマ字に基づいて前記単語に対する類似語を検索する類似語検索部と、を含むことを特徴とする。 An automatic Japanese recommendation system according to an embodiment of the present invention includes a romaji conversion unit that converts pronunciation of a word expressed in Japanese hiragana or katakana form into romaji, and the word based on the converted romaji A similar word search unit that searches for similar words for.

また、本発明の一実施形態に係る日本語自動推薦システムは、前記検索された類似語を前記ひらがな、カタカナ、または漢字のうちのいずれか１つの日本語形態に変換して推薦する類似語推薦部をさらに含むことができる。 The automatic Japanese recommendation system according to an embodiment of the present invention recommends similar word recommendation by converting the searched similar word into the Japanese form of any one of the hiragana, katakana, or kanji. May further include a portion.

本発明の一実施形態に係る日本語自動推薦システムは、入力された単語を分析して前記単語が誤字であるか否かを判断する誤字判断部をさらに含むことができる。 The automatic Japanese recommendation system according to an embodiment of the present invention may further include a typographical error determination unit that analyzes an input word and determines whether the word is a typo.

本発明の一実施形態に係る日本語自動推薦システムは、入力された単語が誤字である場合、類似度点数または単語の入力頻度による編集距離に基づいて、前記検索された類似語のうちの前記単語に対する正解単語を選択する正解単語選択部をさらに含むことができる。 In the Japanese automatic recommendation system according to an embodiment of the present invention, when the input word is a typographical error, based on the similarity score or the edit distance according to the input frequency of the word, A correct word selection unit for selecting a correct word for the word may be further included.

本発明の一実施形態に係る日本語自動推薦システムは、入力された単語が漢字である場合、トークン分割学習データを用いて前記単語をトークン別に分割し、漢字−ひらがな変換学習データを用いて前記分割されたトークンに対応するひらがなに変換する漢字−ひらがな変換部をさらに含むことができる。 In the Japanese automatic recommendation system according to an embodiment of the present invention, when an input word is kanji, the word is divided into tokens using token division learning data, and the kanji-Hiragana conversion learning data is used to divide the word. It may further include a Kanji-Hiragana conversion unit for converting into hiragana corresponding to the divided tokens.

本発明の一実施形態に係る日本語自動推薦方法は、日本語のひらがな形態またはカタカナ形態で表現される単語の発音をローマ字に変換するステップと、前記変換されたローマ字に基づいて前記単語に対する類似語を検索するステップと、を含むことができる。 An automatic Japanese recommendation method according to an embodiment of the present invention includes a step of converting a pronunciation of a word expressed in Japanese hiragana form or katakana form into Roman letters, and similarity to the word based on the converted Roman letters Searching for words.

本発明の一実施形態によれば、入力された日本語単語の発音をローマ字に変換し、変換されたローマ字に基づいて単語に対する類似語を検索することによって、日本語に対する類似語検索の正確度を向上させることができる。 According to an embodiment of the present invention, the pronunciation of an input Japanese word is converted to Roman characters, and the similar word search accuracy for Japanese is searched by searching similar words for the word based on the converted Roman characters. Can be improved.

本発明の一実施形態によれば、入力された日本語単語が誤字であるかを判別し、誤字である場合、類似語を検索して正解単語を提供することによって、ユーザが検索質疑を誤って入力しても適切な正解単語を推薦して検索の正確度を向上させることができる。 According to an embodiment of the present invention, it is determined whether an input Japanese word is a typographical error. Even if entered, the correct correct word can be recommended to improve the accuracy of the search.

本発明の一実施形態によれば、入力された日本語単語が漢字である場合、機械学習を介して生成した学習データによってトークンに分割し、分割されたトークンに対してひらがなに変換することによって迅速で正確な漢字−ひらがな変換を行うことができる。 According to an embodiment of the present invention, when an input Japanese word is a kanji, it is divided into tokens by learning data generated through machine learning, and converted into hiragana for the divided tokens. Quick and accurate Kanji-Hiragana conversion can be performed.

本発明の一実施形態によれば、ユーザが入力した日本語単語の形態と異なる形態の類似語を検索して推薦することによって、ユーザにさらに正確な検索を行うことができるようにすることができる。 According to an embodiment of the present invention, it is possible to perform a more accurate search for a user by searching for and recommending similar words in a form different from the form of a Japanese word input by the user. it can.

本発明の一実施形態に係る日本語自動推薦システムの全体構成を示すブロックダイアグラムである。It is a block diagram which shows the whole structure of the Japanese automatic recommendation system which concerns on one Embodiment of this invention. 本発明の一実施形態によって入力された単語に対してローマ字変換を介して日本語を自動的に推薦する過程を示す図である。FIG. 6 is a diagram illustrating a process of automatically recommending Japanese via a romaji conversion for a word input according to an embodiment of the present invention. 本発明の一実施形態によって漢字からひらがなに変換する過程を示す図である。FIG. 5 is a diagram illustrating a process of converting kanji into hiragana according to an embodiment of the present invention. 本発明の一実施形態によってローマ字に変換する一例を示す図である。It is a figure which shows an example converted into a Roman character by one Embodiment of this invention. 本発明の一実施形態に係る日本語自動推薦方法の全体の過程を示すフローチャートである。3 is a flowchart illustrating an overall process of an automatic Japanese recommendation method according to an embodiment of the present invention.

以下、添付された図面に記載の内容を参照して本発明に係る実施形態を詳細に説明する。ただし、本発明が以下に説明する実施形態によって制限又は限定されることはない。また、各図面に提示された同じ参照符号は同じ部材を示す。 Hereinafter, embodiments according to the present invention will be described in detail with reference to the contents described in the accompanying drawings. However, this invention is not restrict | limited or limited by embodiment described below. Moreover, the same referential mark shown in each drawing shows the same member.

図１は、本発明の一実施形態に係る日本語自動推薦システムの全体構成を示すブロックダイアグラムである。 FIG. 1 is a block diagram showing the overall configuration of a Japanese automatic recommendation system according to an embodiment of the present invention.

本実施形態の日本語自動推薦システム１００は、コンピュータにより構成され、コンピュータが備える制御部（ＣＰＵ）が、所定のプログラムを読み込むことにより後述する各部１０１〜１０６が実現される。また、本実施形態の日本語自動推薦システム１００の１つの態様としては、検索サービスを提供するウェブサーバ又は検索語（質疑語）を用いた検索処理を遂行する検索サーバ（検索エンジン）の一部に組み込まれたり、ウェブサーバ又は検索サーバに対して別構成で接続されるコンピュータ装置として構成される。 The automatic Japanese recommendation system 100 of the present embodiment is configured by a computer, and each unit 101 to 106 described later is realized by a control unit (CPU) included in the computer reading a predetermined program. In addition, as one aspect of the automatic Japanese recommendation system 100 of the present embodiment, a part of a web server that provides a search service or a search server (search engine) that performs a search process using a search word (question word). Or a computer device connected to the web server or the search server in a different configuration.

この場合、本実施形態の日本語自動推薦システム１００は、ユーザ端末から所定の検索窓に入力された日本語単語の検索語（質疑語）に対し、後述する１つ又は複数の類似語を自動的に抽出し、抽出された類似語（検索語が誤字であると判断された場合に、類似語の中から抽出された正解語を含む）は、検索語に対する検索結果ページに露出される。日本語自動推薦システム１００は、検索サーバが行う検索語に対する所定の検索結果の生成処理とは個別に、検索窓に入力された日本語単語を用いた類似語の抽出処理を遂行することができ、ウェブサーバや検索サーバの一部として組み込まれる場合は、ウェブサーバ又は検索サーバが、検索窓に対する検索語の入力をトリガーに、入力された日本語単語の検索語（質疑語）に対し、１つ又は複数の類似語を自動的に抽出し、抽出された類似語を含む検索結果ページを生成して検索要請をしたユーザのユーザ端末に伝送する。また、ウェブサーバ又は検索サーバに対して別構成で接続される場合は、ウェブサーバ又は検索サーバから検索窓に入力された検索語をネットワークを通じて受信し、受信した日本語単語の検索語（質疑語）に対して１つ又は複数の類似語を自動的に抽出し、抽出した類似語をウェブサーバ又は検索サーバに伝送する。 In this case, the automatic Japanese recommendation system 100 according to the present embodiment automatically applies one or more similar words, which will be described later, to a search word (question word) of a Japanese word input from a user terminal to a predetermined search window. The extracted similar words (including the correct words extracted from the similar words when the search word is determined to be a typo) are exposed to the search result page for the search word. The automatic Japanese recommendation system 100 can perform a similar word extraction process using Japanese words input to the search window, separately from the process of generating a predetermined search result for the search word performed by the search server. When incorporated as a part of a web server or a search server, the web server or the search server uses a search word input to the search window as a trigger for a search word (question word) of the input Japanese word. One or a plurality of similar words are automatically extracted, a search result page including the extracted similar words is generated and transmitted to the user terminal of the user who requested the search. When the web server or the search server is connected in another configuration, the search word input from the web server or the search server to the search window is received through the network, and the search word (question word) of the received Japanese word is received. ) Is automatically extracted, and the extracted similar words are transmitted to the web server or the search server.

図１に示すように、本実施形態の日本語自動推薦システム１００は、誤字判断部１０１、漢字−ひらがな変換部１０２、ローマ字変換部１０３、類似語検索部１０４、類似語推薦部１０５、および正解単語選択部１０６を含んで構成される。 As shown in FIG. 1, the automatic Japanese recommendation system 100 according to the present embodiment includes an erroneous character determination unit 101, a kanji-hiragana conversion unit 102, a Roman character conversion unit 103, a similar word search unit 104, a similar word recommendation unit 105, and a correct answer A word selection unit 106 is included.

日本語検索において、ユーザは所望する情報検索のために日本語を入力するが、このとき、ユーザは漢字、ひらがな、またはカタカナの各形態の日本語の単語Ａ１０７を入力することができる。日本語自動推薦システム１００は、ユーザが入力した各形態の単語１０７の発音をローマ字に変換することによって、より正確な日本語単語Ｂ１０８を推薦する。 In Japanese search, the user inputs Japanese for desired information search. At this time, the user can input a Japanese word A107 in each form of kanji, hiragana or katakana. The automatic Japanese recommendation system 100 recommends a more accurate Japanese word B108 by converting the pronunciation of each form of the word 107 input by the user into Roman letters.

本発明の一実施形態として、誤字判断部１０１により所定の画面からユーザにより入力された日本語単語が誤字であるか否かを判断し、ユーザが誤字を入力した場合には、日本語自動推薦システム１００は、漢字−ひらがな変換部１０２、ローマ字変換部１０３、類似語検索部１０４、及び類似語推薦部１０５により類似語を抽出し、正解単語選択部１０６が誤字に対する正しい単語（正解単語）を当該誤字に対して抽出された複数の類似語の中から選択して提供する。また、本発明の他の一実施形態として、誤字判断部１０１によって入力された日本語単語が誤字でないと判断された場合、または、誤字判断部１０１の判断処理とは無関係に、ユーザが誤字でない正確な単語を入力した場合、日本語自動推薦システム１００は、漢字−ひらがな変換部１０２、ローマ字変換部１０３、類似語検索部１０４、及び類似語推薦部１０５を介して類似語を提供することができる。以下の説明では、ユーザが誤字を入力する場合を中心に説明する。 As one embodiment of the present invention, it is determined whether or not a Japanese word input by a user from a predetermined screen by the typographical error determination unit 101 is a typographical error. In the system 100, similar words are extracted by the Kanji-Hiragana conversion unit 102, the Romaji conversion unit 103, the similar word search unit 104, and the similar word recommendation unit 105, and the correct word selection unit 106 selects the correct word (correct word) for the typo. A plurality of similar words extracted for the typo are selected and provided. As another embodiment of the present invention, when it is determined that the Japanese word input by the typographical error determination unit 101 is not a typographical error, or regardless of the determination process of the typographical error determination unit 101, the user is not a typographical error. When an accurate word is input, the automatic Japanese recommendation system 100 may provide a similar word via the kanji-hiragana conversion unit 102, the romaji conversion unit 103, the similar word search unit 104, and the similar word recommendation unit 105. it can. In the following description, the case where the user inputs a typographical character will be mainly described.

誤字判断部１０１は、ユーザ端末を通じてユーザから入力された単語１０７を分析して単語１０７が誤字であるか否かを判断する。この場合、ローマ字変換部１０３は、ユーザが入力した単語１０７が誤字であると判断された場合、単語１０７をローマ字に変換する。 The typographical error determination unit 101 analyzes the word 107 input from the user through the user terminal and determines whether the word 107 is a typo. In this case, the Roman character conversion unit 103 converts the word 107 into a Roman character when it is determined that the word 107 input by the user is an erroneous character.

一例として、誤字判断部１０１は、ユーザが入力した単語１０７が予め設定した誤字データに含まれるか否かに基づいて単語１０７が誤字であるか否かを判断することができる。具体的に、誤字判断部１０１は、予め登載された単語や検索エンジンで構築されたコンテンツＤＢ目録、手動レビュー等によって決められ、所定の記憶領域に記憶される誤字データを用いて、ユーザが入力した単語１０７が誤字データに含まれる場合に誤字として判断する。 As an example, the typo determination unit 101 can determine whether or not the word 107 is a typo based on whether or not the word 107 input by the user is included in preset typo data. Specifically, the typographical error determination unit 101 is input by a user using typographical data that is determined by a word listed in advance, a content DB catalog constructed by a search engine, a manual review, or the like and stored in a predetermined storage area. If the word 107 is included in the typo data, it is determined as a typo.

また、他の一例として、誤字判断部１０１は、ユーザが入力した単語１０７の入力頻度または文書出現の頻度が、予め設定された基準頻度よりも低いか否かに基づいて単語１０７が誤字であるか否かを判断するようにしてもよい。 As another example, the typographical error determination unit 101 determines that the word 107 is a typo based on whether the input frequency of the word 107 input by the user or the frequency of appearance of the document is lower than a preset reference frequency. It may be determined whether or not.

このとき、単語１０７の入力頻度は、ユーザが入力した単語１０７の入力回数を意味する。すなわち、誤字判断部１０１は、入力頻度の低い単語１０７を誤字として判断することができる。また、文書出現の頻度は、入力された単語１０７を用いて文書を検索した際に、検索結果として抽出される文書の個数（回数）、言い換えれば、文書内に単語１０７を含む文書数を意味する。誤字判断部１０１は、単語１０７を文書内に含む文書数が所定の基準数よりも少ない場合に、文書出現の頻度が低い単語であると判断し、当該単語１０７を誤字として判断する。なお、この場合、日本語自動推薦システム１００は、ユーザの単語入力に対して入力された単語別に入力回数を集計する機能やユーザが入力した単語を文書内に含む文書数を取得する機能を備えることができ、単語別の入力回数、文書数及びこれらの各々に対して予め設定される基準頻度（基準入力回数、基準文書数）等の情報を所定の記憶領域に記憶する。 At this time, the input frequency of the word 107 means the number of times the word 107 is input by the user. That is, the typo determination unit 101 can determine the word 107 with a low input frequency as a typo. The frequency of document appearance means the number (number of times) of documents extracted as search results when a document is searched using the input word 107, in other words, the number of documents including the word 107 in the document. To do. When the number of documents including the word 107 in the document is smaller than a predetermined reference number, the typographical error determination unit 101 determines that the word appearance frequency is low and determines the word 107 as a typo. In this case, the automatic Japanese recommendation system 100 has a function of counting the number of times of input for each word input in response to the user's word input, and a function of acquiring the number of documents including the word input by the user in the document. In addition, information such as the number of times of input for each word, the number of documents, and a reference frequency (reference number of inputs, number of reference documents) preset for each of these is stored in a predetermined storage area.

また、誤字判断部１０１は、単語１０７に対して文書出現の頻度が質疑頻度（単語１０７による質疑が入力された回数、例えば、単語１０７の入力頻度）よりも低い場合、該当の単語１０７を誤字として判断することもできる。また、誤字判断部１０１は、文書出現の頻度が低いながら連続した単語１０７（文書出現の頻度が低い、単語と単語が繋がっている連続した単語１０７）を誤字として判断するように構成してもよい。 In addition, the typographical error determination unit 101 typographically corrects the corresponding word 107 when the frequency of appearance of the document with respect to the word 107 is lower than the query frequency (the number of times that the query by the word 107 is input, for example, the input frequency of the word 107). It can also be judged as. Further, the typographical error determination unit 101 may be configured to determine, as a typographical error, a continuous word 107 (a continuous word 107 having a low document appearance frequency and a word-to-word connection) although the frequency of document appearance is low. Good.

また、他の一例として、誤字判断部１０１は、ユーザが入力した単語１０７が形態素に分離されるか否かに基づいて単語１０７が誤字であるか否かを判断することができる。このとき、誤字判断部１０１は、入力された単語が形態素分析器や品詞タッガー（part of speech tagger）によって各形態素に分離される場合、該当単語１０７が誤字でないと判断することができる。言い換えれば、単語が誤字である場合、形態素（例えば、それ以上分解したら言語意味をなさなくなるまで分割して抽出された意味を持つ言語の最小の単位）に分離することができないため、誤字判断部１０１は、単語が形態素に分離される場合、ユーザが入力した単語１０７を正字（誤字でない単語）として判断することができる。 As another example, the typo determination unit 101 can determine whether or not the word 107 is a typo based on whether or not the word 107 input by the user is separated into morphemes. At this time, if the input word is separated into morphemes by a morphological analyzer or a part of speech tagger, the typographical error determination unit 101 can determine that the corresponding word 107 is not a typographical error. In other words, if the word is a typographical error, it cannot be separated into morphemes (for example, the smallest unit of a language that has a meaning extracted by dividing it until it no longer makes a language meaning if it is further decomposed). 101, when words are separated into morphemes, the word 107 input by the user can be determined as a normal character (a word that is not a typo).

漢字−ひらがな変換部１０２は、入力された単語１０７が漢字である場合、トークン分割学習データを用いて単語をトークン別に分割する。また、漢字−ひらがな変換部１０２は、漢字−ひらがな変換学習データを用いて分割されたトークンに対応する単語又は文字をひらがなに変換する。なお、日本語は、同じ漢字であっても使い方によって読み方が異なるため、漢字に対応する正確なひらがなに変換することが重要であるが、漢字−ひらがな変換部１０２の詳細な処理については、図３を参照して具体的に後述する。 When the input word 107 is a Chinese character, the kanji-hiragana conversion unit 102 divides the word into tokens using token division learning data. In addition, the kanji-hiragana conversion unit 102 converts words or characters corresponding to the tokens divided using the kanji-hiragana conversion learning data into hiragana. Note that even if the same kanji is used in Japanese, the reading differs depending on how it is used, so it is important to convert it into an accurate hiragana corresponding to the kanji. However, the detailed processing of the kanji-hiragana conversion unit 102 is illustrated in FIG. The details will be described later with reference to FIG.

ローマ字変換部１０３は、日本語のひらがな形態またはカタカナ形態に表現された単語１０７を、その発音に基づいてローマ字（ｒｏｍａｊｉ）に変換する。、単語１０７が漢字である場合には、漢字−ひらがな変換部１０２によって単語１０７がひらがなに変換された後、ひらがなに変換された単語１０７を各ひらがな文字の発音に対応するローマ字に基づいて、ローマ字変換部１０３がローマ字に変換する。例えば、入力された単語が漢字の「映画」である場合、漢字−ひらがな変換部１０２によって「えいが」に変換され、ローマ字変換部１０３は、ひらがなに変換された単語の発音に基づいてローマ字（ｅｉｇａ）に変換する。ローマ字変換部１０３がローマ字に変換する例について図４を参照して具体的に後述する。 The Romaji conversion unit 103 converts the word 107 expressed in Japanese hiragana form or katakana form into romaji based on the pronunciation. When the word 107 is a kanji, after the word 107 is converted into hiragana by the kanji-hiragana conversion unit 102, the word 107 converted into hiragana is converted into roman characters based on the romaji corresponding to the pronunciation of each hiragana character. The conversion part 103 converts into a Roman character. For example, when the input word is a “movie” in kanji, it is converted into “eiga” by the kanji-hiragana conversion unit 102, and the romaji conversion unit 103 is based on the pronunciation of the word converted into hiragana (eiga). ). An example in which the Romaji conversion unit 103 converts to Romaji will be specifically described later with reference to FIG.

類似語検索部１０４は、ローマ字変換部１０３によって変換されたローマ字に基づいて単語１０７に対する類似語を所定の類似語群から検索（抽出）する。一例として、類似語検索部１０４は、ローマ字に変換された単語の類似度(類似度点数)に基づいて当該単語に対する類似語を抽出することができる。ひらがな／カタカナまたは漢字の文字形態で、入力された単語と類似語として抽出される単語との間の類似度を測定することは編集距離の解像度が極めて低く、かつ正確度が落ちるため、本発明によれば、双方の単語を発音に基づいてローマ字に変換して類似度を測定する。例えば、「オリゴン」と「オリコン」を直接に比較することよりも、これをローマ字に変換し、「ｏｒｉｇｏｎ」と「ｏｒｉｋｏｎ」とを比較することによって、より正確に類似度を比較することができる。 The similar word search unit 104 searches (extracts) a similar word for the word 107 from a predetermined similar word group based on the Roman character converted by the Roman character conversion unit 103. As an example, the similar word search unit 104 can extract a similar word for the word based on the similarity (similarity score) of the word converted into Romaji. In the hiragana / katakana or kanji character form, measuring the similarity between an input word and a word extracted as a similar word has a very low edit distance resolution and the accuracy is reduced. According to the above, both words are converted into Romaji based on pronunciation and the similarity is measured. For example, rather than comparing “oligon” and “Oricon” directly, the degree of similarity can be compared more accurately by converting this into Romaji and comparing “origon” and “orikon”. .

このとき、類似度点数は、単語の長さに応じた入力頻度、単語が長音、中点、促音または濁音を含むか否かによる編集距離、または単語の原型状態の比較程度のうちの少なくとも１つに基づいて決定（算出）される。一例として、単語が漢字である場合、類似語検索部１０４は、漢字がローマ字に変換された形態の比較結果（ローマ字に変換された後の単語間の類似度）、漢字がひらがなに変換された形態の比較結果（ひらがなに変換された後の単語間の類似度）、および漢字の形態の比較結果（漢字形態での単語間の類似度）に基づいて、類似度点数を決めることができる。類似語検索については図２を参照して具体的に後述する。 At this time, the similarity score is at least one of an input frequency corresponding to the length of the word, an edit distance depending on whether the word includes a long sound, a middle point, a prompt sound or a muddy sound, or a comparison degree of the original state of the word. Is determined (calculated) based on As an example, when the word is a kanji, the similar word search unit 104 compares the kanji converted into romaji (similarity between words after being converted into romaji), and the kanji is converted into hiragana. The similarity score can be determined based on the comparison result of the form (similarity between words after being converted into hiragana) and the comparison result of the kanji form (similarity between words in the kanji form). The similar word search will be specifically described later with reference to FIG.

類似語推薦部１０５は、検索された類似語をひらがな、カタカナ、または漢字のうちのいずれか１つの日本語形態の単語１０８に変換して推薦する。ユーザは推薦される単語１０８を入力して検索を行うことができる。この場合、類似語推薦部１０５は、所定のページや画面等に抽出された入力された単語１０７に対する類似語である単語１０８を露出させる処理を遂行する。また、検索サーバの検索結果が露出される検索結果ページに類似語である単語１０８を露出させたり、検索結果ページを生成するウェブサーバや検索サーバに、単語１０８を送信する。 The similar word recommendation unit 105 converts the searched similar word into a Japanese word 108 of hiragana, katakana, or kanji and recommends it. The user can search by inputting the recommended word 108. In this case, the similar word recommendation unit 105 performs a process of exposing the word 108 that is a similar word to the input word 107 extracted on a predetermined page or screen. Further, the word 108 that is a similar word is exposed on the search result page where the search result of the search server is exposed, or the word 108 is transmitted to a web server or a search server that generates the search result page.

なお、一例として、類似語推薦部１０５は、検索された類似語をユーザが入力した単語１０７の日本語形態と異なる形態の単語１０８に変換して推薦するようにしてもよい。例えば、ユーザがひらがな形態の単語１０７を入力した場合、類似語推薦部１０５は、入力された単語１０７に対する類似語を漢字形態の単語１０８に変換してユーザに推薦するようにしてもよい。 As an example, the similar word recommending unit 105 may convert the recommended similar word into a word 108 having a form different from the Japanese form of the word 107 input by the user and recommend it. For example, when the user inputs a hiragana word 107, the similar word recommendation unit 105 may convert the similar word for the input word 107 into a kanji word 108 and recommend it to the user.

正解単語選択部１０６は、ユーザから入力された単語１０７が誤字である場合、類似度点数または単語の入力頻度に基づく編集距離に基づいて、検索された類似語のうちの単語１０７に対する正解単語１０８を選択する。すなわち、誤字である入力された単語１０７に対して複数の類似語が検索される場合、正解単語選択部１０６は、類似度点数が最も高い又は所定の基準値よりも高い類似語、または当該単語の入力頻度が所定の基準値よりも高い類似語を正解単語１０８として選択して提供することができる。なお、編集距離は、単語間の類似度を判断するための直接的な根拠（基準）であり、編集距離が低ければ類似度が高くなる。すなわち、単語の入力頻度に基づく編集距離とは、例えば、入力頻度の高ければ高いほど単語の編集距離が低く付与されることを意味し、入力頻度に応じた編集距離、言い換えれば、単語の入力頻度に基づいて類似語を正解単語として選択することができる。 When the word 107 input from the user is a typo, the correct word selection unit 106 corrects the correct word 108 for the word 107 among the searched similar words based on the similarity score or the edit distance based on the word input frequency. Select. That is, when a plurality of similar words are searched for the input word 107 that is a typographical error, the correct word selection unit 106 selects the similar word having the highest similarity score or higher than a predetermined reference value, or the word A similar word whose input frequency is higher than a predetermined reference value can be selected and provided as the correct word 108. The edit distance is a direct basis (reference) for determining the similarity between words, and the similarity is higher when the edit distance is lower. That is, the edit distance based on the word input frequency means, for example, that the higher the input frequency is, the lower the word edit distance is given. In other words, the edit distance according to the input frequency, in other words, the word input Similar words can be selected as correct words based on the frequency.

図２は、本発明の一実施形態に係る入力される単語に対してローマ字変換によって日本語を自動的に推薦する過程を示す図である。 FIG. 2 is a diagram illustrating a process of automatically recommending Japanese by Romaji conversion for an input word according to an embodiment of the present invention.

ユーザ端末を介してユーザから日本語からなる単語が入力されると、誤字判断部１０１は、入力された単語が誤字であるかを判断する。上述したように、誤字判断部１０１は、単語が予め設定された誤字データに含まれるか否か、単語の入力頻度または文書出現の頻度が予め設定された基準頻度よりも低いか否か、または単語が形態素に分離されるか否かに基づいて、単語が誤字であるか否かを判断する。 When a Japanese word is input from the user via the user terminal, the typographical error determination unit 101 determines whether the input word is a typographical error. As described above, the typographical error determination unit 101 determines whether the word is included in the preset typographical data, whether the word input frequency or the document appearance frequency is lower than a preset reference frequency, or Based on whether the word is separated into morphemes, it is determined whether the word is a typo.

ユーザから入力された単語が誤字であると判断された場合、正解単語選択部１０６は、入力された単語に対して検索された類似語の中から所定の基準を満たす類似語を正解単語として選択して提供する。なお、入力された単語が誤字でないと判断された場合、すなわち、正字であると判断された場合には、正解単語選択部１０６は、動作しない。 When it is determined that the word input from the user is a typo, the correct word selection unit 106 selects a similar word satisfying a predetermined criterion from the similar words searched for the input word as the correct word. And provide. When it is determined that the input word is not a typo, that is, when it is determined that the input word is a correct character, the correct word selection unit 106 does not operate.

図２に示すように、入力された日本語単語は、ひらがな形態、カタカナ形態、または漢字形態のうちのいずれか１つであり、入力された単語がひらがな形態またはカタカナ形態である場合、ローマ字変換部１０３は、日本語のひらがな形態またはカタカナ形態に表現された単語の発音に基づいてローマ字（ｒｏｍａｊｉ）に変換する。 As shown in FIG. 2, the input Japanese word is in any one of the hiragana form, the katakana form, or the kanji form, and if the inputted word is in the hiragana form or the katakana form, the romaji conversion is performed. The unit 103 converts to a romaji based on pronunciation of a word expressed in Japanese hiragana form or katakana form.

一方、入力された単語が漢字形態である場合、漢字を直接ローマ字に変換することが難しいため、漢字−ひらがな変換部１０２によってひらがな形態に正規化する過程を経てることができる。具体的に、漢字−ひらがな変換部１０２は、トークン分割学習データを用いて漢字をトークン別に分割し、漢字−ひらがな変換学習データを用いて分割されたトークンに対応する単語又は文字をひらがなに変換することができる。そして、ローマ字変換部１０３は、漢字−ひらがな変換部１０２によって変換されたひらがなをその発音に対応するローマ字に変換する。 On the other hand, when the input word is in the kanji form, it is difficult to directly convert the kanji into the romaji. Therefore, the kanji-hiragana conversion unit 102 can normalize the hiragana form. Specifically, the kanji-hiragana conversion unit 102 divides kanji into tokens using token division learning data, and converts words or characters corresponding to the divided tokens into hiragana using kanji-hiragana conversion learning data. be able to. Then, the Roman character conversion unit 103 converts the hiragana converted by the kanji-hiragana conversion unit 102 into a Roman character corresponding to the pronunciation.

類似語検索部１０４は、変換されたローマ字に基づいて単語に対する類似語を所定の類似語群から検索する。具体的に、類似語検索部１０４は、ローマ字に変換された単語の類似度点数に基づいて単語に対する類似語を検索する。 The similar word search unit 104 searches for a similar word for a word from a predetermined group of similar words based on the converted romaji. Specifically, the similar word search unit 104 searches for a similar word for a word based on the similarity score of the word converted into Roman characters.

一例として、類似度点数は、単語の長さに応じた入力頻度、単語が長音、中点、促音、または濁音が含まれるか否かに基づく編集距離、または単語の原型状態の比較程度のうちの少なくとも１つに基づいて決定される。 As an example, the similarity score is an input frequency according to the length of the word, an edit distance based on whether the word includes a long sound, a middle point, a prompt sound, or a cloudy sound, or a comparison degree of the original state of the word. Determined based on at least one of the following.

単語の長さ、ｉｎｆｏｒｍａｔｉｏｎ−ｉｎｆｏｒｍａｔｉｏｎ［編集距離、類似度］
長音：ハロワーク（誤字）、ハロ-ワ-ク（誤字）、ハローワーク（正解）
中点：ピートローズ（誤字）、ピート・ローズ（正解）
半濁音：オリゴン（誤字）、オリコン（正解）
促音：ビクカメラ（誤字）ビックカメラ（正解）
原型：花よりだんごファイナル（誤字）花より男子ファイナル（正解） Word length, information-information [edit distance, similarity]
Long sound: Hello Work (wrong), Hello Work (wrong), Hello Work (correct)
Midpoint: Pete Rose (typo), Pete Rose (correct)
Semi-turbid sound: Oligon (typo), Oricon (correct answer)
Encouragement sound: BicCamera (typo) Biccamera (correct answer)
Prototype: Dango Final from Flower (Typographical) Boys Final from Flower (Correct)

単語の長さが短いほど単語の入力頻度（入力回数）が増加するため、類似度検索部１０４は、単語の長さが短いほど類似度点数を増加させることができる（高い類似度点数を付与することができる）。言い換えれば、単語の長さに応じた入力頻度に基づく類似度点数とは、単語の長さとその単語の入力頻度との関係に基づいて、単語の長さが短いほど入力頻度が増加することに起因する単語の長さに応じて付与される類似度点である。 Since the word input frequency (input count) increases as the word length is shorter, the similarity search unit 104 can increase the similarity score as the word length is shorter (higher similarity score is given). can do). In other words, the similarity score based on the input frequency according to the word length is based on the relationship between the word length and the input frequency of the word. It is a similarity score given according to the length of the originating word.

日本語の長音（ー）は、他の文字に比べて容易に挿入され、または削除されるため、類似語検索部１０４は、単語に長音が含まれる場合、編集距離を小さく加重して類似度点数を増加させることができる。具体的には、単語に長音が含まれる場合、長音分編集距離が大きくなるが、長音は、他の文字に比べて容易に挿入され、または削除されるため、長音を含む編集距離に対して小さい加重値（例えば、０以上１未満の数字）を適用（乗算）し、長音を含む単語の編集距離を小さく調整し、類似度点数を増加させることができる。また、同様に、日本語の中点（・）は他の文字に比べて容易に挿入され、または削除されるため、類似語検索部１０４は、単語に中点が含まれる場合、編集距離を小さく加重して類似度点数を増加させてもよい。さらには、日本語の促音（っ）は、容易に省略されたり、類似発音として誤って用いられる場合が多いため、類似語検索部１０４は、入力された単語に促音が含まれる場合に編集距離を小さく加重して類似度点数を増加させることもできる。 Since the Japanese long sound (-) is easily inserted or deleted compared to other characters, the similar word search unit 104 weights the editing distance to a small degree when the word contains a long sound. The score can be increased. Specifically, when a long sound is included in a word, the editing distance for the long sound becomes large, but since a long sound is easily inserted or deleted compared to other characters, By applying (multiplying) a small weight value (for example, a number greater than or equal to 0 and less than 1), the edit distance of words including long sounds can be adjusted to be small, and the similarity score can be increased. Similarly, since the Japanese midpoint (•) is easily inserted or deleted compared to other characters, the similar word search unit 104 sets the edit distance when the midpoint is included in the word. The similarity score may be increased by applying a small weight. Furthermore, since the Japanese prompt sound (tsu) is often omitted or mistakenly used as a similar pronunciation, the similar word search unit 104 may edit the edit distance when the input word includes the prompt sound. It is also possible to increase the similarity score by weighting.

また、ローマ字に変換された形態だけでなく、類似語検索部１０４は、単語の原型状態の比較結果を類似度点数に反映することができる。原型状態を比較することによって、ローマ字に正規化した状態で類似語を検索する結果のエラーを補完することができる。例えば、入力された単語が「うとん」である場合、類似語検索部１０４は「うろん」よりも原型状態が類似する「うどん」の類似度点数を高く付与することによって、ローマ字変換によって類似度を判断するときのエラーを補完することができる。 Moreover, not only the form converted into the Roman character but the similar word search part 104 can reflect the comparison result of the original state of a word on a similarity score. By comparing the prototype states, it is possible to compensate for errors resulting from searching for similar words in a state normalized to Roman characters. For example, if the input word is “Uton”, the similar word search unit 104 assigns a higher similarity score of “Udon” whose original state is similar than “Uron”, thereby converting the similarity by Romaji conversion. It is possible to supplement the error when judging.

また、一例として、単語が漢字である場合、類似語検索部１０４は、ローマ字に変換された形態の比較結果、ひらがなに変換された形態の比較結果、および漢字本来の形態の比較結果に基づいて類似度点数を決定することもできる。具体的に、単語が漢字である場合、類似語検索部１０４は、下記の数式１によって類似度点数を決定することができる。 Further, as an example, when the word is a Chinese character, the similar word search unit 104 is based on the comparison result of the form converted into Romaji, the comparison result of the form converted into Hiragana, and the comparison result of the original form of the Chinese character. A similarity score can also be determined. Specifically, when the word is a Chinese character, the similar word search unit 104 can determine the similarity score according to the following mathematical formula 1.

ここで、ｑはユーザが入力した日本語（質疑語）、ｔは類似語を意味する。また、ａ、ｂ、ｃは定数を意味する。このとき、ａ、ｂ、ｃは、機械学習機能等によって導き出すことができる。 Here, q means Japanese (question word) input by the user, and t means a similar word. Moreover, a, b, and c mean constants. At this time, a, b, and c can be derived by a machine learning function or the like.

このような過程を通じて類似語が検索（抽出）されると、図２に示すように、類似語推薦部１０５は、検索された類似語をひらがな、カタカナ、または漢字のうちのいずれか１つの日本語形態に変換して推薦する。例えば、入力された単語がひらがな形態である場合、類似語推薦部１０５は、検索された類似語をひらがな形態、カタカナ形態、または漢字状態のうちのいずれか１つの日本語形態に変換して推薦することができる。すなわち、類似語推薦部１０５は、検索された類似語を入力された単語の日本語形態と異なる形態に変換して推薦することができる。 When similar words are searched (extracted) through such a process, as shown in FIG. 2, the similar word recommendation unit 105 selects one of the hiragana, katakana, and kanji characters as the searched similar word. Recommend by converting to word form. For example, when the input word is in the hiragana form, the similar word recommendation unit 105 converts the searched similar word into the Japanese form of any one of the hiragana form, the katakana form, or the kanji state, and recommends it. can do. That is, the similar word recommendation unit 105 can convert and recommend the searched similar words into a form different from the Japanese form of the input word.

また、一例として、類似語推薦部１０５は、ローマ字に変換された状態の類似度とローマ字に変換されない状態の類似度との差が、予め設定した基準を超える場合、ローマ字に変換された状態の類似度が高い場合であっても該当の類似語を推薦しないようにすることができる。さらに他の一例としては、類似語推薦部１０５は、入力された単語が推薦される類似語よりもさらに多く用いられる場合に類似語を推薦しなくてもよい（ユーザに推薦される類似語の質疑頻度と、ユーザによって入力された単語１０７による質疑頻度（入力頻度）とを比較し、ユーザに推薦される類似語の質疑頻度がユーザによって入力された単語１０７による質疑頻度よりも低い場合、言い換えれば、ユーザが入力した単語１０７が推薦される類似語よりもその頻度が高い場合、あえて使用頻度の低い類似語を推薦しない）。 Also, as an example, the similar word recommendation unit 105 determines that the difference between the degree of similarity converted to Roman letters and the degree of similarity not converted to Roman letters exceeds a preset criterion, Even if the degree of similarity is high, it is possible not to recommend the corresponding similar word. As another example, the similar word recommendation unit 105 may not recommend a similar word when the input word is used more than the recommended similar word (for the similar word recommended to the user). If the question frequency is compared with the question frequency (input frequency) of the word 107 input by the user, and if the question frequency of similar words recommended to the user is lower than the question frequency of the word 107 input by the user, in other words For example, if the frequency of the word 107 input by the user is higher than the recommended similar word, the similar word that is used less frequently is not recommended).

また、入力された単語が誤字である場合、正解単語選択部１０６は、類似度点数または単語の入力頻度による編集距離に基づいて、検索された類似語のうちの単語に対する正解単語を選択してもよい。具体的に、正解単語選択部１０６は、類似度点数が最も高いか、または単語の入力頻度が高くて編集距離が低い類似語を単語に対する正解単語を選択することができる。 When the input word is a typo, the correct word selection unit 106 selects the correct word for the word among the searched similar words based on the similarity score or the edit distance based on the word input frequency. Also good. Specifically, the correct word selection unit 106 can select a correct word with respect to a word having the highest similarity score or a similar word having a high word input frequency and a short editing distance.

図３は、本発明の一実施形態に係る漢字からひらがなに変換する過程を示す図である。 FIG. 3 is a diagram illustrating a process of converting kanji into hiragana according to an embodiment of the present invention.

本発明の一実施形態に係る漢字−ひらがな変換部１０２は、入力された漢字をひらがなに変換する。ローマ字変換部１０３は、漢字−ひらがな変換部１０２によって変換されたひらがな及び入力されたひらがな及びカタカナをローマ字に変換してもよい。 The Kanji-Hiragana conversion unit 102 according to the embodiment of the present invention converts the input Kanji into Hiragana. The romaji conversion unit 103 may convert the hiragana converted by the kanji-hiragana conversion unit 102 and the input hiragana and katakana into romaji.

一例として、漢字−ひらがな変換部１０２は、トークン分割学習データ３０２を用いてトークン分割処理３０５を遂行し、入力された漢字３０４をトークン別に分割する。そして、漢字−ひらがな変換学習データ３０３を用いて漢字−ひらがな変換処理３０６を遂行し、トークン分割処理によって分割されたトークン３０５を対応するひらがな３０７に変換する。 As an example, the Kanji-Hiragana conversion unit 102 performs token division processing 305 using the token division learning data 302 and divides the input Kanji 304 by token. Then, the kanji-hiragana conversion learning data 303 is used to perform a kanji-hiragana conversion process 306, and the token 305 divided by the token dividing process is converted into a corresponding hiragana 307.

例えば、入力された単語が「僕と彼女の生きる道」である場合、トークン分割学習データ３０２を用いて、「僕、と、彼女、の、生き、る、道」のようにトークン分割処理を行い、各トークンバイグラムから最大の確率値を有するひらがな状態列を選択する。具体的には、「僕−ぼくと彼女−かのじょの生きる−いきる道−みち」と変換し、最終的に「ぼくとかのじょのいきるみち」のひらがな形態に変換する。 For example, if the input word is “Me and her way of life”, token division learning data 302 is used to perform token division processing such as “I and her, of life, way of life”. The hiragana state sequence having the maximum probability value is selected from each token bigram. Specifically, it translates to "I-I and her-Kanojo's life-Iki-no-Michi-Michi", and finally to Hiragana form of "I, Kano-jo's Iki-ichi Michi".

このとき、学習データは、日本語ニュースまたは日本語ブログに掲示された文書のような日本語文書３０１において漢字３０４に対応するひらがな学習文書を作り、学習文書に基づいて、所定の機械学習アルゴリズムによって入力形態に従うひらがなを選択して組み合わせることで決定することができる。 At this time, the learning data creates a hiragana learning document corresponding to the kanji 304 in the Japanese document 301 such as a document posted on a Japanese news or Japanese blog, and based on the learning document, a predetermined machine learning algorithm is used. It can be determined by selecting and combining hiragana according to the input form.

一例として、トークン分割学習データ３０２は、漢字の形態素トークン別に分離するコーパス（ｃｏｒｐｕｓ）を用いて、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）基盤の分かち書き学習アルゴリズムに基づいて決定することができる。このとき、音節トライグラム（ｔｒｉｇａｍ）ＨＭＭ基盤の分かち書き学習アルゴリズムに基づいてトークン分割学習データ３０２が決定することもできる。 As an example, the token division learning data 302 can be determined based on a hidden Markov model (HMM) -based descriptive learning algorithm using a corpus that is separated by kanji morpheme tokens. At this time, the token division learning data 302 can be determined based on a syllabic trigram HMM-based segmentation learning algorithm.

また、一例として、漢字−ひらがな変換学習データ３０３は、漢字３０４の形態素トークン別に分離するコーパスに基づく学習アルゴリズムに基づいて決定されたユニグラム（ｕｎｉｇｒａｍ）辞書３０３−１およびバイグラム（ｂｉｇｒａｍ）辞書３０３−２を含むことができる。この場合、ユニグラム辞書３０３−１は、トークンとひらがなとの間の頻度数（トークン−ひらがな）で構築することができる。バイグラム辞書３０３−２は、トークン間の頻度数（トークン１−トークン２）で構築することができる。すなわち、漢字−ひらがな変換部１０２は、日本語文書３０１から所定の学習処理に基づいて決定されたトークン分割学習データ３０２および漢字−ひらがな変換学習データ３０３を用いて漢字３０４をひらがな３０７に変換することができる。 Further, as an example, the kanji-hiragana conversion learning data 303 includes a unigram dictionary 303-1 and a bigram dictionary 303-2 determined based on a corpus-based learning algorithm that separates kanji 304 by morpheme tokens. Can be included. In this case, the unigram dictionary 303-1 can be constructed with the frequency number (token-hiragana) between the token and the hiragana. The bigram dictionary 303-2 can be constructed with the number of frequencies between tokens (token 1-token 2). That is, the Kanji-Hiragana conversion unit 102 converts the Kanji 304 into the Hiragana 307 using the token division learning data 302 and the Kanji-Hiragana conversion learning data 303 determined from the Japanese document 301 based on a predetermined learning process. Can do.

また、他の一例としては、漢字−ひらがな変換部１０２は、トークン分割学習データ３０１に基づいて漢字３０４から分割されたトークンに対し、２つのトークン毎にバイグラム辞書３０３−２を検索して、最大の確率を有するトークンを選択することができる。また、漢字−ひらがな変換部１０２は、最終的に選択されたトークンに対してユニグラム辞書３０３−１に対応するひらがな３０７に変換する。なお、バイグラム辞書３０３−２の情報量が足りない場合、漢字−ひらがな変換部１０２は、ユニグラム辞書３０３−１を用いて最大の確率を有するトークンを選択することができる。 As another example, the kanji-hiragana conversion unit 102 searches the bigram dictionary 303-2 for every two tokens for tokens divided from the kanji 304 based on the token division learning data 301, and the maximum Tokens with probabilities can be selected. Further, the Kanji-Hiragana conversion unit 102 converts the finally selected token into the Hiragana 307 corresponding to the unigram dictionary 303-1. When the information amount of the bigram dictionary 303-2 is insufficient, the kanji-hiragana conversion unit 102 can select a token having the maximum probability using the unigram dictionary 303-1.

図４は、本発明の一実施形態におけるひらがな又はカタカナをローマ字に変換する一例を示す図である。 FIG. 4 is a diagram illustrating an example of converting hiragana or katakana into romaji according to an embodiment of the present invention.

同図に示すように、「あ」行と「か」行に対してローマ字に変換する一例を示している。ローマ字変換部１０３は、日本語のひらがな形態またはカタカナ形態で表現される単語の発音をローマ字（ｒｏｍａｊｉ）に変換する。このとき、入力された単語が漢字である場合、漢字−ひらがな変換部１０２によって漢字をひらがなに変換する。 As shown in the figure, an example is shown in which “a” and “ka” lines are converted to Roman characters. The Romaji conversion unit 103 converts the pronunciation of a word expressed in Japanese hiragana form or katakana form into romaji. At this time, if the input word is kanji, the kanji-hiragana conversion unit 102 converts the kanji into hiragana.

同図に示すように、「あ」行に対してローマ字変換部１０３は、ひらがな「あ」をローマ字「ａ」に変換する。また、ローマ字変換部１０３は、ひらがな「い」をローマ字「ｉ」に変換する。同様に、ローマ字変換部１０３は各ひらがな「う」を「ｕ」に、「え」を「ｅ」に、「お」を「ｏ」に変換する。このような変換過程を通じて日本語自動推薦システム１００は、ひらがなまたはカタカナをローマ字に変換し変換されたローマ字を用いることでより精密に入力された単語の類似語を検索することができる。 As shown in the figure, for the “a” line, the Roman alphabet conversion unit 103 converts the hiragana “a” into the Roman alphabet “a”. In addition, the romaji conversion unit 103 converts hiragana “i” into romaji “i”. Similarly, the Romaji conversion unit 103 converts each hiragana “u” into “u”, “e” into “e”, and “o” into “o”. Through such a conversion process, the automatic Japanese recommendation system 100 can search for similar words of the input word more precisely by converting the hiragana or katakana into romaji and using the converted romaji.

また、上述したように、ひらがなとカタカナをそのまま用いて類似語を検索する場合は、編集距離の解像度が低いため、人間ではないサーバのような機械の場合、「オリゴン」と「オリコン」を区別することが難しい。この場合、「オリゴン」と「オリコン」をローマ字の「ｏｒｉｇｏｎ」と「ｏｒｉｋｏｎ」で比較することによって、より精密な類似度点数を算定して類似語推薦の正確度を向上させることができる。 Also, as described above, when searching for similar words using hiragana and katakana as they are, the resolution of the edit distance is low, so in the case of a machine such as a non-human server, “oligon” is distinguished from “oricon”. Difficult to do. In this case, by comparing “Oligon” and “Oricon” with the Roman letters “origon” and “orikon”, it is possible to calculate a more precise similarity score and improve the accuracy of similar word recommendation.

図５は、本発明の一実施形態に係る日本語自動推薦方法の全体の処理遷移を示すフローチャートである。 FIG. 5 is a flowchart showing overall process transition of the automatic Japanese recommendation method according to the embodiment of the present invention.

同図を参照すると、日本語自動推薦システム１００は、ユーザ端末に表示された所定のページ又は画面を介してユーザから入力された単語が誤字であるかを判断する（Ｓ５０１）。このとき、入力された単語が誤字である場合、日本語自動推薦システム１００は、単語に対する類似語の中から正解単語を選択して提供する（Ｓ５０７）。 Referring to the figure, the automatic Japanese recommendation system 100 determines whether a word input from the user via a predetermined page or screen displayed on the user terminal is a typo (S501). At this time, if the input word is a typo, the Japanese automatic recommendation system 100 selects and provides the correct word from the similar words to the word (S507).

日本語自動推薦システム１００は、入力された単語が誤字でなく正字である場合であっても、入力された単語に対する類似語を自動的に推薦することができる。日本語自動推薦システム１００は、入力された単語が漢字であるかを判断する（Ｓ５０２）。なお単語が漢字であると判断された場合、日本語自動推薦システム１００は、漢字をひらがなに変換し（Ｓ５０３）、その後、ステップＳ５０４を遂行する。入力された単語が漢字でない場合は、ステップＳ５０４における変換過程を経ない。 The automatic Japanese recommendation system 100 can automatically recommend similar words for the input word even when the input word is a correct character instead of a typo. The automatic Japanese recommendation system 100 determines whether the input word is a Chinese character (S502). If it is determined that the word is kanji, the Japanese automatic recommendation system 100 converts the kanji into hiragana (S503), and then performs step S504. If the input word is not kanji, the conversion process in step S504 is not performed.

具体的に、日本語自動推薦システム１００は、単語が漢字であると判断された場合又は入力された単語に漢字が含まれると判別された場合、トークン分割学習データを用いて単語をトークン別に分割し、さらに漢字−ひらがな変換学習データを用いて分割されたトークンに対応するひらがなに変換する。 Specifically, the automatic Japanese recommendation system 100 divides a word into tokens using token division learning data when it is determined that the word is kanji or when the input word is determined to contain kanji. Further, the kanji-hiragana conversion learning data is used to convert the hiragana corresponding to the divided tokens.

このとき、トークン分割学習データは、漢字の形態素トークン別に分離するコーパスを用いて隠れマルコフモデル基盤の分かち書き学習アルゴリズムに基づいて決定することができる。また、漢字−ひらがな変換学習データは、漢字の形態素トークン別に分離されるコーパスに基づく学習アルゴリズムによって決定されたバイグラム辞書およびユニグラム辞書を含むことができる。ここで、バイグラム辞書は、トークンとの間の頻度数で構築され、ユニグラム辞書は、トークンとひらがなとの間の頻度数で構築される。 At this time, the token division learning data can be determined based on a hidden Markov model-based division learning algorithm using a corpus that is separated by kanji morpheme tokens. In addition, the kanji-hiragana conversion learning data may include a bigram dictionary and a unigram dictionary determined by a learning algorithm based on a corpus that is separated for each kanji morpheme token. Here, the bigram dictionary is constructed with the frequency number between tokens, and the unigram dictionary is constructed with the frequency number between tokens and hiragana.

この場合、日本語自動推薦システム１００は、分割されたトークンに対してバイグラム辞書を検索して最大の確率を示すトークンを選択し、選択されたトークンに対してユニグラム辞書に対応するひらがなに変換する。 In this case, the automatic Japanese recommendation system 100 searches the bigram dictionary for the divided tokens, selects a token that shows the maximum probability, and converts the selected token into a hiragana corresponding to the unigram dictionary. .

日本語自動推薦システム１００は、日本語のひらがな形態またはカタカナ形態で表現される単語の発音をローマ字に変換する（Ｓ５０４）。日本語自動推薦システム１００は、変換されたローマ字に基づいて入力された単語に対する類似語を検索する（Ｓ５０５）。 The automatic Japanese recommendation system 100 converts the pronunciation of a word expressed in Japanese hiragana form or katakana form into Roman letters (S504). The automatic Japanese recommendation system 100 searches for similar words with respect to the input word based on the converted romaji (S505).

なお、一例として、日本語自動推薦システム１００は、ローマ字に変換された単語の類似度点数に基づいて入力された単語に対する類似語を検索することができる。このとき、類似度点数は、単語の長さに応じた入力頻度、単語が長音、中点、促音、または濁音が含まれるか否かによる編集距離または単語の原型状態の比較程度のうちの少なくとも１つに基づいて、またはこれらを組み合わせて決定することができる。 As an example, the automatic Japanese recommendation system 100 can search for similar words for the input word based on the similarity score of the word converted into Roman characters. At this time, the similarity score is at least one of an input frequency according to the length of the word, an edit distance depending on whether the word includes a long sound, a midpoint, a prompt sound, or a muddy sound, or a comparison of the original state of the word. It can be determined based on one or a combination thereof.

また、日本語自動推薦システム１００は、検索された類似語をひらがな、カタカナ、または漢字のうちのいずれか１つの日本語形態に変換してユーザに推薦することもできる（Ｓ５０６）。このとき、類似語推薦部１０５は、検索された類似語を入力された単語の日本語形態と異なる形態に変換して推薦することができる。 In addition, the automatic Japanese recommendation system 100 may convert the searched similar word into any one Japanese form of hiragana, katakana, or kanji and recommend it to the user (S506). At this time, the similar word recommendation unit 105 can convert the recommended similar word into a form different from the Japanese form of the input word and recommend it.

また他の一例として、ローマ字に変換された状態の類似度とローマ字に変換されない状態の類似度との差が予め設定した基準を超える場合、日本語自動推薦システム１００は、類似語を推薦しないように構成することもできる。また、他の一例としては、入力された単語が推薦される類似語よりもさらに多く用いられる場合、日本語自動推薦システムは類似語を推薦しなくてもよい。 As another example, if the difference between the similarity in the state converted to Romaji and the similarity in the state not converted to Romaji exceeds a preset criterion, the automatic Japanese recommendation system 100 does not recommend a similar word. It can also be configured. As another example, when the input word is used more frequently than the recommended similar word, the automatic Japanese recommendation system may not recommend the similar word.

日本語自動推薦システム１００は、ステップＳ５０１において、入力された単語が誤字であると判断される場合、類似度点数または単語出現頻度（例えば、単語の入力頻度）による編集距離に基づいて、検索された類似語の中から単語に対する正解単語を選択して提供する（Ｓ５０７）。 If it is determined in step S501 that the input word is a typographical error, the Japanese automatic recommendation system 100 is searched based on the edit distance based on the similarity score or the word appearance frequency (for example, the word input frequency). The correct word for the word is selected from the similar words and provided (S507).

図５において具体的に説明していない部分は、図１〜図４の説明を参考することができる。 The description of FIGS. 1 to 4 can be referred to for portions not specifically described in FIG.

また、本発明の一実施形態に係る日本語自動推薦方法は、コンピュータにより実現される多様な動作を実行するためのプログラム命令を含むコンピュータ読取可能な記録媒体を含む。当該記録媒体は、プログラム命令、データファイル、データ構造などを単独または組み合わせて含むこともでき、記録媒体およびプログラム命令は、本発明の目的のために特別に設計されて構成されたものでもよく、コンピュータソフトウェア分野の技術を有する当業者にとって公知であり使用可能なものであってもよい。コンピュータ読取可能な記録媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク及び磁気テープのような磁気媒体、ＣＤ−ＲＯＭ、ＤＶＤのような光記録媒体、フロプティカルディスクのような磁気−光媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどのようなプログラム命令を保存して実行するように特別に構成されたハードウェア装置が含まれる。また、記録媒体は、プログラム命令、データ構造などを保存する信号を送信する搬送波を含む光または金属線、導波管などの送信媒体でもある。プログラム命令の例としては、コンパイラによって生成されるような機械語コードだけでなく、インタプリタなどを用いてコンピュータによって実行され得る高級言語コードを含む。上述したハードウェア装置は、本発明の動作を行うため１つ以上のソフトウェアモジュールとして作動するよう構成され、その逆も同様である。 The automatic Japanese recommendation method according to an embodiment of the present invention includes a computer-readable recording medium including program instructions for executing various operations realized by a computer. The recording medium may include program instructions, data files, data structures, etc. alone or in combination, and the recording medium and program instructions may be specially designed and configured for the purposes of the present invention, It may be known and usable by those skilled in the computer software art. Examples of computer-readable recording media include magnetic media such as hard disks, floppy (registered trademark) disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic-lights such as floppy disks. A medium and a hardware device specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like are included. The recording medium is also a transmission medium such as an optical or metal line or a waveguide including a carrier wave that transmits a signal for storing program instructions, data structures, and the like. Examples of program instructions include not only machine language code generated by a compiler but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above is configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

上述したように、本発明の好ましい実施形態を参照して説明したが、該当の技術分野において熟練した当業者にとっては、特許請求の範囲に記載された本発明の技術的思想および領域から逸脱しない範囲内で、本発明を多様に修正および変更させることができることを理解することができるであろう。すなわち、本発明の技術的範囲は、特許請求の範囲に基づいて定められ、発明を実施するための最良の形態により制限されるものではない。 As described above, the preferred embodiments of the present invention have been described with reference to the preferred embodiments, but those skilled in the relevant technical field will not depart from the spirit and scope of the present invention described in the claims. It will be understood that various modifications and changes can be made to the present invention within the scope. In other words, the technical scope of the present invention is defined based on the claims, and is not limited by the best mode for carrying out the invention.

１００：日本語自動推薦システム
１０１：誤字判断部
１０２：漢字−ひらがな変換部
１０３：ローマ字変換部
１０４：類似語検索部
１０５：類似語推薦部
１０６：正解単語選択部 DESCRIPTION OF SYMBOLS 100: Automatic Japanese recommendation system 101: Wrong character judgment part 102: Kanji-Hiragana conversion part 103: Roman character conversion part 104: Similar word search part 105: Similar word recommendation part 106: Correct word selection part

Claims

A romaji conversion unit that converts the pronunciation of words expressed in Japanese hiragana or katakana forms into romaji,
A similar word search unit that searches for similar words for the word based on the converted romaji;
An automatic Japanese recommendation system characterized by including

The similar word search unit searches for a similar word for the word based on the similarity score of the word converted into the Roman characters,
The similarity score is at least one of an input frequency according to the length of the word, an edit distance based on whether the word includes a long sound, a midpoint, a prompt sound, or a muddy sound, or a comparison degree of the original state of the word. The automatic Japanese recommendation system according to claim 1, wherein the automatic Japanese recommendation system is determined based on one.

When the word is a Chinese character, the similar word search unit determines a similarity score based on a comparison result of a form converted into a Roman character, a comparison result of a form converted into a hiragana character, and a comparison result of an original form of a Chinese character The automatic Japanese recommendation system according to claim 2.

2. The Japanese language according to claim 1, further comprising a similar word recommendation unit that converts the searched similar word into a Japanese form of any one of the hiragana, katakana, and kanji characters and recommends the same. Automatic recommendation system.

The similar word recommending unit (1) when the difference between the similarity in the state converted into Roman letters and the similarity in the state not converted into Roman letters exceeds a preset criterion, or (2) the word is recommended 5. The automatic Japanese recommendation system according to claim 4, wherein the similar word is not recommended when used more frequently than the similar word.

5. The automatic Japanese recommendation system according to claim 4, wherein the similar word recommendation unit converts the searched similar word into a form different from the Japanese form of the word and recommends the same.

A typographical error determination unit that analyzes the input word to determine whether the word is a typo;
2. The automatic Japanese recommendation system according to claim 1, wherein, when the input word is an erroneous character, the Roman character conversion unit converts the word into a Roman character.

The typographical error determination unit determines whether the word is included in preset typographical data, whether the input frequency of the word or the frequency of document appearance is lower than a preset reference frequency, or whether the word is The automatic Japanese recommendation system according to claim 7, wherein whether or not the word is a typo is determined based on whether or not the word is separated into morphemes.

When the word is a typo, it further includes a correct word selection unit that selects a correct word for the word among the searched similar words based on a similarity score or an edit distance based on a word input frequency. The automatic Japanese recommendation system according to claim 7.

If the input word is kanji, the token is divided into tokens using token division learning data, and kanji-hiragana conversion is performed to convert the hiragana corresponding to the divided tokens using kanji-hiragana conversion learning data. The automatic Japanese recommendation system according to claim 1, further comprising a section.

11. The automatic Japanese recommendation system according to claim 10, wherein the token division learning data is determined by a hidden Markov model-based segmentation learning using a corpus that separates the kanji morpheme tokens.

The kanji-hiragana conversion learning data includes a bigram dictionary and a unigram dictionary determined by learning based on a corpus that separates kanji morpheme tokens;
The bigram dictionary is built with a frequency number between tokens,
The automatic Japanese recommendation system according to claim 10, wherein the unigram dictionary is constructed with a frequency number between tokens and hiragana.

The Kanji-Hiragana conversion unit searches a bigram dictionary for the divided tokens, selects a token showing the highest probability, and converts the selected token into a hiragana corresponding to a unigram dictionary. The automatic Japanese recommendation system according to claim 12.

The steps performed by the computer are
Converting the pronunciation of a word expressed in Japanese hiragana or katakana form into romaji,
Searching for similar words for the word based on the converted romaji;
An automatic Japanese recommendation method characterized by including

The step of searching for a similar word with respect to the word searches for the similar word with respect to the word based on the similarity score of the word converted into the Roman character,
The similarity score is at least one of an input frequency according to the length of the word, an editing distance depending on whether the word includes a long sound, a prompt sound, or a muddy sound, or a degree of comparison of the original state of the word. The automatic Japanese recommendation method according to claim 14, wherein the automatic Japanese recommendation method is determined based on the determination.

The step of searching for a similar word for the word is similar based on the comparison result of the form converted to Romaji, the comparison result of the form converted to Hiragana and the comparison result of the original form of the Chinese character when the word is Kanji. The automatic Japanese recommendation method according to claim 15, wherein a score is determined.

15. The automatic Japanese recommendation method according to claim 14, further comprising the step of converting and recommending the retrieved similar word into a Japanese form of any one of the hiragana, katakana, or kanji. .

The step of converting and recommending the searched similar words into the Japanese form of any one of the hiragana, katakana, and kanji is as follows: The similarity word is not recommended when the difference from the similarity of the state exceeds a preset criterion, or (2) when the word is used more than the recommended similarity word. Item 18. The automatic Japanese recommendation method according to Item 17.

The step of converting the recommended similar word into a Japanese form of any one of the hiragana, katakana, and kanji characters and recommending the searched similar word into a form different from the Japanese form of the word 18. The automatic Japanese recommendation method according to claim 17, wherein conversion is recommended.

Analyzing the input word to determine whether the word is a typo;
15. The automatic Japanese recommendation method according to claim 14, wherein the step of converting the pronunciation of the word into a Roman character converts the word into a Roman character when the input word is a typographical error.

The step of determining whether or not the word is a typo includes whether or not the word is included in preset typographical data, whether the word input frequency or the document appearance frequency is higher than a preset reference frequency. 21. The automatic Japanese recommendation method according to claim 20, wherein whether or not the word is a typo is determined based on whether the word is low or whether the word is separated into morphemes.

The method may further include selecting a correct word for the word among the searched similar words based on a similarity score or an edit distance based on a word input frequency when the word is a typo. Item 20. The automatic Japanese recommendation method according to Item 20.

If the input word is kanji, the method further includes dividing the words into tokens using token division learning data, and converting the hiragana corresponding to the divided tokens using kanji-hiragana conversion learning data. The automatic Japanese recommendation method according to claim 14.

24. The automatic Japanese recommendation method according to claim 23, wherein the token division learning data is determined by a hidden Markov model-based segmentation learning using a corpus that separates the kanji morpheme tokens.

The kanji-hiragana conversion learning data includes a bigram dictionary and a unigram dictionary determined by corpus-based learning that separates kanji morpheme tokens;
The bigram dictionary is built with a frequency number between tokens,
The automatic Japanese recommendation method according to claim 23, wherein the unigram dictionary is constructed with a frequency number between tokens and hiragana.

The step of converting into hiragana corresponding to the divided tokens is as follows:
Searching the bigram dictionary for the divided tokens to select a token representing the maximum probability;
Converting the selected token into a hiragana corresponding to a unigram dictionary;
The automatic Japanese recommendation method according to claim 23, comprising:

A computer-readable recording medium recording a program for causing a computer to execute the method according to any one of claims 14 to 26.