JP5470620B2

JP5470620B2 - Annotation acquisition device, annotation acquisition method, and program

Info

Publication number: JP5470620B2
Application number: JP2009299287A
Authority: JP
Inventors: 真樹村田; 正裕小島; 健太郎鳥澤; 淳一風間; 航黒田; 篤藤田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2009-12-30
Filing date: 2009-12-30
Publication date: 2014-04-16
Anticipated expiration: 2029-12-30
Also published as: JP2011138440A

Description

本発明は、異表記の用語対を取り出す異表記取得装置等に関するものである。 The present invention relates to a different notation acquisition device and the like for extracting term pairs of different notations.

従来の異表記を取り出す技術としては、荒牧らの研究がある（非特許文献１参照）。この研究は、医療分野の専門用語の異表記の取り出しを行うものであった。なお、異表記とは、例えば「スパゲティ」に対して、「スパゲッティ」など、同義語であるが異なる表現の用語を言う。また、異表記の二つの用語を異表記対という。 As a conventional technique for extracting different notations, there is research by Aramaki et al. (See Non-Patent Document 1). This study was used to extract terminology for medical terminology. Note that the different notation is a term that is a synonym but a different expression, such as “spaghetti” for “spaghetti”. Two terms with different notation are called different notation pairs.

異表記対の第一の考え方は、以下である。例えば、用語対の例１（問い合わせメール，問合わせメール)、例２(学園闘争,学園紛争)について、例１は異表記対とし、例２は、異表記対ではなく日本語同義語対とする。つまり、第一の考え方において、同一語の異形なら異表記対とし、同一語でなければ、例え意味が同等でも異表記対としない。闘争と紛争は、ほぼ同等の意味を有するが、同一の語でないので、例２は異表記対とはしない。一方、例１の「問い合わせ」「問合わせ」は、表記は異なるが同一の語の異形と判断できるので、異表記対とする。 The first concept of the variant notation is as follows. For example, for example 1 (inquiry email, inquiry email) and example 2 (gakuen struggle, school dispute) of term pairs, example 1 is a different notation pair, and example 2 is a Japanese synonym pair instead of a different notation pair. To do. In other words, in the first concept, if the same word is a variant, it is a different notation pair, and if it is not the same word, it is not a different notation pair even if the meaning is the same. Although struggle and conflict have approximately the same meaning, but are not the same word, Example 2 is not a different pair. On the other hand, “inquiry” and “inquiry” in Example 1 are different from each other, but can be determined to be variants of the same word, and are therefore different pairs.

また、異表記対の第二の考え方は、同義語も異表記とする考え方である。第二の考え方では、上記の例１だけではなく、例２(学園闘争,学園紛争)も異表記対となる。 In addition, the second concept of the different notation pair is a concept in which synonyms are also differently described. In the second way of thinking, not only Example 1 above, but also Example 2 (Gakuen Struggle, Gakuen Conflict) are different pairs.

さらに、異表記、異表記対の考え方は、上記の考え方と類似する考え方でも良く、異表記、異表記対は広く解するものとする。 Furthermore, the different notation and different notation pairs may be similar in concept to the above, and the different notation and different notation pairs will be widely understood.

また、従来技術として、機械学習法についての技術がある(例えば、非特許文献２〜非特許文献４参照) Further, as a conventional technique, there is a technique about a machine learning method (see, for example, Non-Patent Document 2 to Non-Patent Document 4).

Eiji Aramaki, Takeshi Imai, Kengo Miyo, Kazuhiko Ohe: Orthographic Disambiguation Incorporating Transliterated Probability, International Joint Conference on Natural Language Processing (IJCNLP2008), pp.48-55, 2008.Eiji Aramaki, Takeshi Imai, Kengo Miyo, Kazuhiko Ohe: Orthographic Disambiguation Incorporating Transliterated Probability, International Joint Conference on Natural Language Processing (IJCNLP2008), pp.48-55, 2008. 村田真樹,機械学習に基づく言語処理,龍谷大学理工学部.招待講演 .2004.http://www2.nict.go.jp/x/x161/member/murata/ps/kougi-ml-siryou-new2.pdfMasaki Murata, Language processing based on machine learning, Faculty of Science and Engineering, Ryukoku University. Invited lecture. 2004.http: //www2.nict.go.jp/x/x161/member/murata/ps/kougi-ml-siryou-new2 .pdf サポートベクトルマシンを用いたテンス・アスペクト・モダリティの日英翻訳,村田真樹,馬青,内元清貴,井佐原均,電子情報通信学会言語理解とコミュニケーション研究会 NLC2000-78 ,2001年.Japanese-English translation of tense aspect modality using support vector machine, Maki Murata, Ma Aoi, Kiyotaka Uchimoto, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2000-78, 2001. SENSEVAL2J辞書タスクでのCRLの取り組み,村田真樹,内山将夫,内元清貴,馬青,井佐原均,電子情報通信学会言語理解とコミュニケーション研究会 NLC2001-40 ,2001年.CRL in the SENSEVAL2J dictionary task, Masaki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Mao Ai, Hitoshi Isahara, IEICE NLC2001-40, 2001.

しかしながら、従来技術においては、日本語の一般の異表記を扱うものではなく、また、従来技術を日本語の一般の異表記に適用したのでは、十分な異表記抽出の精度が得られなかった。 However, the conventional technology does not deal with general Japanese variants, and the application of the conventional technology to general variants of Japanese did not provide sufficient accuracy for extracting different variants. .

本第一の発明の異表記取得装置は、編集距離が１以上の用語対を１以上格納し得る用語対格納部と、用語対の異なる文字である編集箇所の字種に関する素性である字種関連素性、用語辞書を用いて取得された素性である辞書関連素性、用語対を構成する２つの用語の類似度を示す素性である類似度素性のうちの一以上の素性を含む複数の素性と、用語対が異表記の用語対であるかを示す情報である正負情報とを対応付けた学習データを２以上格納し得る学習データ格納部と、用語対格納部の用語対ごとに、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を取得する素性取得部と、各用語対に対して、学習データ格納部の２以上の学習データと、素性取得部が取得した複数の素性とを用いて、教師あり機械学習法により、用語対格納部の各用語対が異表記の用語対であるか否かを判断する機械学習部と、機械学習部における判断結果を出力する出力部とを具備する異表記取得装置である。 The different notation acquisition device according to the first aspect of the present invention is a character type that is a feature of a character pair storage unit that can store one or more term pairs having an edit distance of one or more, and a character type of an edited portion that is a character having a different term pair. A plurality of features including one or more features of a related feature, a dictionary-related feature that is a feature acquired using a term dictionary, and a similarity feature that is a feature that indicates the similarity of two terms constituting a term pair; , A learning data storage unit that can store two or more learning data in which positive / negative information that is information indicating whether a term pair is a different pair of term pairs, and a character type for each term pair in the term pair storage unit A feature acquisition unit that acquires a plurality of features including one or more of related features, dictionary-related features, and similarity features, two or more learning data in a learning data storage unit, and a feature acquisition unit for each term pair Using a plurality of features acquired by The term pair storage unit includes a machine learning unit that determines whether each term pair is a term pair having a different notation, and an output unit that outputs a determination result in the machine learning unit. .

かかる構成により、用語対の分野を問わず、精度の高い異表記の用語対の抽出が可能となる。 With this configuration, it is possible to extract differently-notated term pairs with high accuracy regardless of the field of term pairs.

また、本第二の発明の異表記取得装置は、第一の発明に対して、字種関連素性は、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が数字であるか否かを示す情報であり、素性取得部は、用語対格納部の用語対ごとに、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるという条件に合致するか否かを判断し、判断結果を字種関連素性として取得する異表記取得装置である。 In addition, the different notation acquisition device of the second invention is different from the first invention in that the character type-related feature is different in the character type of the edited part of the two terms included in the term pair, and the two terms This is information indicating whether or not the edited part is a number, and the feature acquisition unit is different for each term pair in the term pair storage unit, and the character type of the edited part of the two terms in the term pair is different, and two This is a different notation acquisition device that determines whether or not the edited part of a term is a number having the same value, and acquires the determination result as a character type-related feature.

また、本第三の発明の異表記取得装置は、第一の発明に対して、字種関連素性は、用語対が有する２つの用語の字種がローマ字であり、かつ、２つの用語の編集箇所が大文字と小文字の違いであるか否かを示す情報であり、素性取得部は、用語対格納部の用語対ごとに、用語対が有する２つの用語の編集箇所の字種がローマ字であり、かつ、２つの用語の編集箇所が大文字と小文字の違いであるという条件に合致するか否かを判断し、判断結果を字種関連素性として取得する異表記取得装置である。 In addition, the different notation acquisition device of the third invention is different from the first invention in that the character type-related feature is that the two types of terms that the term pair has are Roman characters, and the editing of the two terms This is information indicating whether or not the location is a difference between uppercase and lowercase. The feature acquisition unit has a Roman character for the editing location of the two terms in the term pair for each term pair in the term pair storage unit. And it is a different notation acquisition apparatus which determines whether it meets the conditions that the edit location of two terms is a difference between a capital letter and a small letter, and acquires a judgment result as a character type related feature.

また、本第四の発明の異表記取得装置は、第一の発明に対して、用語と、用語の代表表記とを有する１以上の用語情報を格納し得る用語辞書をさらに具備し、辞書関連素性は、用語対が有する２つの用語の代表表記が同一であるか否かを示す情報であり、素性取得部は、用語対格納部の用語対ごとに、用語対が有する２つの用語の代表表記を、用語辞書から取得し、取得した２つの代表表記が同一であるか否かを判断し、判断結果を辞書関連素性として取得する異表記取得装置である。 The different notation acquisition device of the fourth aspect of the invention further includes a term dictionary that can store one or more term information having terms and a representative notation of terms, as compared with the first aspect. The feature is information indicating whether or not the representative notation of the two terms included in the term pair is the same, and the feature acquisition unit represents the representative of the two terms included in the term pair for each term pair in the term pair storage unit. This is a different notation acquisition device that acquires a notation from a term dictionary, determines whether or not two acquired representative notations are the same, and acquires the determination result as a dictionary-related feature.

また、本第五の発明の異表記取得装置は、第一の発明に対して、辞書関連素性は、スタッキングアルゴリズムを使用して、教師あり機械学習法とは異なる分類方法、または同一の分類方法であるが学習データが異なる分類方法により、用語対が異表記の用語対であるか否かを判断した結果であり、素性取得部は、用語対格納部の用語対ごとに、教師あり機械学習法とは異なる分類方法、または同一の分類方法であるが学習データが異なる分類方法により、用語対が異表記の用語対であるか否かを判断し、判断結果を辞書関連素性として取得する異表記取得装置である。 Further, the different notation acquisition device of the fifth invention is different from the supervised machine learning method in that the dictionary-related features are different from the supervised machine learning method or the same classification method as in the first invention. However, it is the result of determining whether a term pair is a term pair with different notation by a classification method with different learning data, and the feature acquisition unit performs supervised machine learning for each term pair in the term pair storage unit. The classification method is different from the law, or the classification method is the same classification method but the learning data is different, and it is judged whether the term pair is a different pair of terminology, and the judgment result is acquired as a dictionary-related feature. It is a notation acquisition device.

また、本第六の発明の異表記取得装置は、第一の発明に対して、用語と、用語の読みとを有する１以上の用語情報を格納し得る用語辞書をさらに具備し、辞書関連素性は、用語対が有する２つの用語の読みが一致するか否かを示す情報であり、素性取得部は、用語対格納部の用語対ごとに、用語辞書から用語対が有する２つの用語の読みを取得し、２つの用語の読みが一致するか否かを判断し、判断結果を辞書関連素性として取得する異表記取得装置である。 Further, the different notation acquisition device of the sixth invention further comprises a term dictionary capable of storing one or more term information having terms and readings of terms, as compared with the first invention. Is information indicating whether or not the readings of the two terms included in the term pair match, and the feature acquisition unit reads the two terms included in the term pair from the term dictionary for each term pair in the term pair storage unit. This is a different notation acquisition device that determines whether the readings of two terms match, and acquires the determination result as a dictionary-related feature.

また、本第七の発明の異表記取得装置は、第一から第六いずれかの発明に対して、機械学習部は、用語対格納部の各用語対が異表記の用語対であるか否かを判断するとともに、異表記の用語対である確度を示すスコアも取得し、出力部は、機械学習部が取得したスコアを出力する異表記取得装置である。 Also, with the different notation acquisition device of the seventh invention, in contrast to any of the first to sixth inventions, the machine learning unit may determine whether each term pair in the term pair storage unit is a term pair having a different notation. The output unit is a different notation acquisition device that outputs the score acquired by the machine learning unit.

かかる構成により、用語対の分野を問わず、さらに精度の高い異表記の用語対の抽出が可能となる。 With this configuration, it is possible to extract term pairs with different notations with higher accuracy regardless of the field of term pairs.

また、本第八の発明の異表記取得装置は、第七の発明に対して、出力部は、スコアの閾値を格納している閾値格納手段と、機械学習部が取得したスコアが閾値以上または閾値より大きいか否かを判断する閾値判断手段と、閾値判断手段が閾値以上または閾値より大きいと判断したスコアに対応する用語対を、異表記の用語対であるとの判断結果とし、判断結果または異表記の用語対または異表記でない用語対のいずれか１以上を出力する出力手段とを具備する異表記取得装置である。 Further, in the different notation acquisition device of the eighth invention, in contrast to the seventh invention, the output unit includes threshold storage means for storing a threshold of the score, and the score acquired by the machine learning unit is equal to or greater than the threshold. A threshold judgment means for judging whether or not the threshold value is greater than a threshold value, and a term pair corresponding to the score judged by the threshold judgment means to be equal to or greater than the threshold value or greater than the threshold value is a judgment result that the term pair is a different notation, and the judgment result Or it is an different notation acquisition apparatus provided with the output means which outputs any one or more of a term pair of different notation or a term pair which is not different notation.

また、本第九の発明の異表記取得装置は、用語対の異なる文字である編集箇所の字種に関する素性である字種関連素性、用語辞書を用いて取得された素性である辞書関連素性、用語対を構成する２つの用語の類似度を示す素性である類似度素性のうちの一以上の素性を含む複数の素性と、用語対が異表記の用語対であるかを示す情報である正負情報とを対応付けた学習データを２以上格納し得る学習データ格納部と、異表記のパターンを示す第一文字列と第二文字列とを対に有する異表記パターンを１以上格納し得る異表記パターン格納部と、１以上の用語を受け付ける受付部と、受付部が受け付けた１以上の各用語に対して、異表記パターン格納部の１以上の各異表記パターンを適用し、１以上の用語を生成し、１以上の各用語と生成した用語とを有する１以上の異表記の候補の用語対である異表記候補用語対を生成する用語対生成部と、用語対生成部が生成した１以上の異表記候補用語対ごとに、字種関連素性、辞書関連素性、類似度素性のうちの一以上の素性を含む複数の素性を取得する素性取得部と、用語対生成部が生成した各異表記候補用語対に対して、学習データ格納部の２以上の学習データと、素性取得部が取得した複数の素性とを用いて、教師あり機械学習法により、用語対格納部の各異表記候補用語対が異表記の用語対であるか否かを判断する機械学習部と、機械学習部における判断結果を出力する出力部とを具備する異表記取得装置である。 Further, the different notation acquisition device of the ninth aspect of the present invention is a character type related feature that is a feature related to a character type of an edited portion that is a different character of a term pair, a dictionary related feature that is a feature acquired using a term dictionary, A plurality of features including one or more features of similarity features that indicate the similarity of two terms constituting a term pair, and positive / negative information that indicates whether the term pair is a different pair of terms A learning data storage unit that can store two or more learning data in association with information, and an allo notation that can store one or more different notation patterns having a pair of a first character string and a second character string indicating a notation pattern A pattern storage unit, a reception unit that receives one or more terms, and one or more terms in the different notation pattern storage unit applied to each of the one or more terms received by the reception unit Generated with one or more terms A term pair generation unit that generates an alternate notation candidate term pair that is a term pair of one or more different notation candidates having a word, and a character type for each of the one or more alternate notation candidate term pairs generated by the term pair generation unit Stores learning data for a feature acquisition unit that acquires a plurality of features including one or more features of related features, dictionary related features, and similarity features, and for each different notation candidate term pair generated by the term pair generation unit Whether each of the different notation candidate term pairs in the term pair storage unit is a different notation term pair by supervised machine learning using two or more learning data of the part and a plurality of features acquired by the feature acquisition unit It is a different notation acquisition apparatus which comprises the machine learning part which determines whether or not, and the output part which outputs the determination result in a machine learning part.

かかる構成により、異表記の用語対の候補を自動生成できる。 With this configuration, it is possible to automatically generate a candidate for a different pair of terminology.

また、本第十の発明の異表記取得装置は、第九の発明に対して、編集距離が１の異表記の用語対を１以上格納し得る異表記用語対格納部と、異表記用語対格納部に格納されている１以上の異表記の用語対の編集箇所を取得する編集箇所取得部と、編集箇所取得部が取得した編集箇所から、第一文字列と第二文字列とを対に有する異表記パターンを取得する異表記パターン取得部と、異表記パターン取得部が取得した異表記パターンを、異表記パターン格納部に蓄積する異表記パターン蓄積部とをさらに具備する異表記取得装置である。 Further, the different notation acquisition device of the tenth invention is different from the ninth invention in that an different notation term pair storage unit capable of storing one or more different notation term pairs having an edit distance of 1 and an different notation term pair. The editing part acquisition unit for acquiring the editing part of one or more different pairs of terms stored in the storage part, and the first character string and the second character string are paired from the editing part acquired by the editing part acquisition unit. A different notation acquisition device further comprising an different notation pattern acquisition unit for acquiring different notation patterns, and an different notation pattern storage unit for storing the different notation patterns acquired by the different notation pattern acquisition unit in the different notation pattern storage unit is there.

かかる構成により、異表記の用語対の候補を自動生成するための異表記パターンを自動的に取得できる。 With such a configuration, it is possible to automatically acquire a different notation pattern for automatically generating a candidate for a different pair of terminology.

また、本第十一の発明の異表記取得装置は、第一から第十いずれかの発明に対して、用語対の編集距離は２であり、素性取得部は、用語対の２つの差分文字の組を、それぞれ取得する差分文字取得手段と、差分文字取得手段が取得した２つの差分文字を、独立に対象として、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を、２組取得する素性取得手段とを具備し、機械学習部は、素性取得手段が取得した２組の複数の素性のうちの組ごとに、各組の複数の素性と、学習データ格納部の２以上の学習データとを用いて、教師あり機械学習法により、用語対格納部の各組の複数の素性が異表記の用語対に対応する素性の組であるか否かを判断し、２つの判断結果を用いて、編集距離が２である用語対が異表記の用語対であるか否かを判断する異表記取得装置である。 In addition, the different notation acquisition device of the eleventh aspect of the invention provides that the edit distance of the term pair is 2 with respect to any one of the first to tenth inventions, and the feature acquisition unit has two difference characters of the term pair. Including at least one of a character type-related feature, a dictionary-related feature, and a similarity feature, with the difference character acquisition unit that acquires each of the pairs and the two difference characters acquired by the difference character acquisition unit as targets independently A feature acquisition unit for acquiring two sets of a plurality of features, and the machine learning unit learns a plurality of features of each set for each set of the plurality of features acquired by the feature acquisition unit; Whether or not a plurality of features of each pair of term pair storage units is a pair of features corresponding to a term pair of different notation by using supervised machine learning method using two or more learning data of the data storage unit Judgment and using two judgment results, the term pair whose edit distance is 2 is a different term A different notation acquisition device that determines whether a.

かかる構成により、編集距離が２の用語対でも、精度高く、異表記の用語対であるか否かを判断できる。 With such a configuration, it is possible to determine whether or not a pair of terms with an editing distance of 2 is a term pair with different notation with high accuracy.

本発明による異表記取得装置によれば、用語対の分野を問わず、精度の高い異表記の用語対の抽出が可能となる。 According to the different notation acquisition apparatus of the present invention, it is possible to extract a differently-notated term pair with high accuracy regardless of the field of term pairs.

本発明の実施の形態１における異表記取得装置のブロック図The block diagram of the different notation acquisition apparatus in Embodiment 1 of this invention 同異表記取得装置の動作について説明するフローチャートThe flowchart explaining operation | movement of the same notation acquisition apparatus 同素性取得処理の動作について説明するフローチャートFlow chart explaining operation of homology acquisition processing 同用語辞書の例を示す図The figure which shows the example of the same term dictionary 同サポートベクトルマシン法のマージン最大化の概念を示す図The figure which shows the concept of margin maximization of the support vector machine method 同実験で用いた編集距離が１の日本語用語対の中に、多数決により日本語異表記対であるか日本語異表記対でないかを判定した内訳を示す図The figure which shows the breakdown which judged whether it is a Japanese variant notation pair or a Japanese variant notation pair by majority vote in the Japanese term pair whose edit distance used in the experiment is 1 同Ｌａｎｄｉｓらによる一致度の評価方法を示す図The figure which shows the coincidence evaluation method by Landis et al. 同クローズドデータとオープンデータに対して、ベースラインの手法を適用した結果を示す図Figure showing the result of applying the baseline method to the closed data and open data 同ブートストラップ法を用いて素性が有効であるかどうかの検討をした結果を示す図Figure showing the result of examining whether the feature is effective using the bootstrap method 同素性の例を示す図Diagram showing examples of equivalence 同提案手法を用い、大規模類似語リストから編集距離が１の日本語異表記対と分類された用語対が、種々の辞書にどの程度の割合で含まれているかの検討結果を示す図The figure which shows the examination result of the ratio which the term pair classified into Japanese different notation pair whose edit distance is 1 from the large-scale similar word list is included in various dictionaries using the proposed method 同ＳＶＭの分類精度を示す図The figure which shows the classification accuracy of the same SVM 同種々の辞書と用語対ＤＢにおいて、編集距離が１の日本語異表記対であると分類された日本語用語対と、分類されなかった日本語用語対をそれぞれランダムに、５組ずつ取り出した結果を示す図In the same various dictionaries and term pair DB, five pairs of Japanese term pairs that were classified as Japanese variant pairs with an edit distance of 1 and Japanese term pairs that were not classified were randomly selected. Figure showing the results 同閾値の評価基準を示す図The figure which shows the evaluation standard of the same threshold 同再現率と適合率の比率を示す図Figure showing the ratio of recall and precision 同ベースライン手法を用いた場合の実験結果を示す図The figure which shows the experimental result when the same baseline method is used 同ベースライン手法を用いた場合の実験結果を示す図The figure which shows the experimental result when the same baseline method is used 同ベースライン手法を用いた場合の実験結果を示す図The figure which shows the experimental result when the same baseline method is used 本発明の実施の形態２における異表記取得装置のブロック図The block diagram of the different notation acquisition apparatus in Embodiment 2 of this invention 同異表記取得装置の動作について説明するフローチャートThe flowchart explaining operation | movement of the same notation acquisition apparatus 同異表記パターンの例を示す図The figure which shows the example of the same notation pattern 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system

以下、異表記取得装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, an embodiment of a different notation acquisition apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１） (Embodiment 1)

本実施の形態において、編集距離が１または２以上の用語対から、異なる文字の字種に関する素性、用語辞書を用いて取得された素性、２つの用語の類似度のうちの１以上の素性を含む複数の素性を取り出し、当該複数の素性を用いて、用語対が異表記の用語対であるか否かを、教師あり機械学習法により判断する異表記取得装置について説明する。 In the present embodiment, a feature relating to a character type of a different character, a feature obtained using a term dictionary from a pair of terms having an edit distance of 1 or 2 or more, and a feature of one or more of the similarities of two terms. A description will be given of a different notation acquisition apparatus that takes out a plurality of features to be included and uses the plurality of features to determine whether a term pair is a differently expressed term pair by a supervised machine learning method.

図１は、本実施の形態における異表記取得装置１のブロック図である。
異表記取得装置１は、用語対格納部１１、学習データ格納部１２、用語辞書１３、受付部１４、素性取得部１５、機械学習部１６、出力部１７を備える。素性取得部１５は、差分文字取得手段１５１、素性取得手段１５２を備える。出力部１７は、閾値格納手段１７１、閾値判断手段１７２、出力手段１７３を備える。 FIG. 1 is a block diagram of the different notation acquisition apparatus 1 in the present embodiment.
The different notation acquisition apparatus 1 includes a term pair storage unit 11, a learning data storage unit 12, a term dictionary 13, a reception unit 14, a feature acquisition unit 15, a machine learning unit 16, and an output unit 17. The feature acquisition unit 15 includes a difference character acquisition unit 151 and a feature acquisition unit 152. The output unit 17 includes a threshold storage unit 171, a threshold determination unit 172, and an output unit 173.

用語対格納部１１は、編集距離が１または２以上の用語対を１以上格納し得る。編集距離とは、異なる文字の数である。また、用語対とは、２つの用語である。編集距離が２の用語対は、異なる文字数が２つの用語である。なお、用語とは、通常、名詞や名詞句であるが、形容詞等の他の品詞の用語でも良い。 The term pair storage unit 11 can store one or more term pairs having an edit distance of 1 or 2 or more. The edit distance is the number of different characters. A term pair is two terms. A term pair with an edit distance of 2 is a term with two different numbers of characters. The term is usually a noun or a noun phrase, but may be a term of another part of speech such as an adjective.

学習データ格納部１２は、２以上の学習データを格納し得る。学習データは、用語対の複数の素性と正負情報とを有する。学習データは、用語対を有しても良い。用語対の複数の素性は、ここでは、字種関連素性、辞書関連素性、類似度素性のうちの一以上の素性を含む、とする。なお、素性とは、異表記取得装置１が学習する際に手掛かりとする情報のことである。 The learning data storage unit 12 can store two or more learning data. The learning data has a plurality of features of the term pair and positive / negative information. The learning data may have a term pair. Here, the plurality of features of the term pair includes one or more of a character type related feature, a dictionary related feature, and a similarity feature. The feature is information used as a clue when the different notation acquisition apparatus 1 learns.

字種関連素性とは、用語対の異なる文字である編集箇所の字種に関する素性である。字種関連素性は、例えば、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるか否かを示す情報である。また、字種関連素性は、例えば、用語対が有する２つの用語の文字数が同数であり、かつ、２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるか否かを示す情報である。また、字種関連素性は、例えば、用語対が有する２つの用語の編集箇所の字種がローマ字であり、かつ、２つの用語の編集箇所が大文字と小文字の違いであるか否かを示す情報である。また、字種関連素性は、例えば、用語対が有する２つの用語の文字数が同数であり、かつ、２つの用語の編集箇所の字種がローマ字であり、かつ、２つの用語の編集箇所が大文字と小文字の違いであるか否かを示す情報である。 The character type-related feature is a feature related to a character type of an edited portion that is a character having a different term pair. The character type-related feature is information indicating, for example, whether or not the character types of the editing positions of the two terms included in the term pair are different and whether the editing positions of the two terms have the same value. The character type-related features are, for example, that the number of characters of two terms included in a term pair is the same, the character types of the edited portions of the two terms are different, and the edited portions of the two terms have the same value. It is information indicating whether it is a number. The character type-related feature is, for example, information indicating whether or not the character type of the edited part of the two terms of the term pair is Roman and the edited part of the two terms is different between uppercase and lowercase letters. It is. The character type-related features are, for example, that the number of characters of two terms included in a term pair is the same, the character type of the edited part of the two terms is Roman, and the edited part of the two terms is capitalized. This is information indicating whether or not the difference is lowercase.

また、辞書関連素性とは、用語辞書１３を用いて取得された素性である。辞書関連素性は、例えば、スタッキングアルゴリズムを使用して、機械学習部１６が利用する教師あり機械学習法とは異なる分類方法、または同一の分類方法であるが学習データが異なる分類方法により、用語対が異表記の用語対であるか否かを判断した結果である。ここで、「学習データが異なる」とは、学習データの元になる用語対の集合が異なる場合、学習データが有する素性が異なる場合などがある。また、辞書関連素性は、例えば、用語対が有する２つの用語の代表表記が同一であるか否かを示す情報である。また、辞書関連素性は、例えば、用語対が有する２つの用語の読みが一致するか否かを示す情報である。また、辞書関連素性は、例えば、用語対が有する２つの用語の文字数が同数であり、かつ、２つの用語の読みが一致するか否かを示す情報である。なお、分類方法とは、異表記の用語対であるか否かの分類の方法である。また、教師あり機械学習法とは異なる分類方法とは、分類のやり方、アルゴリズムが教師あり機械学習法とは異なることである。 The dictionary-related feature is a feature acquired using the term dictionary 13. For example, the dictionary-related feature is obtained by using a stacking algorithm to classify a term pair by a classification method different from the supervised machine learning method used by the machine learning unit 16 or a classification method that is the same classification method but has different learning data. Is a result of determining whether or not is a pair of terms with different notations. Here, “learning data is different” includes a case where the sets of term pairs from which the learning data is based are different, a case where the features of the learning data are different, and the like. The dictionary-related feature is information indicating whether, for example, the representative notation of two terms included in a term pair is the same. The dictionary-related feature is information indicating whether readings of two terms included in a term pair match, for example. The dictionary-related feature is information indicating, for example, whether the number of characters of two terms included in the term pair is the same and whether the readings of the two terms match. The classification method is a classification method for determining whether or not a pair of terms is notated. Further, the classification method different from the supervised machine learning method is that the classification method and algorithm are different from the supervised machine learning method.

また、類似度素性とは、用語対を構成する２つの用語の類似度を示す素性である。二つの用語の類似度は、それらの用語がＷｅｂ上でよく似た文脈に出現するかどうかの情報を利用して求める。なお、用語の類似度を取得する技術は、「風間淳一, De Saeger, Stijn, 鳥澤健太郎, 村田真樹「係り受けの確率的クラスタリングを用いた大規模類似語リストの作成」言語処理学会第15回年次大会(NLP2009)」等に記載されている。つまり、２つの用語の類似度の取得方法は公知技術である。２つの用語の類似度の算出方法は問わない。 The similarity feature is a feature that indicates the similarity of two terms constituting a term pair. The degree of similarity between two terms is obtained by using information on whether or not the terms appear in a similar context on the Web. The technique for obtaining the similarity of terms is “Shinichi Kazama, De Saeger, Stijn, Kentaro Torisawa, Maki Murata“ Creating a large-scale similar word list using dependency stochastic clustering ”The 15th Language Processing Society of Japan Annual convention (NLP2009) ”etc. That is, the method for obtaining the similarity between two terms is a known technique. There is no limitation on the method of calculating the similarity between the two terms.

また、正負情報とは、用語対が異表記の用語対であるか否かを示す情報である。正負情報は、異表記の用語対であれば正例（例えば「１」）、異表記の用語対でなければ負例（例えば「０」）である。 The positive / negative information is information indicating whether or not the term pair is a differently expressed term pair. The positive / negative information is a positive example (for example, “1”) if it is a term pair with different notation, and a negative example (for example, “0”) if it is not a term pair with different notation.

また、用語辞書とは、異表記の用語の情報を含む情報群である。用語辞書の例やデータ構造の例については後述する。 The term dictionary is a group of information including information on terms in different notations. Examples of term dictionaries and data structures will be described later.

また、他の素性として、編集箇所の文字または編集箇所の文字の周辺の文字の情報である編集箇所文字素性がある。 As another feature, there is an edited portion character feature which is information on the character at the edited portion or the characters around the character at the edited portion.

用語辞書１３は、１以上の用語情報を格納し得る。用語辞書１３は、例えば、異表記の用語の情報を含む情報群である。用語辞書１３は、異表記の２つの用語が、陽に対応付けられている必要はない。用語情報は、例えば、用語と用語の代表表記とを有する。用語情報は、例えば、用語と、用語の読みとを有する。 The term dictionary 13 can store one or more term information. The term dictionary 13 is an information group including, for example, information on terms in different notations. The term dictionary 13 does not need to explicitly associate two different terms. The term information includes, for example, a term and a representative representation of the term. The term information includes, for example, a term and a term reading.

受付部１４は、ユーザからの入力を受け付ける。この入力とは、例えば、異表記取得装置１を動作させるための動作指示である。受付部１４は、異表記であるか否かを判断する対象の用語対を受け付けても良い。動作指示などの入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。受付部１４は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The reception unit 14 receives input from the user. This input is, for example, an operation instruction for operating the different notation acquisition device 1. The accepting unit 14 may accept a target term pair for determining whether or not the notation is used. The input means such as operation instructions may be anything such as a numeric keypad, a keyboard, a mouse, or a menu screen. The accepting unit 14 can be realized by a device driver for input means such as a numeric keypad or a keyboard, control software for a menu screen, and the like.

素性取得部１５は、用語対格納部１１の用語対ごとに、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を取得する。複数の素性とは、例えば、後述する６８の素性である。 The feature acquisition unit 15 acquires a plurality of features including at least one of a character type related feature, a dictionary related feature, and a similarity feature for each term pair in the term pair storage unit 11. The plurality of features are, for example, 68 features described later.

素性取得部１５は、用語対格納部１１の用語対ごとに、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるという条件に合致するか否かを判断し、判断結果を字種関連素性として取得する。なお、素性取得部１５は、例えば、用語対「３者会談」「三者会談」に対して、編集箇所が「３」「三」であるので、上記条件に合致する、と判断する。また、素性取得部１５は、例えば、用語対「１２５件」「百二十五件」に対して、編集箇所が「１２５」「百二十五」であるので、用語の文字数は同数ではないが、上記条件に合致する、と判断する。 The feature acquisition unit 15 has a condition that, for each term pair in the term pair storage unit 11, the character types of the edited portions of the two terms included in the term pair are different, and the edited portions of the two terms have the same value. Is determined as a character type-related feature. For example, the feature acquisition unit 15 determines that the above condition is met because the edited portions are “3” and “three” for the term pairs “three-party talk” and “three-party talk”. The feature acquisition unit 15 has, for example, “125” and “125” for the term pairs “125” and “125”, so the number of characters in the term is not the same. Is determined to satisfy the above condition.

素性取得部１５は、例えば、用語対格納部１１の用語対ごとに、用語対が有する２つの用語の編集箇所の字種がローマ字であり、かつ、２つの用語の編集箇所が大文字と小文字の違いであるという条件に合致するか否かを判断し、判断結果を字種関連素性として取得する。なお、素性取得部１５は、例えば、編集箇所が「Ａ」と「ａ」の用語対に対して合致すると判断し、編集箇所が「Ａ」と「ｂ」の用語対に対して合致しないと判断する。 For example, for each term pair in the term pair storage unit 11, the feature acquisition unit 15 has two types of editing positions of the two terms in the term pair in Roman letters, and the editing positions of the two terms in uppercase and lowercase letters. It is determined whether or not the condition of difference is met, and the determination result is acquired as a character type related feature. For example, the feature acquisition unit 15 determines that the edited portion matches the term pair “A” and “a”, and if the edited portion does not match the term pair “A” and “b”. to decide.

素性取得部１５は、例えば、用語対格納部１１の用語対ごとに、機械学習部１６が利用する教師あり機械学習法とは異なる分類方法、または同一の分類方法であるが学習データが異なる分類方法により、用語対が異表記の用語対であるか否かを判断し、判断結果を辞書関連素性として取得する。かかる辞書関連素性を利用する機械学習法を、スタッキングアルゴリズムによる方法という。機械学習部１６が利用する教師あり機械学習法とは異なる分類方法とは、上記の機械学習法がＳＶＭである場合、ＳＶＭとは異なる決定木などの機械学習法、後述するルールに基づく分類方法等である。 The feature acquisition unit 15 is, for example, a classification method different from the supervised machine learning method used by the machine learning unit 16 or the same classification method for each term pair in the term pair storage unit 11 but with different learning data. According to the method, it is determined whether or not the term pair is a different notation term pair, and the determination result is acquired as a dictionary-related feature. A machine learning method using such dictionary-related features is called a method using a stacking algorithm. The classification method different from the supervised machine learning method used by the machine learning unit 16 means that, when the above machine learning method is SVM, a machine learning method such as a decision tree different from SVM, a classification method based on a rule described later Etc.

スタッキングアルゴリズムは、詳細には、例えば、以下の手順による分類方法である。まず、ＪＵＭＡＮ辞書を使って教師データを作成する。つまり、ＪＵＭＡＮ辞書の単語の集合から、編集距離が１文字の単語対を取り出す。ここで、編集距離が１文字の単語対は、９０４６１２組、取り出せる。そのうち、代表表記が等しい単語対（２５９３４組）を取り出す。次に、ＪＵＭＡＮ辞書で、代表表記が等しい単語対を正例、そうでないものを負例とする。以上により、教師データを作成する。 Specifically, the stacking algorithm is a classification method according to the following procedure, for example. First, teacher data is created using a JUMAN dictionary. That is, a word pair whose edit distance is one character is extracted from a set of words in the JUMAN dictionary. Here, 904612 pairs of words with an edit distance of one character can be extracted. Among them, word pairs (25934 sets) having the same representative notation are taken out. Next, in the JUMAN dictionary, word pairs having the same representative notation are set as positive examples, and word pairs other than those are set as negative examples. Thus, teacher data is created.

次に、その教師データを学習データとした機械学習を行う。なお、教師データは、上述した教師データに限らず、他の教師データを用いてもよい。また、機械学習の際に利用する素性は、本発明の全素性（Ｓ１からＳ６８の素性）のうち、Ｓ５４の素性を取り除いた素性を利用する。なお、機械学習の際に利用する素性は、他の素性を用いてもよい。 Next, machine learning using the teacher data as learning data is performed. The teacher data is not limited to the teacher data described above, and other teacher data may be used. In addition, as a feature used in machine learning, a feature obtained by removing the feature of S54 from all the features of the present invention (the features of S1 to S68) is used. Note that other features may be used as the features used in machine learning.

そして、実際に、Ｓ５４の素性を付与したいデータを、上記学習結果を利用して、分類する。分類結果において正例となったか、負例となったかの情報をＳ５４の素性として、そのデータに付与する。 In practice, the data to be given the feature of S54 is classified using the learning result. Information indicating whether the classification result is a positive example or a negative example is added to the data as the feature of S54.

そして、Ｓ５４の素性が付与された学習データ（６８の素性を有する）を用いて、問題となる用語対に対して、機械学習を行うことで、問題となる用語対が異表記対であるか否かを判断していく。 Then, by using the learning data to which the feature of S54 is given (having 68 features), by performing machine learning on the problematic term pair, whether the problematic term pair is a different notation pair Judge whether or not.

スタッキングアルゴリズムによる方法では、ＪＵＭＡＮ辞書で、代表表記が一致するか否かについて学習した結果を素性として付与できるので、実際にＪＵＭＡＮ辞書に記載されていない用語対に対しても、ＪＵＭＡＮ辞書で、代表表記が一致するとされる傾向のある用語対か否かの情報を付与できることとなる。 In the method based on the stacking algorithm, the result of learning whether or not the representative notation matches in the JUMAN dictionary can be given as a feature. Therefore, the term pairs that are not actually described in the JUMAN dictionary can be represented by the JUMAN dictionary. Information indicating whether or not the term pair has a tendency to match the notation can be given.

素性取得部１５は、例えば、用語対格納部１１の用語対ごとに、用語対が有する２つの用語の代表表記を、用語辞書１３から取得し、取得した２つの代表表記が同一であるか否かを判断し、判断結果を辞書関連素性として取得する。 For example, the feature acquisition unit 15 acquires, for each term pair in the term pair storage unit 11, representative representations of two terms included in the term pair from the term dictionary 13, and whether or not the two representative representations acquired are the same. And the determination result is acquired as a dictionary-related feature.

素性取得部１５は、例えば、用語対格納部１１の用語対ごとに、用語辞書１３から２つの用語の読みを取得し、２つの用語の読みが一致するか否かを判断し、判断結果を辞書関連素性として取得する。また、素性取得部１５は、例えば、用語対格納部１１の用語対ごとに、用語対が有する２つの用語の文字数が同数であり、かつ、用語辞書１３から２つの用語の読みを取得し、２つの用語の読みが一致するか否かを判断し、判断結果を辞書関連素性として取得しても良い。 The feature acquisition unit 15 acquires, for example, readings of two terms from the term dictionary 13 for each term pair in the term pair storage unit 11, determines whether the readings of the two terms match, and determines the determination result. Obtained as dictionary-related features. Further, the feature acquisition unit 15 acquires, for each term pair in the term pair storage unit 11, the number of characters of the two terms included in the term pair is the same, and acquires readings of the two terms from the term dictionary 13, It may be determined whether the readings of the two terms match, and the determination result may be acquired as a dictionary-related feature.

なお、上述した判断結果とは、例えば、上記条件に合致する場合の判断結果は「１」、その他の場合の判断結果は「０」などである。 Note that the above-described determination result is, for example, “1” when the condition is met, and “0” in other cases.

差分文字取得手段１５１は、編集距離が２つの用語対について、２つの差分文字の組を、それぞれ取得する。例えば、編集距離が２つの用語対が、（１）「できる」「出来る」（２）「理解できる」「できる」（３）「ＩＸ（ローマ数字の９）」「９」である場合を考える。（１）は両方の用語対が同じ文字数である場合である。（２）はどちらか一方の用語の文字数がもう片方の用語の文字数より２つ多いまたは、少ない場合である。（３）はどちらか一方の用語の文字数がもう片方の用語の文字数より１つ多いまたは、少ない場合である。（１）の場合、差分文字取得手段１５１は、「できる」および「出来る」の用語に対して、前方から後方に１，２，３・・・と文字に番号をつけ、それぞれの用語で同じ文字番号を持ち、違う文字である「で」「出」と「き」「来」が差分文字であるとして、「で」「出」と「き」「来」の２組の差分文字の組を取得する。（２）の場合、差分文字取得手段１５１は、「理」「」と「解」「」（「」はＮＵＬＬである）の２組の差分文字の組を取得する。（３）の場合、差分文字取得手段１５１は、「Ｉ」「９」と「Ｘ」「」の２組の差分文字、または「Ｉ」「」と「Ｘ」「９」の２組の差分文字を取得する。 The difference character acquisition means 151 acquires two sets of difference characters, respectively, for the term pairs whose edit distance is two. For example, consider a case where a pair of terms having two edit distances are (1) “can” “can” (2) “can understand” “can” (3) “IX (roman numeral 9)” “9”. . (1) is the case where both term pairs have the same number of characters. (2) is a case where the number of characters of either term is two more or less than the number of characters of the other term. (3) is a case where the number of characters of one term is one more or less than the number of characters of the other term. In the case of (1), the difference character acquisition unit 151 numbers characters “1, 2, 3,...” From the front to the rear for the terms “can” and “can”, and the same for each term. A set of two difference characters, “de”, “de”, “ki”, and “coming”, having character numbers and different characters “de”, “out”, “ki”, and “coming” being differential characters. To get. In the case of (2), the difference character acquisition unit 151 acquires two sets of difference characters, “reason” “” and “solution” “” (“” is NULL). In the case of (3), the difference character acquisition unit 151 uses two sets of difference characters “I” “9” and “X” “” or two sets of differences “I” “” and “X” “9”. Get a character.

また、差分文字取得手段１５１は、編集距離が１の用語対について、差分文字の組を１組取得する。例えば、編集距離が１の用語対が、（１）「ご苦労」「御苦労」（２）「Ｆｉｒｅｆｏｘ」「ＦｉｒｅＦｏｘ」（３）「肝炎ウイルス」「肝炎ウィルス」（４）「文学史上」「文学史」（５）「咲き分け」「咲分け」（６）「クロゼット」「クローゼット」（７）「大人・子供」「大人子供」（８）「第１位」「第一位」である場合、差分文字取得手段１５１は、それぞれ（１）「ご」「御」（２）「ｆ」「Ｆ」（３）「イ」「ィ」（４）「上」「」（５）「き」「」（６）「」「ー」（７）「・」「」（８）「１」「一」を取得する。 Further, the difference character acquisition unit 151 acquires a set of difference characters for a term pair whose edit distance is 1. For example, a pair of terms whose edit distance is 1 is (1) “difficulty” “difficulty” (2) “Firefox” “FireFox” (3) “hepatitis virus” “hepatitis virus” (4) “in literature history” “literature” “History” (5) “Flower” “Saki” (6) “Closet” “Closet” (7) “Adult / Child” “Adult Child” (8) “First” and “First” , The difference character acquisition means 151 respectively (1) “go” “go” (2) “f” “F” (3) “i” “i” (4) “up” “” (5) “ki” "" (6) "" "-" (7) "·" "" (8) "1" "One" is acquired.

さらに、差分文字取得手段１５１は、編集距離が３以上の用語対について、３組以上の差分文字の組を取得する。例えば、編集距離が４の用語対が、「１０２５位」「千二十五位」である場合、差分文字取得手段１５１は、「１」「千」、「０」「二」、「２」「十」「５」「五」という４組の差分文字を取得する。ここで、差分文字とは、２つの用語の異なる文字である。 Further, the difference character acquisition unit 151 acquires three or more sets of difference characters for a term pair having an edit distance of 3 or more. For example, if the term pair whose edit distance is 4 is “1025” and “thousand twenty-fifth”, the difference character acquisition unit 151 sets “1” “1000”, “0” “two”, “2”. Four sets of difference characters “10”, “5”, and “5” are acquired. Here, the difference character is a character different in two terms.

素性取得手段１５２は、差分文字取得手段１５１が取得した２つの差分文字を、独立に対象として、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を、２組取得する。例えば、編集距離が２つの用語対が、（１）「できる」「出来る」（２）「理解できる」「できる」（３）「ＩＸ」「９」である場合を考える。（１）の用語対について、素性取得手段１５２は、「で」「出」と「き」「来」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。素性取得手段１５２は、例えば、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるか否かを示す字種関連素性について、「で」「出」の編集箇所が同じ値の数字でないと判断し、当該字種関連素性「０」を取得する。また、素性取得手段１５２は、例えば、用語辞書１３から２つの用語の読みを取得し、２つの用語の読みが一致するか否かを示す辞書関連素性「１」を取得する。素性取得手段１５２は、用語辞書１３から「出」の読み「で」を取得し、「で」と「出」の読みが一致すると判断する。また、素性取得手段１５２は、例えば、差分文字「で」「出」に対して、差分文字（編集箇所）の前後の文字という素性について、前の文字の素性「」（なし）、後の文字の素性「き」と「来」を取得する。また素性取得手段１５２は、例えば、差分文字「き」「来」に対して、差分文字の前後の文字という素性について、前の文字の素性「出」と「で」、後の文字の素性「る」を取得する。かかる処理により、別の差分文字も素性に含めることとなる。 The feature acquisition unit 152 targets two difference characters acquired by the difference character acquisition unit 151 as independent targets, and acquires a plurality of features including one or more of character type related features, dictionary related features, and similarity features 2 Get a pair. For example, consider a case where a pair of terms having two edit distances is (1) “can”, “can” (2) “can understand” “can” (3) “IX” “9”. For the term pair of (1), the feature acquisition unit 152 extracts features for each of the two sets of difference characters “de”, “out”, “ki”, and “coming”, and extracts each feature from the difference character. The extracted features are considered different, and two types of test data are created. The feature acquisition unit 152, for example, regarding character type-related features indicating whether or not the character types of the editing positions of the two terms of the term pair are different and whether the editing positions of the two terms are numbers having the same value, It is determined that the edited portions of “de” and “out” are not numbers having the same value, and the character type related feature “0” is acquired. For example, the feature acquisition unit 152 acquires readings of two terms from the term dictionary 13 and acquires a dictionary-related feature “1” indicating whether the readings of the two terms match. The feature acquisition unit 152 acquires “de” reading “de” from the term dictionary 13 and determines that “de” and “depart” read match. Also, the feature acquisition unit 152, for example, with respect to the difference character “de” “out”, regarding the feature of the character before and after the difference character (edited part), the feature “” (none) of the previous character, the subsequent character Acquire the features “Ki” and “Ki”. Also, the feature acquisition unit 152, for example, with respect to the difference characters “ki” and “coming”, the features “out” and “de” of the previous character and the feature “ Get ". With this process, another difference character is included in the feature.

また、（２）の用語対について、素性取得手段１５２は、（１）と同様に、「理」「」と「解」「」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。 In addition, for the term pair (2), the feature acquisition unit 152 extracts features for each of the two sets of difference characters of “reason” “” and “solution” “” as in (1). The features extracted from the difference characters are considered to be different, and two types of test data are created.

さらに、（３）の用語対について、素性取得手段１５２は、（１）（２）と同様に、例えば、「Ｉ」「９」と「Ｘ」「」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。 Further, for the term pair (3), the feature acquisition unit 152, for example, each of two sets of difference characters “I”, “9”, “X”, and “”, as in (1) and (2). The features are extracted from the target, and the features extracted from the difference characters are considered to be different, and two types of test data are created.

また、素性取得手段１５２は、差分文字取得手段１５１が取得した１組以上の差分文字を用いて、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を取得する。なお、字種関連素性、辞書関連素性、類似度素性などの素性を取得する具体的な方法は後述する。 The feature acquisition unit 152 acquires a plurality of features including one or more of a character type related feature, a dictionary related feature, and a similarity feature using one or more sets of difference characters acquired by the difference character acquiring unit 151. To do. A specific method for acquiring features such as character type-related features, dictionary-related features, and similarity features will be described later.

機械学習部１６は、各用語対に対して、学習データ格納部１２の２以上の学習データと、素性取得部１５が取得した複数の素性とを用いて、教師あり機械学習法により、用語対格納部１１の各用語対が異表記の用語対であるか否かを判断する。 For each term pair, the machine learning unit 16 uses two or more learning data in the learning data storage unit 12 and a plurality of features acquired by the feature acquisition unit 15 to perform the term pairing by the supervised machine learning method. It is determined whether or not each term pair in the storage unit 11 is a term pair having a different notation.

機械学習部１６は、用語対格納部１１の各用語対が異表記の用語対であるか否かを判断するとともに、異表記の用語対である確度を示すスコアも取得しても良い。 The machine learning unit 16 may determine whether each term pair in the term pair storage unit 11 is a differently expressed term pair, and may also acquire a score indicating the accuracy of the differently expressed term pair.

機械学習部１６は、素性取得手段１５２が取得した２組の複数の素性のうちの組ごとに、各組の複数の素性と、学習データ格納部１２の２以上の学習データとを用いて、教師あり機械学習法により、用語対格納部１１の各組の複数の素性が異表記の用語対に対応する素性の組であるか否かを判断し、２つの判断結果を用いて、編集距離が２である用語対が異表記の用語対であるか否かを判断する。 The machine learning unit 16 uses a plurality of features of each set and two or more learning data of the learning data storage unit 12 for each of the two sets of features acquired by the feature acquisition unit 152. Using the supervised machine learning method, it is determined whether or not a plurality of features of each set in the term pair storage unit 11 is a set of features corresponding to different pairs of terminology, and the edit distance is calculated using the two determination results. It is determined whether or not the term pair with 2 is a term pair with different notation.

教師あり機械学習法のアルゴリズムは問わない。教師あり機械学習法とは、例えば、サポートベクターマシン（ＳＶＭ）などである。ＳＶＭは、「http://chasen.org/~taku/software/TinySVM/」「http://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83%99%E3%82%AF%E3%82%BF%E3%83%BC%E3%83%9E%E3%82%B7%E3%83%B3」（平成２１年１２月１２日検索）などに記載されている。なお、教師あり機械学習法の詳細は、後述する。 The algorithm of the supervised machine learning method does not matter. The supervised machine learning method is, for example, a support vector machine (SVM). SVM is "http://chasen.org/~taku/software/TinySVM/" "http://en.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83% BC% E3% 83% 88% E3% 83% 99% E3% 82% AF% E3% 82% BF% E3% 83% BC% E3% 83% 9E% E3% 82% B7% E3% 83% B3 '' (Search on December 12, 2009). Details of the supervised machine learning method will be described later.

また、上記の、２つの判断結果を用いてとは、２つとも正例とされた場合に異表記の用語対としても良いし、２つとも負例とされた場合に異表記の用語対ではないとしても良いし、２つのスコアのうちのスコアが０に近い方のスコアを採用して、採用したスコアが正の場合は正例（異表記の用語）、負の場合は負例（異表記の用語でない）と判断しても良いし、スコアの絶対値が大きい方のスコアを採用して、採用したスコアが正の場合は正例（異表記の用語）、負の場合は負例（異表記の用語でない）と判断しても良い。また、２つのスコアのうち、小さい方のスコアを取得し、当該小さい方のスコアが正の場合は正例、負の場合は負例と判断しても良い。つまり、２つの判断結果の用い方は問わない。なお、上記の（２）の場合（どちらか一方の用語の文字数がもう片方の用語の文字数より２つ多いまたは、少ない場合）、大規模類似語リストの中から、約１万５千のタグ付けを行った結果、このパターンの２文字差分データには、異表記対であると判定する用語対はなかった。 In addition, using the above two determination results, it is possible to use different terminology pairs when both are positive examples and when both are negative examples, It may not be, and the score that is closer to 0 of the two scores is adopted, and when the adopted score is positive, it is a positive example (another term), and when negative, it is a negative example ( It may be judged that it is not a term of different notation), and the score with the larger absolute value of the score is adopted, and when the adopted score is positive, it is a positive example (term of different notation), and when it is negative, it is negative It may be judged as an example (not an idiom). Alternatively, the smaller score of the two scores may be acquired, and the positive score may be determined as a positive example when the lower score is positive, and the negative score may be determined as negative. That is, there is no limitation on how to use the two determination results. In the case of (2) above (when the number of characters of one term is two more or less than the number of characters of the other term), about 15,000 tags from the large-scale similar word list As a result, the two-character difference data of this pattern did not have a term pair that was determined to be a different notation pair.

さらに、２組の差分文字の組（例えば、「Ｉ」「９」と「Ｘ」「」、または「Ｉ」「」と「Ｘ」「９」）、つまり２つの問題（問題１、問題２）ができる場合、それぞれの差分文字を対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、４種類のテストデータを作成する。そして、２つの問題ごとに、算出したスコアが０に近い方を取得し、問題ごとのスコアのうちの、絶対値が高いスコアを当該問題のスコアとし、スコアが正の場合は正例、負の場合は負例と判断しても良い。例えば、編集距離が２の用語対が（３）「ＩＸ」「９」である場合、問題「Ｉ」「９」と「Ｘ」「」、および「Ｉ」「」と「Ｘ」ができる。そして、機械学習部１６は、「Ｉ」「９」と「Ｘ」「」のスコアの小さい方を取得し、また、「Ｉ」「」と「Ｘ」「９」のスコアの小さい方を取得し、２つの取得されたスコアのうち、値が大きい方を「ＩＸ」「９」の用語対におけるスコアとする。そして、機械学習部１６は、当該スコアが正の場合は正例、負の場合は負例と判断しても良い。なお、例えば、機械学習部１６は、「Ｉ」「９」と「Ｘ」「」のスコアが０に近い方を取得し、また、「Ｉ」「」と「Ｘ」「９」のスコアが０に近い方を取得し、２つの取得されたスコアのうち、絶対値が大きい方を「ＩＸ」「９」の用語対におけるスコアとしても良い。つまり、４種類のテストデータの判断結果を如何に用いてスコアを算出するかは問わない。 Furthermore, two sets of difference characters (for example, “I” “9” and “X” “” or “I” “” and “X” “9”), that is, two problems (problem 1, problem 2) ), The features are extracted for each difference character, and the features extracted from the difference characters are considered to be different, and four types of test data are created. Then, for each of the two questions, the one with the calculated score closer to 0 is acquired, and the score with the highest absolute value of the scores for each question is set as the score of the question. In this case, it may be determined as a negative example. For example, if the term pair whose edit distance is 2 is (3) “IX” “9”, the problems “I” “9” and “X” “” and “I” “” and “X” can be generated. Then, the machine learning unit 16 acquires the smaller score of “I” “9” and “X” “”, and acquires the smaller score of “I” “” and “X” “9”. Of the two acquired scores, the one with the larger value is set as the score in the term pair “IX” “9”. The machine learning unit 16 may determine that the score is a positive example when the score is positive and a negative example when the score is negative. For example, the machine learning unit 16 obtains the score of “I” “9” and “X” “” that is close to 0, and the scores of “I” “” and “X” “9” The one closer to 0 may be acquired, and the larger of the two acquired scores may be used as the score for the term pair “IX” “9”. That is, it does not matter how the score is calculated using the judgment results of the four types of test data.

出力部１７は、機械学習部１６における判断結果を出力する。また、出力部１７は、機械学習部１６が取得したスコアを出力しても良い。判断結果とは、各用語対が異表記の用語対であるか否かを示す情報、または異表記の１以上の用語対、または異表記でない１以上の用語対などである。また、出力部１７は、判断結果とスコアの両方を出力しても良いし、一方を出力しても良い。 The output unit 17 outputs the determination result in the machine learning unit 16. The output unit 17 may output the score acquired by the machine learning unit 16. The determination result includes information indicating whether each term pair is a different notation, or one or more term pairs having different notations, or one or more term pairs not having different notations. Moreover, the output part 17 may output both a judgment result and a score, and may output one.

また、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。 Also, output means display on a display, projection using a projector, printing on a printer, sound output, transmission to an external device, storage in a recording medium, processing to another processing device or other program, etc. It is a concept that includes delivery of results.

閾値格納手段１７１は、スコアの閾値を格納している。 The threshold storage unit 171 stores a score threshold.

閾値判断手段１７２は、機械学習部１６が取得したスコアが閾値以上または閾値より大きいか否かを判断する。 The threshold determination unit 172 determines whether the score acquired by the machine learning unit 16 is equal to or greater than the threshold or greater than the threshold.

出力手段１７３は、閾値判断手段１７２が閾値以上または閾値より大きいと判断したスコアに対応する用語対を、異表記の用語対であるとの判断結果とし、判断結果または異表記の用語対または異表記でない用語対のいずれか１以上を出力する。 The output unit 173 sets the term pair corresponding to the score determined by the threshold judgment unit 172 to be equal to or greater than the threshold value or greater than the threshold value as a judgment result that is a different term term pair. Output one or more of non-notation term pairs.

用語対格納部１１、学習データ格納部１２、用語辞書１３、および閾値格納手段１７１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The term pair storage unit 11, the learning data storage unit 12, the term dictionary 13, and the threshold storage unit 171 are preferably non-volatile recording media, but can also be realized by volatile recording media.

用語対格納部１１、学習データ格納部１２、および用語辞書１３に格納されている情報が記憶される過程は問わない。 The process in which the information stored in the term pair storage unit 11, the learning data storage unit 12, and the term dictionary 13 is stored is not limited.

素性取得部１５、機械学習部１６、閾値判断手段１７２は、通常、ＭＰＵやメモリ等から実現され得る。素性取得部１５等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The feature acquisition unit 15, the machine learning unit 16, and the threshold determination unit 172 can usually be realized by an MPU, a memory, or the like. The processing procedure of the feature acquisition unit 15 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１７は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部１７は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 17 may or may not include an output device such as a display or a speaker. The output unit 17 can be realized by output device driver software, or output device driver software and an output device.

次に、異表記取得装置１の動作について図２のフローチャートを用いて説明する。ここでは、異表記取得装置１は、編集距離が１の用語対に対して、異表記の用語対であるか否かを判断することとする。 Next, operation | movement of the different notation acquisition apparatus 1 is demonstrated using the flowchart of FIG. Here, it is assumed that the different notation acquisition device 1 determines whether a term pair with an editing distance of 1 is a differently expressed term pair.

（ステップＳ２０１）受付部１４は、動作開始の指示を受け付けたか否かを判断する。指示を受け付ければステップＳ２０２に行き、受け付けなければステップＳ２０１に戻る。 (Step S <b> 201) The receiving unit 14 determines whether an operation start instruction has been received. If an instruction is accepted, the process goes to step S202. If no instruction is accepted, the process returns to step S201.

（ステップＳ２０２）素性取得部１５は、カウンタｉに１を代入する。 (Step S202) The feature acquisition unit 15 substitutes 1 for a counter i.

（ステップＳ２０３）素性取得部１５は、ｉ番目の用語対が用語対格納部１１に存在するか否かを判断する。ｉ番目の用語対が存在すればステップＳ２０４に行き、存在しなければ処理を終了する。 (Step S203) The feature acquisition unit 15 determines whether or not the i-th term pair exists in the term pair storage unit 11. If the i-th term pair exists, the process goes to step S204, and if it does not exist, the process ends.

（ステップＳ２０４）素性取得部１５は、用語対格納部１１から、ｉ番目の用語対を読み出す。 (Step S204) The feature acquisition unit 15 reads the i-th term pair from the term pair storage unit 11.

（ステップＳ２０５）素性取得部１５は、ｉ番目の用語対の素性を取得する処理を行う。素性取得処理について、図３のフローチャートを用いて説明する。 (Step S205) The feature acquisition unit 15 performs a process of acquiring the feature of the i-th term pair. The feature acquisition process will be described with reference to the flowchart of FIG.

（ステップＳ２０６）機械学習部１６は、ステップＳ２０５で取得された複数の素性と、学習データ格納部１２の２以上の学習データとを用いて、教師あり機械学習を行い、スコアを取得する。 (Step S206) The machine learning unit 16 performs supervised machine learning using the plurality of features acquired in step S205 and two or more learning data of the learning data storage unit 12, and acquires a score.

（ステップＳ２０７）出力部１７は、ステップＳ２０６で取得されたスコアを用いて、ｉ番目の用語対は異表記の用語対であるか否かを判断する。例えば、出力部１７を構成する閾値判断手段１７２は、閾値格納手段１７１から閾値を読み出し、ステップＳ２０６で取得されたスコアが閾値より大きいまたは閾値以上であれば、ｉ番目の用語対は異表記の用語対であると判断し、スコアが閾値以下または閾値より小さい場合は、ｉ番目の用語対は異表記の用語対でない、と判断する。 (Step S207) The output unit 17 determines whether the i-th term pair is a term pair of different notation using the score acquired in Step S206. For example, the threshold value determination unit 172 configuring the output unit 17 reads the threshold value from the threshold value storage unit 171, and if the score acquired in step S206 is greater than or equal to the threshold value, the i-th term pair is notationally different. When it is determined that the term pair is a term pair and the score is equal to or less than the threshold value or smaller than the threshold value, it is determined that the i-th term pair is not a term pair of different notation.

（ステップＳ２０８）出力部１７は、ステップＳ２０８での判断結果が、異表記の用語対であればステップＳ２０９に行き、異表記の用語対でなければステップＳ２１０に行く。 (Step S208) The output unit 17 goes to Step S209 if the judgment result in Step S208 is a differently-notated term pair, and goes to Step S210 if it is not an unnotated term pair.

（ステップＳ２０９）出力部１７は、ｉ番目の用語対を異表記の用語対であるとして出力する。 (Step S209) The output unit 17 outputs the i-th term pair as a differently expressed term pair.

（ステップＳ２１０）素性取得部１５は、カウンタｉを１，インクリメントする。ステップＳ２０３に戻る。 (Step S210) The feature acquisition unit 15 increments the counter i by one. The process returns to step S203.

次に、ステップＳ２０５の素性取得処理について、図３のフローチャートを用いて説明する。 Next, the feature acquisition process in step S205 will be described with reference to the flowchart of FIG.

（ステップＳ３０１）素性取得部１５を構成する差分文字取得手段１５１は、２つの用語の編集箇所を取得する。 (Step S301) The difference character acquisition means 151 which comprises the feature acquisition part 15 acquires the edit location of two terms.

（ステップＳ３０２）素性取得部１５の素性取得手段１５２は、ステップＳ３０１で取得された編集箇所を用いて、字種関連素性を取得する。字種関連素性の具体的な取得方法については後述する。 (Step S302) The feature acquisition unit 152 of the feature acquisition unit 15 acquires a character type-related feature using the edited portion acquired in step S301. A specific method for acquiring the character type-related features will be described later.

（ステップＳ３０３）素性取得手段１５２は、用語辞書１３を用いて、辞書関連素性を取得する。辞書関連素性の具体的な取得方法については後述する。 (Step S303) The feature acquisition unit 152 acquires a dictionary-related feature using the term dictionary 13. A specific method for acquiring dictionary-related features will be described later.

（ステップＳ３０４）素性取得手段１５２は、２つの用語の類似度を取得する。この類似度は、類似度素性である。 (Step S304) The feature acquisition unit 152 acquires the similarity between two terms. This similarity is a similarity feature.

（ステップＳ３０５）素性取得手段１５２は、その他、予め決められた素性を取得する。その他の予め決められた素性の例は、後述する。 (Step S305) The feature acquisition unit 152 acquires other predetermined features. Examples of other predetermined features will be described later.

（ステップＳ３０６）素性取得手段１５２は、スタッキングアルゴリズムを使用して、ステップＳ３０２からステップＳ３０５において取得した複数の素性を用いて、ステップＳ２０６における教師あり機械学習法とは異なる分類方法により、用語対が異表記の用語対であるか否かを判断し、その判断結果を取得する。 (Step S306) The feature acquisition unit 152 uses a plurality of features acquired in steps S302 to S305 by using a stacking algorithm, and uses a classification method different from the supervised machine learning method in step S206 to generate a term pair. It is determined whether or not a pair of terms is notated, and the determination result is acquired.

以下、本実施の形態における異表記取得装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the different notation acquisition apparatus 1 in the present embodiment will be described.

今、用語辞書１３は、例えば、図４に示すような構造を有する、とする。図４において、一用語の情報は、一レコードになっている。各レコードは、「用語」「読み」「品詞」「代表表記」「カテゴリ」「ドメイン」の属性値を有する。用語辞書１３は、例えば、ＪＵＭＡＮ辞書（「http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html」参照［平成２１年１２月１３日検索］）である。また、用語辞書１３は、例えば、日本語ワードネット辞書（http://nlpwww.nict.go.jp/wn-ja/index.ja.html参照［平成２１年１２月１３日検索］）や、異体字辞書や、ＥＤＲ電子化辞書（http://www2.nict.go.jp/r/r312/EDR/J_index.html参照［平成２１年１２月１３日検索］）である。なお、異体字辞書とは、異体字の対を有する辞書である。異体字とは、読み方や用い方が同じでも字形に異なる部分のある字体のことである。旧字体と新字体がある漢字などに多く見られ、例えば「沢」と「澤」は異体字の関係にある。なお、異体字辞書は、異体字ではないが、異体字のように代替可能な漢字の対を有しても良い。さらに、用語辞書１３は、異体字辞書とは別に、異体字のように代替可能な漢字の対を有する辞書を有しても良い。 Now, it is assumed that the term dictionary 13 has a structure as shown in FIG. In FIG. 4, one term of information is one record. Each record has attribute values of “term”, “reading”, “part of speech”, “representative notation”, “category”, and “domain”. The term dictionary 13 is, for example, a JUMAN dictionary (see “http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html” [searched on December 13, 2009]). The term dictionary 13 is, for example, a Japanese word net dictionary (see http://nlpwww.nict.go.jp/wn-ja/index.ja.html [searched on December 13, 2009]), It is a variant character dictionary or an EDR electronic dictionary (see http://www2.nict.go.jp/r/r312/EDR/J_index.html [searched on December 13, 2009]). The variant character dictionary is a dictionary having pairs of variant characters. Variant characters are characters that have the same reading and usage, but have different shapes. It is often found in kanji with old and new fonts. For example, “Sawa” and “Sawa” are related to different characters. Note that the variant dictionary is not a variant, but may have a pair of Chinese characters that can be substituted like a variant. Furthermore, the term dictionary 13 may include a dictionary having pairs of kanji characters that can be substituted like variant characters, apart from the variant character dictionary.

また、学習データ格納部１２に格納されている学習データが有する複数の素性、および素性取得部１５が取得する複数の素性は、ここでは、６８種類である、とする。以下に、６８の素性（Ｓ１からＳ６８）について説明する。また、以下、用語対の具体例として、用語対「ショウウインドウ」「ショーウインドウ」を用いて、素性を例示する。 Here, it is assumed that the plurality of features included in the learning data stored in the learning data storage unit 12 and the plurality of features acquired by the feature acquisition unit 15 are 68 types. The 68 features (S1 to S68) will be described below. Hereinafter, as specific examples of the term pairs, the features are illustrated using the term pairs “show window” and “show window”.

Ｓ１は、「一つ目の表記の編集箇所」であり、上記具体例では、「ウ」である。素性Ｓ１を取得する場合、差分文字取得手段１５１は、用語対の構成する２つの用語を１文字ずつずらしながら文字を比較し、編集箇所を得る。例えば、差分文字取得手段１５１は、「ショウウインドウ」の１文字目「シ」と、「ショーウインドウ」の１文字目「シ」とから比較し、同一と判断し、２文字目も同一と判断し、３文字目「ウ」と「ー」とが異なると判断し、一つ目の表記の編集箇所「ウ」と二つ目の表記の編集箇所「ー」とを取得する。 S <b> 1 is “the first notation editing portion”, and “c” in the above specific example. When acquiring the feature S1, the difference character acquisition unit 151 compares the characters while shifting the two terms constituting the term pair one character at a time to obtain an edited portion. For example, the difference character acquisition unit 151 compares the first character “SHI” of “Show window” with the first character “SHI” of “Show window”, determines that they are the same, and determines that the second character is also the same. Then, it is determined that the third character “U” is different from “—”, and the first notation editing portion “U” and the second notation editing portion “—” are acquired.

Ｓ２は、「二つ目の表記の編集箇所」であり、上記具体例では、「ー」である。一つ目の表記とは用語対を構成する一つ目の用語（例えば、「ショウウインドウ」）であり、二つ目の表記とは用語対を構成する二つ目の用語（例えば、「ショーウインドウ」）である。 S2 is "the second notation editing location", and is "-" in the above specific example. The first notation is the first term that makes up the term pair (eg, “show window”), and the second notation is the second term that makes up the term pair (eg, “show” Window ").

Ｓ３は「編集箇所の前方の１文字」であり、上記具体例では、「ョ」である。 S3 is “one character in front of the edited part”, and in the above specific example, “o”.

Ｓ４は「編集箇所の後方の１文字」であり、上記具体例では、「ウ」である。 S4 is “one character behind the edited portion”, and in the above specific example, “U”.

Ｓ５は、「編集箇所の前方の連続する２文字」であり、上記具体例では、「ショ」である。 S5 is “two consecutive characters in front of the edited part”, and in the above specific example, “Sho”.

Ｓ６は、「編集箇所の前方の連続する３文字」であり、上記具体例では、「ショ」である。 S6 is “three consecutive characters in front of the edited part”, and in the above specific example, “Sho”.

Ｓ７は、「編集箇所の前方２文字目の文字」であり、上記具体例では、「シ」である。 S7 is “the second character in front of the edited part”, and in the above specific example, “S”.

Ｓ８は「編集箇所の前方３文字目の文字」であり、上記具体例では、「(del)」である。(del)とは、文字が無いことを示す。 S8 is “the third character in front of the edited part”, and in the above specific example, “(del)”. (del) indicates that there is no character.

Ｓ９は「編集箇所の後方の２文字」であり、上記具体例では、「ウイ」である。 S9 is “two characters behind the edited part”, and in the above specific example, “Wi”.

Ｓ１０は「編集箇所の後方の３文字」であり、上記具体例では、「ウイン」である。 S10 is “three characters behind the edited part”, and in the above specific example, “WIN”.

Ｓ１１は「編集箇所の後方２文字目の文字」であり、上記具体例では、「イ」である。 S11 is “the second character after the edited part”, and in the above specific example, “I”.

Ｓ１２は「編集箇所の後方３文字目の文字」であり、上記具体例では、「ン」である。 S12 is “the third character behind the edited portion”, and in the above specific example, “N”.

Ｓ１３は「'Ｓ１の情報−Ｓ２の情報'とした文字列」であり、上記具体例では、「ウ−ー」である。 S13 is “a character string“ information of S1−information of S2 ””, and in the above specific example, “Woo”.

Ｓ１４は「'Ｓ３の情報−Ｓ１３の情報'とした文字列」であり、上記具体例では、「ョ−ウ−ー」である。 S14 is “a character string that is“ information of S3—information of S13 ””, and in the above-described specific example, “Woo-o”.

Ｓ１５は「'Ｓ５の情報−Ｓ１３の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー」である。 S15 is “a character string that is“ information of S5—information of S13 ””, and is “show” in the above specific example.

Ｓ１６は「'Ｓ６の情報−Ｓ１３の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー」である。 S16 is “a character string“ information of S6—information of S13 ””, and in the above specific example, is “show”.

Ｓ１７は「'Ｓ１３の情報−Ｓ４の情報'」であり、上記具体例では、「ウ−ー−ウ」である。 S17 is “'information of S13-information of S4'”, and in the above specific example, “W-W”.

Ｓ１８は「'Ｓ３の情報−Ｓ１３の情報−Ｓ４の情報'」であり、上記具体例では、「ョ−ウ−ー−ウ」である。 S18 is "'information of S3-information of S13-information of S4'", and in the above-mentioned specific example, is "show-woo".

Ｓ１９は「'Ｓ５の情報−Ｓ１３の情報−Ｓ４の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−ウ」である。 S19 is “a character string that is“ information of S5—information of S13—information of S4 ””, and in the above specific example, is “show-woo”.

Ｓ２０は「'Ｓ６の情報−Ｓ１３の情報−Ｓ４の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−ウ」である。 S20 is “a character string that is“ information of S6—information of S13—information of S4 ””, and is “show-woo” in the above specific example.

Ｓ２１は「'Ｓ１３の情報−Ｓ７の情報'とした文字列」であり、上記具体例では、「ウ−ー−シ」である。 S21 is “a character string that is“ information of S13—information of S7 ””, and in the above specific example, “Woo-shi”.

Ｓ２２は「'Ｓ３の情報−Ｓ１３の情報−Ｓ７の情報'とした文字列」であり、上記具体例では、「ョ−ウ−ー−シ」である。 S22 is “a character string that is“ information of S3—information of S13—information of S7 ””, and in the above-described specific example, is “shower”.

Ｓ２３は「'Ｓ５の情報−Ｓ１３の情報−Ｓ７の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−シ」である。 S23 is “a character string that is“ information of S5−information of S13−information of S7 ””, and in the above specific example, is “show-city”.

Ｓ２４は「'Ｓ６の情報−Ｓ１３の情報−Ｓ７の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−シ」である。 S24 is “a character string that is“ information of S6—information of S13—information of S7 ””, and in the above specific example, is “show-city”.

Ｓ２５は「'Ｓ１３の情報−Ｓ８の情報'とした文字列」であり、上記具体例では、「ウ−ー−（ｄｅｌ）」である。 S25 is “a character string that is“ information of S13—information of S8 ””, and is “woo- (del)” in the above specific example.

Ｓ２６は「'Ｓ３の情報−Ｓ１３の情報−Ｓ８の情報'とした文字列」であり、上記具体例では、「ョ−ウ−ー−（ｄｅｌ）」である。 S26 is “a character string that is“ information of S3—information of S13—information of S8 ””, and in the above-described specific example, is “show- (del)”.

Ｓ２７は「'Ｓ５の情報−Ｓ１３の情報−Ｓ８の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−（ｄｅｌ）」である。 S27 is “a character string that is“ information of S5—information of S13—information of S8 ””, and is “show-(del)” in the above specific example.

Ｓ２８は「'Ｓ６の情報−Ｓ１３の情報−Ｓ８の情報'とした文字列」であり、上記具体例では、「ショ−ウ−ー−（ｄｅｌ）」である。なお、２つの用語が与えられ、編集箇所が判断できれば、単なる文字列の処理（操作）により、素性取得手段１５２は、Ｓ３からＳ２８の素性を取得できる。 S28 is “a character string that is“ information of S6—information of S13—information of S8 ””, and is “show-(del)” in the above specific example. If two terms are given and the edit location can be determined, the feature acquisition unit 152 can acquire the features of S3 to S28 by simple character string processing (operation).

Ｓ２９は「Ｓ１の字種」であり、上記具体例では、「カタカナ」である。文字を与えられた場合、当該文字の字種（漢字、ひらがな、かたかな、アルファベット等）を取得する技術は公知技術である。 S29 is “character type of S1”, and in the specific example, “Katakana”. A technique for obtaining a character type (kanji, hiragana, kana, alphabet, etc.) of a character when a character is given is a known technology.

Ｓ３０は「Ｓ２の字種」であり、上記具体例では、「カタカナ」である。 S30 is “character type of S2”, and in the above specific example, “Katakana”.

Ｓ３１は「Ｓ３の字種」であり、上記具体例では、「カタカナ」である。 S31 is “S3 character type”, and in the above specific example, “Katakana”.

Ｓ３２は「Ｓ４の字種」であり、上記具体例では、「カタカナ」である。 S32 is “character type of S4”, and in the specific example, “Katakana”.

Ｓ３３は「Ｓ１３の字種」であり、上記具体例では、「カタカナ」である。 S33 is “character type of S13”, and in the above specific example, “Katakana”.

Ｓ３４は「Ｓ１４の字種」であり、上記具体例では、「カタカナ」である。 S34 is “character type of S14”, and in the above specific example, “Katakana”.

Ｓ３５は「Ｓ１７の字種」であり、上記具体例では、「カタカナ」である。 S35 is “character type of S17”, and in the specific example, “Katakana”.

Ｓ３６は「Ｓ１８の字種」であり、上記具体例では、「カタカナ」である。 S36 is “character type of S18”, and in the above specific example, “Katakana”.

Ｓ３７は「Ｓ１の品詞」であり、上記具体例では、「名詞」である。ここで、文字の品詞は、その文字（ここではＳ１）が属している用語の品詞である。例えば、用語に対して、形態素解析をかけ、用語を単語に区切り、品詞情報を取得する。そして、文字の品詞は、当該取得した品詞情報が示す品詞である。 S37 is “part of speech of S1”, and in the above specific example is “noun”. Here, the part of speech of a character is the part of speech of the term to which the character (here, S1) belongs. For example, morphological analysis is applied to the term, the term is divided into words, and the part of speech information is acquired. The part of speech of the character is the part of speech indicated by the acquired part of speech information.

Ｓ３８は「Ｓ２の品詞」であり、上記具体例では、「名詞」である。 S38 is “part of speech of S2”, and in the above specific example is “noun”.

Ｓ３９は「Ｓ３の品詞」であり、上記具体例では、「名詞」である。 S39 is “part of speech of S3”, and in the above specific example is “noun”.

Ｓ４０は「Ｓ４の品詞」であり、上記具体例では、「名詞」である。 S40 is “part of speech of S4”, and in the above specific example, “noun”.

Ｓ４１は「Ｓ１３の品詞」であり、上記具体例では、「名詞」である。 S41 is “part of speech of S13”, and in the above specific example, “noun”.

Ｓ４２は「Ｓ１４の品詞」であり、上記具体例では、「名詞」である。 S42 is “part of speech of S14”, and in the above specific example is “noun”.

Ｓ４３は「Ｓ１７の品詞」であり、上記具体例では、「名詞」である。 S43 is “part of speech of S17”, and in the above specific example is “noun”.

Ｓ４４は「Ｓ１８の品詞」であり、上記具体例では、「名詞」である。 S44 is “part of speech of S18”, and in the above specific example is “noun”.

Ｓ４５は「Ｓ１の品詞と位置情報」であり、上記具体例では、「名詞，３」である。ここで「３」は、３文字目であることを示す。 S45 is “part of speech and position information of S1”, and in the above specific example, “noun, 3”. Here, “3” indicates the third character.

Ｓ４６は「Ｓ２の品詞と位置情報」であり、上記具体例では、「名詞，３」である。 S46 is “part of speech and position information of S2,” which is “noun, 3” in the above specific example.

Ｓ４７は「Ｓ３の品詞と位置情報」であり、上記具体例では、「名詞，２」である。 S47 is “part of speech and position information of S3”, and “noun, 2” in the above specific example.

Ｓ４８は「Ｓ４の品詞と位置情報」であり、上記具体例では、「名詞，４」である。 S48 is “part of speech and position information of S4”, and “noun, 4” in the above specific example.

Ｓ４９は「Ｓ１３の品詞と位置情報」であり、上記具体例では、「名詞，３」である。 S49 is “part of speech and position information of S13”, and “noun, 3” in the above specific example.

Ｓ５０は「Ｓ１４の品詞と位置情報」であり、上記具体例では、「名詞，２」である。 S50 is “part of speech and position information of S14”, and “noun, 2” in the above specific example.

Ｓ５１は「Ｓ１７の品詞と位置情報」であり、上記具体例では、「名詞，６」である。 S51 is “part of speech and position information of S17”, and “noun, 6” in the above specific example.

Ｓ５２は「Ｓ１８の品詞と位置情報」であり、上記具体例では、「名詞，３」である。 S52 is “part of speech and position information of S18”, and “noun, 3” in the above specific example.

Ｓ５３は「日本語用語対の類似度」であり、上記具体例では、例えば、０．９である。 S53 is “similarity of Japanese term pairs”, and in the above specific example, for example, 0.9.

Ｓ５４は「スタッキングアルゴリズムを使用して、日本語用語対のＪＵＭＡＮ辞書の代表表記が一致するかどうか」を示す情報であり、上記具体例では、「１」である。つまり、ここでは、機械学習部１６が利用する教師あり機械学習法とは異なる分類方法は、用語対を構成する２つの用語の、ＪＵＭＡＮ辞書における代表表記が一致するか否かにより分類する以下の方法である。まず、ＪＵＭＡＮ辞書の単語の集合から、編集距離が１文字の単語対を取り出す。ここで、編集距離が１文字の単語対は、９０４６１２組、取り出せる。そのうち、代表表記が等しい単語対（２５９３４組）を取り出す。次に、ＪＵＭＡＮ辞書で、代表表記が等しい単語対を正例、そうでないものを負例とする。以上により、教師データを作成する。次に、その教師データを学習データとした機械学習を行う。なお、教師データは、上述した教師データに限らず、他の教師データを用いてもよい。また、機械学習の際に利用する素性は、本発明の全素性（Ｓ１からＳ６８の素性）のうち、Ｓ５４の素性を取り除いた素性を利用する。なお、機械学習の際に利用する素性は、他の素性を用いてもよい。そして、実際に、Ｓ５４の素性を付与したいデータを、上記学習結果を利用して、分類する。分類結果において正例となったか、負例となったかの情報をＳ５４の素性として、そのデータに付与する。スタッキングアルゴリズムによる方法では、ＪＵＭＡＮ辞書で、代表表記が一致するか否かについて学習した結果を素性として付与できるので、実際にＪＵＭＡＮ辞書に記載されていない用語対に対しても、ＪＵＭＡＮ辞書で、代表表記が一致するとされる傾向のある用語対か否かの情報を付与できることとなる。 S54 is information indicating whether or not the representative notation of the JUMAN dictionary of the Japanese term pair matches using the stacking algorithm, and is “1” in the above specific example. That is, here, the classification method different from the supervised machine learning method used by the machine learning unit 16 is to classify the two terms constituting the term pair according to whether or not the representative notations in the JUMAN dictionary match. Is the method. First, a word pair whose edit distance is one character is extracted from a set of words in the JUMAN dictionary. Here, 904612 pairs of words with an edit distance of one character can be extracted. Among them, word pairs (25934 sets) having the same representative notation are taken out. Next, in the JUMAN dictionary, word pairs having the same representative notation are set as positive examples, and word pairs other than those are set as negative examples. Thus, teacher data is created. Next, machine learning using the teacher data as learning data is performed. The teacher data is not limited to the teacher data described above, and other teacher data may be used. In addition, as a feature used in machine learning, a feature obtained by removing the feature of S54 from all the features of the present invention (the features of S1 to S68) is used. Note that other features may be used as the features used in machine learning. In practice, the data to be given the feature of S54 is classified using the learning result. Information indicating whether the classification result is a positive example or a negative example is added to the data as the feature of S54. In the method based on the stacking algorithm, the result of learning whether or not the representative notation matches in the JUMAN dictionary can be given as a feature. Therefore, the term pairs that are not actually described in the JUMAN dictionary can be represented by the JUMAN dictionary. Information indicating whether or not the term pair has a tendency to match the notation can be given.

Ｓ５５は「日本語用語対の文字数が同数で編集箇所が両方とも数字の場合であり、同じ値か違う値かどうか」であり、上記具体例では、「０」である。なお、「２次キャッシュ」と「二次キャッシュ」の用語対の場合、「一週間あたり」「１週間あたり」の用語対の場合は、Ｓ５５の素性は「１」となる。なお、文字数が同数である条件をはずし、Ｓ５５は、「日本語用語対の編集箇所が両方とも数字の場合であり、同じ値か違う値かどうか」が好適である。 S55 is “when the number of characters in the pair of Japanese terms is the same and both editing locations are numbers, whether they are the same value or different values”. In the above specific example, it is “0”. In the case of the term pair “secondary cache” and “secondary cache”, the feature of S55 is “1” in the case of the term pair “per week” and “per week”. It should be noted that the condition that the number of characters is the same is removed, and S55 is preferably “whether the edited portions of the Japanese term pair are both numbers, and are the same value or different values”.

Ｓ５６は「日本語用語対の文字数が同数で編集箇所が両方ともひらがなの場合であり、同じ音声か違う音声かどうか」であり、上記具体例では、「０」である。なお、「おかぁちゃん」「おかあちゃん」の用語対の場合は、Ｓ５６の素性は「１」となる。なお、文字数が同数である条件をはずし、Ｓ５６は、「日本語用語対の編集箇所が両方ともひらがなの場合であり、同じ音声か違う音声かどうか」が好適である。 S56 is “when the number of characters in the Japanese term pair is the same and both of the edited portions are hiragana, whether they are the same voice or different voices”. In the above specific example, it is “0”. In the case of the term pairs “okachan” and “okachan”, the feature of S56 is “1”. It should be noted that the condition that the number of characters is the same is removed, and S56 is preferably “whether the edited portion of the Japanese term pair is both hiragana and is the same speech or different speech”.

Ｓ５７は「日本語用語対の文字数が同数で編集箇所が両方ともカタカナの場合であり、同じ音声か違う音声かどうか」であり、上記具体例では、「１」である。なお、「オリーブ・オイル」「オリーヴ・オイル」の用語対の場合、「ウインドウ」「ウィンドウ」の用語対の場合も、Ｓ５７の素性は「１」となる。なお、文字数が同数である条件をはずし、Ｓ５７は、「日本語用語対の編集箇所が両方ともカタカナの場合であり、同じ音声か違う音声かどうか」が好適である。 S57 is “when the number of characters in the Japanese term pair is the same and both editing portions are in katakana, whether they are the same voice or different voices”. In the above specific example, “1”. In the case of the term pair of “olive oil” and “olive oil”, the feature of S57 is “1” also in the case of the term pair of “window” and “window”. It should be noted that the condition that the number of characters is the same is removed, and S57 is preferably “whether the edited portion of the Japanese term pair is both katakana and whether the speech is the same or different”.

Ｓ５８は「日本語用語対の文字数が同数で編集箇所が両方ともローマ字の場合であり、大文字と小文字の違いだけかどうか」であり、上記具体例では、「０」である。なお、「３００ｋｂｐｓ」「３００Ｋｂｐｓ」の用語対の場合、「Ｗｉｎｄｏｗｓ上」「ｗｉｎｄｏｗｓ上」の用語対の場合は、Ｓ５８の素性は「１」となる。なお、文字数が同数である条件をはずし、Ｓ５８は、「日本語用語対の編集箇所が両方ともローマ字の場合であり、大文字と小文字の違いだけかどうか」が好適である。また、「Ｗｉｎｄｏｗｓ」は登録商標です。 S58 is “whether the number of characters in the pair of Japanese terms is the same and both edited portions are Roman characters, whether only uppercase and lowercase characters are different”. In the above specific example, “0”. In the case of the term pair of “300 kbps” and “300 kbps”, the feature of S58 is “1” in the case of the term pair “on Windows” and “on windows”. It should be noted that the condition that the number of characters is the same is removed, and S58 is preferably “whether the edited portions of the Japanese term pairs are both in Roman letters and only the difference between uppercase and lowercase letters”. “Windows” is a registered trademark.

Ｓ５９は「日本語用語対の文字数が同数で一方の編集箇所に濁点をつけるともう一方の編集箇所になるかどうか」であり、上記具体例では、「０」である。なお、「触れるくらい」「触れるぐらい」の用語対の場合、「飲むぐらい」「飲むくらい」の用語対の場合は、Ｓ５９の素性は「１」となる。なお、Ｓ５９は、文字数が同数である条件をはずし、「日本語用語対の一方の編集箇所に濁点をつけるともう一方の編集箇所になるかどうか」が好適である。 S59 is “whether the number of characters in the Japanese term pair is the same, and if one edit location is marked with a cloud point, it becomes another edit location”. In the above specific example, it is “0”. In the case of a term pair of “about touch” and “about touch”, in the case of a term pair of “about drink” and “about drink”, the feature of S59 is “1”. In S59, it is preferable to remove the condition that the number of characters is the same, and “whether or not the editing point of one of the Japanese term pairs becomes the other editing point”.

Ｓ６０は「日本語用語対の文字数が同数で一方の編集箇所に半濁点をつけるともう一方の編集箇所になるかどうか」であり、上記具体例では、「０」である。なお、Ｓ６０は、文字数が同数である条件をはずし、「日本語用語対の一方の編集箇所に半濁点をつけるともう一方の編集箇所になるかどうか」が好適である。 S60 is “whether the number of characters in the Japanese term pair is the same and if one editing place is given a semi-turbid point, it becomes another editing place”, and in the above specific example, “0”. In S60, it is preferable to remove the condition that the number of characters is the same, and "whether a semi-turbid point is added to one editing portion of a Japanese term pair to become another editing portion".

Ｓ６１は「編集箇所が日本語用語対の一方にしかなく、その編集箇所が'化'、'系'、'類'、'型'、'形'、'氏'、'ー'、'・'かどうか」であり、上記具体例では、「０」である。なお、「サーバ」「サーバー」の用語対の場合、「ハンセン病患者」「ハンセン氏病患者」の用語対の場合、「日本語パッチ」「日本語化パッチ」の用語対の場合、「３０種類ほど」「３０種ほど」の用語対の場合は、Ｓ６１の素性は「１」となる。 S61 is “The editing part is only one of the pair of Japanese terms, and the editing part is' Correct ',' System ',' Class', 'Type', 'Form', 'Mr.', '-', ' It is “whether” or “0” in the above specific example. In the case of the term pairs “server” and “server”, in the case of the term pairs “Hansen's disease patient” and “Hansen's disease patient”, in the case of the term pairs “Japanese patch” and “Japanese patch”, “30 types” In the case of a term pair of “so” and “about 30”, the feature of S61 is “1”.

Ｓ６２は「編集箇所が日本語用語対の一方にしかなく、その編集箇所の用語が日本語用語対の最後の文字と一致するかどうか」であり、上記具体例では、「０」である。なお、「妊娠・授乳中」「妊娠中・授乳中」の用語対の場合、「国産・輸入車」「国産車・輸入車」の用語対の場合は、Ｓ６２の素性は「１」となる。 S62 is “whether the edited portion is only in one of the Japanese term pairs and the term of the edited portion matches the last character of the Japanese term pair”, and is “0” in the above specific example. In the case of “pregnant / nursing” and “pregnant / nursing” term pairs, in the case of “domestic / imported vehicles” and “domestic vehicles / imported vehicles”, the feature of S62 is “1”. .

Ｓ６３は「編集箇所が日本語用語対の一方にしかなく、その編集箇所が桁数をあらわす用語かどうか（例えば、"千""万"など）」であり、上記具体例では、「０」である。なお、「２万５０００人」「２５０００人」の用語対の場合、「１万６５００円」「１６５００円」の用語対の場合は、Ｓ６３の素性は「１」となる。 S63 is “whether the edited part is only in one of the Japanese term pairs and the edited part is a term indicating the number of digits (for example,“ thousand ”“ ten thousand ”, etc.). It is. In the case of the term pairs “25,000” and “25000”, the feature of S63 is “1” in the case of the term pairs of “16,500 yen” and “16,500 yen”.

Ｓ６４は「日本語用語対のＪＵＭＡＮ辞書の定義されている代表表記が一致するかどうか」であり、上記具体例（用語対「ショウウインドウ」「ショーウインドウ」）では、例えば、「１」である。なお、素性取得手段１５２は、用語対を構成する各用語の代表表記を、用語辞書１３から取得し、比較することにより、素性を取得する。 S64 is “whether or not the representative notation defined in the JUMAN dictionary of the Japanese term pair matches.” In the above specific example (term pair “show window” “show window”), for example, “1”. . The feature acquisition unit 152 acquires the feature by acquiring the representative notation of each term constituting the term pair from the term dictionary 13 and comparing it.

Ｓ６５は「日本語用語対が日本語ワードネット辞書に類義語対として定義されているかどうか」であり、上記具体例（用語対「ショウウインドウ」「ショーウインドウ」）では、例えば、「１」である。素性取得手段１５２は、用語対を構成する各用語をキーとして、日本語ワードネット辞書を検索し、類義語対として定義されているか否かを判断する。本処理は、通常の検索処理である。 S65 is “whether the Japanese term pair is defined as a synonym pair in the Japanese word net dictionary”. In the above specific example (the term pair “show window” “show window”), for example, “1”. . The feature acquisition unit 152 searches the Japanese word net dictionary using each term constituting the term pair as a key, and determines whether the term pair is defined as a synonym pair. This process is a normal search process.

Ｓ６６は「日本語用語対の編集箇所が異体字辞書に異体字として定義されているかどうか」であり、上記具体例（用語対「ショウウインドウ」「ショーウインドウ」）では、例えば、「０」である。異体字辞書は、２つの異体字の対の情報を有する。 S66 is “whether the edited part of the Japanese term pair is defined as a variant character in the variant dictionary.” In the above specific example (term pair “show window” “show window”), for example, “0”. is there. The variant dictionary has information on two variant pairs.

Ｓ６７は「日本語用語対の文字数が同数で編集箇所が漢字とひらがなの場合であり、ＪＵＭＡＮ辞書の読みが一致するかどうか」であり、上記具体例（用語対「ショウウインドウ」「ショーウインドウ」）では、「０」である。 S67 is “whether the number of characters in the Japanese term pair is the same and the edited part is kanji and hiragana and the readings in the JUMAN dictionary match”. ) Is “0”.

Ｓ６８は「日本語用語対の文字数が同数で編集箇所が両方とも漢字の場合であり、ＪＵＭＡＮ辞書の読みが一致するかどうか」であり、上記具体例（用語対「ショウウインドウ」「ショーウインドウ」）では、「０」である。 S68 is “whether the number of characters in the pair of Japanese terms is the same and both of the editing parts are kanji and whether the readings of the JUMAN dictionary match.” The above specific example (term vs. “show window” and “show window”) ) Is “0”.

また、上記の６８の素性は、上述したように、字種関連素性、辞書関連素性、類似度素性、編集箇所文字素性などが含まれる。 Further, the above 68 features include a character type related feature, a dictionary related feature, a similarity feature, an edited portion character feature, and the like as described above.

また、上記の６８の素性をグループ化すると、例えば、以下のＧ１からＧ７のグループに分かれる、と考えられる。 Further, when the above 68 features are grouped, for example, it can be considered that they are divided into the following groups G1 to G7.

Ｇ１は、Ｓ１からＳ５２の素性であり、編集箇所とその周辺の文字列に関する情報である編集箇所文字素性である。 G1 is a feature from S1 to S52, and is an edited portion character feature that is information about the edited portion and the surrounding character string.

Ｇ２は、Ｓ５３の素性であり、類似度素性である。 G2 is a feature of S53 and a similarity feature.

Ｇ３は、Ｓ５４の素性であり、スタッキングアルゴリズムを使用した情報である素性である。 G3 is a feature of S54 and is a feature that is information using a stacking algorithm.

Ｇ４は、Ｓ５５からＳ６０の素性であり、編集箇所に関する情報である編集箇所関連素性である。 G4 is the feature of S55 to S60, and is an edit location related feature that is information related to the edit location.

Ｇ５は、Ｓ６１からＳ６３の素性であり、用語対のパターンに関する情報である用語対パターン素性である。 G5 is a feature of S61 to S63, and is a term pair pattern feature which is information on a term pair pattern.

Ｇ６は、Ｓ６４からＳ６６の素性であり、種々の辞書による情報である辞書関連素性である。 G6 is a feature from S64 to S66, and is a dictionary-related feature that is information by various dictionaries.

Ｇ７は、Ｓ６７からＳ６８の素性であり、読みに関する情報である読み関連素性である。 G7 is a feature from S67 to S68, and is a reading-related feature that is information related to reading.

そして、まず、異表記取得装置１において、予め正しい異表記の用語対のデータ（正例）を人手で構築しておき、正例の用語対と、正例であることを示す正負情報（例えば、「１」）とを対応付けて、学習データ格納部１２に格納しておく。また、異表記取得装置１において、予め異表記でない用語対のデータ（負例）を人手で構築しておき、負例の用語対と、負例であることを示す正負情報（例えば、「０」）とを対応付けて、学習データ格納部１２に格納しておく。 First, in the different notation acquisition apparatus 1, data of correct correct notation term pairs (positive examples) is manually constructed in advance, and positive and negative information (for example, positive examples) indicating positive term pairs and positive examples (for example, , “1”) are stored in the learning data storage unit 12 in association with each other. Further, in the different notation acquisition device 1, data of a term pair that is not different notation (negative example) is manually constructed in advance, and the negative term pair and positive / negative information (for example, “0” indicating that it is a negative example) ]) In association with each other and stored in the learning data storage unit 12.

次に、異表記取得装置１の素性取得部１５により、各用語対の、上述した６８の素性を取得し、正負情報または用語対と対応付けて、６８の素性を学習データ格納部１２に蓄積する。 Next, the above-described 68 features of each term pair are acquired by the feature acquisition unit 15 of the different notation acquisition device 1, and the 68 features are stored in the learning data storage unit 12 in association with the positive / negative information or the term pair. To do.

以上の処理により、学習データ格納部１２の学習データが構築された。 Through the above processing, learning data in the learning data storage unit 12 is constructed.

次に、異表記の用語対であるか否かを判断したい１以上の用語対を用語対格納部１１に格納する。 Next, the term pair storage unit 11 stores one or more term pairs for which it is desired to determine whether the term pairs are different notations.

そして、ユーザは、異表記取得装置１に、動作開始の指示を入力する。すると、受付部１４は、動作開始の指示を受け付ける。 Then, the user inputs an operation start instruction to the different notation acquisition apparatus 1. Then, the reception unit 14 receives an operation start instruction.

次に、用語対格納部１１に格納されている用語対を順に、以下のように処理する。つまり、素性取得部１５は、各用語対の素性を取得する処理を行う。かかる素性取得処理については説明済みである。 Next, the term pairs stored in the term pair storage unit 11 are sequentially processed as follows. That is, the feature acquisition unit 15 performs a process of acquiring the feature of each term pair. Such a feature acquisition process has already been described.

次に、機械学習部１６は、取得された６８の素性と、学習データ格納部１２の学習データとを用いて、教師あり機械学習を行い、スコアを取得する。 Next, the machine learning unit 16 performs supervised machine learning using the acquired 68 features and the learning data stored in the learning data storage unit 12 to obtain a score.

次に、出力部１７は、取得された各用語対のスコアを用いて、各用語対は異表記の用語対であるか否かを判断する。 Next, the output unit 17 determines whether each term pair is a term pair of different notation using the acquired score of each term pair.

次に、出力部１７は、異表記の用語対であると判断した用語対のみ出力する。ここで、出力とは、たとえば、予め決められた記憶媒体への蓄積である。
（機械学習について） Next, the output unit 17 outputs only the term pairs that are determined to be the differently expressed term pairs. Here, the output is, for example, accumulation in a predetermined storage medium.
(About machine learning)

以下、機械学習部１６が行う機械学習、および機械学習部１６が行う教師あり機械学習法とは異なる分類方法（スタッキングアルゴリズムで利用）について説明する。 Hereinafter, the machine learning performed by the machine learning unit 16 and the classification method (used in the stacking algorithm) different from the supervised machine learning method performed by the machine learning unit 16 will be described.

まず、機械学習法とは、問題-解の組のセットを多く用意し、それで学習を行ない、どういう問題のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときも解を推測できるようにする方法である(例えば、非特許文献２〜非特許文献４参照)。 First, the machine learning method prepares many sets of problem-solution pairs, learns them, learns what kind of solution the problem becomes, and uses the learning results to create a new problem. In some cases, the solution can be estimated (see, for example, Non-Patent Document 2 to Non-Patent Document 4).

どういう問題のときにどういう解になるかという、問題の状況を機械に伝える際に、素性(解析に用いる情報で問題を構成する各要素)が必要になる。問題を素性によって表現するのである。 In order to convey the problem situation to the machine, what kind of solution is to be solved, it is necessary to have a feature (elements constituting the problem with information used for analysis). The problem is expressed by the feature.

すなわち、機械学習の手法は、素性の集合-解の組のセットを多く用意し、それで学習を行ない、どういう素性の集合のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときもその問題から素性の集合を取り出し、その素性の場合の解を推測する方法である。 In other words, the machine learning method prepares many sets of feature set-solution pairs, learns with it, learns what kind of feature set the solution will be, and uses the learning results. This is a method of extracting a set of features from a new problem and inferring a solution in the case of the feature.

機械学習の手法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法を用いることができる。 As a machine learning method, for example, a k neighborhood method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method, or the like can be used.

ｋ近傍法は、最も類似する一つの事例のかわりに、最も類似するｋ個の事例を用いて、このｋ個の事例での多数決によって分類先（解）を求める手法である。ｋは、あらかじめ定める整数の数字であって、一般的に、１から９の間の奇数を用いる。 The k-nearest neighbor method is a method for obtaining a classification destination (solution) by using the k most similar cases instead of the most similar case, and by majority decision of the k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used.

シンプルベイズ法は、ベイズの定理にもとづいて各分類になる確率を推定し、その確率値が最も大きい分類を求める分類先とする方法である。 The Simple Bayes method is a method of estimating the probability of each classification based on Bayes' theorem and determining the classification having the highest probability value as a classification destination.

シンプルベイズ法において、文脈ｂで分類ａを出力する確率は、以下の数式１で与えられる。
In the simple Bayes method, the probability of outputting the classification a in the context b is given by the following formula 1.

ただし、ここで文脈ｂは、あらかじめ設定しておいた素性ｆ_j（∈Ｆ，１≦ｊ≦ｋ）の集合である。ｐ（ｂ）は、文脈ｂの出現確率である。ここで、分類ａに非依存であって定数のために計算しない。Ｐ（ａ）（ここでＰはｐの上部にチルダ）とＰ（ｆ_i｜ａ）は、それぞれ教師データ（判断情報と同意義）から推定された確率であって、分類ａの出現確率、分類ａのときに素性ｆ_iを持つ確率を意味する。Ｐ（ｆ_i｜ａ）として最尤推定を行って求めた値を用いると、しばしば値がゼロとなり、数式２の２行目の式の値がゼロで分類先を決定することが困難な場合が生じる。そのため、スームージングを行う。ここでは、以下の数式２を用いてスームージングを行ったものを用いる。
Here, the context b is a set of features f _j (εF, 1 ≦ j ≦ k) set in advance. p (b) is the appearance probability of the context b. Here, since it is independent of the classification a and is a constant, it is not calculated. P (a) (where P is a tilde at the top of p) and P (f _i | a) are probabilities estimated from teacher data (same meaning as judgment information), respectively, It means the probability of having a feature f _i for classification a. When the value obtained by performing maximum likelihood estimation as P (f _i | a) is used, the value is often zero, and it is difficult to determine the classification destination because the value of the expression in the second row of Expression 2 is zero. Occurs. Therefore, smoothing is performed. Here, smoothed data using the following formula 2 is used.

ただし、ｆｒｅｑ（ｆ_i，ａ）は、素性ｆ_iを持ちかつ分類がａである事例の個数、ｆｒｅｑ（ａ）は、分類がａである事例の個数を意味する。 Here, freq (f _i , a) means the number of cases having the feature f _i and the classification a, and freq (a) means the number of cases having the classification a.

決定リスト法は、素性と分類先の組とを規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性とを比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。 The decision list method uses features and combinations of classification destinations as rules, stores them in the list in a predetermined priority order, and when input to be detected is given, from the highest priority in the list This is a method in which input data is compared with the feature of the rule, and the classification destination of the rule having the same feature is set as the classification destination of the input.

決定リスト方法では、あらかじめ設定しておいた素性ｆ_j(∈Ｆ，１≦ｊ≦ｋ）のうち、いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈ｂで分類ａを出力する確率は以下の数式３によって与えられる。
In the decision list method, the probability value of each classification is obtained using only one of the features f _j (εF, 1 ≦ j ≦ k) set in advance as a context. The probability of outputting classification a in a context b is given by Equation 3 below.

ただし、ｆ_maxは以下の数式４によって与えられる。
However, f _max is given by Equation 4 below.

また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、素性ｆ_jを文脈に持つ場合の分類ａ_iの出現の割合である。 P (a _i | f _j ) (where P is a tilde at the top of p) is the rate of appearance of the classification a _i when the feature f _j is in the context.

最大エントロピー法は、あらかじめ設定しておいた素性ｆ_j（１≦ｊ≦ｋ）の集合をＦとするとき、以下の所定の条件式（数式５）を満足しながらエントロピーを意味する式（数式６）を最大にするときの確率分布ｐ（ａ，ｂ）を求め、その確率分布にしたがって求まる各分類の確率のうち、最も大きい確率値を持つ分類を求める分類先とする方法である。
In the maximum entropy method, when a set of preset features f _j (1 ≦ j ≦ k) is F, an expression (expression) that represents entropy while satisfying the following conditional expression (expression 5): 6) This is a method of obtaining a probability distribution p (a, b) when maximizing 6) and determining a classification having the largest probability value among the classification probabilities obtained according to the probability distribution.

ただし、Ａ、Ｂは分類と文脈の集合を意味し、ｇ_j（ａ，ｂ）は文脈ｂに素性ｆ_jがあって、なおかつ分類がａの場合１となり、それ以外で０となる関数を意味する。また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、既知データでの（ａ，ｂ）の出現の割合を意味する。 However, A and B mean a set of classifications and contexts, and g _j (a, b) is a function that is 1 if the context b has a feature f _j and the classification is a, and is 0 otherwise. means. Further, P (a _i | f _j ) (where P is a tilde at the top of p) means the rate of appearance of (a, b) in the known data.

数式５は、確率ｐと出力と素性の組の出現を意味する関数ｇをかけることで出力と素性の組の頻度の期待値を求めることになっており、右辺の既知データにおける期待値と、左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として、エントロピー最大化(確率分布の平滑化)を行なって、出力と文脈の確率分布を求めるものとなっている。最大エントロピー法の詳細については、以下の参考文献（１）および参考文献（２）に記載されている。 Formula 5 is to obtain the expected value of the frequency of the pair of output and feature by multiplying the probability p and the function g meaning the appearance of the pair of output and feature. With the constraint that the expected values calculated based on the probability distribution obtained on the left side are equal, entropy maximization (smoothing of the probability distribution) is performed to obtain the probability distribution of the output and the context. Details of the maximum entropy method are described in the following references (1) and (2).

参考文献（１）：Eric Sven Ristad, Maximum Entropy Modeling for Natural Language,(ACL/EACL Tutorial Program, Madrid, 1997 Reference (1): Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (ACL / EACL Tutorial Program, Madrid, 1997

参考文献（２）：Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release1.6beta, (http://www.mnemonic.com/software/memt,1998) ） Reference (2): Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998))

サポートベクトルマシン法は、空間を超平面で分割することにより、二つの分類からなるデータを分類する手法である。 The support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane.

図５にサポートベクトルマシン法のマージン最大化の概念を示す。図５において、白丸は正例、黒丸は負例を意味し、実線は空間を分割する超平面を意味し、破線はマージン領域の境界を表す面を意味する。図５（Ａ）は、正例と負例の間隔が狭い場合（スモールマージン）の概念図、図５（Ｂ）は、正例と負例の間隔が広い場合（ラージマージン）の概念図である。 FIG. 5 shows the concept of margin maximization in the support vector machine method. In FIG. 5, a white circle means a positive example, a black circle means a negative example, a solid line means a hyperplane that divides the space, and a broken line means a surface that represents the boundary of the margin area. 5A is a conceptual diagram when the interval between the positive example and the negative example is narrow (small margin), and FIG. 5B is a conceptual diagram when the interval between the positive example and the negative example is wide (large margin). is there.

このとき、二つの分類が正例と負例からなるものとすると、学習データにおける正例と負例の間隔（マージン)が大きいものほどオープンデータで誤った分類をする可能性が低いと考えられ、図５（Ｂ）に示すように、このマージンを最大にする超平面を求めそれを用いて分類を行なう。 At this time, if the two classifications consist of positive examples and negative examples, the larger the interval (margin) between the positive examples and the negative examples in the learning data, the lower the possibility of incorrect classification with open data. As shown in FIG. 5B, a hyperplane that maximizes this margin is obtained, and classification is performed using it.

基本的には上記のとおりであるが、通常、学習データにおいてマージンの内部領域に少数の事例が含まれてもよいとする手法の拡張や、超平面の線形の部分を非線型にする拡張（カーネル関数の導入)がなされたものが用いられる。 Basically, it is as described above. Usually, an extension of the method that the training data may contain a small number of cases in the inner area of the margin, or an extension that makes the linear part of the hyperplane nonlinear ( The one with kernel function introduced) is used.

この拡張された方法は、以下の識別関数（ｆ（ｘ））を用いて分類することと等価であり、その識別関数の出力値が正か負かによって二つの分類を判別することができる。
This extended method is equivalent to classification using the following discriminant function (f (x)), and the two classes can be discriminated depending on whether the output value of the discriminant function is positive or negative.

ただし、ｘは識別したい事例の文脈（素性の集合)を、ｘ_iとｙ_j（ｉ＝１，…，ｌ，ｙj∈｛１，−１｝）は学習データの文脈と分類先を意味し、関数ｓｇｎは、
ｓｇｎ（ｘ）＝１（ｘ≧０）
−１（otherwise）
であり、また、各α_iは数式８の式（８−２）と式（８−３）の制約のもと、式（８−１）を最大にする場合のものである。
Where x is the context (set of features) to be identified, and x _i and y _j (i = 1,..., L, yj∈ {1, −1}) mean the context and classification destination of the learning data. The function sgn is
sgn (x) = 1 (x ≧ 0)
-1 (otherwise)
In addition, each α _i is for maximizing Expression (8-1) under the constraints of Expressions (8-2) and (8-3) of Expression 8.

また、関数Ｋはカーネル関数と呼ばれ、様々なものが用いられるが、本形態では、例えば、以下の多項式（数式９）のものを用いる。
The function K is called a kernel function, and various functions are used. In this embodiment, for example, the following polynomial (formula 9) is used.

数式８、数式９において、Ｃ、ｄは実験的に設定される定数である。例えば、Ｃはすべての処理を通して１に固定した。また、ｄは、１と２の二種類を試している。ここで、α_i＞０となるｘ_iは、サポートベクトルと呼ばれ、通常、数式７の和をとっている部分は、この事例のみを用いて計算される。つまり、実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられない。 In Expressions 8 and 9, C and d are constants set experimentally. For example, C was fixed at 1 throughout all treatments. Moreover, two types of 1 and 2 are tried for d. Here, x _i satisfying α _i > 0 is called a support vector, and the portion taking the sum of Expression 7 is usually calculated using only this case. That is, only actual cases called support vectors are used for actual analysis.

なお、拡張されたサポートベクトルマシン法の詳細については、以下の参考文献（３）および参考文献（４）に記載されている。 Details of the extended support vector machine method are described in the following references (3) and (4).

参考文献（３）：Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods,(Cambridge University Press,2000) Reference (3): Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000)

参考文献（４）：Taku Kudoh, Tinysvm:Support Vector machines,(http://cl.aistnara.ac.jp/taku-ku//software/Tiny SVM/index.html,2000) Reference (4): Taku Kudoh, Tinysvm: Support Vector machines, (http://cl.aistnara.ac.jp/taku-ku//software/Tiny SVM / index.html, 2000)

サポートベクトルマシン法は、分類の数が２個のデータを扱うものである。したがって、分類の数が３個以上の事例を扱う場合には、通常、これにペアワイズ法またはワンＶＳレスト法などの手法を組み合わせて用いることになる。 The support vector machine method handles data with two classifications. Therefore, when handling cases with three or more classifications, a pair-wise method or a one-VS rest method is usually used in combination with this.

ペアワイズ法は、ｎ個の分類を持つデータの場合に、異なる二つの分類先のあらゆるペア（ｎ（ｎ−１）／２個）を生成し、各ペアごとにどちらがよいかを二値分類器、すなわちサポートベクトルマシン法処理モジュールで求めて、最終的に、ｎ（ｎ−１）／２個の二値分類による分類先の多数決によって、分類先を求める方法である。 In the pairwise method, in the case of data having n classifications, every pair (n (n-1) / 2) of two different classification destinations is generated, and a binary classifier indicates which is better for each pair. That is, it is obtained by the support vector machine method processing module and finally obtains the classification destination by majority decision of the classification destination by n (n−1) / 2 binary classification.

ワンＶＳレスト法は、例えば、ａ、ｂ、ｃという三つの分類先があるときは、分類先ａとその他、分類先ｂとその他、分類先ｃとその他、という三つの組を生成し、それぞれの組についてサポートベクトルマシン法で学習処理する。そして、学習結果による推定処理において、その三つの組のサポートベクトルマシンの学習結果を利用する。推定するべき問題が、その三つのサポートベクトルマシンではどのように推定されるかを見て、その三つのサポートベクトルマシンのうち、その他でないほうの分類先であって、かつサポートベクトルマシンの分離平面から最も離れた場合のものの分類先を求める解とする方法である。例えば、ある解くべき問題が、「分類先ａとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面から最も離れた場合には、その解くべき問題の分類先は、aと推定する。 For example, when there are three classification destinations a, b, and c, the one VS rest method generates three sets of classification destination a and other, classification destination b and other, classification destination c and other, The learning process is performed on the set of the support vector machine method. Then, in the estimation process based on the learning result, the learning results of the three sets of support vector machines are used. See how the problem to be estimated is estimated in the three support vector machines, and among the three support vector machines, which is the other classification destination, and the separation plane of the support vector machine This is a method for obtaining a classification destination of a thing farthest from the object. For example, when a problem to be solved is farthest from the separation plane in the support vector machine created by the learning process of “classification destination a and others”, the classification destination of the problem to be solved is estimated as a. .

機械学習部１６が推定する、解くべき問題についての、どのような解（分類先）になりやすいかの度合いの求め方は、機械学習部１６が機械学習の手法として用いる様々な方法によって異なる。 The method of obtaining the degree of the solution (classification destination) that the machine learning unit 16 estimates about the problem to be solved is different depending on various methods used by the machine learning unit 16 as a machine learning method.

例えば、本発明の実施の形態において、機械学習部１６が、機械学習の手法としてｋ近傍法を用いる場合、機械学習部１６は、教師データの事例同士で、その事例から抽出された素性の集合のうち重複する素性の割合（同じ素性をいくつ持っているかの割合）にもとづく事例同士の類似度を定義して、前記定義した類似度と事例とを学習結果情報として学習データ格納部１２に記憶しておく。 For example, in the embodiment of the present invention, when the machine learning unit 16 uses the k-nearest neighbor method as the machine learning method, the machine learning unit 16 sets the feature data extracted from the examples of the cases of the teacher data. The similarity between cases based on the ratio of overlapping features (the number of the same features) is defined, and the defined similarity and the case are stored in the learning data storage unit 12 as learning result information Keep it.

そして、機械学習部１６は、素性取得部１５によって解くべき問題の素性が抽出されたときに、記憶された類似度と事例を参照して、素性取得部１５によって抽出された解くべき問題の素性について、その解くべき問題の素性の類似度が高い順にｋ個の事例を選択し、選択したｋ個の事例での多数決によって決まった分類先を、解くべき問題の分類先（解）として推定する。すなわち、機械学習部１６では、解くべき問題についての、どのような解（分類先）になりやすいかの度合いを、選択したｋ個の事例での多数決の票数とする。 Then, the machine learning unit 16 refers to the stored similarity and case when the feature of the problem to be solved is extracted by the feature acquisition unit 15, and the feature of the problem to be solved extracted by the feature acquisition unit 15 For k, the k cases are selected in descending order of the similarity of the features of the problem to be solved, and the classification destination determined by the majority vote in the selected k cases is estimated as the classification destination (solution) of the problem to be solved. . That is, in the machine learning unit 16, the degree of the kind of solution (classification destination) that is likely to be the problem to be solved is set as the number of votes of majority vote in the selected k cases.

また、機械学習手法として、シンプルベイズ法を用いる場合には、教師データの事例について、前記事例の解と素性の集合との組を学習データとして学習データ格納部１２に記憶する。そして、機械学習部１６は、素性取得部１５によって解くべき問題の素性が抽出されたときに、学習データ格納部１２の判断情報の解と素性の集合との組をもとに、ベイズの定理にもとづいて素性取得部１５で取得した解くべき問題の素性の集合の場合の各分類になる確率を算出して、その確率の値が最も大きい分類を、その解くべき問題の素性の分類（解）と推定する。すなわち、機械学習部１６では、解くべき問題の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率とする。 When the simple Bayes method is used as a machine learning method, a set of a solution of the case and a set of features is stored as learning data in the learning data storage unit 12 for the case of the teacher data. Then, when the feature acquisition unit 15 extracts the feature of the problem to be solved, the machine learning unit 16 uses the Bayes' theorem based on the combination of the judgment information solution and the feature set in the learning data storage unit 12. Based on the above, the probability of each classification in the case of a set of the features of the problem to be solved acquired by the feature acquisition unit 15 is calculated, and the classification having the highest probability value is classified into the classification of the features of the problem to be solved (solution ). That is, in the machine learning unit 16, the degree of ease of becoming a solution in the case of a set of features of the problem to be solved is set as the probability of being classified.

また、機械学習手法として決定リスト法を用いる場合には、教師データの事例について、素性と分類先との規則を所定の優先順序で並べたリストを、予め、何らかの手段により、学習データ格納部１２に記憶させる。そして、素性取得部１５によって解くべき問題の素性が抽出されたときに、機械学習部１６は、学習データ格納部１２のリストの優先順位の高い順に、抽出された解くべき問題の素性と規則の素性とを比較し、素性が一致した規則の分類先をその解くべき問題の分類先（解）として推定する。 Further, when the decision list method is used as the machine learning method, a list in which rules of features and classification destinations are arranged in a predetermined priority order with respect to the example of the teacher data is previously stored by the learning data storage unit 12 by some means. Remember me. When the features of the problem to be solved are extracted by the feature acquisition unit 15, the machine learning unit 16 extracts the features and rules of the extracted problem to be solved in descending order of the priority of the list in the learning data storage unit 12. The features are compared with each other, and the classification destination of the rule having the same feature is estimated as the classification destination (solution) of the problem to be solved.

また、機械学習手法として最大エントロピー法を使用する場合には、教師データの事例から解となりうる分類を特定し、所定の条件式を満足し、かつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求めて、学習データ格納部１２に記憶する。そして、素性取得部１５によって解くべき問題の素性が抽出されたときに、機械学習部１６は、学習データ格納部１２の確率分布を利用して、抽出された解くべき問題の素性の集合についてその解となりうる分類の確率を求めて、最も大きい確率値を持つ解となりうる分類を特定し、その特定した分類をその解くべき問題の解と推定する。すなわち、機械学習部１６では、解くべき問題の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率とする。 In addition, when using the maximum entropy method as a machine learning method, the classification that can be a solution is identified from the example of the teacher data, the predetermined conditional expression is satisfied, and the feature when maximizing the expression showing entropy is determined. A probability distribution composed of two terms of a set and a class that can be a solution is obtained and stored in the learning data storage unit 12. When the features of the problem to be solved are extracted by the feature acquisition unit 15, the machine learning unit 16 uses the probability distribution of the learning data storage unit 12 to extract the set of features of the extracted problem to be solved. A probability of a class that can be a solution is obtained, a class that can be a solution having the largest probability value is specified, and the specified class is estimated as a solution of the problem to be solved. That is, in the machine learning unit 16, the degree of ease of becoming a solution in the case of a set of features of the problem to be solved is set as the probability of being classified.

また、機械学習手法としてサポートベクトルマシン法を使用する場合には、教師データの事例から解となりうる分類を特定し、分類を正例と負例に分割して、カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で、その事例の正例と負例の間隔を最大にし、かつ正例と負例を超平面で分割する超平面を求めて学習データ格納部１２に記憶する。そして、素性取得部１５によって解くべき問題の素性が抽出されたときに、機械学習部１６は、学習データ格納部１２の超平面を利用して、解くべき問題の素性の集合が超平面で分割された空間において正例側か負例側のどちらにあるかを特定し、その特定された結果にもとづいて定まる分類を、その解くべき問題の解と推定する。すなわち、機械学習部１６では、解くべき問題の素性の集合の場合にある解となりやすさの度合いを、分離平面からのその解くべき問題の事例への距離の大きさとする。
（実験結果１） In addition, when using the support vector machine method as a machine learning method, the classification that can be a solution is identified from the example of the teacher data, the classification is divided into positive examples and negative examples, and predetermined execution using a kernel function is performed. Stores learning data by finding the hyperplane that maximizes the interval between the positive and negative examples of the case and divides the positive and negative examples by the hyperplane in a space whose dimension is a set of case features according to the function Store in unit 12. When the features of the problem to be solved are extracted by the feature acquisition unit 15, the machine learning unit 16 uses the hyperplane of the learning data storage unit 12 to divide the set of features of the problem to be solved by the hyperplane. In the specified space, it is specified whether it is on the positive example side or the negative example side, and the classification determined based on the specified result is estimated as the solution of the problem to be solved. That is, in the machine learning unit 16, the degree of ease of becoming a solution in the case of a set of features of the problem to be solved is set as the magnitude of the distance from the separation plane to the case of the problem to be solved.
(Experimental result 1)

以下、異表記取得装置１の実験結果について説明する。まず、実験に利用するデータについて説明する。 Hereinafter, the experimental result of the different notation acquisition apparatus 1 will be described. First, data used for the experiment will be described.

実験で用いるデータは、大規模類似語リストである。大規模類似語リストとは、検索エンジン研究基盤ＴＳＵＢＡＫＩ（http://tsubaki.ixnlp.nii.ac.jp/se/index.cgi参照［平成２１年１２月１３日検索］）の約１億ページ・６０億文のデータから１００万語を抽出し、その１００万語の各々の語に対して最大５００個の類義語を類似度付きで生成したものである。この大規模類似語リストに含まれる１００万語の日本語用語と、その日本語用語の各々の類義語の日本語用語を日本語用語対とする。 The data used in the experiment is a large-scale similar word list. The large-scale similar word list is about 100 million pages of the search engine research base TSUBAKI (see http://tsubaki.ixnlp.nii.ac.jp/se/index.cgi [searched on December 13, 2009]) • One million words are extracted from 6 billion sentence data, and a maximum of 500 synonyms are generated with similarities for each of the one million words. One million Japanese terms included in this large-scale similar word list and Japanese terms corresponding to the respective Japanese terms are taken as Japanese term pairs.

そして、大規模類似語リストから、編集距離が１の日本語用語対をランダムに１４１８５組取り出した。その取り出した日本語用語対が日本語異表記対であるか、日本語異表記対でないかのタグ付けを３人の評定者の多数決により行った。３人の評定者のタグ付けがどれくらい一致しているのか、カッパ統計量Ｋを用いて判定する。 Then, 14185 Japanese word pairs with an edit distance of 1 were randomly extracted from the large-scale similar word list. Tagging whether the extracted Japanese term pairs are Japanese variant pairs or not was performed by a majority vote of three evaluators. It is determined using the Kappa statistic K how much the tagging of the three evaluators matches.

右の用語と左の用語の２つの用語を有する日本語用語対の組み合わせが左にあるか、右にあるかにより、異なる情報になる素性がある。その素性に対応し、情報量を増やすために、日本語用語対の組み合わせを左右入れ替えたデータも用いる。つまり、本実験では、大規模類似リストから取り出した１４１８５組に加え、合計２８３７０組の実験データを用いる。 There is a feature of different information depending on whether a combination of a Japanese term pair having two terms of the right term and the left term is on the left or the right. In order to increase the amount of information corresponding to the feature, data obtained by changing the combination of Japanese term pairs is used. That is, in this experiment, a total of 28370 sets of experimental data are used in addition to 14185 sets extracted from the large-scale similarity list.

また、２８３７０組の実験データを１つのまとまったデータであるとすると、実験の公正性が失われるのではないかと考え、２８４７０組ある実験データの半分を素性の考案を行うデータとして用いる。残り半分の実験データをクローズドデータで考案された素性が、他のデータにおいても有効であるかどうかの検討を行うデータとして用いる。素性の考案を行うデータをクローズドデータと呼ぶ。検討を行うデータをオープンデータと呼ぶ。クローズドデータは１０分割クロスバリデーションによる学習により精度の測定を行う。なお、１０分割クロスバリデーションとは、実験対象のデータを、第一から第十の１０に分割し、以下の（１）から（１０）の学習を行う。（１）第一をテストデータとし、第二から第十を学習データとして、学習を行う。（２）第二をテストデータとし、第一、第三から第十を学習データとして、学習を行う。（３）第三をテストデータとし、第一、第二、第四から第十を学習データとして、学習を行う。（４）第四をテストデータとし、他を学習データとして、学習を行う。（５）第五をテストデータとし、他を学習データとして、学習を行う。（６）第六をテストデータとし、他を学習データとして、学習を行う。（７）第七をテストデータとし、他を学習データとして、学習を行う。（８）第八をテストデータとし、他を学習データとして、学習を行う。（９）第九をテストデータとし、他を学習データとして、学習を行う。（１０）第十をテストデータとし、他を学習データとして、学習を行う。なお、１０分割クロスバリデーションは、公知技術である。 Also, assuming that 28370 sets of experimental data are a single set of data, the fairness of the experiment may be lost, and half of the 28470 sets of experimental data are used as data for devising features. The remaining half of the experimental data is used as data for examining whether or not the feature devised by the closed data is valid for other data. Data that devise features is called closed data. The data to be considered is called open data. The accuracy of the closed data is measured by learning by 10 division cross validation. In the 10-part cross validation, the test target data is divided into the first to tenth, and the following learning (1) to (10) is performed. (1) Learning is performed using the first as test data and the second through tenth as learning data. (2) Learning is performed using the second as test data and the first, third to tenth as learning data. (3) Learning is performed with the third as test data and the first, second, fourth to tenth as learning data. (4) Learning is performed with the fourth as test data and the other as learning data. (5) Learning is performed with the fifth as test data and the other as learning data. (6) Learning is performed using the sixth as test data and the other as learning data. (7) Learning is performed using the seventh as test data and the other as learning data. (8) Learning is performed using the eighth as test data and the other as learning data. (9) Learning is performed using the ninth as test data and the other as learning data. (10) Learning is performed with the tenth as test data and the other as learning data. In addition, 10 division | segmentation cross validation is a well-known technique.

また、クローズドデータを学習データ（学習データ格納部１２に格納されるデータ）、オープンデータをテストデータ（異表記の用語対であるか否かを判断されるデータ）とし、オープンクローズによる学習により精度の測定を行う。 In addition, closed data is used as learning data (data stored in the learning data storage unit 12), and open data is used as test data (data for determining whether or not a term pair is different). Measure.

図６に、実験で用いた編集距離が１の日本語用語対の中に、多数決により日本語異表記対であるか日本語異表記対でないかを判定した内訳を示す。 FIG. 6 shows a breakdown of the Japanese term pairs having an editing distance of 1 used in the experiment, which are determined by majority voting as a Japanese different notation pair or not a Japanese different notation pair.

なお、カッパ統計量Ｋとは、Ｋ人評定者のカテゴリ評定における一致度を表す数値のことである。カッパ統計量Ｋの算出方法は公知であるので、説明を省略する。 The kappa statistic K is a numerical value representing the degree of coincidence in the category rating of K person raters. Since the method for calculating the kappa statistic K is known, the description thereof is omitted.

評定者の間に完全な一致があればＫの値は１になる。チャンスレベルでの一致であればＫの値は０である。一致度が高くなればＫの値は０から１に近づく。図７に、Ｌａｎｄｉｓらによる一致度の評価方法を示す。本実験では、１４１８５組の日本語用語対を対象に、３人で日本語異表記対であるか日本語異表記対でないかの２カテゴリでカッパ統計量Ｋを求めたところ、一致度は０．８４であった。これは０．８以上の一致度であるため、ほぼ完全な一致であると評価できる。 If there is a perfect match between the graders, the value of K will be 1. If the chance level matches, the value of K is 0. If the degree of coincidence increases, the value of K approaches 0 to 1. FIG. 7 shows a method for evaluating the degree of coincidence according to Landis et al. In this experiment, Kappa statistic K was calculated for two categories of 14185 pairs of Japanese terminology, which were either Japanese variant pairs or not Japanese variant pairs. .84. Since this is a degree of coincidence of 0.8 or more, it can be evaluated as almost perfect coincidence.

次に、異表記取得装置１における異表記の用語対であるか否かの判断手法が優れていることを示すために、異表記取得装置１の判断手法と比較対照となるベースライン手法について説明する。 Next, in order to show that the method for determining whether or not a different pair of terminology is used in the different notation acquisition device 1 is described, the determination method of the different notation acquisition device 1 and the baseline method as a comparison are explained. To do.

編集距離の小さい（例えば、編集距離が１）日本語異表記対の抽出を行う対象の日本語用語対に対して、ベースライン手法では、以下のルールを適用し、機械的に日本語異表記対であるか日本語異表記対でないかについて判定を行う。
（ルール１）文字数が同じ日本語用語対の編集箇所が同じ値の数字を表す場合、日本語異表記対であると判定する。
（ルール２）文字数が同じ日本語用語対の編集箇所が同じ意味のアルファベットを表す場合、日本語異表記対であると判定する。
（ルール３）文字数が同じ日本語用語対がＪＵＭＡＮを使い読み方を調べることでき、読み方が一致する場合、日本語異表記対であると判定する。
（ルール４）ルール１、ルール２、ルール３と一致しなかった場合、日本語異表記対でないと判定する。 The baseline method applies the following rules to the Japanese terminology pairs that are subject to the extraction of Japanese variant pairs with a short edit distance (for example, the edit distance is 1), and mechanically uses the Japanese variant. A determination is made as to whether it is a pair or not a Japanese variant pair.
(Rule 1) If the edited portion of a Japanese term pair having the same number of characters represents a number having the same value, it is determined that the pair is a Japanese variant notation.
(Rule 2) If the edited portion of a Japanese term pair having the same number of characters represents the same alphabet, it is determined that the pair is a Japanese variant.
(Rule 3) Japanese term pairs having the same number of characters can be read using JUMAN, and if the readings match, it is determined to be a Japanese variant notation pair.
(Rule 4) If the rule does not match Rule 1, Rule 2, and Rule 3, it is determined that the pair is not a Japanese variant.

上記のルール１からルール４を適用するベースライン手法において、日本語用語対「第２版」「第二版」については、以下のように判断される。この日本語用語対における編集箇所は、「２」と「二」であり数字を表している。ルール１を適用し、同じ値を表しているため、ベースライン手法では、この日本語用語対は日本語異表記対であると判定される。 In the baseline method applying rule 1 to rule 4 above, the Japanese term pair “second edition” and “second edition” are determined as follows. The edited parts in this Japanese term pair are “2” and “2”, which represent numbers. Since Rule 1 is applied to represent the same value, the baseline method determines that the Japanese term pair is a Japanese variant notation pair.

日本語用語対「Ｔｅａ」「ｔｅａ」については、以下のように判断される。この日本語用語対における編集箇所は、「Ｔ」と「ｔ」でありアルファベットを表している。そして、ルール２が適用され、この日本語用語対は、同じ意味の語であるためこの日本語用語対は日本語異表記対であると判定される。 The Japanese term pair “Tea” and “tea” are determined as follows. The edited parts in this Japanese term pair are “T” and “t”, which represent the alphabet. Then, rule 2 is applied, and since this Japanese term pair is a word having the same meaning, it is determined that this Japanese term pair is a Japanese notation pair.

日本語用語対「誉める」「褒める」については、以下のように判断される。この日本語用語対は、「ほめる」と「ほめる」にＪＵＭＡＮを使い読み方を調べることできる。そして、ルール３が適用され、読み方が一致し、この日本語用語対は日本語異表記対であると判定される。 Japanese terms vs. “praise” and “praise” are judged as follows. This Japanese terminology pair can be read using JUMAN for “praise” and “praise”. Then, rule 3 is applied, the readings are the same, and it is determined that this Japanese term pair is a Japanese variant notation pair.

日本語用語対「シルキーホワイト」「ミルキーホワイト」については、以下のように判断される。この日本語用語対における編集箇所は、「シ」と「ミ」でありカタカナを表している。シルキーホワイトとミルキーホワイトはＪＵＭＡＮ辞書において未定義であるため、読み方を調べることができない。よってルール４が適用され、日本語異表記対でないと判定される。 The Japanese terms “silky white” and “milky white” are judged as follows. The edited parts in this Japanese term pair are “shi” and “mi”, representing katakana. Since Silky White and Milky White are undefined in the JUMAN dictionary, how to read them cannot be checked. Therefore, rule 4 is applied and it is determined that the pair is not a Japanese variant.

次に、本実験で用いた機械学習部１６の機械学習手法について、詳細に説明する。本機械学習手法は、サポートベクトルマシン法である。サポートベクトルマシン法は、上述したように、空間を超平面で分割することにより、２つの分類からなるデータを分類する手法である。このとき２つの分類が正例と負例からなるとすると、学習データにおいてこの２つの間隔が大きいものほど誤った分類をする可能性が低いと判断される。この間隔を最大にする超平面を求め、それを求めて分類を行うことが基本とされる。しかし、ここでは、学習データにおいて間隔の内部領域に少数の事例を含んでもよいとする手法や超平面の線形の部分を非線形にするなどの拡張がされたものを用いる。これらの拡張された方法は、識別関数を用いて分類することと等価となり、識別関数の出力値が正か負かによって２つに分類を判別することができる。また、３つ以上からなるデータを扱う場合にはペアワイズ手法というのを並行して用いる。ペアワイズ手法はＮ個の分類をもつデータの場合、異なる２つの分類先のあらゆるペアを作り、各ペアごとにどちらがよいかを２値分類器（サポートベクトルマシン法）で求め最終的に分類先の多数決により求める方法である。以降、サポートベクトルマシン法はＳＶＭと、適宜、表記する。 Next, the machine learning method of the machine learning unit 16 used in this experiment will be described in detail. This machine learning method is a support vector machine method. As described above, the support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane. At this time, if the two classifications are a positive example and a negative example, it is determined that the larger the interval between the two in the learning data, the lower the possibility of erroneous classification. It is fundamental to obtain a hyperplane that maximizes this interval and classify it by obtaining it. However, here, a method in which a small number of cases may be included in the inner region of the interval in the learning data or an extension such as making the linear portion of the hyperplane nonlinear is used. These extended methods are equivalent to classification using a discriminant function, and can classify into two depending on whether the output value of the discriminant function is positive or negative. Further, when handling data consisting of three or more, the pair-wise method is used in parallel. In the case of data with N classifications, the pairwise method creates every pair of two different classification destinations, finds which one is better for each pair by a binary classifier (support vector machine method), and finally determines the classification destination It is a method to seek by majority vote. Hereinafter, the support vector machine method is appropriately expressed as SVM.

ＳＶＭによって編集距離の小さい（例えば、編集距離が１）日本語異表記対を抽出するために用いる素性は、上述したＳ１からＳ６８の素性である。これらの素性は、大規模類似語リストからランダムで取り出した編集距離の小さい日本語用語対から取り出す。素性によってそれぞれの機械学習は、日本語用語対が日本語異表記対であるか日本語異表記対でないかを判定をする。日本語用語対からできるだけ多くの情報を得るために、種々の素性を用いた。また、それぞれの素性について、上述したように、Ｇ１あらＧ７に分類できる。Ｇ１、Ｇ２、Ｇ３は、すべての編集距離の小さい日本語用語対に対応できる素性である。字種は対象の文字がひらがな、カタカナ、数字、アルファベット、その他のどの種類を表しているかの情報である。品詞は用語にＪＵＭＡＮを用いて形態素解析をかけ、用語を単語に区切り、品詞情報を取得する。そして、対象の文字がどの品詞に属しているかの情報である。位置情報は、対象の文字が品詞に属している中でさらに、その品詞の先頭、最後尾、それ以外のどの位置を示しているかの情報である。類似度は、大規模類似語リストを生成する際に用いた類似度の情報である。 The features used to extract Japanese variant pairs with a short edit distance (for example, the edit distance is 1) by the SVM are the features of S1 to S68 described above. These features are extracted from pairs of Japanese terms with a small editing distance that are randomly extracted from the large-scale similar word list. Each machine learning determines whether a Japanese term pair is a Japanese variant pair or a Japanese variant pair depending on the feature. Various features were used to obtain as much information as possible from the Japanese term pairs. Each feature can be classified into G1 and G7 as described above. G1, G2, and G3 are features that can correspond to all pairs of Japanese terms having a small editing distance. The character type is information indicating whether the target character represents hiragana, katakana, numbers, alphabet, or any other type. Part of speech is subjected to morphological analysis using JUMAN as a term, and the term is divided into words to obtain part of speech information. This is information about which part of speech the target character belongs to. The position information is information indicating which position of the part of speech is the first, last, or other part of the part of speech when the target character belongs to the part of speech. The similarity is information on the similarity used when generating the large-scale similar word list.

スタッキングアルゴリズムとは、上述したように、実験データを本来の目的とは別の分類方法で分類させたデータを機械学習で学習させ、学習結果の分類情報を素性に加えることである。本実験において、スタッキングアルゴリズムに使用するデータは、実験で用いる２８３７０組の日本語用語対以外の、大規模類似語リストから得られたＪＵＭＡＮの代表表記が判別できる９０４６１２組の日本語用語対を用いる。９０４６１２組の中で、正例は２５９３４組、負例は８７８６７８組である。これにより、ＪＵＭＡＮ辞書において未定義の日本語用語対にも、近似的ではあるがＳ６４の素性の情報を付与することができる。また、Ｇ４、Ｇ５は、特徴がある編集距離の小さい日本語用語対に特化した素性である。Ｇ４は置換によって等しい文字列で、Ｇ５は削除によって等しい文字列になる日本語用語対が対象である。Ｇ６、Ｇ７は、ＪＵＭＡＮ辞書、日本語ワードネット辞書、ＥＤＲ辞書を用いた素性である。ＪＵＭＡＮ辞書については、未定義とされている用語が出てくる日本語用語対に対して、素性の情報は付与しないこととする。 As described above, the stacking algorithm is to learn, by machine learning, data obtained by classifying experimental data by a classification method different from the original purpose, and to add classification information of a learning result to a feature. In this experiment, the data used for the stacking algorithm uses 904612 pairs of Japanese terms that can discriminate the representative representation of JUMAN obtained from the large-scale similar word list other than 28370 pairs of Japanese terms used in the experiment. . Among the 904612, the positive example is 25934, and the negative example is 878678. Thereby, although it is approximate, the information of the feature of S64 can be given also to a Japanese term pair undefined in the JUMAN dictionary. G4 and G5 are features specialized for a Japanese term pair having a characteristic and a small editing distance. G4 is an equal character string by substitution, and G5 is a Japanese term pair that becomes an equal character string by deletion. G6 and G7 are features using a JUMAN dictionary, a Japanese word net dictionary, and an EDR dictionary. Regarding the JUMAN dictionary, feature information is not given to Japanese term pairs in which undefined terms appear.

上述したＳ１からＳ６８の素性がどれくらい有効であるのかを有意差の分析により検討する。有意の検討はブートストラップ法を用いて求める。ブートストラップ法とは分類手法によって二つに分類されたデータを用いる。分類された二つのデータをそれぞれ、データ数（例えば、問いの数は１４００）は変えずに重複を許しランダムに取り出す（例えば、取り出したデータ数は１４００）。取り出したデータでそれぞれのＦ値を求め、それぞれのＦ値を比較する。取り出しとＦ値の比較をする工程を１００００回繰り返す。Ｆ値とは以下の数式２で定義される。すなわち再現率と適合率の調和平均である。
The effectiveness of the above-described features S1 to S68 will be examined by analyzing a significant difference. Significant consideration is determined using the bootstrap method. The bootstrap method uses data classified into two by a classification method. Each of the two classified data is extracted at random without changing the number of data (for example, the number of questions is 1400) (for example, the number of extracted data is 1400). Each F value is obtained from the extracted data, and each F value is compared. The process of taking out and comparing the F value is repeated 10,000 times. The F value is defined by Equation 2 below. That is, the harmonic average of recall and precision.

工程を１００００回繰り返し比較した結果が、どちらかの手法のＦ値よりも、もう一方の手法のＦ値の方が高い回数が９５００回（９５％）以上の場合、有意水準５％によりＦ値が高い方の手法は有意であるといえる。どちらも９５００回（９５％）以上ない場合、有意水準５％により有意かどうかの判定はできない。なお、ここでは、有意水準５％を適用するが、例えば、有意水準１０％を適用しても良い。 If the number of times that the F value of the other method is higher than the F value of the other method is 9500 times (95%) or more than the F value of the one method, the F value of the significance level is 5%. It can be said that the method with the higher is significant. If neither of them is 9500 times (95%) or more, it cannot be determined whether the significance level is 5%. Here, a significance level of 5% is applied, but for example, a significance level of 10% may be applied.

また、本実験では全素性と全素性から１種類の素性だけを取り除いたデータを、ＳＶＭの学習結果により比較する。上述した６８個すべて組み合わせた素性を全素性とし、取り除く素性はＳ１からＳ６８におけるすべての素性でおこなう。この有意差の検討を、クローズドデータを１０分割クロスバリデーション（１０ＣＶ）でＳＶＭによる学習結果と、クローズドデータとオープンデータを使いオープンクローズ（ＯＣ）でＳＶＭによる学習結果で行う。なお、オープンクローズとは、クローズドデータを学習データとして、オープンデータをテストデータとして実験することをいう。以降は、１０分割クロスバリデーションによるＳＶＭの実験は１０ＣＶと表記し、オープンクローズによるＳＶＭの実験はＯＣと表記する。 In this experiment, all features and data obtained by removing only one type of feature from all features are compared by the learning result of SVM. All the 68 features described above are set as the total features, and the features to be removed are performed in all the features in S1 to S68. This significant difference is examined based on the learning result by SVM with 10-division cross validation (10 CV) for closed data and the learning result by SVM with open data (OC) using closed data and open data. Note that “open / close” refers to an experiment in which closed data is used as learning data and open data is used as test data. Hereinafter, the SVM experiment by 10-fold cross validation is described as 10 CV, and the SVM experiment by open close is expressed as OC.

次に、ベースライン手法と機械学習を利用した手法について実験を行った結果について報告する。 Next, we report the results of experiments on the baseline method and the method using machine learning.

本実験において、上述した大規模類似語リストに含まれる２８３７０組の編集距離の小さい日本語用語対が、日本語異表記対であるか、日本語異表記対でないかについて判定を行った。図８に、用意したクローズドデータとオープンデータに対して、ベースラインの手法を適用した結果を示す。また、図８には、１０ＣＶとＯＣの結果も示す。実験で用いるＳＶＭの実装としてＴｉｎｙＳＶＭを採用し、１次の多項式カーネルでソフトマージンパラメータＣを１に設定して利用した。それぞれの表での「全素性」はＳ１からＳ６８のすべての素性を利用した実験を示し、「素性選択」は省いた素性以外の全素性を利用した実験を示す。 In this experiment, it was determined whether the 28370 pairs of Japanese terms with a short editing distance included in the large-scale similar word list described above are Japanese variant pairs or not. FIG. 8 shows the result of applying the baseline method to the prepared closed data and open data. FIG. 8 also shows the results of 10 CV and OC. TinySVM was adopted as the SVM implementation used in the experiment, and the soft margin parameter C was set to 1 with a first-order polynomial kernel. “Total features” in each table indicates an experiment using all the features of S1 to S68, and “Feature selection” indicates an experiment using all features other than the omitted features.

図８における正解率は、それぞれの実験データに対して、編集距離が１の日本語異表記対であるのか、編集距離が１の日本語異表記対でないのかを、正しく判定した割合である。図８のＦ値は、それぞれの実験データに対して、編集距離が１の日本語異表記対を抽出する場合のＦ値である。１０ＣＶ、ＯＣに対して、全素性を利用したＳＶＭの正解率、Ｆ値ともにベースラインの手法よりも高いことがわかる。編集距離１の日本語用語対から日本語異表記対を抽出する場合のＦ値は、ベースラインと比較して全素性を利用したＳＶＭの方が、１０ＣＶでは０．４３３高く、オープンデータでは０．４６０高かった。ベースラインの結果より、編集距離が１の日本語用語対から日本語異表記対を抽出することは難しいといえるが、本報告で提案している種々の素性と機械学習を用いた手法は、ベースライン手法よりも多くの日本語異表記対が抽出できることがわかる。 The correct answer rate in FIG. 8 is a ratio of correctly determining whether each of the experimental data is a Japanese variant pair with an edit distance of 1 or a Japanese variant pair with an edit distance of 1. The F value in FIG. 8 is an F value in the case of extracting a Japanese variant notation pair whose edit distance is 1 for each experimental data. It can be seen that, for 10 CV and OC, the accuracy rate and F value of SVM using all features are higher than those of the baseline method. The F-value when extracting a Japanese variant pair from a Japanese term pair with an edit distance of 1 is 0.433 higher for 10 CV and 0 for open data in the SVM using the total feature compared to the baseline. .460 was expensive. From the results of the baseline, it can be said that it is difficult to extract Japanese grammar pairs from Japanese term pairs with an edit distance of 1. However, the methods using various features and machine learning proposed in this report are: It can be seen that more Japanese variant pairs can be extracted than the baseline method.

次に、上述したブートストラップ法を用いて素性が有効であるかどうかの検討をした結果を図９に示す。なお、Ｓ１からＳ６８の素性を図１０に示す。図９において、省いた素性は、ブートストラップ法により有意かどうかの判定が行われる素性である。全素性は本実験で扱った素性による手法であり、素性選択は省いた素性を全素性から省いた素性による手法である。それぞれの値は、全素性が素性選択よりＦ値が高かった回数あるいは、素性選択が全素性よりＦ値が高かった回数である。この実験では、全素性が素性選択よりＦ値が高い回数が９５００回（９５％）以上あれば、省いた素性は精度向上に役立っているということになり、省いた素性は有効であるといえる。全素性が素性選択よりＦ値が高い回数が９５００回（９５％）以上であった素性は、１０ＣＶの場合はＳ４７、Ｓ５５、Ｓ５８、Ｓ６７であり、ＯＣの場合はＳ５２、Ｓ５４、Ｓ５５、Ｓ５８、Ｓ６７であった。１０ＣＶとＯＣの両方で、全素性が素性選択よりＦ値が高い回数が９５００回（９５％）以上であった素性は、Ｓ５５、Ｓ５８、Ｓ６７であった。この結果からＳ５５、Ｓ５８、Ｓ６７の素性は、どのような編集距離の小さい日本語異表記対を抽出するデータにも、有効である素性といえる。Ｓ４７とＳ５２の素性はそれぞれの実験で使われたデータには有効である素性といえるが、編集距離の小さい日本語用語対のデータが変われば、有効でなくなる可能性がある素性といえる。そのためＳ４７とＳ５２は、どのような編集距離の小さい日本語異表記対を抽出するデータにも、有効であるとはいえない。 Next, FIG. 9 shows the result of examining whether the feature is effective using the bootstrap method described above. The features from S1 to S68 are shown in FIG. In FIG. 9, the omitted features are features that are determined to be significant by the bootstrap method. The total feature is a technique based on the feature handled in this experiment, and the feature selection is a technique based on the feature that omits the omitted feature from the total feature. Each value is the number of times that the total feature has an F value higher than the feature selection, or the number of times that the feature selection has an F value higher than the total feature. In this experiment, if the number of times that the total feature is higher than the feature selection is 9500 times (95%) or more, the omitted feature is useful for improving accuracy, and the omitted feature is effective. . The features whose total feature is 9500 times (95%) or more when the F value is higher than the feature selection are S47, S55, S58, S67 in the case of 10 CV, and S52, S54, S55, S58 in the case of OC. , S67. In both 10CV and OC, the features whose total feature number was higher than the feature selection by 9500 times (95%) were S55, S58, and S67. From this result, it can be said that the features of S55, S58, and S67 are effective features for data that extracts pairs of Japanese idioms having a small edit distance. Although the features of S47 and S52 can be said to be effective for the data used in each experiment, it can be said that if the data of a Japanese term pair with a short editing distance changes, it may become ineffective. Therefore, it cannot be said that S47 and S52 are effective for data for extracting a pair of Japanese notation having a small edit distance.

次に、本異表記取得装置１の提案手法（以下、単に提案手法とも言う）が、編集距離が１の日本語異表記対を抽出できたのかを、種々の同義語辞書を用いて比較を行った結果について説明する。種々の同義語辞書は、ＥＤＲ辞書、日本語ワードネット辞書、ＪＵＭＡＮ辞書である。編集距離が１の日本語用語対は、ＥＤＲ辞書には２１２２４７７９組、日本語ワードネット辞書には８９０６１６組、ＪＵＭＡＮ辞書には２３３４８組あることがわかった。ＥＤＲ辞書には人名に関する単語がある。本実験では人名は同義語でないと判断し、取り除いた。その結果、ＥＤＲ辞書に含まれている編集距離が１の日本語用語対は９３３０３７組であった。ＪＵＭＡＮ辞書は同じ代表表記をもつ単語対を日本語用語対として扱った。 Next, a comparison is made using various synonym dictionaries to determine whether the proposed method of the different notation acquisition device 1 (hereinafter also simply referred to as the proposed method) can extract a Japanese different notation pair whose edit distance is 1. The results will be described. Various synonym dictionaries are an EDR dictionary, a Japanese word net dictionary, and a JUMAN dictionary. It was found that there are 21224779 sets of Japanese terms in the EDR dictionary, 890616 sets in the Japanese word net dictionary, and 23348 sets in the JUMAN dictionary, with the edit distance of 1. The EDR dictionary has words related to personal names. In this experiment, it was judged that the person's name was not a synonym and was removed. As a result, the number of Japanese term pairs whose edit distance is 1 included in the EDR dictionary is 933037 sets. The JUMAN dictionary treated word pairs with the same representative notation as Japanese term pairs.

提案手法を用い、大規模類似語リストから編集距離が１の日本語異表記対と分類された用語対が、種々の辞書にどの程度の割合で含まれているかの検討結果を図１１に示す。以降は大規模類似語リストにおける編集距離が１の日本語用語対すべてを、日本語用語対データベースとし、用語対ＤＢと表記する。さらに、用語対ＤＢにおいて、日本語異表記対であると提案手法が分類した日本語用語対すべてを、日本語異表記対データベースとし、異表記ＤＢと表記する。用語対ＤＢにおいて、日本語異表記対でないと提案手法が分類した非日本語用語対すべてを、非日本語異表記対データベースとし、非異表記ＤＢと表記する。ＥＤＲ辞書は２０．４５％、日本語ワードネットは１．７１％、ＪＵＭＡＮ辞書は６．５２％の割合で異表記ＤＢの日本語異表記対が含まれていた。どの辞書においても、異表記ＤＢの日本語用語対を含んでいる割合は高くない。これらの結果より、本明細書で記載した異表記取得装置により得られた異表記と既存辞書は重なりが小さいので、異表記取得装置により、既存辞書に対して多くの異表記を追加できることが分かる。また、例えば、ＥＤＲ辞書では、約２割のカバー率であるが、相当な程度のカバー率である、と言える。 FIG. 11 shows the examination results of the ratios of the term pairs classified as Japanese variant pairs whose editing distance is 1 from the large-scale similar word list using the proposed method in various dictionaries. . Hereinafter, all Japanese term pairs having an editing distance of 1 in the large-scale similar term list will be referred to as a Japanese term pair database, which will be referred to as term pair DB. Further, in the term pair DB, all the Japanese term pairs classified by the proposed method as Japanese different notation pairs are referred to as a Japanese different notation pair database and denoted as different notation DB. In the term pair DB, all of the non-Japanese term pairs classified by the proposed method as not being Japanese non-differential pairs are referred to as a non-Japanese different notation pair database and referred to as a non-different notation DB. The EDR dictionary contained 20.45% of the Japanese word net, 1.71% of the Japanese word net, and 6.52% of the JUMAN dictionary. In any dictionary, the proportion of Japanese term pairs in the different notation DB is not high. From these results, it is understood that the different notation obtained by the different notation acquisition device described in this specification and the existing dictionary have a small overlap, so that the different notation acquisition device can add many different notations to the existing dictionary. . Further, for example, in the EDR dictionary, the coverage rate is about 20%, but it can be said that the coverage rate is considerably high.

また、種々の辞書に含まれる編集距離が１の日本語用語対を、提案手法により編集距離が１の日本語異表記対であるか、編集距離が１の日本語異表記対でないか分類した。ＳＶＭの分類における正解率を図１２に示す。 In addition, Japanese term pairs with an edit distance of 1 included in various dictionaries are classified according to the proposed method as either Japanese variant pairs with an edit distance of 1 or Japanese variant pairs with an edit distance of 1. . The accuracy rate in the SVM classification is shown in FIG.

また、種々の辞書と用語対ＤＢにおいて、編集距離が１の日本語異表記対であると分類された日本語用語対と、分類されなかった日本語用語対をそれぞれランダムに、５組ずつ取り出した結果を図１３に示す。学習データはオープンデータとクローズデータを組み合わせたデータとし、テストデータを用語対ＤＢ、種々の辞書のそれぞれでＯＣにより、用語対ＤＢと種々の辞書を分類した。 Also, in various dictionaries and term pair DBs, five pairs of Japanese term pairs that are classified as Japanese variant pairs with an edit distance of 1 and Japanese term pairs that are not classified are randomly extracted. The results are shown in FIG. The learning data is data combining open data and closed data. The test data is classified into the term pair DB and various dictionaries, and the term pair DB and various dictionaries are classified by OC.

図１２において、日本語ワードネットにおいて分類の正解率が低かったのは、図１３のように、日本語異表記対ではなく、日本語類義語対が多く含まれているからである。図１２に示すように、ＪＵＭＡＮ辞書の場合は８割という高い正解率で分類できている。また、ＪＵＭＡＮ辞書には、日本語異表記対でないものが含まれるという問題が少なく、また、本提案手法により適切に異表記を抽出できるために、８割という高い正解率を達成できたものと考えられる。 In FIG. 12, the reason why the correct rate of classification is low in the Japanese word net is that, as shown in FIG. As shown in FIG. 12, the JUMAN dictionary can be classified with a high accuracy rate of 80%. In addition, the JUMAN dictionary has few problems that include non-Japanese grammar pairs, and because the proposed method can properly extract different grammars, it can achieve a high accuracy rate of 80%. Conceivable.

次に、編集距離が１の日本語異表記対抽出の評価について述べる。ＳＶＭは識別関数の出力値（機械学習部１６が出力するスコア）が正か負かによって、データを分類することも可能であるが、ここでは、識別関数の出力値が正か負かによって、データを分類するのではなく、任意の値（閾値）によって正か負のデータを分類し、編集距離が１の日本語異表記対抽出の評価を行う。つまり、閾値判断手段１７２が、機械学習部１６が取得したスコアが閾値格納手段１７１に格納されている閾値以上または閾値より大きいか否かを判断するものとする。正のデータを編集距離が１の日本語異表記対であると分類し、負のデータを編集距離が１の日本語異表記対でないと分類する手法では、精度が１００％ではないため、誤って編集距離が１の日本語異表記対でない日本語用語対を、日本語異表記対であると判断し、抽出することがある。そのため、少量であっても確実に抽出を行いたい場合は、閾値を高く設定することで、日本語異表記対を確実に抽出できる。また、誤ったデータが含まれていても、網羅的に抽出を行いたい場合は、閾値を低く設定することで、可能となる。図１４に、閾値の評価基準を示す。図１４に示すように、閾値を−０．２に設定することで、Ｆ値０．９３２３と最も高い値を得られることがわかった。また、再現率と適合率の比率を図１５に示す。図１５によれば、再現率が高くしようとすると、カバー率を上げなくてはならなくなり、適合率は低くなる。そして、適合率が高くしようとすると、カバー率を下げなくてはならなくなり、再現率は低くなる。
（実験結果２） Next, we will describe the evaluation of extracting Japanese variant pairs with an edit distance of 1. The SVM can also classify data depending on whether the output value of the discrimination function (score output by the machine learning unit 16) is positive or negative. Here, depending on whether the output value of the discrimination function is positive or negative, Rather than classifying data, positive or negative data is classified by an arbitrary value (threshold value), and an evaluation of Japanese variant notation pair extraction with an edit distance of 1 is performed. That is, the threshold determination unit 172 determines whether the score acquired by the machine learning unit 16 is equal to or greater than or greater than the threshold stored in the threshold storage unit 171. The method that classifies positive data as Japanese variant pairs with an edit distance of 1 and classifies negative data as non-Japanese pairs with an edit distance of 1 is not 100% accurate. Thus, a Japanese term pair that is not a Japanese variant pair with an edit distance of 1 may be determined as a Japanese variant pair and extracted. For this reason, if it is desired to reliably extract even a small amount, it is possible to reliably extract Japanese variant notation pairs by setting a high threshold value. In addition, even if erroneous data is included, if it is desired to perform exhaustive extraction, it is possible to set a low threshold value. FIG. 14 shows threshold evaluation criteria. As shown in FIG. 14, it was found that the highest value of F value 0.9323 can be obtained by setting the threshold value to -0.2. Further, the ratio between the recall rate and the matching rate is shown in FIG. According to FIG. 15, if the reproduction rate is to be increased, the coverage rate must be increased and the matching rate is lowered. If the precision ratio is to be increased, the coverage ratio must be lowered, and the recall ratio is lowered.
(Experimental result 2)

第２番目の実験において、正例（例えば、「スパゲティ」と「スパゲッティ」との対）７４５個、負例（正例に該当しない対）１３，４４０個を持つ学習データから、正例７２５個、負例１３，４６０個のテストデータの抽出が行なわれ、そのＦ値は、０．９３であった。なお、実験結果２において、実験結果１で利用した素性や学習データが完全に一致するものではないが、本提案手法の有効性を示すために足りる、素性や学習データの重複がある。 In the second experiment, 725 positive examples from learning data having 745 positive examples (for example, “spaghetti” and “spaghetti” pairs) and 13,440 negative examples (non-positive examples) 13,460 negative test data were extracted, and the F value was 0.93. In the experimental result 2, the features and learning data used in the experimental result 1 do not completely match, but there is a duplication of features and learning data that is sufficient to show the effectiveness of the proposed method.

すべてを正例と判断する、即ちどんなものでも正例とするベースラインの方法であると、Ｆ値は０．０９７２程度であった。異表記かどうかを判定する対象の用語対において、編集箇所の文字また、編集箇所の文字の周辺の文字だけの素性を用いる従来の方法でも、Ｆ値は０．８５であった。つまり、提案手法のように、多数の素性（ここでは、６８）を用いた方法の効果は顕著であることが分かる。 The F value was about 0.0972 in the case of a baseline method in which all were judged to be positive examples, that is, whatever was a positive example. Even in the conventional method using only the character of the character at the edit location or the character around the character at the edit location, the F value is 0.85 in the pair of terms to be judged whether they are different notations. That is, it can be seen that the effect of the method using a large number of features (here, 68) as in the proposed method is remarkable.

また、既存の異表記辞書に基づく素性、また、スタッキング手法に基づく素性（上記の辞書関連素性）を利用しなかった方法よりも、これらの方法を利用した方が有意にＦ値が高いことも確かめており、これらの手法の有効性も確認している。 In addition, features based on existing different notation dictionaries, and those using these methods may have significantly higher F values than methods based on features based on stacking techniques (the above dictionary-related features). We are confirming the effectiveness of these methods.

また、ルールベース的手法として、編集箇所の文字の字種が漢数字かアラビア数字であること、または、同じアルファベットであること、また、既存の異表記辞書を利用することで異表記と判定できるものを、異表記と決定的に推定する方法も試した。この場合のＦ値は、０．４２０２であり、ルールベース的手法でなく教師あり機械学習を利用する方が良いことがわかる。 In addition, as a rule-based method, it is possible to determine that the character type of the edited portion is a Chinese numeral or an Arabic numeral, or the same alphabet, or that an existing different notation dictionary is used. I also tried a method to estimate things definitively. The F value in this case is 0.4202, indicating that it is better to use supervised machine learning instead of the rule-based method.

正しい異表記の対の差分データから、異表記になりやすい差分パターンを学習し、ある用語Ａに対して、異表記の候補Ｂを上記差分パターンより生成し、用語Ａと用語Ｂが異表記の対であるかを判定する操作を利用することにより、取得できる異表記が格段に増えるという効果がある。かかる操作については、実施の形態２で説明する。
（実験結果３） A difference pattern that tends to be different notation is learned from the difference data of the correct different notation, and a candidate B for different notation is generated from the above difference pattern for a certain term A, and the terms A and B are different notations. By using the operation of determining whether it is a pair or not, there is an effect that the number of different expressions that can be acquired increases significantly. Such an operation will be described in the second embodiment.
(Experimental result 3)

第３番目の実験において、１０万語の単語とそれの類似する１００語の単語を用いた。１０万語の単語とそれの類似する１００語の単語のすべての対のうち、１文字のみ変化している用語対は１７０万個あった。なお、実験結果３において、実験結果１で利用した素性や学習データが完全に一致するものではないが、本提案手法の有効性を示すために足りる、素性や学習データの重複がある。そして、異表記取得装置１の技術を利用して、そこから７万対の異表記を取り出せる。以下に構築できる異表記の例を示す。
? ＢｕｓｉｎｅｓｓＷｅｅｋＢｕｓｉｎｅｓｓＷｅｅｋ
? ＪＡＶＡＳｃｒｉｐｔＪＡＶＡＳｃｒｉｐｔ
? 書いてた頃書いていた頃
? アイリッシュトラッドアイリッシュ・トラッド
? 自サーバ自サーバー
? でない場合出ない場合
? ＷＷＷサーバ上ＷＷＷサーバー上
? 日光彫日光彫り
? 隣同士隣り同士 In the third experiment, 100,000 words and 100 similar words were used. Of all pairs of 100,000 words and 100 similar words, there were 1.7 million term pairs that only changed one letter. In the experimental result 3, the features and learning data used in the experimental result 1 do not completely match, but there is a duplication of features and learning data that is sufficient to show the effectiveness of the proposed method. Then, using the technique of the different notation acquisition device 1, 70,000 pairs of different notations can be extracted therefrom. An example of different notation that can be constructed is shown below.
Business Week Business Week
JAVA Script JAVA Script
When I was writing When I was writing
? Irish Trad Irish Trad
Own server Own server
If not? If not?
? On WWW server On WWW server
Nikko carving Nikko carving
Next to each other next to each other

なお、ＥＤＲ（ＥｌｅｃｔｒｉｃＤｉｃｔｉｏｎａｒｙＲｅｓｅａｒｃｈ）電子化辞書に含まれる差分が１文字の異表記は２４，１８５語である。また、日本語ＷｏｒｄＮｅｔに含まれる差分が１文字の異表記のようなものは８２，２７０語ある。ただし、日本語ＷｏｒｄＮｅｔには、異表記でないもの（類義語）も多く含まれており、適切に異表記を取り出すことが困難である。さらに、ＪＵＭＡＮの辞書に含まれる差分が１文字の異表記は２３，３４８語である。これらと比較しても本提案手法の技術の有効性がわかる。また、「ＪＡＶＡ」は登録商標です。
（実験結果４） It should be noted that the difference notation of one character in the difference included in an EDR (Electronic Dictionary Research) electronic dictionary is 24,185 words. In addition, there are 82,270 words in which the difference included in the Japanese WordNet is a one-character different notation. However, Japanese WordNet contains many things that are not different notations (synonyms), and it is difficult to appropriately extract different notations. Further, the difference notation of one character included in the JUMAN dictionary is 23,348 words. Compared with these, the effectiveness of the proposed technique can be seen. “JAVA” is a registered trademark.
(Experimental result 4)

第４番目の実験において、上記したベースライン手法（上記のルール１からルール４を適用した方法）による精度を算出する。ベースライン手法では、有意差が高かった素性（Ｓ５５、Ｓ５８、Ｓ６７）がｙｅｓと判定されたものを正例、すべてｎｏと判定されたものを負例としてＦ値を求める。 In the fourth experiment, the accuracy is calculated by the above-described baseline method (method in which rules 1 to 4 are applied). In the baseline method, an F value is obtained by setting a positive feature (S55, S58, S67) having a high significant difference as a positive example and a negative example as a negative example.

図１６は、ベースライン手法で、オープンデータとクローズドデータの全部を用いて１０分割クロスバリデーションによる実験をおこなった場合の結果である。図１６において、「０」は負例、「１」は正例である。また、図１６において、最も左側の列の「０」「１」は、正しい分類を示す。最も上の第一行の「０」「１」は、実験対象の手法（図１６では、ベースライン手法）での出力結果を示す。つまり、正しい分類が「０」であり実験結果が「０」であったデータの数が２６８９２、正しい分類が「０」であり実験結果が「１」であったデータの数が１０１８、正しい分類が「１」であり実験結果が「０」であったデータの数が８、正しい分類が「１」であり実験結果が「１」であったデータの数が４５２であった。また、負例（「０」）の再現率は９９．９７％、適合率は９６．３５％であった。また、正例（「１」）の再現率は３０．７５％、適合率は９８．２６％であった。また、すべてのデータの再現率は９６．３８％、適合率は９６．３８％であった。さらに、「総数」は、実験データの数である。以上の再現率、適合率を、数式１０に代入して、算出した負例のＦ値は０．９８１３、正例のＦ値は０．４６８４であった。なお、図１７から図２１の各データの意味は、図１６と同様であるので説明を省略する。 FIG. 16 shows the results when an experiment by 10-division cross-validation is performed using all of open data and closed data by the baseline method. In FIG. 16, “0” is a negative example, and “1” is a positive example. In FIG. 16, “0” and “1” in the leftmost column indicate correct classification. “0” and “1” in the uppermost first row indicate the output results of the experiment target technique (baseline technique in FIG. 16). That is, the number of data whose correct classification is “0” and the experimental result is “0” is 26892, the number of data whose correct classification is “0” and the experimental result is “1” is 1018, and the correct classification Is “1” and the experimental result is “0”, the number of data is 8, and the correct classification is “1” and the experimental result is “1”, and the number of data is 452. The reproducibility of the negative example (“0”) was 99.97%, and the precision was 96.35%. In addition, the recall rate of the positive example (“1”) was 30.75%, and the precision rate was 98.26%. The reproducibility of all data was 96.38%, and the precision was 96.38%. Furthermore, the “total number” is the number of experimental data. By substituting the above recall rate and precision rate into Equation 10, the F value of the negative example calculated was 0.9813, and the F value of the positive example was 0.4684. The meaning of each data in FIGS. 17 to 21 is the same as that in FIG.

図１７は、ベースライン手法で、クローズドデータのみを用いて１０分割クロスバリデーションによる実験をおこなった場合の結果である。図１７において、負例のＦ値は０．９８１４、正例のＦ値は０．４８３３であった。 FIG. 17 shows the results when an experiment based on 10-division cross-validation is performed using only closed data by the baseline method. In FIG. 17, the F value of the negative example was 0.9814, and the F value of the positive example was 0.4833.

図１８は、ベースライン手法で、オープンクローズを用いた場合の結果である。図１８において、負例のＦ値は０．９８１２、正例のＦ値は０．４５２９であった。 FIG. 18 shows the results when the open-close method is used in the baseline method. In FIG. 18, the F value of the negative example was 0.9812, and the F value of the positive example was 0.4529.

実験結果４において、ベースライン手法は、正例のＦ値が、提案手法におけるＦ値（例えば、実験結果１の０．９１２）と比較して極めて小さく、提案手法の有効性が極めて高い、と言える。
（実験結果５） In the experimental result 4, the baseline method shows that the F value of the positive example is extremely small compared to the F value in the proposed method (for example, 0.912 of the experimental result 1), and the effectiveness of the proposed method is extremely high. I can say that.
(Experimental result 5)

第５番目の実験において、すべてを正例としたベースライン手法の場合による精度を算出した。すべて正例としたベースライン手法の場合、再現率は「１００％」、適合率は「０．０５２５％」であった。そして、かかる再現率および適合率を、数式１０に代入し、算出された、正例（「１」）のＦ値は「０．０９９８」であった。すべてを正例としたベースライン手法の正例のＦ値は、提案手法におけるＦ値と比較して極めて小さく、提案手法の有効性が極めて高い、と言える。なお、提案手法において、正解率「９９．１２％」、再現率「９９．０７％」、適合率「９２．２９％」、Ｆ値「０．９１２」を得ている。なお、本実験で利用した素性や学習データは、提案手法の評価において利用した素性や学習データと完全に一致するものではないが、本提案手法の有効性を示すために足りる、素性や学習データの重複がある。
（応用例） In the fifth experiment, the accuracy in the case of the baseline method in which all were positive examples was calculated. In the case of all baseline examples, the recall was “100%” and the precision was “0.0525%”. Then, the F value of the positive example (“1”) calculated by substituting the recall rate and the matching rate into Equation 10 was “0.0998”. It can be said that the F value of the positive example of the baseline method in which all are positive examples is extremely small compared with the F value in the proposed method, and the effectiveness of the proposed method is extremely high. In the proposed method, the correct answer rate “99.12%”, the recall rate “99.07%”, the precision rate “92.29%”, and the F value “0.912” are obtained. Note that the features and learning data used in this experiment do not completely match the features and learning data used in the evaluation of the proposed method, but the features and learning data are sufficient to show the effectiveness of the proposed method. There are duplicates.
(Application examples)

以下、異表記取得装置１の応用例について説明する。応用例とは、異表記取得装置１を組み込んだ情報検索装置である。情報検索装置は、異表記取得装置１と検索部とを具備する。つまり、受付部１４は、キーワード（ＫＷ１）を受け付ける。そして、異表記取得装置１は、受け付けたＫＷ１の異表記の用語（ＫＷ２）を取得する。そして、検索部は、ＫＷ１＋ＫＷ２（＋はＯＲ）の検索式により、情報検索を行う。なお、情報検索の検索対象は問わないことは言うまでもない。また、検索部は、いわゆるＷｅｂの検索エンジンを起動するだけの処理でも良い。 Hereinafter, application examples of the different notation acquisition apparatus 1 will be described. An application example is an information retrieval device incorporating the different notation acquisition device 1. The information search device includes an alternate notation acquisition device 1 and a search unit. That is, the reception unit 14 receives a keyword (KW1). And the different notation acquisition apparatus 1 acquires the different notation term (KW2) of KW1 received. Then, the search unit performs information search using a search formula of KW1 + KW2 (+ is OR). Needless to say, the search target of the information search is not limited. Further, the search unit may be a process that only activates a so-called Web search engine.

本情報検索装置を利用して、情報をキーワード検索する際に、ユーザが、「スパゲティ」と入力した場合に、情報検索装置は、「スパゲティ」の異表記である「スパゲッティ」を取得する。そして、情報検索装置は、これらの「スパゲティ」と「スパゲッティ」との双方をキーワードとして情報検索する。その結果、「スパゲティ」と「スパゲッティ」のいずれの表現が為されている情報もヒットするので、検索漏れの少ない情報検索が実現できる。 When a user inputs “spaghetti” when performing a keyword search for information using the information search device, the information search device acquires “spaghetti”, which is a different notation of “spaghetti”. Then, the information search device searches for information using both of these “spaghetti” and “spaghetti” as keywords. As a result, information that represents either “spaghetti” or “spaghetti” is hit, so that information retrieval with few omissions can be realized.

特に情報検索装置は、検索漏れが許されない特許情報の検索に大きな効果をもたらす。例えば、情報検索装置が特許検索システムにおいて利用されることを考える。特許の明細書や特許請求の範囲や要約書等の特許の書類には、例えば、「コンピュータ」も「コンピューター」も存在するので、キーワードとして「コンピュータ＋コンピューター」を入力しなければ、検索漏れが生じる。従って、検索者は検索時には細心の注意を払って検索しようとするキーワードの異表記を考える必要があった。「デジタル」と「ディジタル」となど、同義語であるにも拘わらず、異表記の文言は特に特許公報においては多い。しかしながら、本情報検索装置を採用することによってこのような配慮をすることなく、検索漏れのない特許情報の検索が可能となる。 In particular, the information retrieval apparatus has a great effect on retrieval of patent information that cannot be overlooked. For example, consider that an information search apparatus is used in a patent search system. For example, “computer” and “computer” exist in patent documents such as patent specifications, claims, abstracts, etc., so if you do not enter “computer + computer” as a keyword, the search will be omitted. Arise. Therefore, the searcher needs to consider the different notation of the keyword to be searched with great care when searching. Despite being synonymous, such as “digital” and “digital”, there are many different expressions, especially in patent gazettes. However, by adopting this information retrieval apparatus, it is possible to retrieve patent information without omissions without taking such consideration into consideration.

以上、本実施の形態によれば、用語対の分野を問わず、精度の高い異表記の用語対の抽出が可能となる。 As described above, according to the present embodiment, it is possible to extract differently-notated term pairs with high accuracy regardless of the field of term pairs.

なお、本実施の形態によれば、主として、編集距離が１の用語対について、異表記の用語対であるか否かの判断手法について説明した。しかし、上述したとおり、異表記取得装置１は、編集距離が２の用語対についても、異表記の用語対であるか否かを判断できる。 Note that, according to the present embodiment, the method for determining whether or not a term pair having an editing distance of 1 is a term pair having different notation has been described. However, as described above, the different notation acquisition device 1 can determine whether or not a term pair with an edit distance of 2 is also a different notation term pair.

つまり、素性取得部１５の差分文字取得手段１５１は、編集距離が２つの用語対について、２つの差分文字の組を、それぞれ取得する。例えば、以下の３つの具体的な用語対を考える。（１）「できる」「出来る」（２）「理解できる」「できる」（３）「ＩＸ（ローマ数字の９）」「９」を考える。かかる場合、差分文字取得手段１５１は、（１）の用語対について、「で」「出」と「き」「来」の２組の差分文字の組を取得する。また、差分文字取得手段１５１は、（２）の用語対について、「理」「」と「解」「」（「」はＮＵＬＬである）の２組の差分文字の組を取得する。また、差分文字取得手段１５１は、（３）の用語対について、「Ｉ」「９」と「Ｘ」「」の２組の差分文字の組、または「Ｉ」「」と「Ｘ」「９」の２組の差分文字の組を取得する。 That is, the difference character acquisition unit 151 of the feature acquisition unit 15 acquires a pair of two difference characters for each term pair having two edit distances. For example, consider the following three specific term pairs. (1) “Can” “Can” (2) “Can understand” “Can” (3) “IX (Roman numeral 9)” “9” In such a case, the difference character acquisition unit 151 acquires two sets of difference characters, “de”, “out”, “ki”, and “coming”, for the term pair (1). Further, the difference character acquisition unit 151 acquires two sets of difference characters of “reason” “” and “solution” “” (“” is NULL) for the term pair in (2). Also, the difference character acquisition unit 151 sets two pairs of difference characters “I” “9” and “X” “” or “I” “” and “X” “9” for the term pair in (3). 2 sets of difference characters are acquired.

そして、素性取得手段１５２は、差分文字取得手段１５１が取得した２つの差分文字を、独立に対象として、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を、２組取得する。つまり、（１）の用語対について、素性取得手段１５２は、「で」「出」と「き」「来」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。素性取得手段１５２は、例えば、用語対が有する２つの用語の編集箇所の字種が異なり、かつ、２つの用語の編集箇所が同じ値の数字であるか否かを示す字種関連素性について、「で」「出」の編集箇所が同じ値の数字でないと判断し、当該字種関連素性「０」を取得する。また、素性取得手段１５２は、例えば、「で」「出」について、２つの用語の読みが一致するか否かを示す辞書関連素性「１」を取得する。素性取得手段１５２は、用語辞書１３から「出」の読み「で」を取得し、「で」と「出」の読みが一致すると判断する。また、素性取得手段１５２は、例えば、差分文字「で」「出」に対して、差分文字（編集箇所）の前後の文字という素性について、前の文字の素性「」（なし）、後の文字の素性「き」と「来」を取得する。また素性取得手段１５２は、例えば、差分文字「き」「来」に対して、差分文字の前後の文字という素性について、前の文字の素性「出」と「で」、後の文字の素性「る」を取得する。かかる処理により、別の差分文字も素性に含めることとなる。 Then, the feature acquisition unit 152 sets a plurality of features including one or more of a character type related feature, a dictionary related feature, and a similarity feature for two difference characters acquired by the difference character acquiring unit 151 independently. Get two sets. That is, for the term pair (1), the feature acquisition unit 152 extracts features for each of the two sets of difference characters “de”, “out”, “ki”, and “to”, The features extracted from the characters are considered different, and two types of test data are created. The feature acquisition unit 152, for example, regarding character type-related features indicating whether or not the character types of the editing positions of the two terms of the term pair are different and whether the editing positions of the two terms are numbers having the same value, It is determined that the edited portions of “de” and “out” are not numbers having the same value, and the character type related feature “0” is acquired. Also, the feature acquisition unit 152 acquires, for example, a dictionary-related feature “1” indicating whether the readings of the two terms match for “de” and “out”. The feature acquisition unit 152 acquires “de” reading “de” from the term dictionary 13 and determines that “de” and “depart” read match. Also, the feature acquisition unit 152, for example, with respect to the difference character “de” “out”, regarding the feature of the character before and after the difference character (edited part), the feature “” (none) of the previous character, the subsequent character Acquire the features “Ki” and “Ki”. Also, the feature acquisition unit 152, for example, with respect to the difference characters “ki” and “coming”, the features “out” and “de” of the previous character and the feature “ Get ". With this process, another difference character is included in the feature.

また、（２）の用語対について、素性取得手段１５２は、（１）と同様に、「理」「」と「解」「」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。さらに、（３）の用語対について、素性取得手段１５２は、（１）（２）と同様に、例えば、「Ｉ」「９」と「Ｘ」「」の２組の差分文字の組のそれぞれを対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、２種類のテストデータを作成する。 In addition, for the term pair (2), the feature acquisition unit 152 extracts features for each of the two sets of difference characters of “reason” “” and “solution” “” as in (1). The features extracted from the difference characters are considered to be different, and two types of test data are created. Further, for the term pair (3), the feature acquisition unit 152, for example, each of two sets of difference characters “I”, “9”, “X”, and “”, as in (1) and (2). The features are extracted from the target, and the features extracted from the difference characters are considered to be different, and two types of test data are created.

次に、機械学習部１６は、（１）（２）（３）について、２種類のテストデータをそれぞれ、異表記の用語対であるか否かを判定する。そして、判定の結果、例えば、２種類のテストデータともに異表記の用語対であると判定された場合、元の用語対（例えば、「できる」「出来る」）は、異表記の用語対であるとして、出力部１７は、判断結果を出力する。なお、出力部１７は、上述したように、２種類のテストデータに対する２つのスコアのうちのスコアが０に近い方のスコアを採用して、採用したスコアが正の場合は正例（異表記の用語）、負の場合は負例（異表記の用語でない）と判断しても良いし、スコアの絶対値が大きい方のスコアを採用して、採用したスコアが正の場合は正例（異表記の用語）、負の場合は負例（異表記の用語でない）と判断しても良いし、２つのスコアのうち、小さい方のスコアを取得し、当該小さい方のスコアが正の場合は正例（異表記の用語）、負の場合は負例（異表記の用語でない）と判断しても良い。 Next, the machine learning unit 16 determines whether or not the two types of test data for (1), (2), and (3) are term pairs having different notations. As a result of the determination, for example, when it is determined that the two types of test data are term pairs having different notations, the original term pairs (for example, “can” and “can”) are term pairs having different notations. As a result, the output unit 17 outputs the determination result. Note that, as described above, the output unit 17 adopts the score of the two scores for the two types of test data that has a score closer to 0, and if the adopted score is positive, a positive example (another notation) Term), if negative, it may be judged as a negative example (not a term in different notation), or the score with the larger absolute value of the score is adopted, and if the adopted score is positive, the positive example ( If it is negative, it may be judged as a negative example (not a term of different notation), and the smaller score of the two scores is acquired, and the smaller score is positive May be determined as a positive example (a term in different notation), and in a negative case as a negative example (not a term in different notation).

また、（３）の２組の差分文字の組（例えば、「Ｉ」「９」と「Ｘ」「」、または「Ｉ」「」と「Ｘ」「９」）、つまり２つの問題（問題１、問題２）ができる場合、それぞれの差分文字を対象に素性の抽出を行い、それぞれ差分文字から抽出した素性は、別のものと考え、４種類のテストデータを作成する。そして、２つの問題ごとに、算出したスコアが０に近い方を取得し、問題ごとのスコアのうちの、絶対値が高いスコアを当該問題のスコアとし、スコアが正の場合は正例、負の場合は負例と判断しても良い。また、例えば、用語対が（３）「ＩＸ」「９」である場合、問題「Ｉ」「９」と「Ｘ」「」、および「Ｉ」「」と「Ｘ」ができる。そして、機械学習部１６は、「Ｉ」「９」と「Ｘ」「」のスコアの小さい方を取得し、また、「Ｉ」「」と「Ｘ」「９」のスコアの小さい方を取得し、２つの取得されたスコアのうち、値が大きい方を「ＩＸ」「９」の用語対におけるスコアとする。そして、機械学習部１６は、当該スコアが正の場合は正例、負の場合は負例と判断しても良い。なお、例えば、機械学習部１６は、「Ｉ」「９」と「Ｘ」「」のスコアが０に近い方を取得し、また、「Ｉ」「」と「Ｘ」「９」のスコアが０に近い方を取得し、２つの取得されたスコアのうち、絶対値が大きい方を「ＩＸ」「９」の用語対におけるスコアとしても良い。そして、機械学習部１６は、当該スコアが正の場合は正例、負の場合は負例と判断しても良い。 In addition, two sets of difference characters (3) (for example, “I” “9” and “X” “” or “I” “” and “X” “9”), that is, two problems (problems) 1. If problem 2) is possible, the features are extracted for each difference character, and the features extracted from the difference characters are considered to be different, and four types of test data are created. Then, for each of the two questions, the one with the calculated score closer to 0 is acquired, and the score with the highest absolute value of the scores for each question is set as the score of the question. In this case, it may be determined as a negative example. Further, for example, when the term pair is (3) “IX” “9”, problems “I” “9” and “X” “” and “I” “” and “X” can be generated. Then, the machine learning unit 16 acquires the smaller score of “I” “9” and “X” “”, and acquires the smaller score of “I” “” and “X” “9”. Of the two acquired scores, the one with the larger value is set as the score in the term pair “IX” “9”. The machine learning unit 16 may determine that the score is a positive example when the score is positive and a negative example when the score is negative. For example, the machine learning unit 16 obtains the score of “I” “9” and “X” “” that is close to 0, and the scores of “I” “” and “X” “9” The one closer to 0 may be acquired, and the larger of the two acquired scores may be used as the score for the term pair “IX” “9”. The machine learning unit 16 may determine that the score is a positive example when the score is positive and a negative example when the score is negative.

また、本実施の形態において、編集距離が３以上の用語対についても、編集距離が２つの用語対と同様に、３以上のテストデータを作成し、３以上のテストデータの判断結果を用いて、元の用語対が異表記の用語対であるか否かを判定しても良い。かかる場合、例えば、３以上の差分文字のうちの１文字や２文字などを素性として用いるなど、新しい素性を機械学習手法に導入しても良い。 Further, in the present embodiment, for a term pair with an edit distance of 3 or more, similar to a term pair with an edit distance of 2, the test data of 3 or more is created and the determination result of the test data of 3 or more is used. It may be determined whether or not the original term pair is a different term term pair. In such a case, for example, a new feature may be introduced into the machine learning method, such as using one or two of three or more difference characters as a feature.

また、本実施の形態において、異表記取得装置１は、例えば、「あなた」「あんた」という日本語の用語対が異表記の用語対であると判断できたが、日本語以外の言語（例えば、英語）の用語対（例えば、「colour」「color」）も、異表記の用語対であると判断できる。 Further, in the present embodiment, the different notation acquisition apparatus 1 can determine that the Japanese term pairs “you” and “anta” are different term term pairs, for example. , English) term pairs (for example, “colour” and “color”) can also be determined as terminology pairs.

また、本実施の形態において、異表記取得装置１は、用語対を構成する２つの用語の編集箇所の文字が２文字以上である場合、１文字ずつの対応とせずに、編集箇所をまとめ、当該まとめた文字列をそのまま機械学習しても良い。つまり、用語対「１２３組」「百二十三組」について、編集箇所を「１２３」「百二十三」とまとめて、処理しても良い。用語対「１２３組」「百二十三組」に対して、例えば、Ｓ１「一つ目の表記の編集箇所」"１２３"、Ｓ２「二つ目の表記の編集箇所」"百二十三"、Ｓ３「編集箇所の前方の１文字」""（なし）、Ｓ４「編集箇所の後方の１文字」"組"、Ｓ５５「編集箇所が両方とも数字の場合であり、同じ値か違う値かどうか」"１"（同じ値）、Ｓ５６「日本語用語対の編集箇所が両方ともひらがなの場合であり、同じ音声か違う音声かどうか」"０"等が得られる。そして、用語対「１２３組」「百二十三組」に対する学習データが構成され、学習データ格納部１２に蓄積されて、利用されても良い。また、異表記取得装置１の素性取得部１５は、編集箇所の文字が２文字以上の用語対のテストデータに対して、編集箇所をまとめて処理し、例えば、上述した６８の素性を取得し、機械学習部１６が、テストデータが異表記対か否かを判断しても良い。 Further, in the present embodiment, the different notation acquisition apparatus 1 compiles edited portions without corresponding each character when the characters of the edited portions of two terms constituting a term pair are two or more characters. The collected character strings may be machine-learned as they are. That is, for the term pairs “123 sets” and “1233 sets”, the edited portions may be combined and processed as “123” and “1233”. For the term pairs “123 set” and “1233 set”, for example, S1 “Edited part of the first notation” “123”, S2 “Edited part of the second notation” “1233” ", S3" one character in front of the editing part "" "(none), S4" one character after the editing part "" group ", S55" if the editing part is both numbers, the same or different values "1" (same value), S56 "when the edited part of the Japanese term pair is both hiragana, whether the same voice or different voice" "0", etc. are obtained. Then, learning data for the term pairs “123 sets” and “1233 sets” may be configured, accumulated in the learning data storage unit 12, and used. Further, the feature acquisition unit 15 of the different notation acquisition device 1 processes the edited portions collectively for the test data of the term pairs having two or more characters in the edited portions, and acquires, for example, the above-described 68 features. The machine learning unit 16 may determine whether or not the test data is a different pair.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記憶媒体に、編集距離が１以上の用語対、および、用語対の異なる文字である編集箇所の字種に関する素性である字種関連素性、用語辞書を用いて取得された素性である辞書関連素性、前記用語対を構成する２つの用語の類似度を示す素性である類似度素性のうちの一以上の素性を含む複数の素性と、前記用語対が異表記の用語対であるかを示す情報である正負情報とを対応付けた学習データを２以上格納しており、コンピュータを、前記記憶媒体の用語対ごとに、字種関連素性、辞書関連素性、類似度素性のうちの一以上を含む複数の素性を取得する素性取得部と、前記各用語対に対して、前記記憶媒体の２以上の学習データと、前記素性取得部が取得した複数の素性とを用いて、教師あり機械学習法により、前記記憶媒体の各用語対が異表記の用語対であるか否かを判断する機械学習部と、前記機械学習部における判断結果を出力する出力部として機能させるためのプログラム、である。
（実施の形態２） Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, this program was obtained using a term pair having a feature distance related to a character pair whose edit distance is one or more and a character type of an edited portion which is a different character of the term pair, and a term dictionary on the storage medium. A dictionary-related feature that is a feature, a plurality of features that include one or more features of similarity features that are features that indicate the similarity of two terms that constitute the term pair, and a term pair in which the term pair is notated 2 or more learning data in association with positive / negative information, which is information indicating whether or not, and for each term pair of the storage medium, the computer stores character type related features, dictionary related features, similarity similarity features Using a feature acquisition unit that acquires a plurality of features including one or more of them, two or more learning data of the storage medium for each term pair, and a plurality of features acquired by the feature acquisition unit , With supervised machine learning, And machine learning unit where each term pair of the serial storage medium to determine whether a term pair different notation, a program, to function as an output unit for outputting the decision result in the machine learning unit.
(Embodiment 2)

本実施の形態において、置き換え文字対を保持し、置き換え文字対を用いて、用語から用語対を生成し、その用語対に対して、機械学習により異表記用語対を生成する異表記取得装置２について説明する。異表記取得装置２は、異表記取得装置１の機能に加えて、パターンを使った異表記用語対の生成機能を有する機能を有する。 In the present embodiment, a different character acquisition device 2 that holds replacement character pairs, generates a term pair from a term using the replacement character pair, and generates a different notation term pair by machine learning for the term pair. Will be described. In addition to the function of the different notation acquisition device 1, the different notation acquisition device 2 has a function of having a function of generating different notation term pairs using patterns.

図１９は、本実施の形態における異表記取得装置２のブロック図である。
異表記取得装置２は、用語対格納部１１、異表記用語対格納部２１、学習データ格納部１２、用語辞書１３、異表記パターン格納部２２、受付部２３、編集箇所取得部２４、異表記パターン取得部２５、異表記パターン蓄積部２６、用語対生成部２７、素性取得部１５、機械学習部１６、出力部１７を備える。 FIG. 19 is a block diagram of the different notation acquisition device 2 according to the present embodiment.
The different notation acquisition device 2 includes a term pair storage unit 11, a different notation term pair storage unit 21, a learning data storage unit 12, a term dictionary 13, an different notation pattern storage unit 22, a reception unit 23, an edited part acquisition unit 24, an different notation. The pattern acquisition unit 25, the different notation pattern storage unit 26, the term pair generation unit 27, the feature acquisition unit 15, the machine learning unit 16, and the output unit 17 are provided.

異表記用語対格納部２１は、編集距離が１の異表記の用語対を１以上格納し得る。 The different notation term pair storage unit 21 can store one or more different notation term pairs having an edit distance of one.

異表記パターン格納部２２は、異表記のパターンを示す第一文字列と第二文字列とを対に有する異表記パターンを１以上格納し得る。 The different notation pattern storage unit 22 can store one or more different notation patterns having a pair of a first character string and a second character string indicating different notation patterns.

受付部２３は、ユーザからの入力を受け付ける。また、受付部２３は、１以上の用語を受け付ける。この用語とは、用語対を生成する元となる用語である。受付部２３が用語を受け付けるのは、ユーザからの入力でも良いし、記憶媒体からの読み込みや、通信手段を用いた受信でも良い。 The receiving unit 23 receives input from the user. The receiving unit 23 receives one or more terms. This term is a term from which a term pair is generated. The accepting unit 23 accepts the term from an input from a user, reading from a storage medium, or receiving using a communication means.

編集箇所取得部２４は、異表記用語対格納部２１に格納されている１以上の異表記の用語対の編集箇所を取得する。 The edit location acquisition unit 24 acquires the edit location of one or more different notation term pairs stored in the different notation term pair storage unit 21.

異表記パターン取得部２５は、編集箇所取得部２４が取得した編集箇所から、第一文字列と第二文字列とを対に有する異表記パターンを取得する。異表記パターン取得部２５は、例えば、用語対「２番目」「二番目」から第一文字列「２」と第二文字列「二」とを対に有する異表記パターン「２」「二」を取得する。また、異表記パターン取得部２５は、例えば、用語対「自サーバ」「自サーバー」から異表記パターン「del」「ー」を取得する。 The different notation pattern acquisition unit 25 acquires the different notation pattern having the first character string and the second character string as a pair from the edit location acquired by the edit location acquisition unit 24. The different notation pattern acquisition unit 25, for example, displays different notation patterns “2” and “two” having a pair of the first character string “2” and the second character string “2” from the term pairs “second” and “second”. get. The different notation pattern acquisition unit 25 acquires the different notation patterns “del” and “−” from the term pairs “own server” and “own server”, for example.

異表記パターン蓄積部２６は、異表記パターン取得部２５が取得した異表記パターンを、異表記パターン格納部２２に蓄積する。 The different notation pattern storage unit 26 stores the different notation pattern acquired by the different notation pattern acquisition unit 25 in the different notation pattern storage unit 22.

用語対生成部２７は、受付部２３が受け付けた１以上の各用語に対して、異表記パターン格納部２２の１以上の各異表記パターンを適用し、１以上の用語を生成し、１以上の各用語と生成した用語とを有する１以上の異表記の候補の用語対である異表記候補用語対を生成する。 The term pair generation unit 27 applies one or more different notation patterns of the different notation pattern storage unit 22 to one or more terms accepted by the accepting unit 23, generates one or more terms, and generates one or more terms. An alternate notation candidate term pair that is a term pair of one or more different notation candidates having each term and the generated term is generated.

異表記用語対格納部２１、異表記パターン格納部２２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The different notation term pair storage unit 21 and the different notation pattern storage unit 22 are preferably non-volatile recording media, but can also be realized by volatile recording media.

異表記用語対格納部２１に異表記用語対が記憶される過程は問わない。 The process in which the different notation term pair is stored in the different notation term pair storage unit 21 does not matter.

編集箇所取得部２４、異表記パターン取得部２５、異表記パターン蓄積部２６、および用語対生成部２７は、通常、ＭＰＵやメモリ等から実現され得る。編集箇所取得部２４等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The edit location acquisition unit 24, the different notation pattern acquisition unit 25, the different notation pattern storage unit 26, and the term pair generation unit 27 can be usually realized by an MPU, a memory, or the like. The processing procedure of the editing location acquisition unit 24 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、異表記取得装置２の動作について図２０のフローチャートを用いて説明する。図２０のフローチャートにおいて、異表記パターンを蓄積する処理、用語対を生成する処理について説明する。異表記取得装置２の動作について、異表記取得装置１の動作と同じである異表記用語対の判断処理、および判断結果の出力処理については、説明しない。 Next, operation | movement of the different notation acquisition apparatus 2 is demonstrated using the flowchart of FIG. In the flowchart of FIG. 20, a process of accumulating different notation patterns and a process of generating term pairs will be described. Regarding the operation of the different notation acquisition device 2, the different notation term pair determination processing and the determination result output processing that are the same as the operation of the different notation acquisition device 1 will not be described.

（ステップＳ２００１）受付部２３は、異表記パターンの生成指示を受け付けたか否かを判断する。異表記パターンの生成指示を受け付ければステップＳ２００２に行き、受け付けなければステップＳ２００９に行く。 (Step S2001) The accepting unit 23 determines whether or not an instruction for generating a different notation pattern has been accepted. If an instruction for generating a different notation pattern is accepted, the process goes to step S2002, and if not, the process goes to step S2009.

（ステップＳ２００２）編集箇所取得部２４は、カウンタｉに１を代入する。 (Step S2002) The edit location acquisition unit 24 substitutes 1 for a counter i.

（ステップＳ２００３）編集箇所取得部２４は、異表記パターン格納部２２にｉ番目の用語対が存在するか否かを判断する。ｉ番目の用語対が存在すればステップＳ２００４に行き、ｉ番目の用語対が存在しなければステップＳ２００１に戻る。 (Step S2003) The edit location acquisition unit 24 determines whether or not the i-th term pair exists in the different notation pattern storage unit 22. If the i-th term pair exists, the process goes to step S2004. If the i-th term pair does not exist, the process returns to step S2001.

（ステップＳ２００４）編集箇所取得部２４は、ｉ番目の用語対の差分文字（編集箇所）を取得する。 (Step S2004) The edit location acquisition part 24 acquires the difference character (edit location) of the i-th term pair.

（ステップＳ２００５）異表記パターン取得部２５は、ステップＳ２００４で取得した差分文字（編集箇所）から、異表記パターンを構成する。 (Step S2005) The different notation pattern acquisition unit 25 configures the different notation pattern from the difference character (edited portion) acquired in step S2004.

（ステップＳ２００６）異表記パターン蓄積部２６は、ステップＳ２００５で取得された異表記パターンが、異表記パターン格納部２２に存在するか否かを判断する。存在すればステップＳ２００７に行き、存在しなければステップＳ２００８に行く。 (Step S2006) The different notation pattern storage unit 26 determines whether or not the different notation pattern acquired in step S2005 exists in the different notation pattern storage unit 22. If it exists, go to step S2007, otherwise go to step S2008.

（ステップＳ２００７）異表記パターン蓄積部２６は、ステップＳ２００５で取得された異表記パターンを、異表記パターン格納部２２に蓄積する。 (Step S2007) The different notation pattern accumulating unit 26 accumulates the different notation pattern acquired in step S2005 in the different notation pattern storage unit 22.

（ステップＳ２００８）編集箇所取得部２４は、カウンタｉを１、インクリメントする。ステップＳ２００３に戻る。 (Step S2008) The edit location acquisition unit 24 increments the counter i by 1. The process returns to step S2003.

（ステップＳ２００９）受付部２３は、用語を受け付けたか否かを判断する。用語を受け付ければステップＳ２０１０に行き、受け付けなければステップＳ２００１に戻る。 (Step S2009) The reception unit 23 determines whether a term has been received. If the term is accepted, the procedure goes to step S2010, and if not, the procedure returns to step S2001.

（ステップＳ２０１０）用語対生成部２７は、カウンタｉに１を代入する。 (Step S2010) The term pair generation unit 27 substitutes 1 for a counter i.

（ステップＳ２０１１）用語対生成部２７は、ｉ番目の異表記パターンが、異表記パターン格納部２２に存在するか否かを判断する。存在すればステップＳ２０１２に行き、存在しなければ処理を終了する。 (Step S2011) The term pair generation unit 27 determines whether or not the i-th different notation pattern exists in the different notation pattern storage unit 22. If it exists, the process goes to step S2012, and if it does not exist, the process ends.

（ステップＳ２０１２）用語対生成部２７は、ステップＳ２００９で受け付けられた用語が、ｉ番目の異表記パターンに合致するか否かを判断する。合致すればステップＳ２０１３に行き、合致しなければステップＳ２０１６に行く。なお、用語「ＷＷＷサーバ」に対して、異表記パターン「２」「二」は合致しない。異表記パターンは両方とも文字であり、当該いずれの文字も用語「ＷＷＷサーバ」が含まないからである。また、用語「ＷＷＷサーバ」に対して、異表記パターン「del」「ー」は合致する。異表記パターンに「del」が含まれる場合は、すべての用語が異表記パターンに合致することとなる。 (Step S2012) The term pair generation unit 27 determines whether or not the term accepted in step S2009 matches the i-th notation pattern. If they match, go to step S2013, otherwise go to step S2016. Note that the different notation patterns “2” and “2” do not match the term “WWW server”. This is because both of the different notation patterns are characters, and neither of the characters includes the term “WWW server”. Also, the different notation patterns “del” and “−” match the term “WWW server”. When “del” is included in the different notation pattern, all terms match the different notation pattern.

（ステップＳ２０１３）用語対生成部２７は、ステップＳ２００９で受け付けられた用語に対して、ｉ番目の異表記パターンを適用し、１以上の異表記の用語を取得する。用語が「アイトラッキング」であり、ｉ番目の異表記パターンが「del」「・」である場合、用語対生成部２７は、用語「アイトラッキング」に異表記パターン「del」「・」を適用し、「・」を各文字間に挿入し、７つの異表記の用語「ア・イトラッキング」「アイ・トラッキング」「アイト・ラッキング」「アイトラ・ッキング」「アイトラッ・キング」「アイトラッキ・ング」「アイトラッキン・グ」を生成する。また、用語が「一番目」であり、ｉ番目の異表記パターンが「一」「１」である場合、用語対生成部２７は、用語「一番目」に異表記パターン「一」「１」を適用し、「１番目」を生成する。 (Step S2013) The term pair generation unit 27 applies the i-th different notation pattern to the terms accepted in step S2009, and acquires one or more different notation terms. When the term is “eye tracking” and the i-th different notation pattern is “del” “•”, the term pair generation unit 27 applies the different notation pattern “del” “•” to the term “eye tracking”. Insert "・" between each character, and use the seven different terms "a tracking", "eye tracking", "eye racking", "eye trucking", "eye tracking king", "eye tracking king" Creates “Itracking Gu”. When the term is “first” and the i-th different notation pattern is “one” “1”, the term pair generation unit 27 adds the different notation patterns “one” “1” to the term “first”. Is applied to generate “first”.

（ステップＳ２０１４）用語対生成部２７は、ステップＳ２００９で受け付けられた用語と、ステップＳ２０１３で生成した１以上の異表記の用語を用いて、１以上の用語対を生成する。例えば、用語が「アイトラッキング」であり、ｉ番目の異表記パターンが「del」「・」である場合、用語対生成部２７は、用語対「アイトラッキング」「ア・イトラッキング」、「アイトラッキング」「アイ・トラッキング」、「アイトラッキング」、「アイトラッキング」「アイト・ラッキング」、「アイトラッキング」「アイトラ・ッキング」、「アイトラッキング」「アイトラッ・キング」、「アイトラッキング」「アイトラッキ・ング」、「アイトラッキング」「アイトラッキン・グ」の７つの用語対を生成する。また、用語が「一番目」であり、ｉ番目の異表記パターンが「一」「１」である場合、用語対生成部２７は、用語対「一番目」「１番目」を生成する。 (Step S2014) The term pair generation unit 27 generates one or more term pairs using the terms accepted in Step S2009 and one or more different notation terms generated in Step S2013. For example, when the term is “eye tracking” and the i-th notation pattern is “del” “•”, the term pair generation unit 27 sets the term pairs “eye tracking”, “a-itracking”, “eye”. Tracking, Eye Tracking, Eye Tracking, Eye Tracking, Eye Racking, Eye Tracking, Eye Traking, Eye Tracking, Eye Tracking King, Eye Tracking, Eye Tracking 7 "," eye tracking "and" eye tracking "are generated. When the term is “first” and the i-th different notation pattern is “one” “1”, the term pair generation unit 27 generates the term pair “first” “first”.

（ステップＳ２０１５）用語対生成部２７は、ステップＳ２０１３で生成した１以上の用語対を、用語対格納部１１に蓄積する。 (Step S2015) The term pair generation unit 27 accumulates one or more term pairs generated in step S2013 in the term pair storage unit 11.

（ステップＳ２０１６）用語対生成部２７は、カウンタｉを１、インクリメントする。ステップＳ２０１１に戻る。 (Step S2016) The term pair generation unit 27 increments the counter i by 1. The process returns to step S2011.

以下、本実施の形態における異表記取得装置２の具体的な動作について説明する。 Hereinafter, a specific operation of the different notation acquisition device 2 in the present embodiment will be described.

異表記パターン取得部２５が取得し、異表記パターン蓄積部２６が異表記パターン格納部２２に蓄積した異表記パターンの例を、図２１に示す。図２１において、「ｄｅｌ」は、もう一方のパターン文字を削除することを示す。つまり、異表記パターン取得部２５は、delのもう一方のパターン文字について、すべての大規模類似語リストに用いた用語を対象とし、用語対を生成する。 An example of the different notation pattern acquired by the different notation pattern acquisition unit 25 and stored in the different notation pattern storage unit 22 by the different notation pattern storage unit 26 is shown in FIG. In FIG. 21, “del” indicates that the other pattern character is to be deleted. That is, the different notation pattern acquisition unit 25 generates a term pair for the other pattern characters of del with respect to the terms used in all large-scale similar word lists.

かかる状況において、上述したように、用語対生成部２７は、受け付けられた用語「アイトラッキング」が、１番目の異表記パターン「del」「・」に合致する、と判断する。そして、用語が「アイトラッキング」が入力された場合、１番目の異表記パターン「del」「・」が適用され、用語対生成部２７は、用語「アイトラッキング」は、「・」を各文字間に挿入し、７つの異表記の用語「ア・イトラッキング」「アイ・トラッキング」「アイト・ラッキング」「アイトラ・ッキング」「アイトラッ・キング」「アイトラッキ・ング」「アイトラッキン・グ」を生成する。次に、用語対生成部２７は、用語対「アイトラッキング」「ア・イトラッキング」、「アイトラッキング」「アイ・トラッキング」、「アイトラッキング」、「アイトラッキング」「アイト・ラッキング」、「アイトラッキング」「アイトラ・ッキング」、「アイトラッキング」「アイトラッ・キング」、「アイトラッキング」「アイトラッキ・ング」、「アイトラッキング」「アイトラッキン・グ」の７つの用語対を生成する。そして、異表記パターン蓄積部２６は、７つの用語対を異表記パターン格納部２２に蓄積する。 In this situation, as described above, the term pair generation unit 27 determines that the accepted term “eye tracking” matches the first different notation pattern “del” “•”. When the term “eye tracking” is input, the first different notation pattern “del” “•” is applied, and the term pair generation unit 27 sets the term “eye tracking” to “•” for each character. Inserted in between to generate seven different terms "a tracking", "eye tracking", "eye racking", "eye tracking", "eye tracking king", "eye tracking king", "eye tracking" To do. Next, the term pair generation unit 27 includes the term pairs “eye tracking”, “a itracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye racking”, “eye tracking”. Seven term pairs of “tracking”, “eye tracking”, “eye tracking” “eye tracking king”, “eye tracking” “eye tracking”, “eye tracking” “eye tracking” are generated. Then, the different notation pattern storage unit 26 stores seven term pairs in the different notation pattern storage unit 22.

次に、用語対生成部２７は、受け付けられた用語「アイトラッキング」が、２番目の異表記パターン「del」「−」に合致する、と判断する。次に、用語「アイトラッキング」に対して、２番目の異表記パターン「del」「−」が適用され、用語対生成部２７は、「ア−イトラッキング」「アイ−トラッキング」「アイト−ラッキング」「アイトラ−ッキング」「アイトラッ−キング」「アイトラッキ−ング」「アイトラッキン−グ」を生成する。次に、用語対生成部２７は、用語対「アイトラッキング」「ア−イトラッキング」、「アイトラッキング」「アイ−トラッキング」、「アイトラッキング」「アイト−ラッキング」、「アイトラッキング」「アイトラ−ッキング」、「アイトラッキング」「アイトラッ−キング」、「アイトラッキング」「アイトラッキ−ング」、「アイトラッキング」「アイトラッキン−グ」の７つの用語対を生成する。そして、異表記パターン蓄積部２６は、７つの用語対を異表記パターン格納部２２に蓄積する。 Next, the term pair generation unit 27 determines that the accepted term “eye tracking” matches the second different notation pattern “del” and “−”. Next, the second different notation pattern “del” “−” is applied to the term “eye tracking”, and the term pair generation unit 27 performs “eye tracking” “eye tracking” “eye tracking”. "Eye tracking", "eye tracking", "eye tracking", and "eye tracking". Next, the term pair generation unit 27 includes the term pairs “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”. Seven term pairs of “king”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking” are generated. Then, the different notation pattern storage unit 26 stores seven term pairs in the different notation pattern storage unit 22.

次に、用語対生成部２７は、受け付けられた用語「アイトラッキング」が、３番目の異表記パターン「del」「い」に合致する、と判断する。そして、次に、用語「アイトラッキング」に対して、３番目の異表記パターン「del」「い」が適用され、用語対生成部２７は、「アいイトラッキング」「アイいトラッキング」「アイトいラッキング」「アイトラいッキング」「アイトラッいキング」「アイトラッキいング」「アイトラッキンいグ」を生成する。次に、用語対生成部２７は、用語対「アイトラッキング」「アいイトラッキング」、「アイトラッキング」「アイいトラッキング」、「アイトラッキング」「アイトいラッキング」、「アイトラッキング」「アイトラいッキング」、「アイトラッキング」「アイトラッいキング」、「アイトラッキング」「アイトラッキいング」、「アイトラッキング」「アイトラッキンいグ」の７つの用語対を生成する。そして、異表記パターン蓄積部２６は、７つの用語対を異表記パターン格納部２２に蓄積する。 Next, the term pair generation unit 27 determines that the accepted term “eye tracking” matches the third different notation pattern “del” “yes”. Then, the third different notation pattern “del” “I” is applied to the term “eye tracking”, and the term pair generation unit 27 performs “ai i tracking” “eye tracking” “eye tracking”. "I racking", "eye tracking", "eye tracking king", "eye tracking", and "eye tracking" are generated. Next, the term pair generation unit 27 includes the term pairs “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”, “eye tracking”. Seven term pairs of “King”, “Eye tracking”, “Eye tracking”, “Eye tracking”, “Eye tracking”, “Eye tracking”, “Eye tracking” are generated. Then, the different notation pattern storage unit 26 stores seven term pairs in the different notation pattern storage unit 22.

次に、用語対生成部２７は、受け付けられた用語「アイトラッキング」が、４番目の異表記パターン「−」「１」を構成する文字を含まないので、この異表記パターンに合致しない、と判断する。 Next, since the accepted term “eye tracking” does not include the characters constituting the fourth different notation pattern “−” “1”, the term pair generation unit 27 does not match this different notation pattern. to decide.

次に、用語対生成部２７は、受け付けられた用語「アイトラッキング」が、５番目の異表記パターン「イ」「ィ」を構成する文字「イ」を含むので、この異表記パターンに合致する、と判断する。そして、次に、用語「アイトラッキング」に対して、５番目の異表記パターン「イ」「ィ」が適用され、用語対生成部２７は、「アィトラッキング」を生成する。次に、用語対生成部２７は、用語対「アイトラッキング」「アィトラッキング」の１つの用語対を生成する。そして、異表記パターン蓄積部２６は、１つの用語対を異表記パターン格納部２２に蓄積する。 Next, since the received term “eye tracking” includes the character “I” that constitutes the fifth different notation pattern “I” and “I”, the term pair generation unit 27 matches this different notation pattern. Judge that. Then, the fifth different notation pattern “A” and “I” are applied to the term “eye tracking”, and the term pair generation unit 27 generates “A tracking”. Next, the term pair generation unit 27 generates one term pair of the term pair “eye tracking” and “a tracking”. Then, the different notation pattern storage unit 26 stores one term pair in the different notation pattern storage unit 22.

次に、同様に、用語対生成部２７は、６番目以降の異表記パターンを適用して、処理していく。 Next, similarly, the term pair generation unit 27 applies and processes the sixth and subsequent different notation patterns.

そして、用語対生成部２７は、新たな用語対を用語対格納部１１に蓄積する。 Then, the term pair generation unit 27 accumulates new term pairs in the term pair storage unit 11.

以上、本実施の形態によれば、異表記の用語対の候補を自動生成できる。また、本実施の形態によれば、異表記の用語対の候補を自動生成するための異表記パターンを自動的に取得できる。 As described above, according to the present embodiment, it is possible to automatically generate a candidate for a different pair of terminology. Further, according to the present embodiment, it is possible to automatically acquire a different notation pattern for automatically generating a candidate for a different pair of terminology.

なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記憶媒体に、用語対の異なる文字である編集箇所の字種に関する素性である字種関連素性、用語辞書を用いて取得された素性である辞書関連素性、前記用語対を構成する２つの用語の類似度を示す素性である類似度素性のうちの一以上の素性を含む複数の素性と、前記用語対が異表記の用語対であるかを示す情報である正負情報とを対応付けた学習データを２以上格納しており、記憶媒体に、異表記のパターンを示す第一文字列と第二文字列とを対に有する異表記パターンを１以上格納しており、コンピュータを、１以上の用語を受け付ける受付部と、前記受付部が受け付けた１以上の各用語に対して、前記記憶媒体の１以上の各異表記パターンを適用し、１以上の用語を生成し、前記１以上の各用語と前記生成した用語とを有する１以上の異表記の候補の用語対である異表記候補用語対を生成する用語対生成部と、前記用語対生成部が生成した１以上の異表記候補用語対ごとに、字種関連素性、辞書関連素性、類似度素性のうちの一以上の素性を含む複数の素性を取得する素性取得部と、前記用語対生成部が生成した各異表記候補用語対に対して、前記記憶媒体の２以上の学習データと、前記素性取得部が取得した複数の素性とを用いて、教師あり機械学習法により、前記用語対格納部の各異表記候補用語対が異表記の用語対であるか否かを判断する機械学習部と、前記機械学習部における判断結果を出力する出力部として機能させるためのプログラム、である。
また、図２２は、本明細書で述べたプログラムを実行して、上述した実施の形態の異表記取得装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図２２は、このコンピュータシステム３４０の概観図であり、図２３は、コンピュータシステム３４０のブロック図である。 Note that the software that implements the information processing apparatus according to the present embodiment is the following program. That is, this program stores, on the storage medium, a character type-related feature that is a feature related to a character type of an edited portion that is a character having a different term pair, a dictionary-related feature that is a feature acquired using a term dictionary, and the term pair. A plurality of features including one or more features of similarity features that are the features indicating the similarity of two terms constituting; and positive / negative information that is information indicating whether the term pair is a different pair of terms 2 or more learning data in association with each other is stored, and one or more different notation patterns having a pair of a first character string and a second character string indicating a different notation pattern are stored in a storage medium. An accepting unit that accepts one or more terms, and applies one or more different notation patterns of the storage medium to one or more terms accepted by the accepting unit to generate one or more terms, One or more terms and the generation For each one or more different notation candidate term pairs generated by the term pair generation unit, the term pair generation unit generates a different notation candidate term pair that is a term pair of one or more different notation candidates having A feature acquisition unit that acquires a plurality of features including at least one of a character type-related feature, a dictionary-related feature, and a similarity feature, and each non-notation candidate term pair generated by the term pair generation unit, A term in which each different notation candidate term pair in the term pair storage unit is differently expressed by supervised machine learning using two or more learning data of the storage medium and a plurality of features acquired by the feature acquisition unit. It is a program for functioning as a machine learning part which judges whether it is a pair, and an output part which outputs the judgment result in the machine learning part.
FIG. 22 shows the external appearance of a computer that executes the program described in this specification and realizes the different notation acquisition device of the above-described embodiment. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 22 is a schematic view of the computer system 340, and FIG. 23 is a block diagram of the computer system 340.

図２２において、コンピュータシステム３４０は、ＦＤドライブ、ＣＤ−ＲＯＭドライブを含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４とを含む。 In FIG. 22, a computer system 340 includes a computer 341 including an FD drive and a CD-ROM drive, a keyboard 342, a mouse 343, and a monitor 344.

図２３において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＭＰＵ３４１３と、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３４１５とに接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 23, in addition to the FD drive 3411 and the CD-ROM drive 3412, the computer 341 stores an MPU 3413, a bus 3414 connected to the CD-ROM drive 3412 and the FD drive 3411, and a program such as a bootup program. A RAM 3416 for temporarily storing application program instructions and providing a temporary storage space; and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の異表記取得装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 340 to execute the functions of the different notation acquisition apparatus and the like of the above-described embodiment is stored in the CD-ROM 3501 or FD 3502, inserted into the CD-ROM drive 3412 or FD drive 3411, and further a hard disk 3417 may be transferred. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の異表記取得装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third party program, or the like that causes the computer 341 to execute the functions of the different notation acquisition device of the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that two or more communication units existing in one apparatus may be physically realized by one medium.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる異表記取得装置は、用語対の分野を問わず、精度の高い異表記の用語対の抽出が可能となる、という効果を有し、異表記取得装置等として有用である。 As described above, the different notation acquisition apparatus according to the present invention has the effect that it is possible to extract a highly accurate different notation term pair regardless of the field of term pairs, Useful.

１、２異表記取得装置
１１用語対格納部
１２学習データ格納部
１３用語辞書
１４、２３受付部
１５素性取得部
１６機械学習部
１７出力部
２１異表記用語対格納部
２２異表記パターン格納部
２４編集箇所取得部
２５異表記パターン取得部
２６異表記パターン蓄積部
２７用語対生成部
１５１差分文字取得手段
１５２素性取得手段
１７１閾値格納手段
１７２閾値判断手段
１７３出力手段 DESCRIPTION OF SYMBOLS 1, 2 Different notation acquisition apparatus 11 Term pair storage part 12 Learning data storage part 13 Term dictionary 14, 23 Reception part 15 Feature acquisition part 16 Machine learning part 17 Output part 21 Different notation term pair storage part 22 Different notation pattern storage part 24 Edit location acquisition unit 25 Different notation pattern acquisition unit 26 Different notation pattern storage unit 27 Term pair generation unit 151 Difference character acquisition unit 152 Feature acquisition unit 171 Threshold storage unit 172 Threshold determination unit 173 Output unit

Claims

A term pair storage unit capable of storing one or more term pairs having an edit distance of one or more;
Indicates the character type-related feature that is a feature related to the character type of the edited part that is a different character of the term pair, the dictionary-related feature that is a feature acquired by using the term dictionary, and the similarity of two terms that constitute the term pair Stores two or more learning data in which a plurality of features including one or more features of similarity features that are features and positive / negative information that is information indicating whether the term pair is a different pair of terms Possible learning data storage,
A feature acquisition unit that acquires a plurality of features including one or more of a character type related feature, a dictionary related feature, and a similarity feature for each term pair in the term pair storage unit;
For each term pair, each of the term pair storage units is obtained by supervised machine learning using two or more learning data in the learning data storage unit and a plurality of features acquired by the feature acquisition unit. A machine learning unit that determines whether or not the term pair is a different term term pair;
A different notation acquisition device comprising: an output unit that outputs a determination result in the machine learning unit ;
A term dictionary capable of storing one or more term information having terms and representative representations of the terms;
The dictionary-related features are:
Information indicating whether or not the representative notation of two terms in a term pair is the same,
The feature acquisition unit
For each term pair in the term pair storage unit, representative representations of two terms included in the term pair are acquired from the term dictionary, and it is determined whether or not the two representative representations acquired are the same. A different notation acquisition device that acquires a result as a dictionary-related feature .

The character type-related features are:
Information indicating whether or not the character types of the edited portions of the two terms in the term pair are different and whether the edited portions of the two terms are numbers having the same value ,
The feature acquisition unit
Whether the term pairs in the term pair storage unit have different character types in the editing positions of the two terms in the term pair and whether the editing positions of the two terms have the same value or not The different notation acquisition apparatus according to claim 1, wherein the determination result is acquired as a character type-related feature.

The character type-related features are:
Information indicating whether or not the character type of the edited part of the two terms in the term pair is Roman, and whether the edited part of the two terms is a difference between uppercase and lowercase letters,
The feature acquisition unit
For each term pair in the term pair storage section, it matches the condition that the character type of the edited part of the two terms in the term pair is Roman, and the edited part of the two terms is different between uppercase and lowercase letters The different notation acquisition apparatus according to claim 1, wherein it is determined whether or not to perform determination, and the determination result is acquired as a character type related feature.

The term information is
Corresponding to the term, also has a reading of the term,
The dictionary-related features are:
Information indicating whether or not the representative notation of two terms in a term pair is the same,
The feature acquisition unit
For each term pair in the term pair storage unit, representative representations of two terms included in the term pair are acquired from the term dictionary, and it is determined whether or not the two representative representations acquired are the same. The different notation acquisition apparatus according to claim 1, wherein the result is acquired as a dictionary-related feature.

The machine learning unit
It is determined whether each term pair in the term pair storage unit is a term pair with different notation, and also obtains a score indicating the accuracy of being a term pair with different notation,
The output unit is
The different notation acquisition apparatus in any one of Claims 1-4 which outputs the score which the said machine learning part acquired.

The output unit is
Threshold storage means for storing score thresholds;
Threshold determination means for determining whether the score acquired by the machine learning unit is equal to or greater than the threshold or greater than the threshold;
The term pair corresponding to the score determined by the threshold judging means to be greater than or equal to the threshold or greater than the threshold is taken as a judgment result that is a different term term pair, and is not the judgment result or the different term term pair or different notation. The different notation acquisition apparatus according to claim 5 , further comprising an output unit that outputs at least one of the term pairs.

Character type-related features that are features related to the character type of the edited part that are different characters of the term pairs, dictionary-related features that are information indicating whether or not the representative notation of two terms of the term pairs are the same, the term pairs A plurality of features including one or more features of similarity features that are the features indicating the similarity of two terms that constitute, and positive / negative information that is information indicating whether the term pair is a term pair of different notations A learning data storage unit capable of storing two or more learning data associated with
A different notation pattern storage unit capable of storing one or more different notation patterns having a pair of a first character string and a second character string indicating different notation patterns;
A reception unit that accepts one or more terms;
When it is determined whether or not any one or more of the different notation patterns of the different notation pattern storage unit includes any one of the character strings received by the accepting unit In addition, the portion of the character string included in the term is a character string of the different notation pattern, and the one or more terms are generated by replacing the other character string of the included character string, A term pair generation unit that generates a different notation candidate term pair that is a term pair of one or more different notation candidates having each term and the generated term;
A feature acquisition unit that acquires a plurality of features including at least one of a character type-related feature, a dictionary-related feature, and a similarity feature for each one or more different notation candidate term pairs generated by the term pair generation unit; ,
For each different notation candidate term pair generated by the term pair generation unit, supervised machine learning using two or more learning data in the learning data storage unit and a plurality of features acquired by the feature acquisition unit A machine learning unit that determines whether or not each of the different notation candidate term pairs in the term pair storage unit is a differently expressed term pair by law,
A different notation acquisition device comprising: an output unit that outputs a determination result in the machine learning unit;
A different notation term pair storage unit capable of storing one or more different notation term pairs having an edit distance of 1,
An edit location acquisition unit for acquiring an edit location of one or more different notation term pairs stored in the different notation term pair storage unit;
From the edited part acquired by the edited part acquisition unit, for each term pair having different notation, the edited part included in the character string of one term constituting the term pair is set as the first character string, and the term pair is configured. get the edits contained in the string of terms as a second string, and different notation pattern acquisition unit that acquires different notation pattern having a said first string and the second string in pairs,
The different notation acquisition apparatus which further comprises the different notation pattern storage part which accumulate | stores the different notation pattern which the said different notation pattern acquisition part acquired in the said different notation pattern storage part.

The edit distance of the term pair is 2,
The feature acquisition unit
Difference character acquisition means for acquiring a pair of two difference characters of the term pair, respectively;
Feature acquisition means for acquiring two sets of plural features including one or more of character type-related features, dictionary-related features, and similarity features, targeting two difference characters acquired by the difference character acquisition means independently And
The machine learning unit
A supervised machine learning method using a plurality of features of each set and two or more learning data of the learning data storage unit for each set of the two sets of features acquired by the feature acquisition means Thus, it is determined whether or not a plurality of features of each set of the term pair storage unit is a set of features corresponding to a term pair of different notation, and the edit distance is 2 using the two determination results the term pairs different notation acquisition device according to any one of claims 1 to 7 for determining whether a term pair different notation.

On the storage medium,
A term pair whose edit distance is 1 or more,
A term dictionary capable of storing one or more term information having a term and a representative representation of the term; and
Indicates the character type-related feature that is a feature related to the character type of the edited part that is a different character of the term pair, the dictionary-related feature that is a feature acquired by using the term dictionary, and the similarity of two terms that constitute the term pair Stores two or more learning data in which a plurality of features including one or more features of similarity features that are features and positive / negative information that is information indicating whether the term pair is a different pair of terms And
A different notation acquisition method realized by a feature acquisition unit, a machine learning unit, and an output unit,
A feature acquisition step of acquiring a plurality of features including at least one of a character type related feature, a dictionary related feature, and a similarity feature for each term pair of the storage medium by the feature acquisition unit;
For each term pair, the machine learning unit uses the two or more learning data of the storage medium and the plurality of features acquired in the feature acquisition step to perform the storage by the supervised machine learning method. A machine learning step for determining whether each term pair in the medium is a term pair with a different notation;
An output step of outputting a determination result in the machine learning step by the output unit ;
The dictionary-related features are:
Information indicating whether or not the representative notation of two terms in a term pair is the same,
In the feature acquisition step,
For each term pair of the storage medium, the representative notation of two terms included in the term pair is acquired from the term dictionary, it is determined whether the two representative notations acquired are the same, and the determination result is Different notation acquisition method acquired as dictionary-related features .

On the storage medium,
A term pair whose edit distance is 1 or more,
A term dictionary capable of storing one or more term information having a term and a representative representation of the term; and
Indicates the character type-related feature that is a feature related to the character type of the edited part that is a different character of the term pair, the dictionary-related feature that is a feature acquired by using the term dictionary, and the similarity of two terms that constitute the term pair Stores two or more learning data in which a plurality of features including one or more features of similarity features that are features and positive / negative information that is information indicating whether the term pair is a different pair of terms And
Computer
A feature acquisition unit that acquires a plurality of features including at least one of a character type related feature, a dictionary related feature, and a similarity feature for each term pair of the storage medium;
For each term pair, each term pair of the storage medium is different by supervised machine learning using two or more learning data of the storage medium and a plurality of features acquired by the feature acquisition unit. A machine learning unit that determines whether or not the term pair is a notation,
A program for causing the machine learning unit to function as an output unit that outputs a determination result ,
The dictionary-related features are:
Information indicating whether or not the representative notation of two terms in a term pair is the same,
The feature acquisition unit
For each term pair in the term pair storage unit, representative representations of two terms included in the term pair are acquired from the term dictionary, and it is determined whether or not the two representative representations acquired are the same. A program that causes a computer to function as a dictionary-related feature .