JP6427466B2

JP6427466B2 - Synonym pair acquisition apparatus, method and program

Info

Publication number: JP6427466B2
Application number: JP2015106871A
Authority: JP
Inventors: いつみ斉藤; 九月貞光; 久子浅野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-05-26
Filing date: 2015-05-26
Publication date: 2018-11-21
Anticipated expiration: 2035-05-26
Also published as: JP2016224482A

Description

本発明は、同義語ペア獲得装置、方法、及びプログラムに係り、特に、同義語ペアを獲得するための同義語ペア獲得装置、方法、及びプログラムに関する。 The present invention relates to a synonym pair acquisition apparatus, method, and program, and more particularly, to a synonym pair acquisition apparatus, method, and program for acquiring synonym pairs.

従来より、正規表記語に対して揺らいだ表記である崩れ表記語を獲得するための手法が提案されている。教師データを用いた手法としては、非特許文献１及び非特許文献２に記載されている手法が挙げられる。 Heretofore, methods have been proposed for acquiring broken notation words that are expressions that are jumbled with respect to regular expression words. Examples of methods using teacher data include the methods described in Non-Patent Document 1 and Non-Patent Document 2.

教師データを用いない手法としては、非特許文献３及び非特許文献４に記載されている手法が挙げられる。 Examples of methods that do not use teacher data include the methods described in Non-Patent Document 3 and Non-Patent Document 4.

岡崎直観, 辻井潤一，“アライメント識別モデルを用いた略語定義の自動獲得”. 言語処理学会第14回年次大会 (NLP2008), pp. 139-142Okazaki Intuition, Junichi Asai, "Automatic Acquisition of Abbreviation Definition Using Alignment Discrimination Model". 14th Annual Conference of the Association for Language Processing (NLP2008), pp. 139-142 藤沼祥成, 横野光, 相澤彰子，“Twitter（Ｒ）上の「おはよう」を例とした崩れた表記の検出と分析.” 第27 回人工知能学会全国大会, 2013.06Fujinuma Yoshinari, Yokono Hikaru, Aizawa Akiko, "Detection and analysis of broken notations with" Good morning "on Twitter (R) as an example." The 27th Annual Conference of Japan Society for Artificial Intelligence, 2013.06 増山毅司, 関根聡，“大規模コーパスからのカタカナ語の表記の揺れリストの自動構築”，言語処理学会第14回年次大会 (NLP2004)Yuji Masuyama, Kei Sekine, "Automatic Construction of Shaking List of Katakana Expressions from Large-scale Corpus," 14th Annual Meeting of the Association for Language Processing (NLP 2004) 池田和史，柳原正，松本一則，滝嶋康弘，“くだけた表現を高精度に解析するための正規化ルール自動生成手法”，情報処理学会論文誌，vol3. No.3 pp.68-77, 2010Ikeda Kazufumi, Yanagihara Tadashi, Matsumoto Kazunori, Takishima Yasuhiro, "A Method of Automatic Generation of Normalized Rules for Analyzing Shattered Expressions with High Accuracy", Journal of the Information Processing Society of Japan, vol3. No.3 pp.68-77 , 2010 Kudo,T., Japanese Morphological Analyzer,インターネット＜URL:http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html＞Kudo, T., Japanese Morphological Analyzer, Internet <URL: http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html>

しかし、非特許文献１及び非特許文献２に記載の教師データを用いた手法により崩れ表記語を抽出する場合、Ｗｅｂデータから、図７のような正解ペアを人手で作成する必要があり、正解ペアの生成コストが高いという課題がある。 However, in the case of extracting broken words by the method using teacher data described in Non-Patent Document 1 and Non-Patent Document 2, it is necessary to manually create correct pairs as shown in FIG. 7 from Web data. There is a problem that the generation cost of the pair is high.

また、教師データを用いない手法に基づく場合、獲得候補となる崩れ語の候補が限られた候補（カタカナ語，既存解析器で未知語となった語等）に限られており、多様な崩れ表記を獲得することができないという課題がある。これは、既存解析器では崩れ表記語は誤って解析されてしまうことが多く、多様な崩れ表記語を獲得することが困難なためである。なぜならば、日本語は単語間にスペースなどの区切りが存在しないため、一般に存在するテキストにおいては形態素の正しい区切り位置を解析することが困難である。また、Ｗｅｂ上には、ひらがなや漢字とひらがな、カタカナとひらがな等で書かれる崩れ表記語が多数存在しており、解析が困難である。例えば、「すげー」、「やば」、「さみい」、「サムい」、「寒っ」等である。また、図８に非特許文献５に記載のMecab（IPAdic）を用いて崩れ表記語を含む文を解析した結果の一例を示す。 Also, when based on a method that does not use teacher data, the candidates for the corrupted word to be acquired candidates are limited to the limited candidates (Katakana, words that became unknown words in the existing analyzer, etc.) There is a problem that it can not acquire the notation. This is because the existing analyzer often analyzes broken words incorrectly, and it is difficult to obtain various broken words. This is because Japanese does not have a space between words, so that it is difficult to analyze the correct separation position of morphemes in generally existing text. Also, on the Web, there are many broken written words written in hiragana, kanji and hiragana, katakana and hiragana, etc., which makes analysis difficult. For example, "Suge", "Yaba", "Samii", "Samui", "Cold" etc. Further, FIG. 8 shows an example of a result of analyzing a sentence including broken words using Mecab (IPAdic) described in Non-Patent Document 5.

本発明は、上記問題点を解決するために成されたものであり、効率よく、同義語ペアを獲得することができる同義語ペア獲得装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and an object thereof is to provide a synonym pair acquiring apparatus, method, and program capable of efficiently acquiring synonym pairs.

上記目的を達成するために、第１の発明に係る同義語ペア獲得装置は、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する単語分割候補生成部と、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、を含んで構成されている。 In order to achieve the above object, a synonym pair acquiring device according to a first aspect of the present invention is, from a document, a plurality of regular expressions or a plurality of regular expressions which are candidates for irregular expressions with respect to the regular expressions. Word semantic vectors are calculated for each of the plurality of word division candidates based on the word division candidate generation unit that generates word division candidates, and the plurality of word division candidates generated by the word division candidate generation unit. Based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the semantic vector calculation unit and the word division candidate which is a regular expression word, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate which is a regular expression word and the selected word division candidate is acquired as a synonym pair. It is configured to include the word pair acquisition unit.

また、第１の発明に係る同義語ペア獲得装置において、前記同義語ペア獲得部は、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。 Further, in the synonym pair acquiring apparatus according to the first invention, the synonym pair acquiring unit is configured to determine, based on the semantic similarity and the sound similarity, for each of the word division candidates which are regular expression words. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate which is a regular expression word and the selected word division candidate is acquired as a synonym pair, and the selected word division candidate is selected. For each of the word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular written words are selected The pair with the word division candidate may be acquired as a synonym pair.

第２の発明に係る同義語ペア獲得方法は、単語分割候補生成部が、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成するステップと、意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、を含んで実行することを特徴とする。 In the synonym pair acquiring method according to the second aspect of the present invention, the word division candidate generation unit is a regular expression word from the document, or a plurality of words as broken expression words that are candidates for expressions that are distorted with respect to the regular expression word. A step of generating division candidates, and a semantic vector calculation unit calculating a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit And a synonym pair acquisition unit is calculated based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the word division candidates that are regular expressions. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a regular expression word and the selected word division candidate is a synonym pe And executes includes a step of acquiring, as.

また、第２の発明に係る同義語ペア獲得方法は、前記同義語ペア獲得部が獲得するステップは、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。 In the synonym pair acquisition method according to the second aspect of the invention, the synonym pair acquisition unit acquires the semantic similarity degree and the sound similarity degree for each of the word division candidates which are regular expression words. And selecting the word division candidate from the plurality of word division candidates, and acquiring a pair of the word division candidate that is a regular expression word and the selected word division candidate as a synonym pair. For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates are regular written words And the selected word division candidate may be acquired as a synonym pair.

第３の発明に係るプログラムは、第１の発明に係る同義語ペア獲得装置の各部として機能させるためのプログラムである。 A program according to a third invention is a program for functioning as each unit of the synonym pair acquisition device according to the first invention.

本発明の同義語ペア獲得装置、方法、及びプログラムによれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる、という効果が得られる。 According to the synonym pair acquiring apparatus, method, and program of the present invention, a plurality of word division candidates which are regular written words or broken notation words are generated from a document, and a plurality of word division candidates are generated based on the plurality of word division candidates. For each of the word division candidates, the semantic vector of the word is calculated, and for each word division candidate which is a regular expression word, the sound similarity calculated based on the semantic vector and the sound calculated based on the word reading By selecting a word division candidate from a plurality of word division candidates based on the degree of similarity, and acquiring a pair of a word division candidate that is a regular expression word and the selected word division candidate as a synonym pair The effect is obtained that synonym pairs can be obtained efficiently.

本発明の実施の形態に係る同義語ペア獲得装置の構成を示すブロック図である。It is a block diagram which shows the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention. 音類似度の一例を示す図である。It is a figure which shows an example of a sound similarity. 同義語ペアの獲得の例を示す概念図である。It is a conceptual diagram which shows the example of acquisition of a synonym pair. 正規表記語を起点として単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate by making a regular expression word into a starting point. 選択された単語分割候補を起点として更に単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate further by using the selected word division candidate as a starting point. 本発明の実施の形態に係る同義語ペア獲得装置における同義語ペア獲得処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym pair acquisition process routine in the synonym pair acquisition apparatus which concerns on embodiment of this invention. 正規表記語及び崩れ表記語の組み合わせの一例を示す図である。It is a figure which shows an example of the combination of a regular expression word and a collapse expression word. Ｍｅｃａｂを用いて崩れ表記語を含む文を解析した結果の一例を示す図である。It is a figure which shows an example of the result of having analyzed the sentence which contains a collapse written word using Mecab.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る同義語ペア獲得装置の構成＞ <Configuration of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置の構成について説明する。図１に示すように、本発明の実施の形態に係る同義語ペア獲得装置１００は、ＣＰＵと、ＲＡＭと、後述する同義語ペア獲得処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この同義語ペア獲得装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 Next, the configuration of the synonym pair acquisition apparatus according to the embodiment of the present invention will be described. As shown in FIG. 1, the synonym pair acquisition apparatus 100 according to the embodiment of the present invention is a ROM storing a CPU, a RAM, a program for executing a synonym pair acquisition processing routine described later, and various data. And can be configured with a computer. The synonym pair acquisition apparatus 100 functionally includes an input unit 10, an operation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、崩れ表記語を含む文書からなる文書集合を受け付ける。 The input unit 10 receives a document set consisting of documents including broken notation words.

演算部２０は、辞書データベース２８と、単語分割候補生成部３０と、意味ベクトル計算部３２と、同義語ペア獲得部３４とを含んで構成されている。 The calculation unit 20 includes a dictionary database 28, a word division candidate generation unit 30, a semantic vector calculation unit 32, and a synonym pair acquisition unit 34.

辞書データベース２８には、辞書引きを行うために必要な辞書（読み、表記、品詞）が記憶されている。 The dictionary database 28 stores dictionaries (reading, notation, parts of speech) necessary for performing dictionary lookup.

単語分割候補生成部３０は、入力部１０により受け付けた文書集合の文書の各々から、正規表記語、又は正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する。 The word division candidate generation unit 30 selects, from each of the documents in the document set received by the input unit 10, a plurality of word division candidates that are regular written words or broken written words that are candidates for writings that are distorted with respect to the regular written words. Generate

単語分割候補生成部３０は、具体的には、文書に対して、既存の単語分割手法である以下の第１の手法から第３の手法の各々を適用して単語分割候補を生成する。この際、辞書データベース２８に存在しない崩れ表記語についても区切り候補として出力できるような手法を用いる。 Specifically, the word division candidate generation unit 30 generates word division candidates by applying each of the following first to third methods, which are existing word division methods, to a document. At this time, a method is used which can also output broken words not present in the dictionary database 28 as segmentation candidates.

単語分割候補生成部３０は、文書集合に含まれる文書の各々に対して、第１の手法として、点推定を用いた単語分割手法を適用して単語分割候補の生成を行う。点推定を用いた単語分割手法では、文字ｎｇｒａｍ、文字種ｎｇｒａｍ等を素性とした文字間の区切りモデルを用いて、文書を複数の単語分割候補に分割する。 The word division candidate generation unit 30 generates word division candidates by applying a word division method using point estimation as a first method to each of the documents included in the document set. In the word division method using point estimation, a document is divided into a plurality of word division candidates using a character separation model having characters ngram, character type ngram, and the like as features.

単語分割候補生成部３０は、文書集合に対して、第２の手法として、教師なし解析を用いた単語分割手法を適用して、単語分割候補の生成を行う。教師なし解析を用いた単語分割手法では、サンプリングした単語分割候補に対して出現頻度等を算出し、目的関数が最適化されるように、文書の各々を単語分割候補に分割する。 The word division candidate generation unit 30 generates a word division candidate by applying a word division method using unsupervised analysis as a second method to the document set. In the word division method using unsupervised analysis, the appearance frequency etc. are calculated for the sampled word division candidates, and each of the documents is divided into word division candidates so that the objective function is optimized.

単語分割候補生成部３０は、文書集合に含まれる文書の各々に対して、第３の手法として、Ｍｅｃａｂ等による解析結果を取得し、あらかじめ定めたルールを元に一部結合させた単語分割候補の生成を行う。ルールとしては、例えば、未知語連続は結合する、名詞連続は結合する等である。なお、ルールとして以下の方法を用いてもよい。例えば、Ｔｗｉｔｔｅｒ（Ｒ）等から短い文を切り出して、単語分割候補とする場合には、短い文の切り出しは、複数の区切り文字（例えば、改行、記号的表現（「！」，「ｗ」，「♪」）、句読点（「、」，「。」）など）を設定し、短い文を区切り文字で分割するようにすればよい。このように設定することで、例えば「やっべぇぇｗｗｗｗｗｗｗｗｗｗｗ」という文であれば、「ｗ」以前の「やっべぇぇ」を単語分割候補として取得できる。また、「おっはよお♪ ってお昼だけど・・・今起きた・・・」という文であれば、「♪」以前の「おっはよお」が単語分割候補として取得できる。上記のようにして取得した文字数がｎ文字以下の文字列を形態素辞書に追加して解析を行うようにしてもよい。 The word division candidate generation unit 30 acquires, as a third method, an analysis result by Mecab or the like for each of the documents included in the document set, and is a word division candidate partially combined based on a predetermined rule. Generate the As a rule, for example, unknown word continuation is connected, noun continuation is connected, etc. The following method may be used as a rule. For example, when a short sentence is cut out from Twitter (R) or the like and used as a word division candidate, cutting out a short sentence may be performed by a plurality of delimiters (for example, line feed, symbolic expression ("!", "W", "♪"), punctuation marks (",", ".", Etc.) may be set, and short sentences may be divided by delimiters. By setting in this manner, for example, in the case of the sentence “Yabe wwwwwwwwwww”, “Yabe” before “w” can be acquired as a word division candidate. In addition, in the case of the sentence "Oohayo ♪ I have lunch ... I just got up ...", "Oohayo" before "♪" can be acquired as a word division candidate. A character string having n or fewer characters acquired as described above may be added to the morpheme dictionary for analysis.

意味ベクトル計算部３２は、単語分割候補生成部３０により生成された複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算する。 The semantic vector calculation unit 32 calculates a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit 30.

意味ベクトル計算部３２は、具体的には、単語分割候補生成部３０により生成された複数の単語分割候補を列挙するように、単語区切りが付与された文書集合に対し、単語分割候補として出現した各単語の意味ベクトルを計算する。この際、各単語の意味ベクトルを求める手法としては既存の手法を用いることができる。例えば、非特許文献６に記載のｗｏｒｄ２ｖｅｃ等が代表的な手法として挙げられる。 More specifically, the semantic vector calculation unit 32 appears as a word division candidate in a document set to which word breaks have been added so as to enumerate a plurality of word division candidates generated by the word division candidate generation unit 30. Calculate the semantic vector of each word. Under the present circumstances, the existing method can be used as a method of calculating | requiring the semantic vector of each word. For example, word2vec and the like described in Non-Patent Document 6 can be mentioned as a representative method.

［非特許文献６］：Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [Non-patent document 6]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

同義語ペア獲得部３４は、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。同義語ペア獲得部３４は、更に、選択された単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。図３に同義語ペア獲得部３４の処理の概念図を示す。 The synonym pair acquiring unit 34 determines, for each word division candidate that is a regular expression word, the semantic similarity calculated based on the semantic vector is equal to or higher than a threshold, and the sound calculated based on the word reading. A word division candidate whose similarity is equal to or higher than a threshold is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair. The synonym pair acquiring unit 34 further determines, for each of the selected word division candidates, the semantic similarity calculated based on the semantic vector is equal to or higher than a threshold, and the sound is calculated based on the word reading. A word division candidate whose similarity is equal to or higher than a threshold is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair. FIG. 3 shows a conceptual diagram of processing of the synonym pair acquisition unit 34. As shown in FIG.

同義語ペア獲得部３４は、具体的には、まず正規表記語である単語分割候補の各々について、他の単語分割候補の各々との意味類似度の計算を行う。意味類似度は、意味ベクトル計算部３２において求めた単語ごとの意味ベクトルのコサイン類似度を用いて計算する。 Specifically, the synonym pair acquiring unit 34 first calculates, for each of the word division candidates which are regular expression words, the semantic similarity with each of the other word division candidates. The semantic similarity is calculated using the cosine similarity of the semantic vector for each word obtained in the semantic vector calculator 32.

同義語ペア獲得部３４は、次に、正規表記語である単語分割候補の各々について、他の単語分割候補との音類似度の計算を行う。本実施の形態では、音類似度として、音類似度距離を、単語分割候補の読みに基づいて計算する。ここで、漢字表記は読み推定を行い、カタカナ表記はひらがなに変換する。変換コストは次のように設定する。同一文字の変換コストは０とする。また、母音(小文字も含む（例：ぁ，ぃ，ぅ，ぇ，ぉ）)、促音（っ）、撥音（ん）、長音の削除はコスト０とする。ただし、単語の先頭における削除はコスト１として音類似度距離をカウントアップする。また、同行又は同列（日本語ひらがな50音表の同行又は同列を指す。濁音又は半濁音は濁音化又は半濁音化する前の文字と同一の位置として考える）文字の置換、母音-促音間の置換、母音‐長音間、母音‐母音間の変換はコスト０とする。例えば、「ぶ」又は「ぷ」→「ふ」というような同行又は同列の文字列（はひふへほうくすつぬむゆる）をコスト０とする。上記以外の変換はコスト１として音類似度距離をカウントアップする。図２に音類似度距離の計算例を示す。本実施の形態では、閾値以上の音類似度のものをフィルタリングするため、音類似度距離が閾値以下のものがフィルタリングされる。 Next, the synonym pair acquiring unit 34 calculates, for each word division candidate which is a regular expression word, the sound similarity with another word division candidate. In the present embodiment, the sound similarity distance is calculated as the sound similarity based on the reading of the word division candidate. Here, kanji notation is used for reading estimation and katakana notation is converted to hiragana. The conversion cost is set as follows. The conversion cost of the same character is 0. In addition, vowels (including small letters (eg, ぁ, ぃ, ぅ, ぇ, ぉ)), 促 (音), 撥 ((), and deletion of long notes are regarded as cost 0. However, deletion at the beginning of a word counts up the sound similarity distance as cost 1. In addition, same or same line (refers to the same line or line of Japanese Hiragana 50 phonogram. Duzziness or Hemitone is considered to be the same position as the character before Hakuon or Hakuon conversion) substitution of characters, vowel-speech Substitution, vowel-long tone, vowel-vowel conversion is cost 0. For example, it is assumed that the cost (0) is a string of the same line or string (such as “bu” or “pu” → “fu”). Conversion other than the above counts up the sound similarity distance as cost 1. FIG. 2 shows an example of calculation of the sound similarity distance. In the present embodiment, in order to filter sound similarities that are equal to or higher than the threshold, those having a sound similarity distance equal to or lower than the threshold are filtered.

次に、同義語ペア獲得部３４は、文書集合から得られた正規表記語の単語分割候補の各々について、以下に説明する第１の獲得処理及び第２の獲得処理を行って、同義語ペアを獲得する。同義語ペア獲得部３４の第１の獲得処理では、文書集合から得られた正規表記語の単語分割候補の各々について、以下の処理を行う。 Next, the synonym pair acquiring unit 34 performs the first acquisition processing and the second acquisition processing described below on each of the word division candidates of the regular expression word acquired from the document set, and thereby the synonym pair is acquired. To earn In the first acquisition process of the synonym pair acquisition unit 34, the following process is performed on each of the word division candidates of the regular expression word acquired from the document set.

まず、当該正規表記語の単語分割候補について、文書集合中に現れた他の単語分割候補から、他の単語分割候補との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該正規表記語について、他の単語分割候補との音類似度が予め定めた閾値以上（音類似度距離が閾値以下）となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース２８において、当該単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部３４は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図４に第１の獲得処理の一例を示す。図４では、正規表記語の単語分割候補「さむい」を起点として単語分割候補を選択している。 First, for the word split candidates of the regular written word, filtering out word split candidates whose semantic similarity with other word split candidates is equal to or more than a predetermined threshold from other word split candidates appearing in the document set Do. Next, from the filtered word division candidates, the word division candidates for which the sound similarity with another word division candidate is equal to or more than a predetermined threshold (the sound similarity distance is equal to or less than the threshold) are filtered. . Furthermore, from the filtered word division candidates, in the dictionary database 28, the word division candidate is present as a regular expression word in the dictionary and is the same as the part of speech of the regular expression word in the dictionary delete. Then, the synonym pair acquiring unit 34 selects the word division candidate after deletion. In this way, a pair of the word segmentation candidate of the regular expression word and the selected word segmentation candidate is obtained as a synonym pair. FIG. 4 shows an example of the first acquisition process. In FIG. 4, word division candidates are selected starting from the word division candidate “Samui” of a regular expression word.

次に、同義語ペア獲得部３４は、当該正規表記語の単語分割候補について、以下のように、上記の第１の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補を起点とした、第２の獲得処理を行う。まず、上記の第１の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、他の単語分割候補との間の意味類似度の計算、及び音類似度距離の計算を行う。次に、当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、以下の処理を行う。 Next, the synonym pair acquiring unit 34 is selected as a synonym pair for the word segmentation candidate of the regular expression word in the first acquisition process described above for the word segmentation candidate of the regular expression word as follows: A second acquisition process is performed starting from the word division candidate. First, for each of the word division candidates selected as synonym pairs for the word division candidates of the regular written word in the first acquisition process described above, calculation of semantic similarity with other word division candidates, and sound Calculate similarity distance. Next, the following processing is performed on each of the word division candidates selected as synonym pairs for the word division candidates of the regular written word.

当該単語分割候補について、文書集合中に現れた他の単語分割候補の各々との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該単語分割候補との音類似度距離が予め定めた閾値以下となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース２８において、単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部３４は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図５に第２の獲得処理の一例を示す。図５では、第１の獲得処理で正規表記語の単語分割候補「さむい」に対して選択された単語分割候補「さみぃ」を起点として単語分割候補を選択している。更に、同義語ペア獲得部３４は、上記第２の獲得処理で選択された単語分割候補を起点として、上記第２の獲得処理と同じ処理を予め定めた回数繰り返し、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得する。 For the word division candidates, the word division candidates whose semantic similarity with each of the other word division candidates appearing in the document set is equal to or more than a predetermined threshold value are filtered. Next, from the filtered word division candidates, the word division candidates for which the sound similarity distance with the word division candidate is equal to or less than a predetermined threshold value are filtered. Furthermore, from the filtered word division candidates, in the dictionary database 28, the word division candidate notation is present as a regular expression word in the dictionary, and the part of speech of the regular expression word in the dictionary is the same as the part of speech Do. Then, the synonym pair acquiring unit 34 selects the word division candidate after deletion. In this way, a pair of the word segmentation candidate of the regular expression word and the selected word segmentation candidate is obtained as a synonym pair. FIG. 5 shows an example of the second acquisition process. In FIG. 5, the word division candidates are selected starting from the word division candidate "Samichi" selected for the word division candidate "Samui" of a regular expression word in the first acquisition processing. Furthermore, the synonym pair acquiring unit 34 repeats the same processing as the second acquisition processing a predetermined number of times, starting from the word division candidate selected in the second acquisition processing, as a starting point, and divides the word of the regular written word. A pair of the candidate and the selected word division candidate is acquired as a synonym pair.

＜本発明の実施の形態に係る同義語ペア獲得装置の作用＞ <Operation of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置１００の作用について説明する。入力部１０において崩れ表記語を含む文書からなる文書集合を受け付けると、同義語ペア獲得装置１００は、図６に示す同義語ペア獲得処理ルーチンを実行する。 Next, the operation of the synonym pair acquisition apparatus 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a document set consisting of documents including broken notation words, the synonym pair acquisition apparatus 100 executes a synonym pair acquisition processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた文書集合の文書の各々から複数の単語分割候補を生成する。 First, in step S100, a plurality of word division candidates are generated from each of the documents of the document set accepted by the input unit 10.

次に、ステップＳ１０２では、ステップＳ１００で生成された複数の単語分割候補に基づいて、単語分割候補の各々について、単語の意味ベクトルを計算する。 Next, in step S102, a word semantic vector is calculated for each of the word division candidates based on the plurality of word division candidates generated in step S100.

ステップＳ１０４では、ステップＳ１００で生成された正規表記語である単語分割候補の各々について、ステップＳ１０２で計算された意味ベクトルに基づいて、他の単語分割候補の各々との意味類似度を計算する。 In step S104, semantic similarity with each of the other word division candidates is calculated based on the semantic vector calculated in step S102 for each of the word division candidates that are regular expression words generated in step S100.

ステップＳ１０６では、ステップＳ１００で生成された正規表記語である単語分割候補の各々について、単語分割候補の読みに基づいて他の単語分割候補の各々との音類似度距離を計算する。 In step S106, the sound similarity distance with each of the other word division candidates is calculated based on the reading of the word division candidates for each of the word division candidates that are regular expression words generated in step S100.

ステップＳ１０８では、正規表記語である単語分割候補の各々について、ステップＳ１０４で計算された意味類似度が閾値以上であって、かつ、ステップＳ１０６で計算された音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。 In step S108, the semantic similarity calculated in step S104 is greater than or equal to the threshold and the sound similarity distance calculated in step S106 is less than or equal to the threshold for each of the word division candidates that are regular expressions. A word division candidate is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair.

ステップＳ１１０では、正規表記語である単語分割候補の各々に対し、ステップＳ１０８又は前回のステップＳ１１０で選択された単語分割候補の各々について、ステップＳ１０４と同様に計算される意味類似度が閾値以上であって、かつ、ステップＳ１０６と同様に計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、当該正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。 In step S110, for each word division candidate that is a regular expression word, the semantic similarity calculated in the same manner as step S104 is equal to or greater than the threshold value for each of the word division candidates selected in step S108 or the previous step S110. And a word division candidate whose sound similarity distance calculated in the same manner as in step S106 is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate which is the regular written word is selected A pair with the word division candidate is acquired as a synonym pair.

ステップＳ１１２では、ステップＳ１１０の処理を予め定めた回数繰り返したかを判定し、繰り返していればステップＳ１１４へ移行し、繰り返していなければステップＳ１１０へ戻って処理を繰り返す。 In step S112, it is determined whether the process of step S110 is repeated a predetermined number of times, and if it is repeated, the process proceeds to step S114, and if it is not repeated, the process returns to step S110 to repeat the process.

ステップＳ１１４では、ステップＳ１０８及びステップＳ１１０で獲得された同義語ペアを出力部５０に出力して処理を終了する。 In step S114, the synonym pair acquired in step S108 and step S110 is output to the output unit 50, and the process ends.

以上説明したように、本発明の実施の形態に係る同義語ペア獲得装置によれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる。 As described above, according to the synonym pair acquiring apparatus according to the embodiment of the present invention, a plurality of word division candidates which are regular written words or broken written words are generated from a document, and a plurality of word division candidates are generated. The semantic vector of the word is calculated for each of the plurality of word division candidates, and the semantic similarity calculated based on the semantic vector is equal to or greater than the threshold value for each of the word division candidates that are regular written words, And, a word division candidate whose sound similarity distance calculated based on word reading is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate which is a regular written word and the selected word By acquiring a pair with a division candidate as a synonym pair, it is possible to efficiently acquire a synonym pair.

また、意味類似度と音類似度の双方を考慮することにより、精度よく同義候補のペアを獲得することができる。 Also, by considering both the semantic similarity and the sound similarity, it is possible to obtain pairs of synonymous candidates with high accuracy.

また、正規表記語を起点とした獲得だけではフィルタされてしまった単語分割候補に対しても、選択された単語分割候補を起点として新たな同義語ペアを獲得することでより多様な崩れ表記語を獲得することが可能になる。 In addition, even with respect to word division candidates that have been filtered only by acquisition based on regular written words, a variety of broken transcription words can be obtained by acquiring new synonym pairs starting from the selected word division candidate. It will be possible to earn

また、従来手法に比べ、多様な崩れ表記語の正しい区切りとして単語分割候補を生成することが可能になる。 Also, compared with the conventional method, it becomes possible to generate word division candidates as correct divisions of various broken words.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

１０入力部
２０演算部
２８辞書データベース
３０単語分割候補生成部
３２意味ベクトル計算部
３４同義語ペア獲得部
５０出力部
１００同義語ペア獲得装置 Reference Signs List 10 input unit 20 operation unit 28 dictionary database 30 word division candidate generation unit 32 meaning vector calculation unit 34 synonym pair acquisition unit 50 output unit 100 synonym pair acquisition device

Claims

A word division candidate generation unit that generates, from a document, a plurality of word division candidates including a word division candidate that is a regular expression word and a word division candidate that is a broken expression word that is a substitution candidate that is a fluctuation candidate for the regular expression word When,
A semantic vector calculation unit that calculates a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
For each of said word candidate dividing a normalized notation word, the meaning similarity is calculated based on the mean vector, based on a sound similarity is calculated based on the reading of the word, before Symbol plurality of word segmentation The word division from the plurality of word division candidates filtered and filtered, which is the same expression as the predetermined regular expression word and is the same as the part of speech with the regular expression word of the same expression A synonym pair acquiring unit which selects a candidate excluding a candidate and acquires a pair of the word division candidate which is a regular expression word and the selected word division candidate as a synonym pair;
Synonyms pair acquisition device including.

The synonym pair acquisition unit
For each selected by said word segmentation candidate, on the basis of said sound similarity to the mean similarity, filtering the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidate, predetermined The word segmentation candidate is a regular notation word, and the word segmentation candidate is a regular notation word, and the word segmentation candidate is selected except for the word segmentation candidate having the same notation as the regular notation word and the same part of speech with the regular notation word. The synonym pair acquisition apparatus according to claim 1, wherein a pair with the selected word division candidate is further acquired as a synonym pair.

From the document, the word division candidate generation unit generates, from the document, a plurality of word division candidates including a word division candidate which is a regular expression word and a word division candidate which is a broken expression word which is a candidate of a transcriptional expression with respect to the regular expression word. Generating steps,
Calculating a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
For each of the word division candidates that are regular expression words, the synonym pair acquisition unit is calculated based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading filters the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidates have the same notation and the normal notation word predetermined, equal to the normal notation word of the same notation Selecting the word division candidate which is a part of speech, and acquiring a pair of the word division candidate which is a regular expression word and the selected word division candidate as a synonym pair;
How to get synonym pairs including.

The steps acquired by the synonym pair acquisition unit are:
For each selected by said word segmentation candidate, on the basis of said sound similarity to the mean similarity, filtering the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidate, predetermined The word segmentation candidate is a regular notation word, and the word segmentation candidate is a regular notation word, and the word segmentation candidate is selected except for the word segmentation candidate having the same notation as the regular notation word and the same part of speech with the regular notation word. The synonym pair acquisition method according to claim 3, wherein a pair with the selected word division candidate is further acquired as a synonym pair.

The program for functioning a computer as each part of the synonym pair acquisition apparatus of Claim 1 or Claim 2.