JP2016224482A

JP2016224482A - Synonym pair acquisition device, method and program

Info

Publication number: JP2016224482A
Application number: JP2015106871A
Authority: JP
Inventors: いつみ斉藤; Itsumi Saito; 九月貞光; Kugatsu Sadamitsu; 久子浅野; Hisako Asano; 義博松尾; Yoshihiro Matsuo
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-05-26
Filing date: 2015-05-26
Publication date: 2016-12-28
Anticipated expiration: 2035-05-26
Also published as: JP6427466B2

Abstract

PROBLEM TO BE SOLVED: To acquire a synonym pair efficiently.SOLUTION: A word division candidate creation part 30 creates plural word division candidates which are normal notation words, or informal notation words, from a document. A meaning vector calculation part 32 calculates a word meaning vector for each of the plural word division candidates, based on the plural word division candidates. A synonym pair acquisition part 34 selects word division candidates in which, meaning similarity calculated based on the meaning vector is equal to or higher than a threshold, and a sound similarity distance calculated based on reading of the word is equal to or lower than a threshold, for each of the word division candidates which are the normal notation words, from the word division candidates, then, acquires a pair of the word division candidate being the normal notation word, and the selected word division candidate as a synonym pair.SELECTED DRAWING: Figure 1

Description

本発明は、同義語ペア獲得装置、方法、及びプログラムに係り、特に、同義語ペアを獲得するための同義語ペア獲得装置、方法、及びプログラムに関する。 The present invention relates to a synonym pair acquisition apparatus, method, and program, and more particularly, to a synonym pair acquisition apparatus, method, and program for acquiring synonym pairs.

従来より、正規表記語に対して揺らいだ表記である崩れ表記語を獲得するための手法が提案されている。教師データを用いた手法としては、非特許文献１及び非特許文献２に記載されている手法が挙げられる。 Conventionally, a method for acquiring a collapsed notation word, which is a distorted notation with respect to a regular notation word, has been proposed. Examples of the technique using the teacher data include techniques described in Non-Patent Document 1 and Non-Patent Document 2.

教師データを用いない手法としては、非特許文献３及び非特許文献４に記載されている手法が挙げられる。 Examples of methods that do not use teacher data include the methods described in Non-Patent Document 3 and Non-Patent Document 4.

岡崎直観, 辻井潤一，“アライメント識別モデルを用いた略語定義の自動獲得”. 言語処理学会第14回年次大会 (NLP2008), pp. 139-142Nakan Okazaki, Jun-ichi Sakurai, “Automatic Acquisition of Abbreviation Definitions Using Alignment Discrimination Models”. The 14th Annual Conference of the Association for Natural Language Processing (NLP2008), pp. 139-142 藤沼祥成, 横野光, 相澤彰子，“Twitter（Ｒ）上の「おはよう」を例とした崩れた表記の検出と分析.” 第27 回人工知能学会全国大会, 2013.06Yoshinari Fujinuma, Hikaru Yokono, Akiko Aizawa, “Detection and Analysis of Broken Notation Using“ Good Morning ”on Twitter (R).” 27th Annual Conference of the Japanese Society for Artificial Intelligence, 2013.06 増山毅司, 関根聡，“大規模コーパスからのカタカナ語の表記の揺れリストの自動構築”，言語処理学会第14回年次大会 (NLP2004)Koji Masuyama, Satoshi Sekine, “Automatic construction of a katakana spelling list from a large corpus”, 14th Annual Meeting of the Association for Natural Language Processing (NLP2004) 池田和史，柳原正，松本一則，滝嶋康弘，“くだけた表現を高精度に解析するための正規化ルール自動生成手法”，情報処理学会論文誌，vol3. No.3 pp.68-77, 2010Kazufumi Ikeda, Tadashi Yanagihara, Kazunori Matsumoto, Yasuhiro Takishima, “Automatic Normalization Rule Generation Method for Analyzing Complex Expressions with High Accuracy”, IPSJ Journal, vol3. No.3 pp.68-77 , 2010 Kudo,T., Japanese Morphological Analyzer,インターネット＜URL:http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html＞Kudo, T., Japanese Morphological Analyzer, Internet <URL: http: //mecab.googlecode.com/svn/trunk/mecab/doc/index.html>

しかし、非特許文献１及び非特許文献２に記載の教師データを用いた手法により崩れ表記語を抽出する場合、Ｗｅｂデータから、図７のような正解ペアを人手で作成する必要があり、正解ペアの生成コストが高いという課題がある。 However, when a broken notation word is extracted by the method using the teacher data described in Non-Patent Document 1 and Non-Patent Document 2, it is necessary to manually create a correct answer pair as shown in FIG. There is a problem that the cost of generating pairs is high.

また、教師データを用いない手法に基づく場合、獲得候補となる崩れ語の候補が限られた候補（カタカナ語，既存解析器で未知語となった語等）に限られており、多様な崩れ表記を獲得することができないという課題がある。これは、既存解析器では崩れ表記語は誤って解析されてしまうことが多く、多様な崩れ表記語を獲得することが困難なためである。なぜならば、日本語は単語間にスペースなどの区切りが存在しないため、一般に存在するテキストにおいては形態素の正しい区切り位置を解析することが困難である。また、Ｗｅｂ上には、ひらがなや漢字とひらがな、カタカナとひらがな等で書かれる崩れ表記語が多数存在しており、解析が困難である。例えば、「すげー」、「やば」、「さみい」、「サムい」、「寒っ」等である。また、図８に非特許文献５に記載のMecab（IPAdic）を用いて崩れ表記語を含む文を解析した結果の一例を示す。 In addition, based on a method that does not use teacher data, the candidates for corrupted words that can be obtained are limited to limited candidates (such as katakana and words that have become unknown words with existing analyzers), and there are various types of corrupted words. There is a problem that the notation cannot be acquired. This is because, with existing analyzers, broken notation words are often mistakenly analyzed, and it is difficult to acquire various broken notation words. This is because, in Japanese, there is no separation such as a space between words, so it is difficult to analyze the correct separation position of morphemes in existing text. Also, on the Web, there are many broken notation words written in hiragana, kanji and hiragana, katakana and hiragana, etc., and analysis is difficult. For example, “Suge”, “Yaba”, “Samii”, “Samui”, “Cold”, etc. FIG. 8 shows an example of a result of analyzing a sentence including a collapsed notation word using Mecab (IPAdic) described in Non-Patent Document 5.

本発明は、上記問題点を解決するために成されたものであり、効率よく、同義語ペアを獲得することができる同義語ペア獲得装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object thereof is to provide a synonym pair acquisition apparatus, method, and program capable of efficiently acquiring synonym pairs.

上記目的を達成するために、第１の発明に係る同義語ペア獲得装置は、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する単語分割候補生成部と、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、を含んで構成されている。 In order to achieve the above object, the synonym pair acquisition apparatus according to the first invention is a regular notation word from a document, or a plurality of collapsed notation words that are candidates for a notation that fluctuates with respect to the regular notation word. Based on the word division candidate generation unit that generates word division candidates and the plurality of word division candidates generated by the word division candidate generation unit, a word semantic vector is calculated for each of the plurality of word division candidates. Based on a semantic vector calculation unit, a semantic similarity calculated based on the semantic vector, and a sound similarity calculated based on a word reading for each of the word division candidates that are regular notation words, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a regular notation word and the selected word division candidate is acquired as a synonym pair. It is configured to include the word pair acquisition unit.

また、第１の発明に係る同義語ペア獲得装置において、前記同義語ペア獲得部は、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。 Further, in the synonym pair acquisition device according to the first invention, the synonym pair acquisition unit, for each of the word division candidates that are regular notation words, based on the semantic similarity and the sound similarity, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair, and the selected For each word division candidate, based on the semantic similarity and the sound similarity, the word division candidate is selected from the plurality of word division candidates, and the word division candidate that is a normal written word is selected. Alternatively, a pair with the word division candidate may be acquired as a synonym pair.

第２の発明に係る同義語ペア獲得方法は、単語分割候補生成部が、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成するステップと、意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、を含んで実行することを特徴とする。 In the synonym pair acquisition method according to the second invention, the word division candidate generation unit includes a plurality of words that are normal notation words or collapsed notation words that are fluctuation candidates for the normal notation words from the document. A step of generating a division candidate, and a semantic vector calculation unit calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit And a synonym pair acquisition unit, for each of the word division candidates that are regular notation words, a semantic similarity calculated based on the semantic vector and a sound similarity calculated based on the reading of the word Based on the above, the word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is defined as a synonym pair. And executes includes a step of acquiring, as.

また、第２の発明に係る同義語ペア獲得方法は、前記同義語ペア獲得部が獲得するステップは、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。 In the synonym pair acquisition method according to the second invention, the synonym pair acquisition unit acquires the semantic similarity and the sound similarity for each of the word division candidates that are regular notation words. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is obtained as a synonym pair, For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words And the selected pair of word division candidates may be acquired as a synonym pair.

第３の発明に係るプログラムは、第１の発明に係る同義語ペア獲得装置の各部として機能させるためのプログラムである。 A program according to a third invention is a program for causing each unit to function as each unit of the synonym pair acquisition device according to the first invention.

本発明の同義語ペア獲得装置、方法、及びプログラムによれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる、という効果が得られる。 According to the synonym pair acquisition device, method, and program of the present invention, a plurality of word division candidates that are regular notation words or collapsed notation words are generated from a document, and a plurality of word division candidates are generated based on the plurality of word division candidates. A word semantic vector is calculated for each word division candidate, and for each word division candidate that is a regular notation word, a semantic similarity calculated based on the semantic vector and a sound calculated based on the word reading Based on the similarity, a word division candidate is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. The effect that a synonym pair can be acquired efficiently is acquired.

本発明の実施の形態に係る同義語ペア獲得装置の構成を示すブロック図である。It is a block diagram which shows the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention. 音類似度の一例を示す図である。It is a figure which shows an example of a sound similarity. 同義語ペアの獲得の例を示す概念図である。It is a conceptual diagram which shows the example of acquisition of a synonym pair. 正規表記語を起点として単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division | segmentation candidate starting from a regular notation word. 選択された単語分割候補を起点として更に単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate further from the selected word division candidate as a starting point. 本発明の実施の形態に係る同義語ペア獲得装置における同義語ペア獲得処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym pair acquisition process routine in the synonym pair acquisition apparatus which concerns on embodiment of this invention. 正規表記語及び崩れ表記語の組み合わせの一例を示す図である。It is a figure which shows an example of the combination of a regular notation word and a collapse notation word. Ｍｅｃａｂを用いて崩れ表記語を含む文を解析した結果の一例を示す図である。It is a figure which shows an example of the result of having analyzed the sentence containing collapse notation word using Mecab.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る同義語ペア獲得装置の構成＞ <Configuration of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置の構成について説明する。図１に示すように、本発明の実施の形態に係る同義語ペア獲得装置１００は、ＣＰＵと、ＲＡＭと、後述する同義語ペア獲得処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この同義語ペア獲得装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 Next, the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention is demonstrated. As shown in FIG. 1, a synonym pair acquisition apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a synonym pair acquisition processing routine described later. And a computer including Functionally, the synonym pair acquisition apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、崩れ表記語を含む文書からなる文書集合を受け付ける。 The input unit 10 accepts a document set made up of documents including corrupted notation words.

演算部２０は、辞書データベース２８と、単語分割候補生成部３０と、意味ベクトル計算部３２と、同義語ペア獲得部３４とを含んで構成されている。 The calculation unit 20 includes a dictionary database 28, a word division candidate generation unit 30, a semantic vector calculation unit 32, and a synonym pair acquisition unit 34.

辞書データベース２８には、辞書引きを行うために必要な辞書（読み、表記、品詞）が記憶されている。 The dictionary database 28 stores dictionaries (reading, notation, parts of speech) necessary for dictionary lookup.

単語分割候補生成部３０は、入力部１０により受け付けた文書集合の文書の各々から、正規表記語、又は正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する。 The word division candidate generation unit 30 includes a plurality of word division candidates that are regular notation words or collapsed notation words that are fluctuation candidates for the normal notation words from each of the documents in the document set received by the input unit 10. Is generated.

単語分割候補生成部３０は、具体的には、文書に対して、既存の単語分割手法である以下の第１の手法から第３の手法の各々を適用して単語分割候補を生成する。この際、辞書データベース２８に存在しない崩れ表記語についても区切り候補として出力できるような手法を用いる。 Specifically, the word division candidate generation unit 30 generates word division candidates by applying each of the following first to third methods, which are existing word division methods, to a document. At this time, a technique is used in which corrupted notation words that do not exist in the dictionary database 28 can also be output as delimiter candidates.

単語分割候補生成部３０は、文書集合に含まれる文書の各々に対して、第１の手法として、点推定を用いた単語分割手法を適用して単語分割候補の生成を行う。点推定を用いた単語分割手法では、文字ｎｇｒａｍ、文字種ｎｇｒａｍ等を素性とした文字間の区切りモデルを用いて、文書を複数の単語分割候補に分割する。 The word division candidate generation unit 30 generates a word division candidate by applying a word division method using point estimation as a first method to each of the documents included in the document set. In the word segmentation technique using point estimation, a document is segmented into a plurality of word segmentation candidates using a delimiter model between characters whose features are character ngram and character type ngram.

単語分割候補生成部３０は、文書集合に対して、第２の手法として、教師なし解析を用いた単語分割手法を適用して、単語分割候補の生成を行う。教師なし解析を用いた単語分割手法では、サンプリングした単語分割候補に対して出現頻度等を算出し、目的関数が最適化されるように、文書の各々を単語分割候補に分割する。 The word division candidate generation unit 30 applies a word division method using unsupervised analysis to the document set as a second method to generate word division candidates. In the word division method using unsupervised analysis, the appearance frequency is calculated for the sampled word division candidates, and each of the documents is divided into word division candidates so that the objective function is optimized.

単語分割候補生成部３０は、文書集合に含まれる文書の各々に対して、第３の手法として、Ｍｅｃａｂ等による解析結果を取得し、あらかじめ定めたルールを元に一部結合させた単語分割候補の生成を行う。ルールとしては、例えば、未知語連続は結合する、名詞連続は結合する等である。なお、ルールとして以下の方法を用いてもよい。例えば、Ｔｗｉｔｔｅｒ（Ｒ）等から短い文を切り出して、単語分割候補とする場合には、短い文の切り出しは、複数の区切り文字（例えば、改行、記号的表現（「！」，「ｗ」，「♪」）、句読点（「、」，「。」）など）を設定し、短い文を区切り文字で分割するようにすればよい。このように設定することで、例えば「やっべぇぇｗｗｗｗｗｗｗｗｗｗｗ」という文であれば、「ｗ」以前の「やっべぇぇ」を単語分割候補として取得できる。また、「おっはよお♪ ってお昼だけど・・・今起きた・・・」という文であれば、「♪」以前の「おっはよお」が単語分割候補として取得できる。上記のようにして取得した文字数がｎ文字以下の文字列を形態素辞書に追加して解析を行うようにしてもよい。 The word division candidate generation unit 30 acquires, as a third method, an analysis result by Mecab or the like for each of the documents included in the document set and partially combines them based on a predetermined rule. Is generated. For example, unknown word sequences are combined, noun sequences are combined, and the like. The following method may be used as a rule. For example, when a short sentence is cut out from Twitter (R) or the like and used as a word division candidate, the short sentence is cut out by using a plurality of delimiters (for example, line breaks, symbolic expressions (“!”, “W”, "♪"), punctuation marks (",", "."), Etc.) are set, and short sentences can be divided by delimiters. By setting in this way, for example, if it is a sentence “Yaybee wwwwwwwww”, “Yaybee” before “w” can be acquired as a word division candidate. In addition, if the sentence “Ohyoyo ♪ is noon, but now ...”, “Ohahayo” before “♪” can be acquired as a word division candidate. Analysis may be performed by adding a character string having n or fewer characters acquired as described above to the morpheme dictionary.

意味ベクトル計算部３２は、単語分割候補生成部３０により生成された複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算する。 The semantic vector calculation unit 32 calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit 30.

意味ベクトル計算部３２は、具体的には、単語分割候補生成部３０により生成された複数の単語分割候補を列挙するように、単語区切りが付与された文書集合に対し、単語分割候補として出現した各単語の意味ベクトルを計算する。この際、各単語の意味ベクトルを求める手法としては既存の手法を用いることができる。例えば、非特許文献６に記載のｗｏｒｄ２ｖｅｃ等が代表的な手法として挙げられる。 Specifically, the semantic vector calculation unit 32 appears as a word division candidate for a document set to which word breaks are given so as to enumerate a plurality of word division candidates generated by the word division candidate generation unit 30. Calculate the semantic vector for each word. At this time, an existing method can be used as a method for obtaining the semantic vector of each word. For example, word2vec described in Non-Patent Document 6 can be cited as a representative technique.

［非特許文献６］：Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [Non-Patent Document 6]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

同義語ペア獲得部３４は、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。同義語ペア獲得部３４は、更に、選択された単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。図３に同義語ペア獲得部３４の処理の概念図を示す。 The synonym pair acquisition unit 34, for each word division candidate that is a regular notation word, the semantic similarity calculated based on the semantic vector is equal to or greater than a threshold, and the sound calculated based on the reading of the word A word division candidate having a similarity equal to or greater than a threshold is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. Further, the synonym pair acquisition unit 34 has, for each of the selected word division candidates, a sound having a semantic similarity calculated based on a semantic vector equal to or greater than a threshold and calculated based on a word reading. A word division candidate having a similarity equal to or greater than a threshold is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. FIG. 3 shows a conceptual diagram of processing of the synonym pair acquisition unit 34.

同義語ペア獲得部３４は、具体的には、まず正規表記語である単語分割候補の各々について、他の単語分割候補の各々との意味類似度の計算を行う。意味類似度は、意味ベクトル計算部３２において求めた単語ごとの意味ベクトルのコサイン類似度を用いて計算する。 Specifically, the synonym pair acquisition unit 34 first calculates the semantic similarity with each of the other word division candidates for each of the word division candidates that are regular notation words. The semantic similarity is calculated using the cosine similarity of the semantic vector for each word obtained by the semantic vector calculation unit 32.

同義語ペア獲得部３４は、次に、正規表記語である単語分割候補の各々について、他の単語分割候補との音類似度の計算を行う。本実施の形態では、音類似度として、音類似度距離を、単語分割候補の読みに基づいて計算する。ここで、漢字表記は読み推定を行い、カタカナ表記はひらがなに変換する。変換コストは次のように設定する。同一文字の変換コストは０とする。また、母音(小文字も含む（例：ぁ，ぃ，ぅ，ぇ，ぉ）)、促音（っ）、撥音（ん）、長音の削除はコスト０とする。ただし、単語の先頭における削除はコスト１として音類似度距離をカウントアップする。また、同行又は同列（日本語ひらがな50音表の同行又は同列を指す。濁音又は半濁音は濁音化又は半濁音化する前の文字と同一の位置として考える）文字の置換、母音-促音間の置換、母音‐長音間、母音‐母音間の変換はコスト０とする。例えば、「ぶ」又は「ぷ」→「ふ」というような同行又は同列の文字列（はひふへほうくすつぬむゆる）をコスト０とする。上記以外の変換はコスト１として音類似度距離をカウントアップする。図２に音類似度距離の計算例を示す。本実施の形態では、閾値以上の音類似度のものをフィルタリングするため、音類似度距離が閾値以下のものがフィルタリングされる。 Next, the synonym pair acquisition unit 34 calculates, for each word division candidate that is a regular notation word, a sound similarity with other word division candidates. In the present embodiment, the sound similarity distance is calculated as the sound similarity based on the reading of the word division candidates. Here, kanji notation is estimated by reading, and katakana notation is converted to hiragana. The conversion cost is set as follows. The conversion cost for the same character is assumed to be zero. In addition, the cost of deleting vowels (including lowercase letters (eg, aaa, i, ぅ, eh, ぉ)), prompting sounds (tsu), repelling sounds (n), and long sounds is 0. However, deletion at the beginning of a word counts up the sound similarity distance as cost 1. In the same row or in the same row (refers to the same row or row in the Japanese hiragana 50 syllabary. The cloudy or semi-turbid sound is considered as the same position as the character before making it muddy or semi-voiced). Replacement, vowel-long sound, and vowel-vowel conversion are assumed to have no cost. For example, a character string in the same row or column (such as “bu” or “pu” → “fu”) is assumed to have a cost of zero. For conversions other than the above, the sound similarity distance is counted up as cost 1. FIG. 2 shows a calculation example of the sound similarity distance. In the present embodiment, in order to filter sound similarities that are equal to or greater than the threshold, those having a sound similarity distance equal to or less than the threshold are filtered.

次に、同義語ペア獲得部３４は、文書集合から得られた正規表記語の単語分割候補の各々について、以下に説明する第１の獲得処理及び第２の獲得処理を行って、同義語ペアを獲得する。同義語ペア獲得部３４の第１の獲得処理では、文書集合から得られた正規表記語の単語分割候補の各々について、以下の処理を行う。 Next, the synonym pair acquisition unit 34 performs a first acquisition process and a second acquisition process described below for each of the word segmentation candidates of the regular notation words obtained from the document set, and synonym pair acquisition To win. In the first acquisition process of the synonym pair acquisition unit 34, the following process is performed for each word division candidate of the regular notation word obtained from the document set.

まず、当該正規表記語の単語分割候補について、文書集合中に現れた他の単語分割候補から、他の単語分割候補との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該正規表記語について、他の単語分割候補との音類似度が予め定めた閾値以上（音類似度距離が閾値以下）となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース２８において、当該単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部３４は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図４に第１の獲得処理の一例を示す。図４では、正規表記語の単語分割候補「さむい」を起点として単語分割候補を選択している。 First, for word division candidates for the regular notation word, word division candidates whose semantic similarity with other word division candidates is greater than or equal to a predetermined threshold are filtered from other word division candidates that appear in the document set. To do. Next, from the filtered word division candidates, word division candidates for which the sound similarity with other word division candidates is equal to or greater than a predetermined threshold (sound similarity distance is equal to or less than the threshold) are filtered for the regular notation word. . Further, from the filtered word segmentation candidates, in the dictionary database 28, the notation of the word segmentation candidate exists as a regular notation word in the dictionary and has the same part of speech as the part of speech of the regular notation word in the dictionary. delete. And the synonym pair acquisition part 34 selects the word division candidate after deletion. In this way, a pair of the word division candidate of the regular notation word and the selected word division candidate is acquired as a synonym pair. FIG. 4 shows an example of the first acquisition process. In FIG. 4, the word division candidate is selected starting from the word division candidate “Samui” of the regular notation word.

次に、同義語ペア獲得部３４は、当該正規表記語の単語分割候補について、以下のように、上記の第１の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補を起点とした、第２の獲得処理を行う。まず、上記の第１の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、他の単語分割候補との間の意味類似度の計算、及び音類似度距離の計算を行う。次に、当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、以下の処理を行う。 Next, the synonym pair acquisition unit 34 selected the word division candidate of the regular notation word as the synonym pair for the word division candidate of the regular notation word in the first acquisition process as described below. A second acquisition process is performed starting from the word division candidate. First, for each word division candidate selected as a synonym pair for the word division candidate of the regular notation word in the first acquisition process, calculation of semantic similarity with other word division candidates, and sound Calculate the similarity distance. Next, the following processing is performed for each word division candidate selected as a synonym pair for the word division candidate of the regular notation word.

当該単語分割候補について、文書集合中に現れた他の単語分割候補の各々との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該単語分割候補との音類似度距離が予め定めた閾値以下となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース２８において、単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部３４は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図５に第２の獲得処理の一例を示す。図５では、第１の獲得処理で正規表記語の単語分割候補「さむい」に対して選択された単語分割候補「さみぃ」を起点として単語分割候補を選択している。更に、同義語ペア獲得部３４は、上記第２の獲得処理で選択された単語分割候補を起点として、上記第２の獲得処理と同じ処理を予め定めた回数繰り返し、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得する。 For the word division candidates, word division candidates whose semantic similarity with each of the other word division candidates appearing in the document set is equal to or greater than a predetermined threshold are filtered. Next, the word division candidate whose sound similarity distance with the word division candidate is equal to or less than a predetermined threshold is filtered from the filtered word division candidates. Further, from the filtered word segmentation candidates, in the dictionary database 28, the word segmentation candidate notation is present as a regular notation word in the dictionary and the part of speech that is the same as the part of speech of the regular notation word in the dictionary is deleted. To do. And the synonym pair acquisition part 34 selects the word division candidate after deletion. In this way, a pair of the word division candidate of the regular notation word and the selected word division candidate is acquired as a synonym pair. FIG. 5 shows an example of the second acquisition process. In FIG. 5, the word division candidate is selected starting from the word division candidate “Samie” selected for the word division candidate “Samui” of the regular notation word in the first acquisition process. Further, the synonym pair acquisition unit 34 repeats the same process as the second acquisition process a predetermined number of times starting from the word division candidate selected in the second acquisition process, and performs word division of the regular notation word A pair of the candidate and the selected word division candidate is acquired as a synonym pair.

＜本発明の実施の形態に係る同義語ペア獲得装置の作用＞ <Operation of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置１００の作用について説明する。入力部１０において崩れ表記語を含む文書からなる文書集合を受け付けると、同義語ペア獲得装置１００は、図６に示す同義語ペア獲得処理ルーチンを実行する。 Next, the operation of the synonym pair acquisition apparatus 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a document set made up of documents including corrupted notation words, the synonym pair acquisition apparatus 100 executes a synonym pair acquisition processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた文書集合の文書の各々から複数の単語分割候補を生成する。 First, in step S100, a plurality of word division candidates are generated from each of the documents in the document set received by the input unit 10.

次に、ステップＳ１０２では、ステップＳ１００で生成された複数の単語分割候補に基づいて、単語分割候補の各々について、単語の意味ベクトルを計算する。 Next, in step S102, a word semantic vector is calculated for each of the word division candidates based on the plurality of word division candidates generated in step S100.

ステップＳ１０４では、ステップＳ１００で生成された正規表記語である単語分割候補の各々について、ステップＳ１０２で計算された意味ベクトルに基づいて、他の単語分割候補の各々との意味類似度を計算する。 In step S104, the semantic similarity with each of the other word division candidates is calculated based on the semantic vector calculated in step S102 for each of the word division candidates that are regular notation words generated in step S100.

ステップＳ１０６では、ステップＳ１００で生成された正規表記語である単語分割候補の各々について、単語分割候補の読みに基づいて他の単語分割候補の各々との音類似度距離を計算する。 In step S106, for each word division candidate that is the normal notation word generated in step S100, the sound similarity distance with each of the other word division candidates is calculated based on the reading of the word division candidates.

ステップＳ１０８では、正規表記語である単語分割候補の各々について、ステップＳ１０４で計算された意味類似度が閾値以上であって、かつ、ステップＳ１０６で計算された音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。 In step S108, the semantic similarity calculated in step S104 is greater than or equal to the threshold for each word division candidate that is a regular notation word, and the sound similarity distance calculated in step S106 is less than or equal to the threshold. A word division candidate is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular notation word and the selected word division candidate is acquired as a synonym pair.

ステップＳ１１０では、正規表記語である単語分割候補の各々に対し、ステップＳ１０８又は前回のステップＳ１１０で選択された単語分割候補の各々について、ステップＳ１０４と同様に計算される意味類似度が閾値以上であって、かつ、ステップＳ１０６と同様に計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、当該正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。 In step S110, for each word division candidate that is a regular notation word, the semantic similarity calculated in the same manner as in step S104 is greater than or equal to the threshold for each word division candidate selected in step S108 or previous step S110. In addition, a word division candidate whose sound similarity distance calculated in the same manner as in step S106 is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate that is the regular notation word is selected. The pair with the word division candidate obtained is acquired as a synonym pair.

ステップＳ１１２では、ステップＳ１１０の処理を予め定めた回数繰り返したかを判定し、繰り返していればステップＳ１１４へ移行し、繰り返していなければステップＳ１１０へ戻って処理を繰り返す。 In step S112, it is determined whether the process of step S110 has been repeated a predetermined number of times. If it has been repeated, the process proceeds to step S114, and if not, the process returns to step S110 to repeat the process.

ステップＳ１１４では、ステップＳ１０８及びステップＳ１１０で獲得された同義語ペアを出力部５０に出力して処理を終了する。 In step S114, the synonym pair acquired in step S108 and step S110 is output to the output unit 50, and the process ends.

以上説明したように、本発明の実施の形態に係る同義語ペア獲得装置によれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる。 As described above, according to the synonym pair acquisition device according to the embodiment of the present invention, a plurality of word division candidates that are regular notation words or collapsed notation words are generated from a document, and a plurality of word division candidates are obtained. Based on the above, a word semantic vector is calculated for each of a plurality of word division candidates, and the semantic similarity calculated based on the semantic vector for each word division candidate that is a regular notation word is greater than or equal to a threshold value In addition, a word division candidate having a sound similarity distance calculated based on the reading of the word is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate that is a normal notation word and the selected word By acquiring a pair with a division candidate as a synonym pair, a synonym pair can be efficiently acquired.

また、意味類似度と音類似度の双方を考慮することにより、精度よく同義候補のペアを獲得することができる。 Further, by considering both the semantic similarity and the sound similarity, a pair of synonym candidates can be obtained with high accuracy.

また、正規表記語を起点とした獲得だけではフィルタされてしまった単語分割候補に対しても、選択された単語分割候補を起点として新たな同義語ペアを獲得することでより多様な崩れ表記語を獲得することが可能になる。 In addition, even for word segmentation candidates that have been filtered only by acquisition based on regular notation words, more diverse collaborative notation words can be obtained by acquiring new synonym pairs starting from the selected word segmentation candidates. It becomes possible to acquire.

また、従来手法に比べ、多様な崩れ表記語の正しい区切りとして単語分割候補を生成することが可能になる。 In addition, compared to the conventional method, it is possible to generate word division candidates as correct delimiters of various corrupted notation words.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

１０入力部
２０演算部
２８辞書データベース
３０単語分割候補生成部
３２意味ベクトル計算部
３４同義語ペア獲得部
５０出力部
１００同義語ペア獲得装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 28 Dictionary database 30 Word division candidate production | generation part 32 Semantic vector calculation part 34 Synonym pair acquisition part 50 Output part 100 Synonym pair acquisition apparatus

Claims

A word division candidate generation unit that generates a plurality of word division candidates that are regular notation words or collapsed notation words that are fluctuation notation candidates with respect to the regular notation word from a document;
A semantic vector calculation unit that calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
Based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the word division candidates that are regular notation words, A synonym pair acquisition unit that acquires a pair of the word division candidate that is a normal notation word and the selected word division candidate as a synonym pair, selected from the plurality of word division candidates;
A synonym pair acquisition device.

The synonym pair acquisition unit
For each word division candidate that is a regular notation word, the word division candidate is selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word that is a normal notation word A pair of the division candidate and the selected word division candidate is acquired as a synonym pair,
For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words The synonym pair acquisition apparatus according to claim 1, wherein a pair of the selected word division candidate is acquired as a synonym pair.

A word division candidate generation unit, from a document, generating a plurality of word division candidates that are regular notation words or collapsed notation words that are candidates for notation that fluctuates with respect to the regular notation words;
A semantic vector calculation unit calculating a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
The synonym pair acquisition unit, based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on the reading of the word, for each of the word division candidates that are regular notation words Selecting the word division candidate from the plurality of word division candidates, and obtaining a pair of the word division candidate that is a normal notation word and the selected word division candidate as a synonym pair;
Synonym pair acquisition method including

The synonym pair acquisition unit acquires,
For each word division candidate that is a regular notation word, the word division candidate is selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word that is a normal notation word A pair of the division candidate and the selected word division candidate is acquired as a synonym pair,
For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words The synonym pair acquisition method according to claim 3, wherein a pair of the selected word division candidate is acquired as a synonym pair.

The program for functioning a computer as each part of the synonym pair acquisition apparatus of Claim 1 or Claim 2.