JP2016224482A - Synonym pair acquisition device, method and program - Google Patents

Synonym pair acquisition device, method and program Download PDF

Info

Publication number
JP2016224482A
JP2016224482A JP2015106871A JP2015106871A JP2016224482A JP 2016224482 A JP2016224482 A JP 2016224482A JP 2015106871 A JP2015106871 A JP 2015106871A JP 2015106871 A JP2015106871 A JP 2015106871A JP 2016224482 A JP2016224482 A JP 2016224482A
Authority
JP
Japan
Prior art keywords
word
word division
candidates
notation
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2015106871A
Other languages
Japanese (ja)
Other versions
JP6427466B2 (en
Inventor
いつみ 斉藤
Itsumi Saito
いつみ 斉藤
九月 貞光
Kugatsu Sadamitsu
九月 貞光
久子 浅野
Hisako Asano
久子 浅野
義博 松尾
Yoshihiro Matsuo
義博 松尾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2015106871A priority Critical patent/JP6427466B2/en
Publication of JP2016224482A publication Critical patent/JP2016224482A/en
Application granted granted Critical
Publication of JP6427466B2 publication Critical patent/JP6427466B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To acquire a synonym pair efficiently.SOLUTION: A word division candidate creation part 30 creates plural word division candidates which are normal notation words, or informal notation words, from a document. A meaning vector calculation part 32 calculates a word meaning vector for each of the plural word division candidates, based on the plural word division candidates. A synonym pair acquisition part 34 selects word division candidates in which, meaning similarity calculated based on the meaning vector is equal to or higher than a threshold, and a sound similarity distance calculated based on reading of the word is equal to or lower than a threshold, for each of the word division candidates which are the normal notation words, from the word division candidates, then, acquires a pair of the word division candidate being the normal notation word, and the selected word division candidate as a synonym pair.SELECTED DRAWING: Figure 1

Description

本発明は、同義語ペア獲得装置、方法、及びプログラムに係り、特に、同義語ペアを獲得するための同義語ペア獲得装置、方法、及びプログラムに関する。   The present invention relates to a synonym pair acquisition apparatus, method, and program, and more particularly, to a synonym pair acquisition apparatus, method, and program for acquiring synonym pairs.

従来より、正規表記語に対して揺らいだ表記である崩れ表記語を獲得するための手法が提案されている。教師データを用いた手法としては、非特許文献1及び非特許文献2に記載されている手法が挙げられる。   Conventionally, a method for acquiring a collapsed notation word, which is a distorted notation with respect to a regular notation word, has been proposed. Examples of the technique using the teacher data include techniques described in Non-Patent Document 1 and Non-Patent Document 2.

教師データを用いない手法としては、非特許文献3及び非特許文献4に記載されている手法が挙げられる。   Examples of methods that do not use teacher data include the methods described in Non-Patent Document 3 and Non-Patent Document 4.

岡崎直観, 辻井潤一,“アライメント識別モデルを用いた略語定義の自動獲得”. 言語処理学会第14回年次大会 (NLP2008), pp. 139-142Nakan Okazaki, Jun-ichi Sakurai, “Automatic Acquisition of Abbreviation Definitions Using Alignment Discrimination Models”. The 14th Annual Conference of the Association for Natural Language Processing (NLP2008), pp. 139-142 藤沼祥成, 横野光, 相澤彰子,“Twitter(R)上の「おはよう」を例とした崩れた表記の検出と分析.” 第27 回人工知能学会全国大会, 2013.06Yoshinari Fujinuma, Hikaru Yokono, Akiko Aizawa, “Detection and Analysis of Broken Notation Using“ Good Morning ”on Twitter (R).” 27th Annual Conference of the Japanese Society for Artificial Intelligence, 2013.06 増山毅司, 関根聡,“大規模コーパスからのカタカナ語の表記の揺れリストの自動構築”,言語処理学会第14回年次大会 (NLP2004)Koji Masuyama, Satoshi Sekine, “Automatic construction of a katakana spelling list from a large corpus”, 14th Annual Meeting of the Association for Natural Language Processing (NLP2004) 池田和史,柳原正,松本一則,滝嶋康弘,“くだけた表現を高精度に解析するための正規化ルール自動生成手法”,情報処理学会論文誌,vol3. No.3 pp.68-77, 2010Kazufumi Ikeda, Tadashi Yanagihara, Kazunori Matsumoto, Yasuhiro Takishima, “Automatic Normalization Rule Generation Method for Analyzing Complex Expressions with High Accuracy”, IPSJ Journal, vol3. No.3 pp.68-77 , 2010 Kudo,T., Japanese Morphological Analyzer,インターネット<URL:http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html>Kudo, T., Japanese Morphological Analyzer, Internet <URL: http: //mecab.googlecode.com/svn/trunk/mecab/doc/index.html>

しかし、非特許文献1及び非特許文献2に記載の教師データを用いた手法により崩れ表記語を抽出する場合、Webデータから、図7のような正解ペアを人手で作成する必要があり、正解ペアの生成コストが高いという課題がある。   However, when a broken notation word is extracted by the method using the teacher data described in Non-Patent Document 1 and Non-Patent Document 2, it is necessary to manually create a correct answer pair as shown in FIG. There is a problem that the cost of generating pairs is high.

また、教師データを用いない手法に基づく場合、獲得候補となる崩れ語の候補が限られた候補(カタカナ語,既存解析器で未知語となった語等)に限られており、多様な崩れ表記を獲得することができないという課題がある。これは、既存解析器では崩れ表記語は誤って解析されてしまうことが多く、多様な崩れ表記語を獲得することが困難なためである。なぜならば、日本語は単語間にスペースなどの区切りが存在しないため、一般に存在するテキストにおいては形態素の正しい区切り位置を解析することが困難である。また、Web上には、ひらがなや漢字とひらがな、カタカナとひらがな等で書かれる崩れ表記語が多数存在しており、解析が困難である。例えば、「すげー」、「やば」、「さみい」、「サムい」、「寒っ」等である。また、図8に非特許文献5に記載のMecab(IPAdic)を用いて崩れ表記語を含む文を解析した結果の一例を示す。   In addition, based on a method that does not use teacher data, the candidates for corrupted words that can be obtained are limited to limited candidates (such as katakana and words that have become unknown words with existing analyzers), and there are various types of corrupted words. There is a problem that the notation cannot be acquired. This is because, with existing analyzers, broken notation words are often mistakenly analyzed, and it is difficult to acquire various broken notation words. This is because, in Japanese, there is no separation such as a space between words, so it is difficult to analyze the correct separation position of morphemes in existing text. Also, on the Web, there are many broken notation words written in hiragana, kanji and hiragana, katakana and hiragana, etc., and analysis is difficult. For example, “Suge”, “Yaba”, “Samii”, “Samui”, “Cold”, etc. FIG. 8 shows an example of a result of analyzing a sentence including a collapsed notation word using Mecab (IPAdic) described in Non-Patent Document 5.

本発明は、上記問題点を解決するために成されたものであり、効率よく、同義語ペアを獲得することができる同義語ペア獲得装置、方法、及びプログラムを提供することを目的とする。   The present invention has been made to solve the above-described problems, and an object thereof is to provide a synonym pair acquisition apparatus, method, and program capable of efficiently acquiring synonym pairs.

上記目的を達成するために、第1の発明に係る同義語ペア獲得装置は、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する単語分割候補生成部と、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、を含んで構成されている。   In order to achieve the above object, the synonym pair acquisition apparatus according to the first invention is a regular notation word from a document, or a plurality of collapsed notation words that are candidates for a notation that fluctuates with respect to the regular notation word. Based on the word division candidate generation unit that generates word division candidates and the plurality of word division candidates generated by the word division candidate generation unit, a word semantic vector is calculated for each of the plurality of word division candidates. Based on a semantic vector calculation unit, a semantic similarity calculated based on the semantic vector, and a sound similarity calculated based on a word reading for each of the word division candidates that are regular notation words, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a regular notation word and the selected word division candidate is acquired as a synonym pair. It is configured to include the word pair acquisition unit.

また、第1の発明に係る同義語ペア獲得装置において、前記同義語ペア獲得部は、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。   Further, in the synonym pair acquisition device according to the first invention, the synonym pair acquisition unit, for each of the word division candidates that are regular notation words, based on the semantic similarity and the sound similarity, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair, and the selected For each word division candidate, based on the semantic similarity and the sound similarity, the word division candidate is selected from the plurality of word division candidates, and the word division candidate that is a normal written word is selected. Alternatively, a pair with the word division candidate may be acquired as a synonym pair.

第2の発明に係る同義語ペア獲得方法は、単語分割候補生成部が、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成するステップと、意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、を含んで実行することを特徴とする。   In the synonym pair acquisition method according to the second invention, the word division candidate generation unit includes a plurality of words that are normal notation words or collapsed notation words that are fluctuation candidates for the normal notation words from the document. A step of generating a division candidate, and a semantic vector calculation unit calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit And a synonym pair acquisition unit, for each of the word division candidates that are regular notation words, a semantic similarity calculated based on the semantic vector and a sound similarity calculated based on the reading of the word Based on the above, the word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is defined as a synonym pair. And executes includes a step of acquiring, as.

また、第2の発明に係る同義語ペア獲得方法は、前記同義語ペア獲得部が獲得するステップは、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。   In the synonym pair acquisition method according to the second invention, the synonym pair acquisition unit acquires the semantic similarity and the sound similarity for each of the word division candidates that are regular notation words. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is obtained as a synonym pair, For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words And the selected pair of word division candidates may be acquired as a synonym pair.

第3の発明に係るプログラムは、第1の発明に係る同義語ペア獲得装置の各部として機能させるためのプログラムである。   A program according to a third invention is a program for causing each unit to function as each unit of the synonym pair acquisition device according to the first invention.

本発明の同義語ペア獲得装置、方法、及びプログラムによれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる、という効果が得られる。   According to the synonym pair acquisition device, method, and program of the present invention, a plurality of word division candidates that are regular notation words or collapsed notation words are generated from a document, and a plurality of word division candidates are generated based on the plurality of word division candidates. A word semantic vector is calculated for each word division candidate, and for each word division candidate that is a regular notation word, a semantic similarity calculated based on the semantic vector and a sound calculated based on the word reading Based on the similarity, a word division candidate is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. The effect that a synonym pair can be acquired efficiently is acquired.

本発明の実施の形態に係る同義語ペア獲得装置の構成を示すブロック図である。It is a block diagram which shows the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention. 音類似度の一例を示す図である。It is a figure which shows an example of a sound similarity. 同義語ペアの獲得の例を示す概念図である。It is a conceptual diagram which shows the example of acquisition of a synonym pair. 正規表記語を起点として単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division | segmentation candidate starting from a regular notation word. 選択された単語分割候補を起点として更に単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate further from the selected word division candidate as a starting point. 本発明の実施の形態に係る同義語ペア獲得装置における同義語ペア獲得処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym pair acquisition process routine in the synonym pair acquisition apparatus which concerns on embodiment of this invention. 正規表記語及び崩れ表記語の組み合わせの一例を示す図である。It is a figure which shows an example of the combination of a regular notation word and a collapse notation word. Mecabを用いて崩れ表記語を含む文を解析した結果の一例を示す図である。It is a figure which shows an example of the result of having analyzed the sentence containing collapse notation word using Mecab.

以下、図面を参照して本発明の実施の形態を詳細に説明する。   Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

<本発明の実施の形態に係る同義語ペア獲得装置の構成> <Configuration of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置の構成について説明する。図1に示すように、本発明の実施の形態に係る同義語ペア獲得装置100は、CPUと、RAMと、後述する同義語ペア獲得処理ルーチンを実行するためのプログラムや各種データを記憶したROMと、を含むコンピュータで構成することが出来る。この同義語ペア獲得装置100は、機能的には図1に示すように入力部10と、演算部20と、出力部50とを備えている。   Next, the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention is demonstrated. As shown in FIG. 1, a synonym pair acquisition apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a synonym pair acquisition processing routine described later. And a computer including Functionally, the synonym pair acquisition apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部10は、崩れ表記語を含む文書からなる文書集合を受け付ける。   The input unit 10 accepts a document set made up of documents including corrupted notation words.

演算部20は、辞書データベース28と、単語分割候補生成部30と、意味ベクトル計算部32と、同義語ペア獲得部34とを含んで構成されている。   The calculation unit 20 includes a dictionary database 28, a word division candidate generation unit 30, a semantic vector calculation unit 32, and a synonym pair acquisition unit 34.

辞書データベース28には、辞書引きを行うために必要な辞書(読み、表記、品詞)が記憶されている。   The dictionary database 28 stores dictionaries (reading, notation, parts of speech) necessary for dictionary lookup.

単語分割候補生成部30は、入力部10により受け付けた文書集合の文書の各々から、正規表記語、又は正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する。   The word division candidate generation unit 30 includes a plurality of word division candidates that are regular notation words or collapsed notation words that are fluctuation candidates for the normal notation words from each of the documents in the document set received by the input unit 10. Is generated.

単語分割候補生成部30は、具体的には、文書に対して、既存の単語分割手法である以下の第1の手法から第3の手法の各々を適用して単語分割候補を生成する。この際、辞書データベース28に存在しない崩れ表記語についても区切り候補として出力できるような手法を用いる。   Specifically, the word division candidate generation unit 30 generates word division candidates by applying each of the following first to third methods, which are existing word division methods, to a document. At this time, a technique is used in which corrupted notation words that do not exist in the dictionary database 28 can also be output as delimiter candidates.

単語分割候補生成部30は、文書集合に含まれる文書の各々に対して、第1の手法として、点推定を用いた単語分割手法を適用して単語分割候補の生成を行う。点推定を用いた単語分割手法では、文字ngram、文字種ngram等を素性とした文字間の区切りモデルを用いて、文書を複数の単語分割候補に分割する。   The word division candidate generation unit 30 generates a word division candidate by applying a word division method using point estimation as a first method to each of the documents included in the document set. In the word segmentation technique using point estimation, a document is segmented into a plurality of word segmentation candidates using a delimiter model between characters whose features are character ngram and character type ngram.

単語分割候補生成部30は、文書集合に対して、第2の手法として、教師なし解析を用いた単語分割手法を適用して、単語分割候補の生成を行う。教師なし解析を用いた単語分割手法では、サンプリングした単語分割候補に対して出現頻度等を算出し、目的関数が最適化されるように、文書の各々を単語分割候補に分割する。   The word division candidate generation unit 30 applies a word division method using unsupervised analysis to the document set as a second method to generate word division candidates. In the word division method using unsupervised analysis, the appearance frequency is calculated for the sampled word division candidates, and each of the documents is divided into word division candidates so that the objective function is optimized.

単語分割候補生成部30は、文書集合に含まれる文書の各々に対して、第3の手法として、Mecab等による解析結果を取得し、あらかじめ定めたルールを元に一部結合させた単語分割候補の生成を行う。ルールとしては、例えば、未知語連続は結合する、名詞連続は結合する等である。なお、ルールとして以下の方法を用いてもよい。例えば、Twitter(R)等から短い文を切り出して、単語分割候補とする場合には、短い文の切り出しは、複数の区切り文字(例えば、改行、記号的表現(「!」,「w」,「♪」)、句読点(「、」,「。」)など)を設定し、短い文を区切り文字で分割するようにすればよい。このように設定することで、例えば「やっべぇぇwwwwwwwwwww」という文であれば、「w」以前の「やっべぇぇ」を単語分割候補として取得できる。また、「おっはよお♪ ってお昼だけど・・・ 今起きた・・・」という文であれば、「♪」以前の「おっはよお」が単語分割候補として取得できる。上記のようにして取得した文字数がn文字以下の文字列を形態素辞書に追加して解析を行うようにしてもよい。   The word division candidate generation unit 30 acquires, as a third method, an analysis result by Mecab or the like for each of the documents included in the document set and partially combines them based on a predetermined rule. Is generated. For example, unknown word sequences are combined, noun sequences are combined, and the like. The following method may be used as a rule. For example, when a short sentence is cut out from Twitter (R) or the like and used as a word division candidate, the short sentence is cut out by using a plurality of delimiters (for example, line breaks, symbolic expressions (“!”, “W”, "♪"), punctuation marks (",", "."), Etc.) are set, and short sentences can be divided by delimiters. By setting in this way, for example, if it is a sentence “Yaybee wwwwwwwww”, “Yaybee” before “w” can be acquired as a word division candidate. In addition, if the sentence “Ohyoyo ♪ is noon, but now ...”, “Ohahayo” before “♪” can be acquired as a word division candidate. Analysis may be performed by adding a character string having n or fewer characters acquired as described above to the morpheme dictionary.

意味ベクトル計算部32は、単語分割候補生成部30により生成された複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算する。   The semantic vector calculation unit 32 calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit 30.

意味ベクトル計算部32は、具体的には、単語分割候補生成部30により生成された複数の単語分割候補を列挙するように、単語区切りが付与された文書集合に対し、単語分割候補として出現した各単語の意味ベクトルを計算する。この際、各単語の意味ベクトルを求める手法としては既存の手法を用いることができる。例えば、非特許文献6に記載のword2vec等が代表的な手法として挙げられる。   Specifically, the semantic vector calculation unit 32 appears as a word division candidate for a document set to which word breaks are given so as to enumerate a plurality of word division candidates generated by the word division candidate generation unit 30. Calculate the semantic vector for each word. At this time, an existing method can be used as a method for obtaining the semantic vector of each word. For example, word2vec described in Non-Patent Document 6 can be cited as a representative technique.

[非特許文献6]:Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [Non-Patent Document 6]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

同義語ペア獲得部34は、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。同義語ペア獲得部34は、更に、選択された単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。図3に同義語ペア獲得部34の処理の概念図を示す。   The synonym pair acquisition unit 34, for each word division candidate that is a regular notation word, the semantic similarity calculated based on the semantic vector is equal to or greater than a threshold, and the sound calculated based on the reading of the word A word division candidate having a similarity equal to or greater than a threshold is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. Further, the synonym pair acquisition unit 34 has, for each of the selected word division candidates, a sound having a semantic similarity calculated based on a semantic vector equal to or greater than a threshold and calculated based on a word reading. A word division candidate having a similarity equal to or greater than a threshold is selected from a plurality of word division candidates, and a pair of the word division candidate that is a normal notation word and the selected word division candidate is acquired as a synonym pair. FIG. 3 shows a conceptual diagram of processing of the synonym pair acquisition unit 34.

同義語ペア獲得部34は、具体的には、まず正規表記語である単語分割候補の各々について、他の単語分割候補の各々との意味類似度の計算を行う。意味類似度は、意味ベクトル計算部32において求めた単語ごとの意味ベクトルのコサイン類似度を用いて計算する。   Specifically, the synonym pair acquisition unit 34 first calculates the semantic similarity with each of the other word division candidates for each of the word division candidates that are regular notation words. The semantic similarity is calculated using the cosine similarity of the semantic vector for each word obtained by the semantic vector calculation unit 32.

同義語ペア獲得部34は、次に、正規表記語である単語分割候補の各々について、他の単語分割候補との音類似度の計算を行う。本実施の形態では、音類似度として、音類似度距離を、単語分割候補の読みに基づいて計算する。ここで、漢字表記は読み推定を行い、カタカナ表記はひらがなに変換する。変換コストは次のように設定する。同一文字の変換コストは0とする。また、母音(小文字も含む(例:ぁ,ぃ,ぅ,ぇ,ぉ))、促音(っ)、撥音(ん)、長音の削除はコスト0とする。ただし、単語の先頭における削除はコスト1として音類似度距離をカウントアップする。また、同行又は同列(日本語ひらがな50音表の同行又は同列を指す。濁音又は半濁音は濁音化又は半濁音化する前の文字と同一の位置として考える)文字の置換、母音-促音間の置換、母音‐長音間、母音‐母音間の変換はコスト0とする。例えば、「ぶ」又は「ぷ」→「ふ」というような同行又は同列の文字列(はひふへほうくすつぬむゆる)をコスト0とする。上記以外の変換はコスト1として音類似度距離をカウントアップする。図2に音類似度距離の計算例を示す。本実施の形態では、閾値以上の音類似度のものをフィルタリングするため、音類似度距離が閾値以下のものがフィルタリングされる。   Next, the synonym pair acquisition unit 34 calculates, for each word division candidate that is a regular notation word, a sound similarity with other word division candidates. In the present embodiment, the sound similarity distance is calculated as the sound similarity based on the reading of the word division candidates. Here, kanji notation is estimated by reading, and katakana notation is converted to hiragana. The conversion cost is set as follows. The conversion cost for the same character is assumed to be zero. In addition, the cost of deleting vowels (including lowercase letters (eg, aaa, i, ぅ, eh, ぉ)), prompting sounds (tsu), repelling sounds (n), and long sounds is 0. However, deletion at the beginning of a word counts up the sound similarity distance as cost 1. In the same row or in the same row (refers to the same row or row in the Japanese hiragana 50 syllabary. The cloudy or semi-turbid sound is considered as the same position as the character before making it muddy or semi-voiced). Replacement, vowel-long sound, and vowel-vowel conversion are assumed to have no cost. For example, a character string in the same row or column (such as “bu” or “pu” → “fu”) is assumed to have a cost of zero. For conversions other than the above, the sound similarity distance is counted up as cost 1. FIG. 2 shows a calculation example of the sound similarity distance. In the present embodiment, in order to filter sound similarities that are equal to or greater than the threshold, those having a sound similarity distance equal to or less than the threshold are filtered.

次に、同義語ペア獲得部34は、文書集合から得られた正規表記語の単語分割候補の各々について、以下に説明する第1の獲得処理及び第2の獲得処理を行って、同義語ペアを獲得する。同義語ペア獲得部34の第1の獲得処理では、文書集合から得られた正規表記語の単語分割候補の各々について、以下の処理を行う。   Next, the synonym pair acquisition unit 34 performs a first acquisition process and a second acquisition process described below for each of the word segmentation candidates of the regular notation words obtained from the document set, and synonym pair acquisition To win. In the first acquisition process of the synonym pair acquisition unit 34, the following process is performed for each word division candidate of the regular notation word obtained from the document set.

まず、当該正規表記語の単語分割候補について、文書集合中に現れた他の単語分割候補から、他の単語分割候補との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該正規表記語について、他の単語分割候補との音類似度が予め定めた閾値以上(音類似度距離が閾値以下)となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース28において、当該単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部34は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図4に第1の獲得処理の一例を示す。図4では、正規表記語の単語分割候補「さむい」を起点として単語分割候補を選択している。   First, for word division candidates for the regular notation word, word division candidates whose semantic similarity with other word division candidates is greater than or equal to a predetermined threshold are filtered from other word division candidates that appear in the document set. To do. Next, from the filtered word division candidates, word division candidates for which the sound similarity with other word division candidates is equal to or greater than a predetermined threshold (sound similarity distance is equal to or less than the threshold) are filtered for the regular notation word. . Further, from the filtered word segmentation candidates, in the dictionary database 28, the notation of the word segmentation candidate exists as a regular notation word in the dictionary and has the same part of speech as the part of speech of the regular notation word in the dictionary. delete. And the synonym pair acquisition part 34 selects the word division candidate after deletion. In this way, a pair of the word division candidate of the regular notation word and the selected word division candidate is acquired as a synonym pair. FIG. 4 shows an example of the first acquisition process. In FIG. 4, the word division candidate is selected starting from the word division candidate “Samui” of the regular notation word.

次に、同義語ペア獲得部34は、当該正規表記語の単語分割候補について、以下のように、上記の第1の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補を起点とした、第2の獲得処理を行う。まず、上記の第1の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、他の単語分割候補との間の意味類似度の計算、及び音類似度距離の計算を行う。次に、当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、以下の処理を行う。   Next, the synonym pair acquisition unit 34 selected the word division candidate of the regular notation word as the synonym pair for the word division candidate of the regular notation word in the first acquisition process as described below. A second acquisition process is performed starting from the word division candidate. First, for each word division candidate selected as a synonym pair for the word division candidate of the regular notation word in the first acquisition process, calculation of semantic similarity with other word division candidates, and sound Calculate the similarity distance. Next, the following processing is performed for each word division candidate selected as a synonym pair for the word division candidate of the regular notation word.

当該単語分割候補について、文書集合中に現れた他の単語分割候補の各々との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該単語分割候補との音類似度距離が予め定めた閾値以下となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース28において、単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部34は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図5に第2の獲得処理の一例を示す。図5では、第1の獲得処理で正規表記語の単語分割候補「さむい」に対して選択された単語分割候補「さみぃ」を起点として単語分割候補を選択している。更に、同義語ペア獲得部34は、上記第2の獲得処理で選択された単語分割候補を起点として、上記第2の獲得処理と同じ処理を予め定めた回数繰り返し、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得する。   For the word division candidates, word division candidates whose semantic similarity with each of the other word division candidates appearing in the document set is equal to or greater than a predetermined threshold are filtered. Next, the word division candidate whose sound similarity distance with the word division candidate is equal to or less than a predetermined threshold is filtered from the filtered word division candidates. Further, from the filtered word segmentation candidates, in the dictionary database 28, the word segmentation candidate notation is present as a regular notation word in the dictionary and the part of speech that is the same as the part of speech of the regular notation word in the dictionary is deleted. To do. And the synonym pair acquisition part 34 selects the word division candidate after deletion. In this way, a pair of the word division candidate of the regular notation word and the selected word division candidate is acquired as a synonym pair. FIG. 5 shows an example of the second acquisition process. In FIG. 5, the word division candidate is selected starting from the word division candidate “Samie” selected for the word division candidate “Samui” of the regular notation word in the first acquisition process. Further, the synonym pair acquisition unit 34 repeats the same process as the second acquisition process a predetermined number of times starting from the word division candidate selected in the second acquisition process, and performs word division of the regular notation word A pair of the candidate and the selected word division candidate is acquired as a synonym pair.

<本発明の実施の形態に係る同義語ペア獲得装置の作用> <Operation of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置100の作用について説明する。入力部10において崩れ表記語を含む文書からなる文書集合を受け付けると、同義語ペア獲得装置100は、図6に示す同義語ペア獲得処理ルーチンを実行する。   Next, the operation of the synonym pair acquisition apparatus 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a document set made up of documents including corrupted notation words, the synonym pair acquisition apparatus 100 executes a synonym pair acquisition processing routine shown in FIG.

まず、ステップS100では、入力部10において受け付けた文書集合の文書の各々から複数の単語分割候補を生成する。   First, in step S100, a plurality of word division candidates are generated from each of the documents in the document set received by the input unit 10.

次に、ステップS102では、ステップS100で生成された複数の単語分割候補に基づいて、単語分割候補の各々について、単語の意味ベクトルを計算する。   Next, in step S102, a word semantic vector is calculated for each of the word division candidates based on the plurality of word division candidates generated in step S100.

ステップS104では、ステップS100で生成された正規表記語である単語分割候補の各々について、ステップS102で計算された意味ベクトルに基づいて、他の単語分割候補の各々との意味類似度を計算する。   In step S104, the semantic similarity with each of the other word division candidates is calculated based on the semantic vector calculated in step S102 for each of the word division candidates that are regular notation words generated in step S100.

ステップS106では、ステップS100で生成された正規表記語である単語分割候補の各々について、単語分割候補の読みに基づいて他の単語分割候補の各々との音類似度距離を計算する。   In step S106, for each word division candidate that is the normal notation word generated in step S100, the sound similarity distance with each of the other word division candidates is calculated based on the reading of the word division candidates.

ステップS108では、正規表記語である単語分割候補の各々について、ステップS104で計算された意味類似度が閾値以上であって、かつ、ステップS106で計算された音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。   In step S108, the semantic similarity calculated in step S104 is greater than or equal to the threshold for each word division candidate that is a regular notation word, and the sound similarity distance calculated in step S106 is less than or equal to the threshold. A word division candidate is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular notation word and the selected word division candidate is acquired as a synonym pair.

ステップS110では、正規表記語である単語分割候補の各々に対し、ステップS108又は前回のステップS110で選択された単語分割候補の各々について、ステップS104と同様に計算される意味類似度が閾値以上であって、かつ、ステップS106と同様に計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、当該正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。   In step S110, for each word division candidate that is a regular notation word, the semantic similarity calculated in the same manner as in step S104 is greater than or equal to the threshold for each word division candidate selected in step S108 or previous step S110. In addition, a word division candidate whose sound similarity distance calculated in the same manner as in step S106 is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate that is the regular notation word is selected. The pair with the word division candidate obtained is acquired as a synonym pair.

ステップS112では、ステップS110の処理を予め定めた回数繰り返したかを判定し、繰り返していればステップS114へ移行し、繰り返していなければステップS110へ戻って処理を繰り返す。   In step S112, it is determined whether the process of step S110 has been repeated a predetermined number of times. If it has been repeated, the process proceeds to step S114, and if not, the process returns to step S110 to repeat the process.

ステップS114では、ステップS108及びステップS110で獲得された同義語ペアを出力部50に出力して処理を終了する。   In step S114, the synonym pair acquired in step S108 and step S110 is output to the output unit 50, and the process ends.

以上説明したように、本発明の実施の形態に係る同義語ペア獲得装置によれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる。   As described above, according to the synonym pair acquisition device according to the embodiment of the present invention, a plurality of word division candidates that are regular notation words or collapsed notation words are generated from a document, and a plurality of word division candidates are obtained. Based on the above, a word semantic vector is calculated for each of a plurality of word division candidates, and the semantic similarity calculated based on the semantic vector for each word division candidate that is a regular notation word is greater than or equal to a threshold value In addition, a word division candidate having a sound similarity distance calculated based on the reading of the word is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate that is a normal notation word and the selected word By acquiring a pair with a division candidate as a synonym pair, a synonym pair can be efficiently acquired.

また、意味類似度と音類似度の双方を考慮することにより、精度よく同義候補のペアを獲得することができる。   Further, by considering both the semantic similarity and the sound similarity, a pair of synonym candidates can be obtained with high accuracy.

また、正規表記語を起点とした獲得だけではフィルタされてしまった単語分割候補に対しても、選択された単語分割候補を起点として新たな同義語ペアを獲得することでより多様な崩れ表記語を獲得することが可能になる。   In addition, even for word segmentation candidates that have been filtered only by acquisition based on regular notation words, more diverse collaborative notation words can be obtained by acquiring new synonym pairs starting from the selected word segmentation candidates. It becomes possible to acquire.

また、従来手法に比べ、多様な崩れ表記語の正しい区切りとして単語分割候補を生成することが可能になる。   In addition, compared to the conventional method, it is possible to generate word division candidates as correct delimiters of various corrupted notation words.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。   The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

10 入力部
20 演算部
28 辞書データベース
30 単語分割候補生成部
32 意味ベクトル計算部
34 同義語ペア獲得部
50 出力部
100 同義語ペア獲得装置
DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 28 Dictionary database 30 Word division candidate production | generation part 32 Semantic vector calculation part 34 Synonym pair acquisition part 50 Output part 100 Synonym pair acquisition apparatus

Claims (5)

文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する単語分割候補生成部と、
前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、
正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、
を含む同義語ペア獲得装置。
A word division candidate generation unit that generates a plurality of word division candidates that are regular notation words or collapsed notation words that are fluctuation notation candidates with respect to the regular notation word from a document;
A semantic vector calculation unit that calculates a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
Based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the word division candidates that are regular notation words, A synonym pair acquisition unit that acquires a pair of the word division candidate that is a normal notation word and the selected word division candidate as a synonym pair, selected from the plurality of word division candidates;
A synonym pair acquisition device.
前記同義語ペア獲得部は、
正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、
選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する請求項1に記載の同義語ペア獲得装置。
The synonym pair acquisition unit
For each word division candidate that is a regular notation word, the word division candidate is selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word that is a normal notation word A pair of the division candidate and the selected word division candidate is acquired as a synonym pair,
For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words The synonym pair acquisition apparatus according to claim 1, wherein a pair of the selected word division candidate is acquired as a synonym pair.
単語分割候補生成部が、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成するステップと、
意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、
同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、
を含む同義語ペア獲得方法。
A word division candidate generation unit, from a document, generating a plurality of word division candidates that are regular notation words or collapsed notation words that are candidates for notation that fluctuates with respect to the regular notation words;
A semantic vector calculation unit calculating a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
The synonym pair acquisition unit, based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on the reading of the word, for each of the word division candidates that are regular notation words Selecting the word division candidate from the plurality of word division candidates, and obtaining a pair of the word division candidate that is a normal notation word and the selected word division candidate as a synonym pair;
Synonym pair acquisition method including
前記同義語ペア獲得部が獲得するステップは、
正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、
選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する請求項3に記載の同義語ペア獲得方法。
The synonym pair acquisition unit acquires,
For each word division candidate that is a regular notation word, the word division candidate is selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word that is a normal notation word A pair of the division candidate and the selected word division candidate is acquired as a synonym pair,
For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular notation words The synonym pair acquisition method according to claim 3, wherein a pair of the selected word division candidate is acquired as a synonym pair.
コンピュータを、請求項1又は請求項2に記載の同義語ペア獲得装置の各部として機能させるためのプログラム。   The program for functioning a computer as each part of the synonym pair acquisition apparatus of Claim 1 or Claim 2.
JP2015106871A 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program Active JP6427466B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2015106871A JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2015106871A JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Publications (2)

Publication Number Publication Date
JP2016224482A true JP2016224482A (en) 2016-12-28
JP6427466B2 JP6427466B2 (en) 2018-11-21

Family

ID=57746569

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2015106871A Active JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Country Status (1)

Country Link
JP (1) JP6427466B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020050706A1 (en) * 2018-09-06 2020-03-12 엘지전자 주식회사 Word vector correcting method
WO2020188883A1 (en) 2019-03-20 2020-09-24 株式会社Screenホールディングス Synonym determination method, computer-readable recording medium in which synonym determination program is recorded, and synonym determination device
KR20200123544A (en) * 2019-04-22 2020-10-30 넷마블 주식회사 Mehtod for extracting synonyms
CN112579794A (en) * 2020-12-25 2021-03-30 清华大学 Method and system for predicting semantic tree for Chinese and English word pairs
US11256869B2 (en) 2018-09-06 2022-02-22 Lg Electronics Inc. Word vector correction method
US11593420B2 (en) 2020-02-28 2023-02-28 SCREEN Holdings Co., Ltd. Similarity calculation apparatus, recording medium, and similarity calculation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000222427A (en) * 1999-02-02 2000-08-11 Mitsubishi Electric Corp Related word extracting device, related word extracting method and recording medium with related word extraction program recorded therein
JP2008299868A (en) * 2008-08-04 2008-12-11 Nec Corp Synonym dictionary preparation support system, synonym dictionary preparation support method, and synonym dictionary preparation support program
JP2009176148A (en) * 2008-01-25 2009-08-06 Nec Corp Unknown word determining system, method and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000222427A (en) * 1999-02-02 2000-08-11 Mitsubishi Electric Corp Related word extracting device, related word extracting method and recording medium with related word extraction program recorded therein
JP2009176148A (en) * 2008-01-25 2009-08-06 Nec Corp Unknown word determining system, method and program
JP2008299868A (en) * 2008-08-04 2008-12-11 Nec Corp Synonym dictionary preparation support system, synonym dictionary preparation support method, and synonym dictionary preparation support program

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020050706A1 (en) * 2018-09-06 2020-03-12 엘지전자 주식회사 Word vector correcting method
US11256869B2 (en) 2018-09-06 2022-02-22 Lg Electronics Inc. Word vector correction method
WO2020188883A1 (en) 2019-03-20 2020-09-24 株式会社Screenホールディングス Synonym determination method, computer-readable recording medium in which synonym determination program is recorded, and synonym determination device
KR20200123544A (en) * 2019-04-22 2020-10-30 넷마블 주식회사 Mehtod for extracting synonyms
KR102189688B1 (en) * 2019-04-22 2020-12-11 넷마블 주식회사 Mehtod for extracting synonyms
US11593420B2 (en) 2020-02-28 2023-02-28 SCREEN Holdings Co., Ltd. Similarity calculation apparatus, recording medium, and similarity calculation method
CN112579794A (en) * 2020-12-25 2021-03-30 清华大学 Method and system for predicting semantic tree for Chinese and English word pairs

Also Published As

Publication number Publication date
JP6427466B2 (en) 2018-11-21

Similar Documents

Publication Publication Date Title
JP6427466B2 (en) Synonym pair acquisition apparatus, method and program
WO2017084267A1 (en) Method and device for keyphrase extraction
Bertaglia et al. Exploring word embeddings for unsupervised textual user-generated content normalization
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
JP6558863B2 (en) Model creation device, estimation device, method, and program
Alegria et al. TweetNorm: a benchmark for lexical normalization of Spanish tweets
Sarkar et al. A practical part-of-speech tagger for Bengali
JP6145059B2 (en) Model learning device, morphological analysis device, and method
Lubis et al. Twitter Data Analysis and Text Normalization in Collecting Standard Word
Sun et al. Syntactic parsing of web queries
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity&#39;s surface string candidates and mtehod thereof
KR20200073524A (en) Apparatus and method for extracting key-phrase from patent documents
Ogrodniczuk et al. Lexical correction of polish twitter political data
Li et al. New word discovery algorithm based on n-gram for multi-word internal solidification degree and frequency
Aliero et al. Systematic review on text normalization techniques and its approach to non-standard words
Rofiq Indonesian news extractive text summarization using latent semantic analysis
Mathew et al. Paraphrase identification of Malayalam sentences-an experience
Szabó et al. Efficiency analysis of inflection rule induction
Hemmer et al. Estimating Post-OCR Denoising Complexity on Numerical Texts
CN111259159A (en) Data mining method, device and computer readable storage medium
JP2014215970A (en) Error detection device, method, and program
Mapa et al. Text normalization in social media by using spell correction and dictionary based approach
Sonnadara et al. Sinhala spell correction: A novel benchmark with neural spell correction
Zin Social Media Text Normalization
Fadaei et al. Persian POS tagging using probabilistic morphological analysis

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20170621

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20180514

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20180605

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20180803

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20181023

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20181029

R150 Certificate of patent or registration of utility model

Ref document number: 6427466

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150