JP6427466B2 - Synonym pair acquisition apparatus, method and program - Google Patents

Synonym pair acquisition apparatus, method and program Download PDF

Info

Publication number
JP6427466B2
JP6427466B2 JP2015106871A JP2015106871A JP6427466B2 JP 6427466 B2 JP6427466 B2 JP 6427466B2 JP 2015106871 A JP2015106871 A JP 2015106871A JP 2015106871 A JP2015106871 A JP 2015106871A JP 6427466 B2 JP6427466 B2 JP 6427466B2
Authority
JP
Japan
Prior art keywords
word
candidate
word division
division
candidates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2015106871A
Other languages
Japanese (ja)
Other versions
JP2016224482A (en
Inventor
いつみ 斉藤
いつみ 斉藤
九月 貞光
九月 貞光
久子 浅野
久子 浅野
松尾 義博
義博 松尾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2015106871A priority Critical patent/JP6427466B2/en
Publication of JP2016224482A publication Critical patent/JP2016224482A/en
Application granted granted Critical
Publication of JP6427466B2 publication Critical patent/JP6427466B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Description

本発明は、同義語ペア獲得装置、方法、及びプログラムに係り、特に、同義語ペアを獲得するための同義語ペア獲得装置、方法、及びプログラムに関する。   The present invention relates to a synonym pair acquisition apparatus, method, and program, and more particularly, to a synonym pair acquisition apparatus, method, and program for acquiring synonym pairs.

従来より、正規表記語に対して揺らいだ表記である崩れ表記語を獲得するための手法が提案されている。教師データを用いた手法としては、非特許文献1及び非特許文献2に記載されている手法が挙げられる。   Heretofore, methods have been proposed for acquiring broken notation words that are expressions that are jumbled with respect to regular expression words. Examples of methods using teacher data include the methods described in Non-Patent Document 1 and Non-Patent Document 2.

教師データを用いない手法としては、非特許文献3及び非特許文献4に記載されている手法が挙げられる。   Examples of methods that do not use teacher data include the methods described in Non-Patent Document 3 and Non-Patent Document 4.

岡崎直観, 辻井潤一,“アライメント識別モデルを用いた略語定義の自動獲得”. 言語処理学会第14回年次大会 (NLP2008), pp. 139-142Okazaki Intuition, Junichi Asai, "Automatic Acquisition of Abbreviation Definition Using Alignment Discrimination Model". 14th Annual Conference of the Association for Language Processing (NLP2008), pp. 139-142 藤沼祥成, 横野光, 相澤彰子,“Twitter(R)上の「おはよう」を例とした崩れた表記の検出と分析.” 第27 回人工知能学会全国大会, 2013.06Fujinuma Yoshinari, Yokono Hikaru, Aizawa Akiko, "Detection and analysis of broken notations with" Good morning "on Twitter (R) as an example." The 27th Annual Conference of Japan Society for Artificial Intelligence, 2013.06 増山毅司, 関根聡,“大規模コーパスからのカタカナ語の表記の揺れリストの自動構築”,言語処理学会第14回年次大会 (NLP2004)Yuji Masuyama, Kei Sekine, "Automatic Construction of Shaking List of Katakana Expressions from Large-scale Corpus," 14th Annual Meeting of the Association for Language Processing (NLP 2004) 池田和史,柳原正,松本一則,滝嶋康弘,“くだけた表現を高精度に解析するための正規化ルール自動生成手法”,情報処理学会論文誌,vol3. No.3 pp.68-77, 2010Ikeda Kazufumi, Yanagihara Tadashi, Matsumoto Kazunori, Takishima Yasuhiro, "A Method of Automatic Generation of Normalized Rules for Analyzing Shattered Expressions with High Accuracy", Journal of the Information Processing Society of Japan, vol3. No.3 pp.68-77 , 2010 Kudo,T., Japanese Morphological Analyzer,インターネット<URL:http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html>Kudo, T., Japanese Morphological Analyzer, Internet <URL: http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html>

しかし、非特許文献1及び非特許文献2に記載の教師データを用いた手法により崩れ表記語を抽出する場合、Webデータから、図7のような正解ペアを人手で作成する必要があり、正解ペアの生成コストが高いという課題がある。   However, in the case of extracting broken words by the method using teacher data described in Non-Patent Document 1 and Non-Patent Document 2, it is necessary to manually create correct pairs as shown in FIG. 7 from Web data. There is a problem that the generation cost of the pair is high.

また、教師データを用いない手法に基づく場合、獲得候補となる崩れ語の候補が限られた候補(カタカナ語,既存解析器で未知語となった語等)に限られており、多様な崩れ表記を獲得することができないという課題がある。これは、既存解析器では崩れ表記語は誤って解析されてしまうことが多く、多様な崩れ表記語を獲得することが困難なためである。なぜならば、日本語は単語間にスペースなどの区切りが存在しないため、一般に存在するテキストにおいては形態素の正しい区切り位置を解析することが困難である。また、Web上には、ひらがなや漢字とひらがな、カタカナとひらがな等で書かれる崩れ表記語が多数存在しており、解析が困難である。例えば、「すげー」、「やば」、「さみい」、「サムい」、「寒っ」等である。また、図8に非特許文献5に記載のMecab(IPAdic)を用いて崩れ表記語を含む文を解析した結果の一例を示す。   Also, when based on a method that does not use teacher data, the candidates for the corrupted word to be acquired candidates are limited to the limited candidates (Katakana, words that became unknown words in the existing analyzer, etc.) There is a problem that it can not acquire the notation. This is because the existing analyzer often analyzes broken words incorrectly, and it is difficult to obtain various broken words. This is because Japanese does not have a space between words, so that it is difficult to analyze the correct separation position of morphemes in generally existing text. Also, on the Web, there are many broken written words written in hiragana, kanji and hiragana, katakana and hiragana, etc., which makes analysis difficult. For example, "Suge", "Yaba", "Samii", "Samui", "Cold" etc. Further, FIG. 8 shows an example of a result of analyzing a sentence including broken words using Mecab (IPAdic) described in Non-Patent Document 5.

本発明は、上記問題点を解決するために成されたものであり、効率よく、同義語ペアを獲得することができる同義語ペア獲得装置、方法、及びプログラムを提供することを目的とする。   The present invention has been made to solve the above-mentioned problems, and an object thereof is to provide a synonym pair acquiring apparatus, method, and program capable of efficiently acquiring synonym pairs.

上記目的を達成するために、第1の発明に係る同義語ペア獲得装置は、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する単語分割候補生成部と、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、を含んで構成されている。   In order to achieve the above object, a synonym pair acquiring device according to a first aspect of the present invention is, from a document, a plurality of regular expressions or a plurality of regular expressions which are candidates for irregular expressions with respect to the regular expressions. Word semantic vectors are calculated for each of the plurality of word division candidates based on the word division candidate generation unit that generates word division candidates, and the plurality of word division candidates generated by the word division candidate generation unit. Based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the semantic vector calculation unit and the word division candidate which is a regular expression word, The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate which is a regular expression word and the selected word division candidate is acquired as a synonym pair. It is configured to include the word pair acquisition unit.

また、第1の発明に係る同義語ペア獲得装置において、前記同義語ペア獲得部は、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。   Further, in the synonym pair acquiring apparatus according to the first invention, the synonym pair acquiring unit is configured to determine, based on the semantic similarity and the sound similarity, for each of the word division candidates which are regular expression words. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate which is a regular expression word and the selected word division candidate is acquired as a synonym pair, and the selected word division candidate is selected. For each of the word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates that are regular written words are selected The pair with the word division candidate may be acquired as a synonym pair.

第2の発明に係る同義語ペア獲得方法は、単語分割候補生成部が、文書から、正規表記語、又は前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成するステップと、意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、を含んで実行することを特徴とする。   In the synonym pair acquiring method according to the second aspect of the present invention, the word division candidate generation unit is a regular expression word from the document, or a plurality of words as broken expression words that are candidates for expressions that are distorted with respect to the regular expression word. A step of generating division candidates, and a semantic vector calculation unit calculating a word semantic vector for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit And a synonym pair acquisition unit is calculated based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading for each of the word division candidates that are regular expressions. The word division candidate is selected from the plurality of word division candidates, and a pair of the word division candidate that is a regular expression word and the selected word division candidate is a synonym pe And executes includes a step of acquiring, as.

また、第2の発明に係る同義語ペア獲得方法は、前記同義語ペア獲得部が獲得するステップは、正規表記語である前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得し、選択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記単語分割候補を、前記複数の単語分割候補から選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するようにしてもよい。   In the synonym pair acquisition method according to the second aspect of the invention, the synonym pair acquisition unit acquires the semantic similarity degree and the sound similarity degree for each of the word division candidates which are regular expression words. And selecting the word division candidate from the plurality of word division candidates, and acquiring a pair of the word division candidate that is a regular expression word and the selected word division candidate as a synonym pair. For each of the selected word division candidates, the word division candidates are selected from the plurality of word division candidates based on the semantic similarity and the sound similarity, and the word division candidates are regular written words And the selected word division candidate may be acquired as a synonym pair.

第3の発明に係るプログラムは、第1の発明に係る同義語ペア獲得装置の各部として機能させるためのプログラムである。   A program according to a third invention is a program for functioning as each unit of the synonym pair acquisition device according to the first invention.

本発明の同義語ペア獲得装置、方法、及びプログラムによれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる、という効果が得られる。   According to the synonym pair acquiring apparatus, method, and program of the present invention, a plurality of word division candidates which are regular written words or broken notation words are generated from a document, and a plurality of word division candidates are generated based on the plurality of word division candidates. For each of the word division candidates, the semantic vector of the word is calculated, and for each word division candidate which is a regular expression word, the sound similarity calculated based on the semantic vector and the sound calculated based on the word reading By selecting a word division candidate from a plurality of word division candidates based on the degree of similarity, and acquiring a pair of a word division candidate that is a regular expression word and the selected word division candidate as a synonym pair The effect is obtained that synonym pairs can be obtained efficiently.

本発明の実施の形態に係る同義語ペア獲得装置の構成を示すブロック図である。It is a block diagram which shows the structure of the synonym pair acquisition apparatus which concerns on embodiment of this invention. 音類似度の一例を示す図である。It is a figure which shows an example of a sound similarity. 同義語ペアの獲得の例を示す概念図である。It is a conceptual diagram which shows the example of acquisition of a synonym pair. 正規表記語を起点として単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate by making a regular expression word into a starting point. 選択された単語分割候補を起点として更に単語分割候補を選択する例を示す図である。It is a figure which shows the example which selects a word division candidate further by using the selected word division candidate as a starting point. 本発明の実施の形態に係る同義語ペア獲得装置における同義語ペア獲得処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym pair acquisition process routine in the synonym pair acquisition apparatus which concerns on embodiment of this invention. 正規表記語及び崩れ表記語の組み合わせの一例を示す図である。It is a figure which shows an example of the combination of a regular expression word and a collapse expression word. Mecabを用いて崩れ表記語を含む文を解析した結果の一例を示す図である。It is a figure which shows an example of the result of having analyzed the sentence which contains a collapse written word using Mecab.

以下、図面を参照して本発明の実施の形態を詳細に説明する。   Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

<本発明の実施の形態に係る同義語ペア獲得装置の構成> <Configuration of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置の構成について説明する。図1に示すように、本発明の実施の形態に係る同義語ペア獲得装置100は、CPUと、RAMと、後述する同義語ペア獲得処理ルーチンを実行するためのプログラムや各種データを記憶したROMと、を含むコンピュータで構成することが出来る。この同義語ペア獲得装置100は、機能的には図1に示すように入力部10と、演算部20と、出力部50とを備えている。   Next, the configuration of the synonym pair acquisition apparatus according to the embodiment of the present invention will be described. As shown in FIG. 1, the synonym pair acquisition apparatus 100 according to the embodiment of the present invention is a ROM storing a CPU, a RAM, a program for executing a synonym pair acquisition processing routine described later, and various data. And can be configured with a computer. The synonym pair acquisition apparatus 100 functionally includes an input unit 10, an operation unit 20, and an output unit 50 as shown in FIG.

入力部10は、崩れ表記語を含む文書からなる文書集合を受け付ける。   The input unit 10 receives a document set consisting of documents including broken notation words.

演算部20は、辞書データベース28と、単語分割候補生成部30と、意味ベクトル計算部32と、同義語ペア獲得部34とを含んで構成されている。   The calculation unit 20 includes a dictionary database 28, a word division candidate generation unit 30, a semantic vector calculation unit 32, and a synonym pair acquisition unit 34.

辞書データベース28には、辞書引きを行うために必要な辞書(読み、表記、品詞)が記憶されている。   The dictionary database 28 stores dictionaries (reading, notation, parts of speech) necessary for performing dictionary lookup.

単語分割候補生成部30は、入力部10により受け付けた文書集合の文書の各々から、正規表記語、又は正規表記語に対して揺らいだ表記の候補である崩れ表記語である複数の単語分割候補を生成する。   The word division candidate generation unit 30 selects, from each of the documents in the document set received by the input unit 10, a plurality of word division candidates that are regular written words or broken written words that are candidates for writings that are distorted with respect to the regular written words. Generate

単語分割候補生成部30は、具体的には、文書に対して、既存の単語分割手法である以下の第1の手法から第3の手法の各々を適用して単語分割候補を生成する。この際、辞書データベース28に存在しない崩れ表記語についても区切り候補として出力できるような手法を用いる。   Specifically, the word division candidate generation unit 30 generates word division candidates by applying each of the following first to third methods, which are existing word division methods, to a document. At this time, a method is used which can also output broken words not present in the dictionary database 28 as segmentation candidates.

単語分割候補生成部30は、文書集合に含まれる文書の各々に対して、第1の手法として、点推定を用いた単語分割手法を適用して単語分割候補の生成を行う。点推定を用いた単語分割手法では、文字ngram、文字種ngram等を素性とした文字間の区切りモデルを用いて、文書を複数の単語分割候補に分割する。   The word division candidate generation unit 30 generates word division candidates by applying a word division method using point estimation as a first method to each of the documents included in the document set. In the word division method using point estimation, a document is divided into a plurality of word division candidates using a character separation model having characters ngram, character type ngram, and the like as features.

単語分割候補生成部30は、文書集合に対して、第2の手法として、教師なし解析を用いた単語分割手法を適用して、単語分割候補の生成を行う。教師なし解析を用いた単語分割手法では、サンプリングした単語分割候補に対して出現頻度等を算出し、目的関数が最適化されるように、文書の各々を単語分割候補に分割する。   The word division candidate generation unit 30 generates a word division candidate by applying a word division method using unsupervised analysis as a second method to the document set. In the word division method using unsupervised analysis, the appearance frequency etc. are calculated for the sampled word division candidates, and each of the documents is divided into word division candidates so that the objective function is optimized.

単語分割候補生成部30は、文書集合に含まれる文書の各々に対して、第3の手法として、Mecab等による解析結果を取得し、あらかじめ定めたルールを元に一部結合させた単語分割候補の生成を行う。ルールとしては、例えば、未知語連続は結合する、名詞連続は結合する等である。なお、ルールとして以下の方法を用いてもよい。例えば、Twitter(R)等から短い文を切り出して、単語分割候補とする場合には、短い文の切り出しは、複数の区切り文字(例えば、改行、記号的表現(「!」,「w」,「♪」)、句読点(「、」,「。」)など)を設定し、短い文を区切り文字で分割するようにすればよい。このように設定することで、例えば「やっべぇぇwwwwwwwwwww」という文であれば、「w」以前の「やっべぇぇ」を単語分割候補として取得できる。また、「おっはよお♪ ってお昼だけど・・・ 今起きた・・・」という文であれば、「♪」以前の「おっはよお」が単語分割候補として取得できる。上記のようにして取得した文字数がn文字以下の文字列を形態素辞書に追加して解析を行うようにしてもよい。   The word division candidate generation unit 30 acquires, as a third method, an analysis result by Mecab or the like for each of the documents included in the document set, and is a word division candidate partially combined based on a predetermined rule. Generate the As a rule, for example, unknown word continuation is connected, noun continuation is connected, etc. The following method may be used as a rule. For example, when a short sentence is cut out from Twitter (R) or the like and used as a word division candidate, cutting out a short sentence may be performed by a plurality of delimiters (for example, line feed, symbolic expression ("!", "W", "♪"), punctuation marks (",", ".", Etc.) may be set, and short sentences may be divided by delimiters. By setting in this manner, for example, in the case of the sentence “Yabe wwwwwwwwwww”, “Yabe” before “w” can be acquired as a word division candidate. In addition, in the case of the sentence "Oohayo ♪ I have lunch ... I just got up ...", "Oohayo" before "♪" can be acquired as a word division candidate. A character string having n or fewer characters acquired as described above may be added to the morpheme dictionary for analysis.

意味ベクトル計算部32は、単語分割候補生成部30により生成された複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算する。   The semantic vector calculation unit 32 calculates a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit 30.

意味ベクトル計算部32は、具体的には、単語分割候補生成部30により生成された複数の単語分割候補を列挙するように、単語区切りが付与された文書集合に対し、単語分割候補として出現した各単語の意味ベクトルを計算する。この際、各単語の意味ベクトルを求める手法としては既存の手法を用いることができる。例えば、非特許文献6に記載のword2vec等が代表的な手法として挙げられる。   More specifically, the semantic vector calculation unit 32 appears as a word division candidate in a document set to which word breaks have been added so as to enumerate a plurality of word division candidates generated by the word division candidate generation unit 30. Calculate the semantic vector of each word. Under the present circumstances, the existing method can be used as a method of calculating | requiring the semantic vector of each word. For example, word2vec and the like described in Non-Patent Document 6 can be mentioned as a representative method.

[非特許文献6]:Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [Non-patent document 6]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

同義語ペア獲得部34は、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。同義語ペア獲得部34は、更に、選択された単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度が閾値以上となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。図3に同義語ペア獲得部34の処理の概念図を示す。   The synonym pair acquiring unit 34 determines, for each word division candidate that is a regular expression word, the semantic similarity calculated based on the semantic vector is equal to or higher than a threshold, and the sound calculated based on the word reading. A word division candidate whose similarity is equal to or higher than a threshold is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair. The synonym pair acquiring unit 34 further determines, for each of the selected word division candidates, the semantic similarity calculated based on the semantic vector is equal to or higher than a threshold, and the sound is calculated based on the word reading. A word division candidate whose similarity is equal to or higher than a threshold is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair. FIG. 3 shows a conceptual diagram of processing of the synonym pair acquisition unit 34. As shown in FIG.

同義語ペア獲得部34は、具体的には、まず正規表記語である単語分割候補の各々について、他の単語分割候補の各々との意味類似度の計算を行う。意味類似度は、意味ベクトル計算部32において求めた単語ごとの意味ベクトルのコサイン類似度を用いて計算する。   Specifically, the synonym pair acquiring unit 34 first calculates, for each of the word division candidates which are regular expression words, the semantic similarity with each of the other word division candidates. The semantic similarity is calculated using the cosine similarity of the semantic vector for each word obtained in the semantic vector calculator 32.

同義語ペア獲得部34は、次に、正規表記語である単語分割候補の各々について、他の単語分割候補との音類似度の計算を行う。本実施の形態では、音類似度として、音類似度距離を、単語分割候補の読みに基づいて計算する。ここで、漢字表記は読み推定を行い、カタカナ表記はひらがなに変換する。変換コストは次のように設定する。同一文字の変換コストは0とする。また、母音(小文字も含む(例:ぁ,ぃ,ぅ,ぇ,ぉ))、促音(っ)、撥音(ん)、長音の削除はコスト0とする。ただし、単語の先頭における削除はコスト1として音類似度距離をカウントアップする。また、同行又は同列(日本語ひらがな50音表の同行又は同列を指す。濁音又は半濁音は濁音化又は半濁音化する前の文字と同一の位置として考える)文字の置換、母音-促音間の置換、母音‐長音間、母音‐母音間の変換はコスト0とする。例えば、「ぶ」又は「ぷ」→「ふ」というような同行又は同列の文字列(はひふへほうくすつぬむゆる)をコスト0とする。上記以外の変換はコスト1として音類似度距離をカウントアップする。図2に音類似度距離の計算例を示す。本実施の形態では、閾値以上の音類似度のものをフィルタリングするため、音類似度距離が閾値以下のものがフィルタリングされる。   Next, the synonym pair acquiring unit 34 calculates, for each word division candidate which is a regular expression word, the sound similarity with another word division candidate. In the present embodiment, the sound similarity distance is calculated as the sound similarity based on the reading of the word division candidate. Here, kanji notation is used for reading estimation and katakana notation is converted to hiragana. The conversion cost is set as follows. The conversion cost of the same character is 0. In addition, vowels (including small letters (eg, ぁ, ぃ, ぅ, ぇ, ぉ)), 促 (音), 撥 ((), and deletion of long notes are regarded as cost 0. However, deletion at the beginning of a word counts up the sound similarity distance as cost 1. In addition, same or same line (refers to the same line or line of Japanese Hiragana 50 phonogram. Duzziness or Hemitone is considered to be the same position as the character before Hakuon or Hakuon conversion) substitution of characters, vowel-speech Substitution, vowel-long tone, vowel-vowel conversion is cost 0. For example, it is assumed that the cost (0) is a string of the same line or string (such as “bu” or “pu” → “fu”). Conversion other than the above counts up the sound similarity distance as cost 1. FIG. 2 shows an example of calculation of the sound similarity distance. In the present embodiment, in order to filter sound similarities that are equal to or higher than the threshold, those having a sound similarity distance equal to or lower than the threshold are filtered.

次に、同義語ペア獲得部34は、文書集合から得られた正規表記語の単語分割候補の各々について、以下に説明する第1の獲得処理及び第2の獲得処理を行って、同義語ペアを獲得する。同義語ペア獲得部34の第1の獲得処理では、文書集合から得られた正規表記語の単語分割候補の各々について、以下の処理を行う。   Next, the synonym pair acquiring unit 34 performs the first acquisition processing and the second acquisition processing described below on each of the word division candidates of the regular expression word acquired from the document set, and thereby the synonym pair is acquired. To earn In the first acquisition process of the synonym pair acquisition unit 34, the following process is performed on each of the word division candidates of the regular expression word acquired from the document set.

まず、当該正規表記語の単語分割候補について、文書集合中に現れた他の単語分割候補から、他の単語分割候補との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該正規表記語について、他の単語分割候補との音類似度が予め定めた閾値以上(音類似度距離が閾値以下)となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース28において、当該単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部34は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図4に第1の獲得処理の一例を示す。図4では、正規表記語の単語分割候補「さむい」を起点として単語分割候補を選択している。   First, for the word split candidates of the regular written word, filtering out word split candidates whose semantic similarity with other word split candidates is equal to or more than a predetermined threshold from other word split candidates appearing in the document set Do. Next, from the filtered word division candidates, the word division candidates for which the sound similarity with another word division candidate is equal to or more than a predetermined threshold (the sound similarity distance is equal to or less than the threshold) are filtered. . Furthermore, from the filtered word division candidates, in the dictionary database 28, the word division candidate is present as a regular expression word in the dictionary and is the same as the part of speech of the regular expression word in the dictionary delete. Then, the synonym pair acquiring unit 34 selects the word division candidate after deletion. In this way, a pair of the word segmentation candidate of the regular expression word and the selected word segmentation candidate is obtained as a synonym pair. FIG. 4 shows an example of the first acquisition process. In FIG. 4, word division candidates are selected starting from the word division candidate “Samui” of a regular expression word.

次に、同義語ペア獲得部34は、当該正規表記語の単語分割候補について、以下のように、上記の第1の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補を起点とした、第2の獲得処理を行う。まず、上記の第1の獲得処理で当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、他の単語分割候補との間の意味類似度の計算、及び音類似度距離の計算を行う。次に、当該正規表記語の単語分割候補について同義語ペアとして選択された単語分割候補の各々について、以下の処理を行う。   Next, the synonym pair acquiring unit 34 is selected as a synonym pair for the word segmentation candidate of the regular expression word in the first acquisition process described above for the word segmentation candidate of the regular expression word as follows: A second acquisition process is performed starting from the word division candidate. First, for each of the word division candidates selected as synonym pairs for the word division candidates of the regular written word in the first acquisition process described above, calculation of semantic similarity with other word division candidates, and sound Calculate similarity distance. Next, the following processing is performed on each of the word division candidates selected as synonym pairs for the word division candidates of the regular written word.

当該単語分割候補について、文書集合中に現れた他の単語分割候補の各々との間の意味類似度が予め定めた閾値以上である単語分割候補をフィルタリングする。次に、フィルタリングされた単語分割候補から、当該単語分割候補との音類似度距離が予め定めた閾値以下となる単語分割候補をフィルタリングする。更に、フィルタリングされた単語分割候補から、辞書データベース28において、単語分割候補の表記が辞書中の正規表記語として存在し、かつ辞書中の当該正規表記語の品詞と同一の品詞であるものを削除する。そして、同義語ペア獲得部34は、削除後の単語分割候補を選択する。このようにして、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得とする。図5に第2の獲得処理の一例を示す。図5では、第1の獲得処理で正規表記語の単語分割候補「さむい」に対して選択された単語分割候補「さみぃ」を起点として単語分割候補を選択している。更に、同義語ペア獲得部34は、上記第2の獲得処理で選択された単語分割候補を起点として、上記第2の獲得処理と同じ処理を予め定めた回数繰り返し、当該正規表記語の単語分割候補と選択した単語分割候補とのペアを、同義語ペアとして獲得する。   For the word division candidates, the word division candidates whose semantic similarity with each of the other word division candidates appearing in the document set is equal to or more than a predetermined threshold value are filtered. Next, from the filtered word division candidates, the word division candidates for which the sound similarity distance with the word division candidate is equal to or less than a predetermined threshold value are filtered. Furthermore, from the filtered word division candidates, in the dictionary database 28, the word division candidate notation is present as a regular expression word in the dictionary, and the part of speech of the regular expression word in the dictionary is the same as the part of speech Do. Then, the synonym pair acquiring unit 34 selects the word division candidate after deletion. In this way, a pair of the word segmentation candidate of the regular expression word and the selected word segmentation candidate is obtained as a synonym pair. FIG. 5 shows an example of the second acquisition process. In FIG. 5, the word division candidates are selected starting from the word division candidate "Samichi" selected for the word division candidate "Samui" of a regular expression word in the first acquisition processing. Furthermore, the synonym pair acquiring unit 34 repeats the same processing as the second acquisition processing a predetermined number of times, starting from the word division candidate selected in the second acquisition processing, as a starting point, and divides the word of the regular written word. A pair of the candidate and the selected word division candidate is acquired as a synonym pair.

<本発明の実施の形態に係る同義語ペア獲得装置の作用> <Operation of Synonym Pair Acquisition Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る同義語ペア獲得装置100の作用について説明する。入力部10において崩れ表記語を含む文書からなる文書集合を受け付けると、同義語ペア獲得装置100は、図6に示す同義語ペア獲得処理ルーチンを実行する。   Next, the operation of the synonym pair acquisition apparatus 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a document set consisting of documents including broken notation words, the synonym pair acquisition apparatus 100 executes a synonym pair acquisition processing routine shown in FIG.

まず、ステップS100では、入力部10において受け付けた文書集合の文書の各々から複数の単語分割候補を生成する。   First, in step S100, a plurality of word division candidates are generated from each of the documents of the document set accepted by the input unit 10.

次に、ステップS102では、ステップS100で生成された複数の単語分割候補に基づいて、単語分割候補の各々について、単語の意味ベクトルを計算する。   Next, in step S102, a word semantic vector is calculated for each of the word division candidates based on the plurality of word division candidates generated in step S100.

ステップS104では、ステップS100で生成された正規表記語である単語分割候補の各々について、ステップS102で計算された意味ベクトルに基づいて、他の単語分割候補の各々との意味類似度を計算する。   In step S104, semantic similarity with each of the other word division candidates is calculated based on the semantic vector calculated in step S102 for each of the word division candidates that are regular expression words generated in step S100.

ステップS106では、ステップS100で生成された正規表記語である単語分割候補の各々について、単語分割候補の読みに基づいて他の単語分割候補の各々との音類似度距離を計算する。   In step S106, the sound similarity distance with each of the other word division candidates is calculated based on the reading of the word division candidates for each of the word division candidates that are regular expression words generated in step S100.

ステップS108では、正規表記語である単語分割候補の各々について、ステップS104で計算された意味類似度が閾値以上であって、かつ、ステップS106で計算された音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。   In step S108, the semantic similarity calculated in step S104 is greater than or equal to the threshold and the sound similarity distance calculated in step S106 is less than or equal to the threshold for each of the word division candidates that are regular expressions. A word division candidate is selected from a plurality of word division candidates, and a pair of a word division candidate that is a regular expression word and the selected word division candidate is acquired as a synonym pair.

ステップS110では、正規表記語である単語分割候補の各々に対し、ステップS108又は前回のステップS110で選択された単語分割候補の各々について、ステップS104と同様に計算される意味類似度が閾値以上であって、かつ、ステップS106と同様に計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、当該正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得する。   In step S110, for each word division candidate that is a regular expression word, the semantic similarity calculated in the same manner as step S104 is equal to or greater than the threshold value for each of the word division candidates selected in step S108 or the previous step S110. And a word division candidate whose sound similarity distance calculated in the same manner as in step S106 is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate which is the regular written word is selected A pair with the word division candidate is acquired as a synonym pair.

ステップS112では、ステップS110の処理を予め定めた回数繰り返したかを判定し、繰り返していればステップS114へ移行し、繰り返していなければステップS110へ戻って処理を繰り返す。   In step S112, it is determined whether the process of step S110 is repeated a predetermined number of times, and if it is repeated, the process proceeds to step S114, and if it is not repeated, the process returns to step S110 to repeat the process.

ステップS114では、ステップS108及びステップS110で獲得された同義語ペアを出力部50に出力して処理を終了する。   In step S114, the synonym pair acquired in step S108 and step S110 is output to the output unit 50, and the process ends.

以上説明したように、本発明の実施の形態に係る同義語ペア獲得装置によれば、文書から、正規表記語、又は崩れ表記語である複数の単語分割候補を生成し、複数の単語分割候補に基づいて、複数の単語分割候補の各々について、単語の意味ベクトルを計算し、正規表記語である単語分割候補の各々について、意味ベクトルに基づいて計算される意味類似度が閾値以上であって、かつ、単語の読みに基づいて計算される音類似度距離が閾値以下となる、単語分割候補を、複数の単語分割候補から選択し、正規表記語である単語分割候補と、選択された単語分割候補とのペアを、同義語ペアとして獲得することにより、効率よく、同義語ペアを獲得することができる。   As described above, according to the synonym pair acquiring apparatus according to the embodiment of the present invention, a plurality of word division candidates which are regular written words or broken written words are generated from a document, and a plurality of word division candidates are generated. The semantic vector of the word is calculated for each of the plurality of word division candidates, and the semantic similarity calculated based on the semantic vector is equal to or greater than the threshold value for each of the word division candidates that are regular written words, And, a word division candidate whose sound similarity distance calculated based on word reading is equal to or less than a threshold is selected from a plurality of word division candidates, and the word division candidate which is a regular written word and the selected word By acquiring a pair with a division candidate as a synonym pair, it is possible to efficiently acquire a synonym pair.

また、意味類似度と音類似度の双方を考慮することにより、精度よく同義候補のペアを獲得することができる。   Also, by considering both the semantic similarity and the sound similarity, it is possible to obtain pairs of synonymous candidates with high accuracy.

また、正規表記語を起点とした獲得だけではフィルタされてしまった単語分割候補に対しても、選択された単語分割候補を起点として新たな同義語ペアを獲得することでより多様な崩れ表記語を獲得することが可能になる。   In addition, even with respect to word division candidates that have been filtered only by acquisition based on regular written words, a variety of broken transcription words can be obtained by acquiring new synonym pairs starting from the selected word division candidate. It will be possible to earn

また、従来手法に比べ、多様な崩れ表記語の正しい区切りとして単語分割候補を生成することが可能になる。   Also, compared with the conventional method, it becomes possible to generate word division candidates as correct divisions of various broken words.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。   The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

10 入力部
20 演算部
28 辞書データベース
30 単語分割候補生成部
32 意味ベクトル計算部
34 同義語ペア獲得部
50 出力部
100 同義語ペア獲得装置
Reference Signs List 10 input unit 20 operation unit 28 dictionary database 30 word division candidate generation unit 32 meaning vector calculation unit 34 synonym pair acquisition unit 50 output unit 100 synonym pair acquisition device

Claims (5)

文書から、正規表記語である単語分割候補、及び前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である単語分割候補を含む複数の単語分割候補を生成する単語分割候補生成部と、
前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算する意味ベクトル計算部と、
正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記複数の単語分割候補をフィルタリングし、フィルタリングされた前記複数の単語分割候補から、予め定められた前記正規表記語と同一の表記であって、前記同一の表記の前記正規表記語と同一の品詞である前記単語分割候補を除いて選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得する同義語ペア獲得部と、
を含む同義語ペア獲得装置。
A word division candidate generation unit that generates, from a document, a plurality of word division candidates including a word division candidate that is a regular expression word and a word division candidate that is a broken expression word that is a substitution candidate that is a fluctuation candidate for the regular expression word When,
A semantic vector calculation unit that calculates a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
For each of said word candidate dividing a normalized notation word, the meaning similarity is calculated based on the mean vector, based on a sound similarity is calculated based on the reading of the word, before Symbol plurality of word segmentation The word division from the plurality of word division candidates filtered and filtered, which is the same expression as the predetermined regular expression word and is the same as the part of speech with the regular expression word of the same expression A synonym pair acquiring unit which selects a candidate excluding a candidate and acquires a pair of the word division candidate which is a regular expression word and the selected word division candidate as a synonym pair;
Synonyms pair acquisition device including.
前記同義語ペア獲得部は、
択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記複数の単語分割候補をフィルタリングし、フィルタリングされた前記複数の単語分割候補から、予め定められた前記正規表記語と同一の表記であって、前記同一の表記の前記正規表記語と同一の品詞である前記単語分割候補を除いて更に選択し、正規表記語である前記単語分割候補と、更に選択された前記単語分割候補とのペアを、同義語ペアとして獲得する請求項1に記載の同義語ペア獲得装置。
The synonym pair acquisition unit
For each selected by said word segmentation candidate, on the basis of said sound similarity to the mean similarity, filtering the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidate, predetermined The word segmentation candidate is a regular notation word, and the word segmentation candidate is a regular notation word, and the word segmentation candidate is selected except for the word segmentation candidate having the same notation as the regular notation word and the same part of speech with the regular notation word. The synonym pair acquisition apparatus according to claim 1, wherein a pair with the selected word division candidate is further acquired as a synonym pair.
単語分割候補生成部が、文書から、正規表記語である単語分割候補、及び前記正規表記語に対して揺らいだ表記の候補である崩れ表記語である単語分割候補を含む複数の単語分割候補を生成するステップと、
意味ベクトル計算部が、前記単語分割候補生成部により生成された前記複数の単語分割候補に基づいて、前記複数の単語分割候補の各々について、単語の意味ベクトルを計算するステップと、
同義語ペア獲得部が、正規表記語である前記単語分割候補の各々について、前記意味ベクトルに基づいて計算される意味類似度と、単語の読みに基づいて計算される音類似度とに基づいて、前記複数の単語分割候補をフィルタリングし、フィルタリングされた前記複数の単語分割候補から、予め定められた前記正規表記語と同一の表記であって、前記同一の表記の前記正規表記語と同一の品詞である前記単語分割候補を除いて選択し、正規表記語である前記単語分割候補と、選択された前記単語分割候補とのペアを、同義語ペアとして獲得するステップと、
を含む同義語ペア獲得方法。
From the document, the word division candidate generation unit generates, from the document, a plurality of word division candidates including a word division candidate which is a regular expression word and a word division candidate which is a broken expression word which is a candidate of a transcriptional expression with respect to the regular expression word. Generating steps,
Calculating a semantic vector of a word for each of the plurality of word division candidates based on the plurality of word division candidates generated by the word division candidate generation unit;
For each of the word division candidates that are regular expression words, the synonym pair acquisition unit is calculated based on the semantic similarity calculated based on the semantic vector and the sound similarity calculated based on word reading filters the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidates have the same notation and the normal notation word predetermined, equal to the normal notation word of the same notation Selecting the word division candidate which is a part of speech, and acquiring a pair of the word division candidate which is a regular expression word and the selected word division candidate as a synonym pair;
How to get synonym pairs including.
前記同義語ペア獲得部が獲得するステップは、
択された前記単語分割候補の各々について、前記意味類似度と前記音類似度とに基づいて、前記複数の単語分割候補をフィルタリングし、フィルタリングされた前記複数の単語分割候補から、予め定められた前記正規表記語と同一の表記であって、前記同一の表記の前記正規表記語と同一の品詞である前記単語分割候補を除いて更に選択し、正規表記語である前記単語分割候補と、更に選択された前記単語分割候補とのペアを、同義語ペアとして獲得する請求項3に記載の同義語ペア獲得方法。
The steps acquired by the synonym pair acquisition unit are:
For each selected by said word segmentation candidate, on the basis of said sound similarity to the mean similarity, filtering the previous SL plurality of word segmentation candidate, from the filtered plurality of word segmentation candidate, predetermined The word segmentation candidate is a regular notation word, and the word segmentation candidate is a regular notation word, and the word segmentation candidate is selected except for the word segmentation candidate having the same notation as the regular notation word and the same part of speech with the regular notation word. The synonym pair acquisition method according to claim 3, wherein a pair with the selected word division candidate is further acquired as a synonym pair.
コンピュータを、請求項1又は請求項2に記載の同義語ペア獲得装置の各部として機能させるためのプログラム。   The program for functioning a computer as each part of the synonym pair acquisition apparatus of Claim 1 or Claim 2.
JP2015106871A 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program Active JP6427466B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2015106871A JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2015106871A JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Publications (2)

Publication Number Publication Date
JP2016224482A JP2016224482A (en) 2016-12-28
JP6427466B2 true JP6427466B2 (en) 2018-11-21

Family

ID=57746569

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2015106871A Active JP6427466B2 (en) 2015-05-26 2015-05-26 Synonym pair acquisition apparatus, method and program

Country Status (1)

Country Link
JP (1) JP6427466B2 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020050706A1 (en) * 2018-09-06 2020-03-12 엘지전자 주식회사 Word vector correcting method
US11256869B2 (en) 2018-09-06 2022-02-22 Lg Electronics Inc. Word vector correction method
JP7323308B2 (en) 2019-03-20 2023-08-08 株式会社Screenホールディングス Synonym determination method, synonym determination program, and synonym determination device
KR102189688B1 (en) * 2019-04-22 2020-12-11 넷마블 주식회사 Mehtod for extracting synonyms
JP7457531B2 (en) 2020-02-28 2024-03-28 株式会社Screenホールディングス Similarity calculation device, similarity calculation program, and similarity calculation method
CN112579794B (en) * 2020-12-25 2022-11-11 清华大学 Method and system for predicting semantic tree for Chinese and English word pairs

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000222427A (en) * 1999-02-02 2000-08-11 Mitsubishi Electric Corp Related word extracting device, related word extracting method and recording medium with related word extraction program recorded therein
JP2009176148A (en) * 2008-01-25 2009-08-06 Nec Corp Unknown word determining system, method and program
JP4245078B2 (en) * 2008-08-04 2009-03-25 日本電気株式会社 Synonym dictionary creation support system, synonym dictionary creation support method, and synonym dictionary creation support program

Also Published As

Publication number Publication date
JP2016224482A (en) 2016-12-28

Similar Documents

Publication Publication Date Title
JP6427466B2 (en) Synonym pair acquisition apparatus, method and program
Hamed et al. Building a first language model for code-switch Arabic-English
KR100999488B1 (en) Method and apparatus for detecting document plagiarism
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
US20160071511A1 (en) Method and apparatus of smart text reader for converting web page through text-to-speech
JP2009223463A (en) Synonymy determination apparatus, method therefor, program, and recording medium
JP6558863B2 (en) Model creation device, estimation device, method, and program
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
WO2019226406A1 (en) Dynamic extraction of contextually-coherent text blocks
CN112380866A (en) Text topic label generation method, terminal device and storage medium
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity&#39;s surface string candidates and mtehod thereof
Sun et al. Syntactic parsing of web queries
JP2011008784A (en) System and method for automatically recommending japanese word by using roman alphabet conversion
Rofiq Indonesian news extractive text summarization using latent semantic analysis
JP5911931B2 (en) Predicate term structure extraction device, method, program, and computer-readable recording medium
CN113988047A (en) Corpus screening method and apparatus
Mathew et al. Paraphrase identification of Malayalam sentences-an experience
WO2021221535A1 (en) System and method for augmenting a training set for machine learning algorithms
Szabó et al. Efficiency analysis of inflection rule induction
CN111259159A (en) Data mining method, device and computer readable storage medium
JP2014215970A (en) Error detection device, method, and program
CN114186552B (en) Text analysis method, device and equipment and computer storage medium
Aliero et al. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words
KR20190009061A (en) Word trademark search system and method for search service

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20170621

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20180514

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20180605

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20180803

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20181023

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20181029

R150 Certificate of patent or registration of utility model

Ref document number: 6427466

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150