JP5524138B2 - Synonym dictionary generating apparatus, method and program thereof - Google Patents

Synonym dictionary generating apparatus, method and program thereof Download PDF

Info

Publication number
JP5524138B2
JP5524138B2 JP2011148198A JP2011148198A JP5524138B2 JP 5524138 B2 JP5524138 B2 JP 5524138B2 JP 2011148198 A JP2011148198 A JP 2011148198A JP 2011148198 A JP2011148198 A JP 2011148198A JP 5524138 B2 JP5524138 B2 JP 5524138B2
Authority
JP
Japan
Prior art keywords
vocabulary
synonym
similarity
context
notation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2011148198A
Other languages
Japanese (ja)
Other versions
JP2013016011A (en
Inventor
真詞 田本
敏 高橋
理 吉岡
浩和 政瀧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2011148198A priority Critical patent/JP5524138B2/en
Publication of JP2013016011A publication Critical patent/JP2013016011A/en
Application granted granted Critical
Publication of JP5524138B2 publication Critical patent/JP5524138B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

本発明は、単語同士の同義性を判定し、同義関係にある単語を関連付けて登録し、同義語辞書を生成する同義語辞書生成方法、同義語辞書生成装置、及びそのプログラムに関する。   The present invention relates to a synonym dictionary generation method, a synonym dictionary generation device, and a program thereof that determine synonyms between words, register and associate synonym words, and generate a synonym dictionary.

表記は異なるが同じ意味を持つ単語をまとめた辞書として同義語辞書がある。同義語辞書は、例えば、情報検索において1つの単語を検索語として入力した際に、同義語辞書を用いてその検索語を補完して検索することによって、利用者の意図する情報を簡単に検索できるようにするために使用される。   A synonym dictionary is a dictionary that collects words that have the same meaning but different notations. For example, when a single word is input as a search word in an information search, the synonym dictionary can easily search for information intended by the user by using the synonym dictionary to complement the search word and search. Used to be able to.

従来の同義語辞書生成方法として、特許文献1が知られている。なお、特許文献1には、特定の利用者の検索行動に基づいた単語同士の関連度を定義することにより、同義語辞書を生成する同義語辞書生成システムが開示されている。   Patent Document 1 is known as a conventional synonym dictionary generation method. Patent Document 1 discloses a synonym dictionary generation system that generates a synonym dictionary by defining the degree of association between words based on search behavior of a specific user.

特開平11−312168号公報JP 11-31168 A

従来の同義語辞書作成方法は、文書テキストに基づいて同義語辞書を作成しており、音声認識結果等の音声テキストに基づいて同義語辞書を作成すること想定していなかった。そのため、単語の脱落や挿入や認識誤り等を含む音声テキストに基づいて、複数の単語が同義であるかどうかを判定すると、その精度は悪くなると考えられる。なお、文書テキストとは新聞や雑誌、web等の元々文書として作成されたテキスト情報を意味し、音声テキストとは、一人の話者による独話(講演やスピーチ等)、二人の話者による対話、3名以上の話者による会話を録音した音声データに対し音声認識を行った結果得られるテキスト情報等を意味し、元々音声に基づき作成されたテキスト情報を意味する。   The conventional synonym dictionary creation method creates a synonym dictionary based on document text, and does not assume that a synonym dictionary is created based on speech text such as a speech recognition result. Therefore, when it is determined whether or not a plurality of words are synonymous based on speech text including word dropout, insertion, recognition error, etc., the accuracy is considered to deteriorate. The document text means text information originally created as a document such as a newspaper, magazine, or web, and the voice text means a single talk by a single speaker (lecture, speech, etc.) or by two speakers. Dialogue means text information obtained as a result of performing voice recognition on voice data recording conversations of three or more speakers, and text information originally created based on voice.

本発明は、文書テキストだけではなく音声テキストに基づいても、精度の高い同義語辞書を作成することができる同義語辞書生成技術を提供することを目的とする。   An object of the present invention is to provide a synonym dictionary generation technique capable of creating a synonym dictionary with high accuracy not only based on document text but also based on speech text.

上記の課題を解決するために、本発明の第一の態様によれば、同義語辞書を作成する際に基準となる基準語彙を含む文脈と、基準語彙に関連する関連語彙を含む文脈の類似性を算出し、基準語彙の表記と関連語彙の表記の類似性を算出し、基準語彙の読みと関連語彙の読みの類似性を算出し、基準語彙及び関連語彙が同義語である確からしさを示す同義指標は、その基準語彙の文脈及びその関連語彙の文脈が類似しているほど確からしいことを示し、その基準語彙の表記及びその関連語彙の表記が類似しているほど確からしいことを示し、その基準語彙の読み及びその関連語彙の読みが類似していないほど確からしいことを示すものとし、算出された文脈、表記及び読みの類似性を用いて基準語彙及び関連語彙についての同義指標を求め、その同義指標の大きさに基づき関連語彙が基準語彙の同義語であるか否かを判定する。   In order to solve the above problems, according to the first aspect of the present invention, a context similar to a context including a reference vocabulary used as a reference when creating a synonym dictionary and a context including a related vocabulary related to the reference vocabulary are similar. To calculate the similarity between the reference vocabulary and the related vocabulary, calculate the similarity between the reference vocabulary reading and the related vocabulary reading, and determine the probability that the reference vocabulary and the related vocabulary are synonyms. The synonymous index shown indicates that the context of the reference vocabulary and the context of the related vocabulary are more likely to be more similar, and the notation of the reference vocabulary and the related vocabulary are more likely to be more likely. , To indicate that the reading of the reference vocabulary and the reading of the related vocabulary are unlikely to be similar, and use the calculated context, notation, and similarity of readings to indicate a synonymous index for the reference vocabulary and related vocabulary. Asking Related vocabulary based on the size of the synonymous indicator determines whether a synonym of the reference vocabulary.

本発明に係る同義語辞書生成技術によれば、文書テキストだけではなく音声テキストに基づいても、精度の高い同義語辞書を作成することができるという効果を奏する。   According to the synonym dictionary generation technology according to the present invention, there is an effect that a synonym dictionary with high accuracy can be created not only based on document text but also based on speech text.

同義語辞書生成装置11の機能ブロック図。The functional block diagram of the synonym dictionary production | generation apparatus 11. FIG. 同義語辞書生成装置11の処理フローを示す図。The figure which shows the processing flow of the synonym dictionary production | generation apparatus 11. FIG. 記憶部22に記憶されているデータ例を示す図。The figure which shows the example of data memorize | stored in the memory | storage part. 図4Aは語彙情報記憶部16に記憶されているデータ例(基準語彙と関連語彙の組合せ)を示す図、図4Bは語彙情報記憶部16に記憶されているデータ例(語彙情報)を示す図。4A shows an example of data stored in the vocabulary information storage unit 16 (combination of reference vocabulary and related vocabulary), and FIG. 4B shows an example of data stored in the vocabulary information storage unit 16 (vocabulary information). . 同義語情報記憶部21に記憶されているデータ例を示す図。The figure which shows the example of data memorize | stored in the synonym information storage part. 語彙情報記憶部16に記憶されているデータ例(語彙情報と概念ベクトル)を示す図。The figure which shows the example of data (vocabulary information and a concept vector) memorize | stored in the vocabulary information storage part 16. FIG.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。   Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

<第一実施形態に係る同義語辞書生成装置11>
図1及び図2を用いて、本発明の一実施形態に係る同義語辞書生成装置11を説明する。同義語辞書生成装置11は、CPU等のプログラム実行手段、メモリやハードディスク記憶装置等の記憶手段、キーボードやマウス等の入力手段、及びモニタ等の表示手段、を含む公知のコンピュータにより構成してもよいし、同様の手段を含む同義語辞書生成用の専用装置として構成してもよい。
<Synonym Dictionary Generation Device 11 According to First Embodiment>
A synonym dictionary generation device 11 according to an embodiment of the present invention will be described with reference to FIGS. 1 and 2. The synonym dictionary generation device 11 may be configured by a known computer including program execution means such as a CPU, storage means such as a memory and a hard disk storage device, input means such as a keyboard and a mouse, and display means such as a monitor. Alternatively, it may be configured as a dedicated device for synonym dictionary generation including similar means.

同義語辞書生成装置11は、機能的には、図1に示すように、関連語彙取得部12、テキスト情報記憶部13、文脈取得部14、文脈類似性算出部15、語彙情報記憶部16、表記類似性算出部17、読み類似性算出部18、品詞類似性算出部19、同義語判定部20、同義語情報記憶部21及び記憶部22を含んで構成される。   As shown in FIG. 1, the synonym dictionary generation device 11 functionally includes a related vocabulary acquisition unit 12, a text information storage unit 13, a context acquisition unit 14, a context similarity calculation unit 15, a vocabulary information storage unit 16, The notation similarity calculation unit 17, the reading similarity calculation unit 18, the part-of-speech similarity calculation unit 19, the synonym determination unit 20, the synonym information storage unit 21, and the storage unit 22 are configured.

<処理フロー>
図2を用いて同義語辞書生成装置11の処理フローの概要を説明する。同義語辞書生成装置11は、基準語彙を入力手段または他の装置から取得し(s201)、記憶部22に格納する(図3のc01参照)。なお、語彙とは単語とその意味からなる概念であり、本実施形態では、単語の表記と読みと品詞で語彙を表現する。なお、単語の表記、読み及び品詞を含む情報を語彙情報と呼ぶ。基準語彙とは同義語辞書を作成する際に基準となる語彙であり、同義語辞書において同義語を検索する際の検索対象となる語彙である。言い換えると、同義語辞書を利用するシステムや利用者は、基準語彙をキーとして同義語辞書を検索し、基準語彙の同義語を取得することができる。
<Processing flow>
The outline of the processing flow of the synonym dictionary generation device 11 will be described with reference to FIG. The synonym dictionary generation device 11 acquires the reference vocabulary from the input means or another device (s201) and stores it in the storage unit 22 (see c01 in FIG. 3). The vocabulary is a concept made up of words and their meanings. In this embodiment, the vocabulary is expressed by word notation, reading and parts of speech. Information including word notation, reading, and part of speech is called vocabulary information. The reference vocabulary is a vocabulary used as a reference when creating a synonym dictionary, and is a vocabulary to be searched when searching for synonyms in the synonym dictionary. In other words, a system or a user who uses a synonym dictionary can search the synonym dictionary using the reference vocabulary as a key, and acquire synonyms of the reference vocabulary.

関連語彙取得部12は、記憶部22から基準語彙を受け取り、その基準語彙に関連する関連語彙を語彙情報記憶部16から取得し、基準語彙及び関連語彙のそれぞれの表記、読み、品詞を併せて取得し(s202)、記憶部22に格納する(図3のc02、c03、c04)。なお、関連語彙とは、基準語彙に関連する語彙であり、言い換えると、基準語彙の同義語の候補である語彙、または、基準語彙と同義語であるか否かを判定される語彙である。例えば、語彙情報記憶部16に既存の同義語辞書を格納してもよい。その場合、基準語彙とその基準語彙に対する1以上の関連語彙が組合せて格納され(図4A)、さらに、語彙情報記憶部16には語彙情報が格納されている(図4B)。   The related vocabulary acquisition unit 12 receives the reference vocabulary from the storage unit 22, acquires the related vocabulary related to the reference vocabulary from the vocabulary information storage unit 16, and combines the notation, reading, and part of speech of each of the reference vocabulary and the related vocabulary. It is acquired (s202) and stored in the storage unit 22 (c02, c03, c04 in FIG. 3). The related vocabulary is a vocabulary related to the reference vocabulary, in other words, a vocabulary that is a candidate for a synonym of the reference vocabulary, or a vocabulary that is determined whether or not it is a synonym for the reference vocabulary. For example, an existing synonym dictionary may be stored in the vocabulary information storage unit 16. In that case, the reference vocabulary and one or more related vocabularies for the reference vocabulary are stored in combination (FIG. 4A), and further, the vocabulary information storage unit 16 stores vocabulary information (FIG. 4B).

次に、文脈取得部14は、記憶部22から基準語彙及び関連語彙を受け取り、テキスト情報記憶部13から、基準語彙及び関連語彙を含む文脈を取得し(s203)、記憶部22に格納する(図3のc05)。本実施形態において、文脈とは、単語の並びや単語の集合(以下、「単語列」ともいう)を意味し、(I)音声テキスト、(II)音声テキストから得られる単語の集合、または(III)連語データ(詳細は後述する)等からなる。以下において、基準語彙を含む文脈を基準語彙文脈と、関連語彙を含む文脈を関連語彙文脈という。   Next, the context acquisition unit 14 receives the reference vocabulary and the related vocabulary from the storage unit 22, acquires the context including the reference vocabulary and the related vocabulary from the text information storage unit 13 (s203), and stores it in the storage unit 22 ( C05 in FIG. 3). In the present embodiment, the context means an arrangement of words or a set of words (hereinafter, also referred to as “word string”), and (I) speech text, (II) a set of words obtained from the speech text, or ( III) Conjunctive data (details will be described later). Hereinafter, a context including the reference vocabulary is referred to as a reference vocabulary context, and a context including the related vocabulary is referred to as a related vocabulary context.

次に、文脈類似性算出部15は、記憶部22から基準語彙文脈と関連語彙文脈とを受け取り、基準語彙文脈と関連語彙文脈の類似性(以下「文脈類似性」という)を取得し(s204)、記憶部22に格納する(図3のc06)。   Next, the context similarity calculation unit 15 receives the reference vocabulary context and the related vocabulary context from the storage unit 22, and acquires the similarity between the reference vocabulary context and the related vocabulary context (hereinafter referred to as “context similarity”) (s204). ) And stored in the storage unit 22 (c06 in FIG. 3).

さらに、表記類似性算出部17、読み類似性算出部18及び品詞類似性算出部19は、記憶部22からそれぞれ基準語彙と関連語彙の表記、読み及び品詞を受け取り、表記、読み及び品詞の類似性(それぞれ以下「表記類似性」、「読み類似性」及び「品詞類似性」という)を取得し(s206)、記憶部22に格納する(図3のc07、c08、c09)。   Further, the notation similarity calculation unit 17, the reading similarity calculation unit 18 and the part-of-speech similarity calculation unit 19 receive the reference vocabulary and related vocabulary notation, reading and part-of-speech from the storage unit 22, respectively, and the notation, reading and part-of-speech similarity Characteristics (hereinafter referred to as “notation similarity”, “reading similarity”, and “part of speech similarity”) are acquired (s206) and stored in the storage unit 22 (c07, c08, c09 in FIG. 3).

同義語判定部20は、記憶部22から文脈類似性、表記類似性、読み類似性及び品詞類似性を受け取り、これらの値を用いて、基準語彙及び前記関連語彙についての同義指標を求める(s207)。なお、同義指標は、基準語彙及び関連語彙が同義語である確からしさを示す指標であり、その基準語彙の文脈及びその関連語彙の文脈が類似しているほど確からしいことを示し、その基準語彙の表記及びその関連語彙の表記が類似しているほど確からしいことを示し、その基準語彙の読み及びその関連語彙の読みが類似していないほど確からしいことを示す指標である。   The synonym determination unit 20 receives the context similarity, the notation similarity, the reading similarity, and the part-of-speech similarity from the storage unit 22, and uses these values to obtain a synonym index for the reference vocabulary and the related vocabulary (s207). ). The synonym index is an index indicating the probability that the reference vocabulary and the related vocabulary are synonyms, and indicates that the similarity of the context of the reference vocabulary and the related vocabulary is more probable. This is an index indicating that the notation of the vocabulary and the related vocabulary are more likely to be similar, and the reading of the reference vocabulary and the reading of the related vocabulary are more likely to be similar.

さらに同義語判定部20は、求めた同義指標の大きさに基づき関連語彙が基準語彙の同義語であるか否かを判定する(s208)。   Further, the synonym determination unit 20 determines whether or not the related vocabulary is a synonym of the reference vocabulary based on the size of the obtained synonym index (s208).

同義語ではないと判定した場合、同義語判定部20は、その処理を終了する。   When it is determined that it is not a synonym, the synonym determination unit 20 ends the process.

同義語であると判定した場合、同義語判定部20は、基準語彙とその関連語彙(以下、この同義語であると判定された関連語彙を「同義語」として扱う)、基準語彙と同義語の語彙情報と、文脈類似性、表記類似性、読み類似性及び品詞類似性とその同義指標を組合せて同義語情報記憶部21に格納する(s209、図5参照)。   When the synonym is determined to be a synonym, the synonym determination unit 20 uses the reference vocabulary and its related vocabulary (hereinafter, the related vocabulary determined to be a synonym is treated as a “synonym”), and the synonym for the reference vocabulary. Are combined with the context similarity, the notation similarity, the reading similarity, the part-of-speech similarity, and their synonyms and stored in the synonym information storage unit 21 (see s209, FIG. 5).

同義語辞書生成装置11は、同義語情報記憶部21に格納されている情報、または、その一部(少なくとも基準語彙と同義語を含む情報であればよい)を同義語辞書として出力する。   The synonym dictionary generation device 11 outputs information stored in the synonym information storage unit 21 or a part thereof (which may be information including at least a reference vocabulary and synonyms) as a synonym dictionary.

以下、各部の処理内容を説明する。   Hereinafter, the processing content of each part is demonstrated.

<関連語彙取得部12及び語彙情報記憶部16>
関連語彙取得部12は、基準語彙を用いて、基準語彙の関連語彙を少なくとも一つ語彙情報記憶部16から取得する。ここで取得される関連語彙は、(1)既存の同義語辞書によるものでもよいし(図4A、図4B参照)、(2)大量のテキスト情報における共起関係に基づく関連性の高い単語であってもよい。なお、大量のテキスト情報はテキスト情報記憶部13に記憶されている音声テキストであってもよいし、他の文書テキスト等であってもよい。(2)の場合について説明する。(2)の場合、関連語彙取得部12には、語彙情報と「概念ベース」が記憶されている(図6、参考文献1参照)。
[参考文献1]特開2009−277099号公報
<Related Vocabulary Acquisition Unit 12 and Vocabulary Information Storage Unit 16>
The related vocabulary acquisition unit 12 acquires at least one related vocabulary of the reference vocabulary from the vocabulary information storage unit 16 using the reference vocabulary. The related vocabulary acquired here may be (1) an existing synonym dictionary (see FIGS. 4A and 4B), or (2) a highly related word based on a co-occurrence relationship in a large amount of text information. There may be. Note that the large amount of text information may be voice text stored in the text information storage unit 13 or other document text. The case (2) will be described. In the case of (2), the related vocabulary acquisition unit 12 stores vocabulary information and a “concept base” (see FIG. 6, Reference 1).
[Reference Document 1] JP 2009-277099 A

この「概念ベース」は、単語間の類似性を判定し、同概念の単語の検索を目的に、単語とその単語に対応する概念ベクトルとの組からなるデータベースであり、文書を大量に集めたコーパスから作成されるコーパス概念ベースが知られている。なお、所定の単語の「概念ベクトル」は、上記所定の単語が属する範囲(例えば、文)内で、予め決められた複数の共起語のそれぞれと共起する頻度に応じて算出される。コーパス概念ベースにおける共起語として、コーパス中に高頻度で出現する単語が用いられ、各単語を行とし、共起語を列とし、単語と共起語との共起頻度を、行列の成分とする共起行列を作成する。コーパス概念ベースにおいて、特異値分解によって、共起行列の列の次元を圧縮した行列を作成し、この圧縮した行列の各行の行ベクトルが概念ベクトルである。このようにして作成された概念ベースは、単語間の類似性が高い程、単語の概念ベクトル間の距離が近いという性質を持つので、単語間の類似性を判定する場合に有効である。つまり、2つの単語間の概念ベクトルの距離が近い程、上記2つの単語間の類似性が高いと判断できる。   This “concept base” is a database consisting of pairs of words and concept vectors corresponding to the words for the purpose of determining similarities between words and searching for words of the same concept. A corpus concept base created from a corpus is known. The “concept vector” of a predetermined word is calculated according to the frequency of co-occurrence with each of a plurality of predetermined co-occurrence words within the range (eg, sentence) to which the predetermined word belongs. As co-occurrence words in the corpus concept base, words that appear frequently in the corpus are used, each word is a row, co-occurrence words are columns, and the co-occurrence frequency of words and co-occurrence words is a matrix component. Create a co-occurrence matrix. In the corpus concept base, a matrix in which the dimension of the column of the co-occurrence matrix is compressed by singular value decomposition, and the row vector of each row of the compressed matrix is a concept vector. The concept base created in this manner has the property that the higher the similarity between words, the closer the distance between the concept vectors of the words, so it is effective in determining similarity between words. In other words, it can be determined that the closer the concept vector between two words is, the higher the similarity between the two words is.

従来技術を用いて、上述の概念ベースを利用に先立ち大量のテキスト情報に基づき構築しておき、語彙情報記憶部16に記憶しておく。関連語彙取得部12は、語彙情報記憶部16から基準語彙の概念ベクトルを取得し、この概念ベクトルと距離が近い概念ベクトルを求める、例えば、コサイン類似度を最大とする概念ベクトルやコサイン類似度が大きい上位数個の概念ベクトルを求め、その概念ベクトルに対応する単語を関連語彙とし、語彙情報と併せて取得する。   Prior to using the above-described concept base, a conventional technique is constructed based on a large amount of text information and stored in the vocabulary information storage unit 16. The related vocabulary acquisition unit 12 acquires the concept vector of the reference vocabulary from the vocabulary information storage unit 16 and obtains a concept vector having a distance close to the concept vector. For example, a concept vector or cosine similarity that maximizes the cosine similarity is obtained. A large number of top concept vectors are obtained, and a word corresponding to the concept vector is used as a related vocabulary and acquired together with vocabulary information.

例えば、基準語彙「セットトップボックス」の概念ベクトルに基づく類似単語検索から作成される関連語彙のリストは、{「中断」、「チューナー」、「リモコン」、「STB」、「中の」、・・・}となる。   For example, a list of related vocabulary created from a similar word search based on the concept vector of the reference vocabulary “set top box” is {“suspend”, “tuner”, “remote control”, “STB”, “medium”,. ....

<文脈取得部14及びテキスト情報記憶部13>
テキスト情報記憶部13には、大量のテキスト情報が記憶されており、文脈取得部14はテキスト情報記憶部13から基準語彙文脈と関連語彙文脈を取得する。
<Context acquisition unit 14 and text information storage unit 13>
A large amount of text information is stored in the text information storage unit 13, and the context acquisition unit 14 acquires the reference vocabulary context and the related vocabulary context from the text information storage unit 13.

テキスト情報記憶部13は、例えばハードディスク記憶装置を含んで構成され、ネットに接続された複数の音声認識サーバ(図示せず)で生成された(I)音声テキストを文脈として複数記憶しておく。また文脈として(II)音声テキストから得られる単語の集合を複数記憶しておいてもよい。また、(III)「連語データ」を複数記憶しておいてもよい(参考文献2及び3参照)。
[参考文献2]特開2010−117764号公報
[参考文献3]寺田雄一郎他、「日本語連語データの整備」、福岡大学工学集報、2007年、9月、79号、p.53-57
The text information storage unit 13 includes, for example, a hard disk storage device, and stores a plurality of (I) speech texts generated by a plurality of speech recognition servers (not shown) connected to the net as contexts. A plurality of sets of words obtained from (II) speech text may be stored as context. A plurality of (III) “collocation data” may be stored (see References 2 and 3).
[Reference 2] JP 2010-117764 [Reference 3] Yuichiro Terada et al., “Development of Japanese collocation data”, Fukuoka University Engineering Bulletin, 2007, September, 79, p.53-57

(II)について説明する。音声テキストからキーワードを人間が抜き出し、その集合を作成してもよい。また、音声テキストに含まれる語彙に係る単語共起確率に基づいて、統計的モデルによって生成される当該語彙を含む単語列(つまり、単語共起確率が高い単語の集合)を文脈として作成してもよい。このような構成とすることで、単語概念的なまとまりを有し、かつ統計的に十分な標本数を有する単語集合を文脈として取得することができる。   (II) will be described. Humans may extract keywords from speech text and create a set of them. Also, based on the word co-occurrence probability related to the vocabulary included in the speech text, a word string including the vocabulary generated by the statistical model (that is, a set of words having a high word co-occurrence probability) is created as a context. Also good. With such a configuration, a word set having a conceptual group of words and a statistically sufficient number of samples can be acquired as a context.

(III)について説明する。「連語データ」は、意味上の単位を定義し、語彙を含む文脈の検索を目的に、単語連鎖の性質に注目して、単語見出しとその単語に続く単語列の組からなるデータベースである。所定の単語の「連語データ」は、単語とそれに連なる単語列との間の確率的束縛性(要素単語相互の確率的な共起しやすさ)、語彙的一体性(要素単語間への他の単語の割り込みにくさ)、熟語性(構成性原理の成り立ちにくさ)の程度によって性格づけされる。連語としての性質の有無は、収編者の内省に基づくほかに、統計的特徴量によって判定される。単語連鎖的なまとまりを有し、かつ統計的に十分な標本数を有する単語列を文脈として取得することができる。この態様によれば、基準語彙文脈ないし関連語彙文脈の単語連鎖的なまとまりや標本数をさらに考慮して以下の同義語判定処理において、より精度の高い同義語判定が可能となり、高品質な同義語辞書を生成できる。   (III) will be described. “Collaborative data” is a database composed of a set of word headings and word strings following the words, focusing on the nature of word chains for the purpose of searching for contexts that define semantic units and include vocabularies. “Conjunction data” for a given word includes probabilistic binding between the word and the word string connected to it (probability of co-occurrence between element words), lexical integrity ( It is classified by the degree of idioms (difficulty of constructability principle). The presence or absence of the property as a collocation is determined by a statistical feature quantity in addition to the introspection of the organizer. A word string having a word chain group and a statistically sufficient number of samples can be acquired as a context. According to this aspect, it is possible to perform synonym determination with higher accuracy in the following synonym determination process in consideration of the word chain group and sample number of the reference vocabulary context or related vocabulary context, and high-quality synonyms can be determined. A word dictionary can be generated.

また、テキスト情報記憶部13に記憶される文脈は定期的に追加・更新されるようになっている。   Further, the context stored in the text information storage unit 13 is periodically added / updated.

文脈取得部14は、基準語彙文脈と関連語彙文脈とをテキスト情報記憶部13から取得する。ここで取得される語彙文脈は、(I)〜(III)の何れかであれば良い。例えば、(III)の場合は、基準語彙ないし関連語彙を含み、確率的束縛性をあらわす数値のうち、例えば、連接確率が0.90以上、収束度が0.60以上の、局所的な単語連鎖系列でもよいし、または、語彙的一体性をあらわす数値のうち、単語割り込み数が1以下の、語彙集合であってもよい。   The context acquisition unit 14 acquires the reference vocabulary context and the related vocabulary context from the text information storage unit 13. The vocabulary context acquired here may be any one of (I) to (III). For example, in the case of (III), a local word including a reference vocabulary or a related vocabulary and having a probability of concatenation of 0.90 or more and a convergence of 0.60 or more among numerical values representing probabilistic binding properties. It may be a chained sequence, or may be a vocabulary set in which the number of word interruptions is 1 or less among numerical values representing lexical unity.

<文脈類似性算出部15>
文脈類似性算出部15は、基準語彙文脈と関連語彙文脈との類似性を算出する。例えば、文脈類似性算出部15は、文脈取得部14により取得した文脈が音声テキストであるときに所定の形態素解析アルゴリズムを用いて形態素に分割する。また、連語データであるときは、形態素に分割されていることを前提とする。次に、基準語彙文脈と関連語彙文脈各々について類似性を算出する。例えば、基準語彙文脈における全語彙の共起関係と、関連語彙文脈における全語彙の共起関係とに基づいて、文脈類似性を算出する。具体的には、分割された形態素ごとにその概念ベクトルを語彙情報記憶部16より取得し、各々の形態素同士の単語概念ベクトルのコサイン類似度の総和を正規化して基準語彙文脈と関連語彙文脈との類似性とする。
<Context Similarity Calculation Unit 15>
The context similarity calculation unit 15 calculates the similarity between the reference vocabulary context and the related vocabulary context. For example, when the context acquired by the context acquisition unit 14 is a speech text, the context similarity calculation unit 15 divides it into morphemes using a predetermined morpheme analysis algorithm. Moreover, when it is collocation data, it is assumed that it is divided into morphemes. Next, the similarity is calculated for each of the reference vocabulary context and the related vocabulary context. For example, context similarity is calculated based on the co-occurrence relationship of all vocabularies in the reference vocabulary context and the co-occurrence relationship of all vocabularies in the related vocabulary context. Specifically, the concept vector for each divided morpheme is acquired from the vocabulary information storage unit 16, and the sum of the cosine similarity of the word concept vectors of each morpheme is normalized to obtain the reference vocabulary context and the related vocabulary context. Similarity.

<表記類似性算出部17>
表記類似性算出部17は、基準語彙の表記と関連語彙の表記との類似性を算出する。例えば、表記類似性算出部17は、語彙情報記憶部16により取得した語彙の表記を1文字ごとに分割する。具体的には、符号化文字や文字コードなどのプログラムないし媒体上で語彙を記述するための最小単位ごとに個別の要素として抽出する。次に、基準語彙と関連語彙の表記の類似性を各々の文字の一致率に基づいて算出する。例えば、基準語彙及び関連語彙を1文字ごとに分割し、生成された符号の列を2つのパターンとみなし、符号を個別の要素とみなして、基準語彙と関連語彙の間の対応付けを行いながら効率的に類似性を計算する方法として動的計画法(Dynamic Programming)によるマッチング(DPマッチング)を用い、正規化された一致率として基準語彙と関連語彙の表記の類似性を算出する。
<Notation similarity calculation unit 17>
The notation similarity calculation unit 17 calculates the similarity between the reference vocabulary notation and the related vocabulary notation. For example, the notation similarity calculation unit 17 divides the vocabulary notation acquired by the vocabulary information storage unit 16 for each character. Specifically, it is extracted as an individual element for each minimum unit for describing a vocabulary on a program or medium such as an encoded character or character code. Next, the similarity of the notation of the reference vocabulary and the related vocabulary is calculated based on the matching rate of each character. For example, the reference vocabulary and the related vocabulary are divided for each character, the generated code string is regarded as two patterns, the code is regarded as an individual element, and the reference vocabulary and the related vocabulary are associated with each other. The matching (DP matching) by dynamic programming (Dynamic Programming) is used as a method for efficiently calculating the similarity, and the similarity between the notation of the reference vocabulary and the related vocabulary is calculated as a normalized matching rate.

<読み類似性算出部18>
読み類似性算出部18は、基準語彙の読みと関連語彙の読みとの類似性を算出する。例えば、読み類似性算出部18は、語彙情報記憶部16により取得した語彙の読みを音素単位に分割する。なお、音素は、一般的に母音、撥音、促音を1単位、それ以外を子音と母音の2単位で記述し、音素による読みの記述を音素表記とする。次に、基準語彙と関連語彙の音素表記の類似性を各々の音素の一致率に基づいて算出する。例えば、前記表記類似性算出部17と同様にDPマッチングを用い、一致率を正規化して基準語彙と関連語彙の読みの類似性を算出する。
<Reading similarity calculation unit 18>
The reading similarity calculation unit 18 calculates the similarity between the reading of the reference vocabulary and the reading of the related vocabulary. For example, the reading similarity calculation unit 18 divides the vocabulary reading acquired by the vocabulary information storage unit 16 into phonemes. Note that phonemes are generally described in one unit of vowels, repellent sounds, and prompting sounds, and other units are described in two units of consonants and vowels. Next, the similarity between the phoneme notation of the reference vocabulary and the related vocabulary is calculated based on the matching rate of each phoneme. For example, DP matching is used in the same manner as the notation similarity calculation unit 17 to normalize the matching rate and calculate the similarity between the reading of the reference vocabulary and the related vocabulary.

<品詞類似性算出部19>
品詞類似性算出部19は、基準語彙の品詞と関連語彙との品詞の類似性を算出する。ここで品詞は、全ての品詞を根とし、大分類から樹状に細分化される意味体系上に位置するものとする(参考文献4参照)。
[参考文献4]白井諭、大山芳史、池原悟、宮崎正弘、横尾昭男、「日本語語彙大系について」、情報処理学会研究報告.IM、1998年11月、Vol.1998 No.106、p.47-52
<Part of speech similarity calculation unit 19>
The part of speech similarity calculation unit 19 calculates the part of speech similarity between the part of speech of the reference vocabulary and the related vocabulary. Here, the part of speech is assumed to be located on a semantic system that is divided into a tree shape from a large classification with all the parts of speech as roots (see Reference 4).
[Reference 4] Satoshi Shirai, Yoshifumi Oyama, Satoru Ikehara, Masahiro Miyazaki, Akio Yokoo, "On the Japanese Vocabulary System", Information Processing Society of Japan. IM, November 1998, Vol.1998 No.106, p.47-52

例えば、基準語彙と関連語彙の品詞の類似性を各々の品詞の距離に基づいて算出する。品詞体系上で基準語彙と関連語彙の双方の品詞と共通する大分類を基点とし、双方の品詞との階層差の和を品詞の距離と定義する。多義語の場合、もっとも小さな値を採用する。その逆数を正規化して基準語彙と関連語彙の品詞の類似性を算出する。   For example, the similarity between the part of speech of the reference vocabulary and the related vocabulary is calculated based on the distance between each part of speech. Based on the major classification common to both the part of speech of the reference vocabulary and the related vocabulary in the part of speech system, the sum of the hierarchical differences between both parts of speech is defined as the distance of part of speech. For polysemy, use the smallest value. The reciprocal number is normalized to calculate the similarity between the part of speech of the reference vocabulary and the related vocabulary.

<同義語判定部20及び同義語情報記憶部21>
同義語判定部20は、文脈類似性、表記類似性、読み類似性及び品詞類似性を用いて基準語彙及び関連語彙についての同義指標を求め、その同義指標の大きさに基づき関連語彙が基準語彙の同義語であるか否かを判定する。次に、同義語判定部20は、基準語彙と、その基準語彙と同義語であると判定された関連語彙とを組合せて出力し、同義語情報記憶部21に格納する。
<Synonym determination unit 20 and synonym information storage unit 21>
The synonym determination unit 20 obtains a synonym index for the reference vocabulary and the related vocabulary using the context similarity, the notation similarity, the reading similarity, and the part-of-speech similarity, and the related vocabulary is determined based on the size of the synonym index. It is determined whether it is a synonym. Next, the synonym determination unit 20 outputs a combination of the reference vocabulary and the related vocabulary determined to be synonymous with the reference vocabulary, and stores it in the synonym information storage unit 21.

例えば、文脈類似性、読み類似性、表記類似性及び品詞類似性は、それぞれについて基準語彙と関連語彙双方の文脈、読み、表記、及び品詞が一致するときに定数値1、全く一致しないときに値0を算出するように正規化し、各々の値に加重して結合した同義指標を求める。   For example, context similarity, reading similarity, notation similarity, and part-of-speech similarity are constant value 1 when the context, reading, notation, and part-of-speech for both the reference vocabulary and related vocabulary match, respectively, Normalization is performed so that the value 0 is calculated, and a synonym index is obtained by weighting and combining each value.

具体的には、文脈類似性、表記類似性及び品詞類似性が大きければ同義指標は大きくなる。一方、読み類似性が大きな際に、表記類似性及び品詞類似性が小さければ誤認識による誤り単語とみなせる。基準語彙と関連語彙の同義指標は、一例として各々の類似性を線形結合した式で表される。
Svocab(u,v)=Scontext(u,v)+β・SPOS(u,v)+γ・Sdescribe(u,v)+δ・Spronounce(u,v)
(0≦Scontext(u,v),SPOS(u,v),Sdescribe(u,v),Spronounce(u,v)≦1,β≧0,γ>0,δ<0) (1)
ここで基準語彙uと関連語彙vに対し、Scontext、SPOS、Sdescribe及びSpronounceは、それぞれ文脈類似性、品詞類似性、表記類似性及び読み類似性を表す。β、γ、δは、重み係数となる。|β|、|γ|、|δ|は1より小さい値が望ましい。
Specifically, the synonym index increases as context similarity, notation similarity, and part-of-speech similarity increase. On the other hand, when the reading similarity is large, if the notation similarity and the part of speech similarity are small, it can be regarded as an erroneous word due to misrecognition. The synonymous index of the reference vocabulary and the related vocabulary is represented by an expression in which each similarity is linearly combined as an example.
S vocab (u, v) = S context (u, v) + β ・ S POS (u, v) + γ ・ S describe (u, v) + δ ・ S pronounce (u, v)
(0 ≦ S context (u, v), S POS (u, v), S describe (u, v), S pronounce (u, v) ≦ 1, β ≧ 0, γ> 0, δ <0) ( 1)
Here, with respect to the reference vocabulary u and the related vocabulary v, S context , S POS , S describe and S pronoun represent context similarity, part-of-speech similarity, notation similarity and reading similarity, respectively. β, γ, and δ are weighting factors. | Β |, | γ |, and | δ | are preferably smaller than 1.

また、他の例として、文脈類似性及び品詞類似性に対し、表記類似性と読み類似性をより強調するために、シグモイド関数を導入した同義指標は、次の式で表される。
Svocab(u,v)=(Scontext(u,v)+β・SPOS(u,v))×sα(Sdescribe(u,v)-Spronounce(u,v)) (α>0) (2)
ここでsαは、ゲインαのシグモイド関数である。例えば、αは3.0〜5.0程度の値を取る。
As another example, a synonym index in which a sigmoid function is introduced in order to further emphasize notation similarity and reading similarity with respect to context similarity and part-of-speech similarity is expressed by the following expression.
S vocab (u, v) = (S context (u, v) + β ・ S POS (u, v)) × s α (S describe (u, v) -S pronounce (u, v)) (α> 0 (2)
Here, s α is a sigmoid function of gain α. For example, α takes a value of about 3.0 to 5.0.

「同義指標の大きさに基づき同義語であるか否かを判定する」とは、例えば、(i)求めた同義指標が閾値κを超えるとき、閾値を越える同義指標に対応する関連語彙を同義語であると判定する、または、(ii)複数の関連語彙について同義指標を求め、同義指標が最大かつ最大値を除く複数の関連語彙の同義指標に対し、有意に大きいとき、最大値に対応する関連語彙を同義語であると判定する。   “Determining whether or not a synonym is a synonym based on the size of the synonym index” means, for example, (i) synonymous with a related vocabulary corresponding to the synonym index exceeding the threshold when the calculated synonym index exceeds the threshold κ (Ii) Find synonymous indices for multiple related vocabularies, and if the synonymous index is the largest and significantly larger than the synonymous indices of multiple related vocabularies excluding the maximum value, it corresponds to the maximum value It is determined that the related vocabulary is a synonym.

式(2)におけるα、β、閾値κの求め方を例示する。βは、文脈類似性に対する品詞類似性の重みであり、同義語関係にある基準語彙と関連語彙、関係のない語彙との間の弁別性能が最大となるように文書集合及び同義関係にある少数の語彙の集まりで構成される学習セットによって値を定める。このとき、同義関係の有無により右辺第一項で定められる境界値を閾値κとする。例えば、右辺第二項を1として重み係数βを変動させ、学習セット内の同義語と同義語以外とを識別する値を精度よく分離できる重み係数βと閾値κを実験的に求める。ついで、αは、関連語彙が誤認識の結果であることが既知の学習セットを使い、誤認識による同義語判定誤りを最小にする値を定める。具体的には、既存の小規模な同義語辞書と同義語を抽出する対象文書として音声の書き起こし文書、及び該音声の音声認識結果の文書を用いる。   An example of how to obtain α, β, and threshold value κ in equation (2) is shown. β is the weight of part-of-speech similarity with respect to context similarity, and a small number of documents and synonyms so that the discrimination performance between the synonym-related reference vocabulary, related vocabulary, and irrelevant vocabulary is maximized. The value is determined by a learning set consisting of a collection of vocabulary. At this time, the threshold value κ is defined as the boundary value defined in the first term on the right side depending on the presence or absence of the synonymous relationship. For example, the second term on the right side is set to 1, and the weighting factor β is varied to experimentally obtain the weighting factor β and the threshold value κ that can accurately separate values for identifying synonyms and non-synonyms in the learning set. Next, α uses a learning set whose related vocabulary is known to be a result of misrecognition, and determines a value that minimizes synonym determination errors due to misrecognition. Specifically, a speech transcription document and a speech recognition result document are used as target documents for extracting synonyms from an existing small synonym dictionary.

次に、基準語彙と、同義語と判定された関連語彙とを同義語辞書に登録する。具体的には、語彙情報記憶部16から取得された語義情報と、文脈類似性、表記類似性、表記類似性、読み類似性、品詞類似性、及び同義指標を同義語情報記憶部21に格納する(図5参照)。   Next, the reference vocabulary and the related vocabulary determined to be synonyms are registered in the synonym dictionary. Specifically, the meaning information acquired from the vocabulary information storage unit 16 and the context similarity, notation similarity, notation similarity, reading similarity, part-of-speech similarity, and synonym index are stored in the synonym information storage unit 21. (See FIG. 5).

同義語辞書生成装置11の利用者は、必要な量の基準語彙を入力する。そうすると、同義語辞書生成装置11は、入力された基準語彙に対する同義語を求め、同義語情報記憶部21に格納しておく。利用者は、記憶された情報を同義語辞書として利用することができる。   A user of the synonym dictionary generation device 11 inputs a necessary amount of reference vocabulary. Then, the synonym dictionary generation device 11 obtains a synonym for the input reference vocabulary and stores it in the synonym information storage unit 21. The user can use the stored information as a synonym dictionary.

<効果>
以上説明した同義語辞書生成装置11によれば、類似する文脈に出現しやすいという同義語の特性と、読みの類似と相反して表記や品詞の類似性が低いという認識誤りの特性を利用することにより、文脈類似性、読み類似性、表記類似性及び品詞類似性とに基づいて、基準語彙と関連語彙が同義語であるかどうかを判定する。このため、単語の脱落や挿入や認識誤り等の影響を受けることなく、音声テキストに基づき、基準語彙と関連語彙とが同義語であるかどうかを精度よく判定し、同義語辞書を生成することができる。
<Effect>
According to the synonym dictionary generating apparatus 11 described above, the synonym characteristic that is likely to appear in a similar context and the recognition error characteristic that the similarity of notation and part of speech is low as opposed to reading similarity are used. Thus, it is determined whether the reference vocabulary and the related vocabulary are synonyms based on the context similarity, reading similarity, notation similarity, and part-of-speech similarity. For this reason, it is possible to accurately determine whether the reference vocabulary and the related vocabulary are synonyms based on the speech text without being affected by word dropout, insertion, recognition error, etc., and generate a synonym dictionary Can do.

<その他の変形例>
本実施形態において、各データのやり取りは記憶部22を介して行われているが、記憶部22を介さず各部間で直接データを送受信してもよい。
<Other variations>
In the present embodiment, each data is exchanged via the storage unit 22, but data may be directly transmitted and received between each unit without using the storage unit 22.

なお、テキスト情報記憶部13には、音声テキストではなく、(I)文書テキスト、(II)文書テキストから得られる単語の集合、(III)文書テキストから得られる「連語データ」を記憶しておいてもよい。文書テキストに基づいても同義語辞書を作成することができる。   Note that the text information storage unit 13 stores (I) document text, (II) a set of words obtained from the document text, and (III) “collocation data” obtained from the document text, instead of the speech text. May be. A synonym dictionary can also be created based on document text.

本実施形態において、同義語判定処理(s208、s209)を行う際に、品詞類似性を用いているが、必須ではない。少なくとも文脈類似性、読み類似性及び表記類似性を用いて同義語判定処理を行えばよい。この場合、語彙は、単語の表記と読みにより表現する。このとき、同義語辞書生成装置11は品詞類似性に係る各部(品詞類似性算出部19等)を含まずともよく、品詞類似性に係る処理(s202、s206、s207、s209において品詞や品詞類似性に係る処理)、データ(語彙情報記憶部16、同義語情報記憶部21、記憶部22に格納されるデータのうち、品詞や品詞類似性に係るデータ)を省くことができる(よって、語彙情報は読み及び表記を含み、品詞を含まない情報とする)。同義語判定部20では、以下の式により、同義指標を算出する。
Svocab(u,v)=Scontext(u,v)+γ・Sdescribe(u,v)+δ・Spronounce(u,v)
(0≦Scontext(u,v),Sdescribe(u,v),Spronounce(u,v)≦1,γ>0,δ<0) (3)
または
Svocab(u,v)=Scontext(u,v)×sα(Sdescribe(u,v)-Spronounce(u,v)) (α>0) (4)
なお、上記式は式(1)及び(2)において、SPOS=0としたものである。品詞を同義語か否かを判定する材料として用いないため、その精度は若干低下する可能性がある。しかし、表記と読みからのみでも認識誤りの特性を利用することができると考えられるので、ほとんど遜色ない精度を期待でき、演算量等を減らすことができるという効果を奏する。
In the present embodiment, when performing synonym determination processing (s208, s209), the part-of-speech similarity is used, but it is not essential. The synonym determination processing may be performed using at least context similarity, reading similarity, and notation similarity. In this case, the vocabulary is expressed by word notation and reading. At this time, the synonym dictionary generation device 11 may not include each part related to the part of speech similarity (the part of speech similarity calculating unit 19 or the like), and the part of speech or the part of speech similarity in the processes related to the part of speech similarity (s202, s206, s207, and s209). Processing), data (data relating to part of speech and part of speech similarity among the data stored in the vocabulary information storage unit 16, the synonym information storage unit 21, and the storage unit 22) (and thus the vocabulary). Information includes reading and notation, not part of speech.) The synonym determination unit 20 calculates a synonym index by the following formula.
S vocab (u, v) = S context (u, v) + γ ・ S describe (u, v) + δ ・ S pronounce (u, v)
(0 ≦ S context (u, v), S describe (u, v), S pronounce (u, v) ≦ 1, γ> 0, δ <0) (3)
Or
S vocab (u, v) = S context (u, v) × s α (S describe (u, v) -S pronounce (u, v)) (α> 0) (4)
The above formula is obtained by setting S POS = 0 in the formulas (1) and (2). Since the part of speech is not used as a material for determining whether or not it is a synonym, the accuracy may slightly decrease. However, since it is considered that the characteristics of recognition errors can be used only from notation and reading, an almost comparable accuracy can be expected, and the amount of calculation can be reduced.

本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。   The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

<プログラム及び記録媒体>
上述した同義語辞書生成装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置(各種実施例で図に示した機能構成をもつ装置)として機能させるためのプログラム、またはその処理手順(各実施例で示したもの)の各過程をコンピュータに実行させるためのプログラムを、CD−ROM、磁気ディスク、半導体記憶装置等の記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。
<Program and recording medium>
The above-mentioned synonym dictionary generation device can also be operated by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a processing procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer and executed.

Claims (7)

同義語辞書を作成する際に基準となる基準語彙を含む文脈と、前記基準語彙に関連する関連語彙を含む文脈の類似性を算出する文脈類似性算出部と、
前記基準語彙の表記と前記関連語彙の表記の類似性を算出する表記類似性算出部と、
前記基準語彙の読みと前記関連語彙の読みの類似性を算出する読み類似性算出部と、
基準語彙及び関連語彙についての同義指標は、その基準語彙とその関連語彙とが同義語である確からしさを示し、その基準語彙の文脈及びその関連語彙の文脈が類似しているほど確からしいことを示し、その基準語彙の表記及びその関連語彙の表記が類似しているほど確からしいことを示し、その基準語彙の読み及びその関連語彙の読みが類似していないほど確からしいことを示すものとし、前記算出された文脈、表記及び読みの類似性を用いて前記基準語彙及び前記関連語彙についての同義指標を求め、その同義指標の大きさに基づき前記関連語彙が前記基準語彙の同義語であるか否かを判定する同義語判定部と、を含む、
同義語辞書生成装置。
A context that includes a reference vocabulary used as a reference when creating a synonym dictionary, and a context similarity calculation unit that calculates a similarity between contexts including related vocabulary related to the reference vocabulary;
A notation similarity calculator that calculates the similarity between the notation of the reference vocabulary and the notation of the related vocabulary;
A reading similarity calculator for calculating the similarity between the reading of the reference vocabulary and the reading of the related vocabulary;
A synonym indicator for a reference vocabulary and related vocabulary indicates the likelihood that the reference vocabulary and the related vocabulary are synonyms, and that the context of the reference vocabulary and the context of the related vocabulary are more likely to be similar. Show that the notation of the reference vocabulary and the notation of the related vocabulary are more likely to be similar, and that the reading of the reference vocabulary and the reading of the related vocabulary are less likely to be similar, Whether a synonym index for the reference vocabulary and the related vocabulary is obtained using the calculated context, notation, and reading similarity, and whether the related vocabulary is a synonym of the reference vocabulary based on the size of the synonym index A synonym determination unit for determining whether or not,
Synonym dictionary generator.
請求項1記載の同義語辞書生成装置であって、
文脈、表記及び読みの類似性をそれぞれScontext(u,v)、Sdescribe(u,v)及びSpronounce(u,v)とし、前記同義指標をSvocab(u,v)とし、sαは、ゲインαのシグモイド関数とし、前記同義語判定部において、前記同義指標を
Svocab(u,v)=Scontext(u,v)×sα(Sdescribe(u,v)-Spronounce(u,v))
として求める、
同義語辞書生成装置。
The synonym dictionary generation device according to claim 1,
Context, notation and each S context similarity reading (u, v), S describe (u, v) and S pronounce (u, v) and then, to the synonymous indicator S vocab (u, v) and, s alpha Is a sigmoid function of gain α, and the synonym determination unit determines the synonym index as
S vocab (u, v) = S context (u, v) × s α (S describe (u, v) -S pronounce (u, v))
Asking,
Synonym dictionary generator.
請求項1記載の同義語辞書生成装置であって、
前記基準語彙の品詞と前記関連語彙の品詞の類似性を算出する品詞類似性算出部をさらに含み、
文脈、表記、読み及び品詞の類似性をそれぞれScontext(u,v)、Sdescribe(u,v)、Spronounce(u,v)及びSPOS(u,v)とし、前記同義指標をSvocab(u,v)とし、sαは、ゲインαのシグモイド関数とし、βを重み係数とし、前記同義語判定部において、前記同義指標を
Svocab(u,v)=(Scontext(u,v)+β・SPOS(u,v))×sα(Sdescribe(u,v)-Spronounce(u,v))
として求める、
同義語辞書生成装置。
The synonym dictionary generation device according to claim 1,
A part-of-speech similarity calculator that calculates the similarity between the part of speech of the reference vocabulary and the part of speech of the related vocabulary;
The similarities of context, notation, reading, and part of speech are S context (u, v), S describe (u, v), S pronounce (u, v), and S POS (u, v), respectively, and the synonymous index is S vocab (u, v), s α is a sigmoid function of gain α, β is a weighting factor, and the synonym determination unit determines the synonym index as
S vocab (u, v) = (S context (u, v) + β ・ S POS (u, v)) × s α (S describe (u, v) -S pronounce (u, v))
Asking,
Synonym dictionary generator.
請求項1から3の何れかに記載の同義語辞書生成装置であって、
前記基準語彙を用いて、その基準語彙に関連する関連語彙を取得する関連語彙取得部と、
大量のテキスト情報が記憶されるテキスト情報記憶部と、
前記テキスト情報記憶部から前記基準語彙を含む文脈と前記関連語彙を含む文脈を取得する文脈取得部と、をさらに含み、
前記同義語判定部は、基準語彙と、その基準語彙と同義語であると判定された関連語彙とを組合せて出力する、
同義語辞書生成装置。
The synonym dictionary generation device according to any one of claims 1 to 3,
Using the reference vocabulary, a related vocabulary acquisition unit that acquires a related vocabulary related to the reference vocabulary;
A text information storage unit for storing a large amount of text information;
A context acquisition unit for acquiring a context including the reference vocabulary and a context including the related vocabulary from the text information storage unit;
The synonym determination unit outputs a combination of a reference vocabulary and a related vocabulary determined to be synonymous with the reference vocabulary.
Synonym dictionary generator.
請求項4記載の同義語辞書生成装置であって、
前記基準語彙と、その基準語彙と同義語であると判定された関連語彙と、その基準語彙とその関連語彙の語彙情報と、その基準語彙とその関連語彙との各前記類似性と、前記同義指標とが記憶される同義語情報記憶部をさらに含む、
同義語辞書生成装置。
The synonym dictionary generation device according to claim 4,
The reference vocabulary, the related vocabulary determined to be synonymous with the reference vocabulary, the vocabulary information of the reference vocabulary and the related vocabulary, the similarities between the reference vocabulary and the related vocabulary, and the synonyms A synonym information storage unit for storing the index;
Synonym dictionary generator.
文脈類似性算出部が、同義語辞書を作成する際に基準となる基準語彙を含む文脈と、前記基準語彙に関連する関連語彙を含む文脈の類似性を算出する文脈類似性算出ステップと、
表記類似性算出部が、前記基準語彙の表記と前記関連語彙の表記の類似性を算出する表記類似性算出ステップと、
読み類似性算出部が、前記基準語彙の読みと前記関連語彙の読みの類似性を算出する読み類似性算出ステップと、
基準語彙及び関連語彙が同義語である確からしさを示す同義指標は、その基準語彙の文脈及びその関連語彙の文脈が類似しているほど確からしいことを示し、その基準語彙の表記及びその関連語彙の表記が類似しているほど確からしいことを示し、その基準語彙の読み及びその関連語彙の読みが類似していないほど確からしいことを示すものとし、同義語判定部が、前記算出された文脈、表記及び読みの類似性を用いて前記基準語彙及び前記関連語彙についての同義指標を求め、その同義指標の大きさに基づき前記関連語彙が前記基準語彙の同義語であるか否かを判定する同義語判定ステップと、を含む、
同義語辞書生成方法。
A context similarity calculating unit that calculates a context similarity including a reference vocabulary that is a reference when creating a synonym dictionary and a context including a related vocabulary related to the reference vocabulary; and
A notation similarity calculating step, wherein a notation similarity calculating unit calculates the similarity between the notation of the reference vocabulary and the notation of the related vocabulary;
A reading similarity calculating unit that calculates a similarity between readings of the reference vocabulary and readings of the related vocabulary; and
A synonym index that indicates the likelihood that a reference vocabulary and related vocabulary are synonyms indicates that the similarity of the context of the reference vocabulary and the context of the related vocabulary is more likely, and the notation of the reference vocabulary and the related vocabulary The similar notation indicates that the reading of the reference vocabulary and the reading of the reference vocabulary and the reading of the related vocabulary are unlikely to be similar, and the synonym determination unit includes the calculated context. Then, a synonym index for the reference vocabulary and the related vocabulary is obtained using similarity of notation and reading, and it is determined whether or not the related vocabulary is a synonym of the reference vocabulary based on the size of the synonym index. A synonym determining step,
Synonym dictionary generation method.
コンピュータを請求項1から5の何れかに記載の同義語辞書生成装置として機能させるためのプログラム。   The program for functioning a computer as a synonym dictionary production | generation apparatus in any one of Claim 1-5.
JP2011148198A 2011-07-04 2011-07-04 Synonym dictionary generating apparatus, method and program thereof Active JP5524138B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011148198A JP5524138B2 (en) 2011-07-04 2011-07-04 Synonym dictionary generating apparatus, method and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011148198A JP5524138B2 (en) 2011-07-04 2011-07-04 Synonym dictionary generating apparatus, method and program thereof

Publications (2)

Publication Number Publication Date
JP2013016011A JP2013016011A (en) 2013-01-24
JP5524138B2 true JP5524138B2 (en) 2014-06-18

Family

ID=47688647

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011148198A Active JP5524138B2 (en) 2011-07-04 2011-07-04 Synonym dictionary generating apparatus, method and program thereof

Country Status (1)

Country Link
JP (1) JP5524138B2 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6106616B2 (en) * 2014-02-13 2017-04-05 日本電信電話株式会社 Database creation device, word search device, information terminal, word search method, program
JP2019049873A (en) * 2017-09-11 2019-03-28 株式会社Screenホールディングス Synonym dictionary creation apparatus, synonym dictionary creation program, and synonym dictionary creation method
JP6509391B1 (en) * 2018-01-31 2019-05-08 株式会社Fronteo Computer system
JP6571231B1 (en) * 2018-03-12 2019-09-04 株式会社ソケッツ Search apparatus and method
JP7168334B2 (en) * 2018-03-20 2022-11-09 ヤフー株式会社 Information processing device, information processing method and program
JP7029813B2 (en) * 2019-02-28 2022-03-04 株式会社ミラボ Dictionary creation device, dictionary creation method and dictionary creation program
CN111488735B (en) * 2020-04-09 2023-10-27 中国银行股份有限公司 Test corpus generation method and device and electronic equipment
WO2022244189A1 (en) * 2021-05-20 2022-11-24 三菱電機株式会社 Information processing device, processing method, and processing program

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5094486B2 (en) * 2008-03-14 2012-12-12 日本電信電話株式会社 Synonymity determination device, method, program, and recording medium
JP5356197B2 (en) * 2009-12-01 2013-12-04 株式会社日立製作所 Word semantic relation extraction device

Also Published As

Publication number Publication date
JP2013016011A (en) 2013-01-24

Similar Documents

Publication Publication Date Title
JP5524138B2 (en) Synonym dictionary generating apparatus, method and program thereof
Arisoy et al. Turkish broadcast news transcription and retrieval
JP5440177B2 (en) Word category estimation device, word category estimation method, speech recognition device, speech recognition method, program, and recording medium
CN108140019B (en) Language model generation device, language model generation method, and recording medium
Klejch et al. Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches
US20110224982A1 (en) Automatic speech recognition based upon information retrieval methods
WO2003010754A1 (en) Speech input search system
Sitaram et al. Speech synthesis of code-mixed text
Sitaram et al. Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text.
Kirchhoff et al. Novel speech recognition models for Arabic
Hanani et al. Spoken Arabic dialect recognition using X-vectors
Bigot et al. Person name recognition in ASR outputs using continuous context models
Juhár et al. Recent progress in development of language model for Slovak large vocabulary continuous speech recognition
Soto et al. Rescoring confusion networks for keyword search
Liu et al. Paraphrastic language models
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Jiang et al. Dict-tts: Learning to pronounce with prior dictionary knowledge for text-to-speech
JP2011175046A (en) Voice search device and voice search method
JP2011128903A (en) Sequence signal retrieval device and sequence signal retrieval method
Pan et al. Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese
JPH117447A (en) Topic extracting method, topic extraction model to be used for the extracting method, preparing method for the topic extraction model, and topic extraction program recording medium
JP6067616B2 (en) Utterance generation method learning device, utterance generation method selection device, utterance generation method learning method, utterance generation method selection method, program
JP2006107353A (en) Information processor, information processing method, recording medium and program
JP4674609B2 (en) Information processing apparatus and method, program, and recording medium
Enzell Domain Adaptation with N-gram Language Models for Swedish Automatic Speech Recognition: Using text data augmentation to create domain-specific n-gram models for a Swedish open-source wav2vec 2.0 model

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20130710

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20140131

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140212

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140310

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20140401

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20140409

R150 Certificate of patent or registration of utility model

Ref document number: 5524138

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150