JP2004326479A - Similarity calculating program and device between words - Google Patents

Similarity calculating program and device between words Download PDF

Info

Publication number
JP2004326479A
JP2004326479A JP2003120899A JP2003120899A JP2004326479A JP 2004326479 A JP2004326479 A JP 2004326479A JP 2003120899 A JP2003120899 A JP 2003120899A JP 2003120899 A JP2003120899 A JP 2003120899A JP 2004326479 A JP2004326479 A JP 2004326479A
Authority
JP
Japan
Prior art keywords
word
similarity
words
occurrence
ideographic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2003120899A
Other languages
Japanese (ja)
Other versions
JP4055638B2 (en
Inventor
Naoto Akira
直人 秋良
Yasutsugu Morimoto
康嗣 森本
Atsuko Koizumi
敦子 小泉
Kazutake Kurenishi
一毅 久連石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Priority to JP2003120899A priority Critical patent/JP4055638B2/en
Publication of JP2004326479A publication Critical patent/JP2004326479A/en
Application granted granted Critical
Publication of JP4055638B2 publication Critical patent/JP4055638B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To quickly and highly precisely calculate similarity from document data between words. <P>SOLUTION: In a similarity calculating program for calculating the word similarity from the document data, a co-occurrence vector with the frequency of ideograph having a co-occurrence relation with words under consideration being the target of similarity calculation as elements is generated from the document data, and the co-occurrence vector is stored in a storage means, and the similarity between the words under consideration is calculated based on the similarity of the co-occurrence vector. Thus, it is possible to quickly calculate inter-word similarity by using the co-occurrence vector with the frequency of the ideograph as elements. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【0001】
【発明の属する技術分野】
この発明は、2つの単語間の類似度を計算する単語間類似度計算プログラム及び装置に関する。
【0002】
【従来の技術】
従来、単語間の類似度を計算する手法として、単語の特徴を、その単語と共起する単語の頻度を要素とする共起ベクトルで定義し、類似度を計算しようとする2つの単語の類似度を、共起ベクトルの類似性に基づき計算する技術がある(例えば、特許文献1参照)。
【特許文献1】特開2000−137718号公報
【発明が解決しようとする課題】
上記従来のプログラムは、単語の頻度を要素とする共起ベクトルの類似性に基づき単語間の類似度を計算するため、共起ベクトルを構成する単語の語彙数が多いと、共起ベクトルを構成する単語の語彙数である共起ベクトルの次元数が大きくなり、上記共起ベクトルを用いての類似度計算にかかる時間が長く、リアルタイムでの処理が困難であった。又、共起ベクトルのサイズが大きく、メモリやハードディスク等の記憶装置に格納することが困難であった。また、低頻度語においては、共起対象となる単語が少ないために、低頻度語と、その他の単語間で共通した共起対象の数が少ない、または共通した共起対象を有さないために、類似度計算が困難であるという問題があった。
【0003】
本発明の目的は、単語間の類似度計算速度を高速化し、注目単語と、それ以外の複数の単語との類似度計算をリアルタイムで行うことを可能にすることと、共起ベクトルのサイズを小さくし、メモリやハードディスク等の記憶装置に、上記共起ベクトルを格納することにある。
本発明の他の目的は、低頻度語に対する類似度計算精度を向上することにある。
【0004】
【課題を解決するための手段】
上記目的を達成するために、本願で開示する発明の概要を説明すれば以下の通りである。本発明の単語間類似度計算プログラムは、記憶装置から文書データを読み出し、該文書データを形態素解析処理を用いて単語に分割し、類似度を計算する各々の注目単語について、その注目単語と共起関係にある単語を抽出し、該共起関係にある単語から表意文字を抽出し、各々の注目単語について、その注目単語と共起関係にある表意文字の頻度を要素とする共起ベクトルを作成し、メモリやハードディスク等の記憶装置に格納する。
次に、類似する単語を抽出しようとする注目単語と、その他の単語との類似度を、該共起ベクトルの類似性に基づき計算し、該注目単語と類似度の高い単語を抽出する。
【0005】
【発明の実施の形態】
以下、本発明の実施例を図を用いて説明する。
図1は、本発明の第1の実施形態である単語間の類似度を計算するための手順を示すフローチャートである。まず、メモリやハードディスク等の記憶装置から文書データを読み出し(S11)、形態素解析処理を用いて文書データを単語に分割し(S12)、単語の品詞情報を用いて、類似度を計算する各々の注目単語に対して共起関係にある単語を抽出する(S13)。ここで、注目単語と共起関係にある単語は、注目単語と係り受けの関係にある単語、注目単語を含む文書中で出現する単語、注目単語の前後で指定した文字数の範囲にある単語等であって、この他にも注目単語と共起関係、つまり注目単語が出現する文脈に含まれる単語であれば構わない。例えば、類似度を計算する注目単語を名詞とし、注目単語と係り受けの関係にある動詞を共起関係にある単語とすると、「パソコンを起動する」という文からは、注目単語「パソコン」という注目単語と、注目単語と係り受けの関係にある単語「起動する」が得られる。
【0006】
次に、注目単語と共起関係にある単語から表意文字を抽出し(S14)、注目単語と共起関係にある表意文字の頻度の集計によって、各々の注目単語と共起関係にある表意文字の頻度を要素とする共起ベクトルを作成し(S15)、メモリやハードディスク等の記憶装置に格納する。ここで、表意文字は漢字とするが、限られた注目単語と共起し、その注目単語を特徴づける文字であれば、仮名、英数字等の文字を用いても構わない。また、共起ベクトルの要素は、表意文字の出現分布を示すものであれば、頻度でなくとも構わない。例えば、「パソコン−起動する」という注目単語と単語の共起関係からは、「パソコン−起」「パソコン−動」という注目単語と表意文字の共起関係が得られる。図2に示す、複数の文書データから、注目単語と共起関係にある表意文字を抽出し、その頻度を集計した結果の例からは、注目単語「パソコン」に対して、
[170(入),160(用),100(購),80(利),80(使),20(動),15(起),…]
、という共起ベクトルが得られる。
【0007】
次に、類似する単語を抽出しようとする1つの注目単語と、それ以外の複数の注目単語との類似度を、上記共起ベクトルの類似性に基づき計算する(S16)。ここでは注目単語として「パソコン、PC、HDD、メモリ、プリンタ」があった場合に、「パソコン」が1つの注目単語で、それ以外の複数の注目単語が「PC、HDD、メモリ、プリンタ」となり、共起ベクトルが似た単語である「PC」を抽出することになる。そして、注目単語と類似度の高い単語を、指定した個数、あるいは類似度が指定した閾値以上であるといった基準で抽出する(S17)。ここで、類似度の計算式は、類似度を計算しようとする2つの単語各々に対する共起ベクトルのなす角を示すコサイン距離のように、共起ベクトルの類似性が求まるものであれば、方式を問わない。例えば、「パソコン」の共起ベクトル[170(入),160(用),100(購),80(利),80(使)]と、「PC」の共起ベクトル[140(入),120(用),80(購),70(利),50(使)]からコサイン距離を用いて類似度を計算すると、
【0008】
【数1】

Figure 2004326479
【0009】
という類似度が得られる。ここで、括弧内の値は、注目単語と共起する表意文字と、表意文字各々の注目単語と共起する頻度を示す。
【0010】
したがって、図3に構成を示すテキストマイニング装置や文書検索装置等の文書処理装置において、キーボードやマウス等の入力装置(304)によって指定された注目単語と類似度の高い単語を、記憶されている文書データ(306)から抽出できるので、図4に示すように、同義語、類義語、関連語といった注目単語と類似する単語をディスプレイ等の表示装置(303)に表示させるために用いる。尚、本願の構成は、ハードディスク等の記憶装置に記憶されたプログラム(307、308)をメモリ(302)に読み込んでCPUが制御することで実現される。
【0011】
本実施形態によれば、表意文字である漢字はJIS第一水準漢字すべての文字を用いたとしても最大2965文字と限られているため、JIS第一水準漢字の頻度を要素とする共起ベクトルを用いることによって、母集団が未知であり語彙数の多い単語の頻度を共起ベクトルの要素に用いた場合と比較し、共起ベクトルの次元数が大きく削減できるので、類似度計算速度の高速化に効果がある。共起ベクトルを作成するために用いる文書データのサイズに依存するが、単語を用いた場合には数千〜数十万次元である共起ベクトルが、表意文字を用いることにより数百〜2965次元(JIS第一水準漢字を用いる場合)に削減できるので、コサイン距離を用いた場合の処理速度は、数倍〜数十倍に高速化できる効果と、共起ベクトルのサイズを数分の一から数十分の一に削減できるという効果がある。
【0012】
また、漢字を用いることによって、文字コードと共起ベクトルを構成する各次元の要素とを対応させることができるので、単語を用いる場合に必要な単語と、共起ベクトルを構成する各次元の要素との対応情報が不要となり、メモリ使用量の削減および処理速度の高速化ができるという効果がある。また、「利用」と「使用」のように意味が類似していても単語では異なるものとして扱われているものが、表意文字を用いると「用」が共通するといったように、共起対象を単語とした場合に異なっていた共起ベクトルの要素が、共起対象を表意文字にすることによって一部が共通するという特徴があるため、共起対象である表意文字が少ない低頻度語についても類似度の計算ができるという効果がある。
【0013】
次に、本発明の第2の実施形態を説明する。第2の実施形態は、図5に示すフローチャートのように、第1の実施形態における注目単語と共起関係にある単語を抽出するステップを省略するもので、注目単語の前後の指定した文字数内にある文字、あるいは文書内で共起する文字等の、注目単語と共起する表意文字を直接抽出する。例えば、「パソコン」という単語に注目した場合、「パソコン」と共起関係にある単語を抽出することなく、「パソコン」と共起関係にある表意文字「入、用、購、利、使」の頻度が得られる。本実施形態によれば、注目単語と共起する単語を抽出するステップが省略できるため、共起ベクトルの作成に要する時間が短縮できるという効果がある。
【0014】
次に、本発明の第3の実施形態を説明する。第3の実施形態は、第1または第2の何れかの実施形態における類似度を計算するステップにおいて、単語間の類似度計算に貢献する表意文字と貢献しない表意文字を考慮し、類似度計算に貢献する表意文字には大きな重みを定義し、類似度計算に貢献しない表意文字には小さな重みを定義し、共起ベクトルの各要素に重みを積算した共起ベクトルを用いて、第1の実施形態で示す方式のように類似度を計算する。ここでは、表意文字の重みを共起ベクトルの要素に積算する方式を用いるが、重みの大きい表意文字が類似度計算へ反映できる方式であれば、どのような方式であっても構わない。ここで、表意文字の重みは、ディスプレイ等の表示装置に、重みを定義または修正しようとする表意文字を表示し、該表意文字の重みの入力を受けることで設定する。例えば、図6に示すような表意文字の重みエディタを用いて表意文字の重みを定義することができる。重みが未定義の表意文字には、予め設定されている値をする重みを用いれば良い。
【0015】
例として、表意文字の重みを「8(入),3(用),10(購),5(利),6(使)」と定義し、「パソコン」「PC」の共起ベクトルを、
パソコン ⇒ [170(入),160(用),100(購),80(利),80(使)]
PC ⇒ [140(入),120(用),80(購),70(利),50(使)]
とすると、類似度計算に用いる共起ベクトルは、
パソコン ⇒ [170×8(入),160×3(用),100×10(購),80×5(利),80×6(使)]
PC ⇒ [140×8(入),120×3(用),80×10(購),70×5(利),50×6(使)]
となる。
【0016】
本実施形態によれば、類似度計算の精度を低下させてしまう表意文字の重みを小さく定義することにより精度低下を防止することができ、類似度計算に貢献する表意文字の重みを大きく定義することによって、該表意文字の類似度計算への貢献度を高めることができるので、類似度計算の精度向上ができるという効果がある。
【0017】
次に、本発明の第4の実施形態を説明する。第4の実施形態は、表意文字間の関連度を図7のように定義し、第1乃至第3の実施形態における類似度を計算するステップにおいて、類似度を計算しようとする2つの単語間で共通しない共起対象の表意文字も類似度計算に利用する方法である。例えば、図7に示す表意文字「使」と「用」の関連度は8と定義されており、「使」と「用」は関連が強い単語であるために、「使」と「用」の頻度の類似性を考慮して類似度を計算できる。
【0018】
ここで、表意文字間の関連度は、ディスプレイ等の表示装置に、関連度を定義しようとする2つの表意文字を表示し、該2つの表意文字間の関連度の入力を受けることで設定することができる。例えば、図8に示すような表意文字間の関連度エディタを表示して、入力装置を介してのユーザからの設定を受けることで、表意文字間の関連度を定義、または修正できる。
【0019】
また、本実施例で用いる類似度の計算式は、異なる表意文字間の関連度を考慮するものであれば、どのような計算式であっても構わない。例えば、単語W1と単語W2の類似度計算式は、単語W1のi番目の文字をCi、単語W2のj番目の文字をCj、文字Ciと文字Cjの関連度をRel(Ci,Cj)、単語W1の共起ベクトルをX={x1,x2,…,xI}、単語W2の共起ベクトルをY={y1,y2,…,yJ}、とすると、
【0020】
【数2】
Figure 2004326479
【0021】
という計算式となる。
【0022】
本実施形態によれば、類似度を計算しようとする2つの単語間で共通する共起対象の表意文字が少ない、あるいは共通する共起対象の表意文字がないといった場合にも、類似度が計算できるため、低頻度語等の共起対象が少ない単語と、その他の単語との類似度の計算ができるという効果がある。
次に、本発明の第5の実施形態を説明する。第5の実施形態は、類似度の計算対象となる複数の単語のすべての組に対して第1乃至第4の実施形態に示した方法を用いて類似度を計算し、図9に示すように2つの単語の組と、その単語間の類似度からなるデータをメモリやハードディスク等の記憶装置に格納しておく。更に、類似した単語を抽出しようとする注目単語を指定し、該単語の組とその単語間の類似度からなるデータから該注目単語が含まれる単語の組を検索することによって、該注目単語と類似度の高い単語を指定した個数あるいは指定する閾値以上の類似度を持つ単語といった基準で抽出する。
【0023】
本実施例によれば、類似する単語を抽出しようとする注目単語が指定された際に、単語間の類似度を計算する必要がないため、高速に注目単語と類似する単語を抽出できるという効果がある。また、文書データの更新や新語の登場等に伴い、単語の組とその類似度からなるデータを更新する必要があるが、表意文字の頻度を要素とする共起ベクトルを用いて類似度を計算することによって、該単語の組と類似度からなるデータを高速に作成できる。
【0024】
次に、本発明の第6の実施形態を説明する。第6の実施形態は、第1乃至第5の実施形態によって、各々の注目単語と同義関係にある同義語を抽出し、ユーザが指定する注目単語と、その同義語の組を、その同義関係が成り立つ文脈情報と共に、図10に示すような同義語辞書としてメモリやハードディスクの記憶装置に記憶する。。ここで、文脈情報は、その文脈で出現する表意文字の頻度を要素とする文脈ベクトルで定義し、文脈に依存せず常に同義関係が成り立つ同義語には、文脈情報を格納しなくても構わない。この文脈情報は、多義語の存在によって、複数の語義がある場合に語義を判定するために用いる。例としては、注目単語「米国」とその同義語「米」の組と、注目単語「お米」とその同義語「米」が挙げられ、米を同義語辞書を用いて「米国」または「お米」に置換しようとする場合に、同義関係が成立する文脈情報を示す文脈ベクトルを確認すると、
「米国−米」⇒[100(旅),80(政),70(府),60(発),55(表),…]
「お米−米」⇒[200(食),150(買),80(炊),30(作),20(育),…]
となっており、注目している文脈中に含まれている表意文字からなる文脈ベクトルは、[2(食),1(炊),1(飯),…]となっているため、この文脈ベクトルと「米国−米」「お米−米」の文脈ベクトルの比較をすることによって、「お米−米」の文脈ベクトルとの距離が近いことが判別でき、ここで置換すべき単語は「お米」であるということが分かる。
本実施例によれば、多義語を含む同義語の組について、最適な同義語を自動で選択できるという効果がある。また、表意文字の頻度を要素とする文脈ベクトルを用いるため、文脈ベクトルの次元数が小さく、メモリやハードディスク等の記憶装置へ文脈ベクトルを記憶できるという効果がある。
【0025】
【発明の効果】
本発明によれば、表意文字の頻度を要素とする共起ベクトルを単語間の類似度計算に用いることによって、単語の頻度を要素とする共起ベクトルを用いる場合と比較し、共起ベクトルの次元が大きく削減できるので、類似度計算速度の高速化と、共起ベクトルのサイズの削減といった効果がある。また、共起対象である単語で、意味が類似していても異なる表記であるために別のものとして扱われていた単語が、表意文字を用いることによって、共通する文字が発生し、共通する共起対象の数が増加するために類似度計算精度が向上するという効果がある。また、語義判定に用いる文脈ベクトルの要素に、表意文字の頻度を用いることによって、文脈ベクトルのサイズを小さくできるので、メモリやハードディスク等の記憶装置へ文脈ベクトルを記憶できるという効果がある。
【図面の簡単な説明】
【図1】本発明の第1の実施形態である単語間類似度を計算する手順を示すフローチャートである。
【図2】注目単語と共起する表意文字の頻度の例を示す図である。
【図3】文書処理装置の構成を示す図である。
【図4】注目単語に類似する単語を表示する画面を示す図である。
【図5】本発明の第2の実施形態である単語間類似度を計算する手順を示すフローチャートである。
【図6】表意文字の類似度計算への貢献度を示す重みを修正する画面を示す図である。
【図7】表意文字間の関連度の定義例を示す図である。
【図8】表意文字間の関連度を修正する画面を示す図である。
【図9】単語間の類似度の保存形式を示す図である。
【図10】同義語となる単語の組と、同義関係が成り立つ文脈情報の保存形式を示す図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an inter-word similarity calculation program and an apparatus for calculating a similarity between two words.
[0002]
[Prior art]
Conventionally, as a method of calculating the similarity between words, a feature of a word is defined by a co-occurrence vector having a frequency of a word co-occurring with the word as an element, and the similarity between two words whose similarity is to be calculated is calculated. There is a technique for calculating a degree based on the similarity of co-occurrence vectors (for example, see Patent Document 1).
[Patent Document 1] Japanese Patent Application Laid-Open No. 2000-137718 [Problems to be Solved by the Invention]
The above-mentioned conventional program calculates the similarity between words based on the similarity of the co-occurrence vector having the frequency of the word as an element. Therefore, when the number of words constituting the co-occurrence vector is large, the co-occurrence vector is formed. The number of dimensions of the co-occurrence vector, which is the number of vocabulary words, becomes large, the time required to calculate the similarity using the co-occurrence vector becomes long, and real-time processing is difficult. Further, the size of the co-occurrence vector is large, and it is difficult to store the co-occurrence vector in a storage device such as a memory or a hard disk. In addition, in low-frequency words, since the number of words to be co-occurred is small, the number of co-occurrence targets common between low-frequency words and other words is small, or there is no common co-occurrence target. In addition, there is a problem that it is difficult to calculate the similarity.
[0003]
An object of the present invention is to increase the speed of calculating the degree of similarity between words, to enable similarity calculation between a word of interest and a plurality of other words in real time, and to reduce the size of a co-occurrence vector. Another object of the present invention is to store the co-occurrence vector in a storage device such as a memory or a hard disk.
Another object of the present invention is to improve the accuracy of calculating the similarity for low-frequency words.
[0004]
[Means for Solving the Problems]
An outline of the invention disclosed in the present application to achieve the above object is as follows. The inter-word similarity calculation program according to the present invention reads out document data from a storage device, divides the document data into words using a morphological analysis process, and assigns each word of interest for which similarity is calculated to the word of interest. The co-occurrence vector having the frequency of the ideographic character co-occurring with the noted word as an element is extracted for each word of interest. It is created and stored in a storage device such as a memory or a hard disk.
Next, the similarity between the attention word for which a similar word is to be extracted and another word is calculated based on the similarity of the co-occurrence vector, and a word having a high similarity to the attention word is extracted.
[0005]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a flowchart showing a procedure for calculating the similarity between words according to the first embodiment of the present invention. First, the document data is read from a storage device such as a memory or a hard disk (S11), the document data is divided into words using morphological analysis processing (S12), and the similarity is calculated using the word class information of the words. A word having a co-occurrence relationship with the target word is extracted (S13). Here, words that have a co-occurrence relationship with the word of interest include words that have a dependency relationship with the word of interest, words that appear in the document containing the word of interest, words that are within the specified number of characters before and after the word of interest, and the like. In addition, any other word may be used as long as the word is included in a context in which the attention word appears, that is, a co-occurrence relationship with the attention word. For example, if the attention word for which the similarity is calculated is a noun, and the verb in a dependency relationship with the attention word is a word having a co-occurrence relationship, the sentence "Start up a personal computer" is called the attention word "PC". The word of interest and the word “start up” that is dependent on the word of interest are obtained.
[0006]
Next, ideographic characters are extracted from words that have a co-occurrence relationship with the target word (S14), and ideographic characters that have a co-occurrence relationship with each target word are calculated by counting the frequencies of ideographic characters that have a co-occurrence relationship with the target word. A co-occurrence vector having the frequency of as an element is created (S15) and stored in a storage device such as a memory or a hard disk. Here, the ideographic character is a kanji, but a character such as a kana or an alphanumeric character may be used as long as it co-occurs with a limited attention word and characterizes the attention word. The element of the co-occurrence vector need not be the frequency as long as it indicates the appearance distribution of ideographic characters. For example, from the co-occurrence relationship between the attention word “PC-start” and the word, a co-occurrence relationship between the attention word “PC-KI” and “PC-movement” and the ideogram can be obtained. From the example of the result of extracting the ideographic characters co-occurring with the noticed word from the plurality of document data shown in FIG.
[170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use), 20 (motion), 15 (ki), ...]
, Are obtained.
[0007]
Next, the similarity between one attention word from which a similar word is to be extracted and a plurality of other attention words is calculated based on the similarity of the co-occurrence vectors (S16). In this case, if the word of interest is “PC, PC, HDD, memory, printer”, “PC” is one word of interest and the other words of interest are “PC, HDD, memory, printer”. , "PC" which is a word having a similar co-occurrence vector. Then, words having a high degree of similarity to the target word are extracted on the basis of a specified number or a criterion that the degree of similarity is equal to or greater than a specified threshold (S17). Here, the formula for calculating the similarity may be any formula as long as the similarity of the co-occurrence vector is obtained, such as the cosine distance indicating the angle formed by the co-occurrence vector for each of the two words for which the similarity is to be calculated. Regardless. For example, the co-occurrence vector of “PC” [170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use)] and the co-occurrence vector of “PC” [140 (in), 120 (for), 80 (for purchase), 70 (for profit), 50 (for use)] to calculate the similarity using the cosine distance,
[0008]
(Equation 1)
Figure 2004326479
[0009]
Is obtained. Here, the value in parentheses indicates the ideographic character co-occurring with the noted word and the frequency of each ideographic character co-occurring with the noted word.
[0010]
Therefore, in a document processing device such as a text mining device or a document search device shown in FIG. 3, a word having a high degree of similarity to the attention word specified by the input device (304) such as a keyboard or a mouse is stored. Since it can be extracted from the document data (306), as shown in FIG. 4, it is used to display words similar to the word of interest such as synonyms, synonyms, and related words on a display device (303) such as a display. Note that the configuration of the present application is realized by reading programs (307, 308) stored in a storage device such as a hard disk into the memory (302) and controlling the CPU.
[0011]
According to the present embodiment, the ideographic kanji is limited to a maximum of 2965 even if all the characters of the JIS first-level kanji are used. By using, the number of dimensions of co-occurrence vectors can be greatly reduced compared to the case where the frequency of words whose population is unknown and the number of vocabularies is large is used for the elements of co-occurrence vectors. It is effective for conversion. Depending on the size of the document data used to create the co-occurrence vector, when words are used, the co-occurrence vector which has thousands to hundreds of thousands of dimensions becomes several hundred to 2965 dimensions by using ideographic characters. (In the case of using JIS first-level kanji), the processing speed when using the cosine distance can be increased several times to several tens times, and the size of the co-occurrence vector can be reduced by a fraction. There is an effect that it can be reduced to several tenths.
[0012]
Also, by using kanji, it is possible to associate a character code with each dimensional element constituting a co-occurrence vector, so that a word necessary when using a word and an element of each dimension constituting a co-occurrence vector are used. This eliminates the need for corresponding information, and has the effect of reducing the amount of memory used and increasing the processing speed. Also, even if the meanings are similar, such as “use” and “use”, they are treated as different words, but if ideographic characters are used, the co-occurrence Since the elements of the co-occurrence vector, which were different when words were used, are partly common by making the co-occurrence target an ideogram, even low-frequency words with few ideographs to be co-occurred There is an effect that the similarity can be calculated.
[0013]
Next, a second embodiment of the present invention will be described. In the second embodiment, as in the flowchart shown in FIG. 5, the step of extracting a word having a co-occurrence relationship with the target word in the first embodiment is omitted. , Or ideographic characters that co-occur with the word of interest, such as characters that co-occur in a document. For example, if attention is paid to the word “PC”, the ideographic characters “ON, use, purchase, use, use” that co-occur with “PC” without extracting words that co-occur with “PC” Is obtained. According to the present embodiment, since the step of extracting a word that co-occurs with the attention word can be omitted, the time required for creating a co-occurrence vector can be reduced.
[0014]
Next, a third embodiment of the present invention will be described. According to the third embodiment, in the step of calculating the similarity in any of the first and second embodiments, the ideographic characters contributing to the calculation of the similarity between words and the ideographic characters not contributing are considered, and the similarity calculation is performed. A large weight is defined for ideographic characters that contribute to the similarity calculation, a small weight is defined for ideographic characters that do not contribute to the similarity calculation, and a first co-occurrence vector obtained by multiplying each element of the co-occurrence vector by a weight is used. The similarity is calculated as in the method described in the embodiment. Here, a method is used in which the weight of the ideographic character is added to the element of the co-occurrence vector. However, any method may be used as long as the ideographic character having a large weight can be reflected in the similarity calculation. Here, the weight of the ideographic character is set by displaying the ideographic character whose weight is to be defined or corrected on a display device such as a display and receiving the input of the weight of the ideographic character. For example, the weight of an ideographic character can be defined using an ideographic character weight editor as shown in FIG. For an ideographic character whose weight is undefined, a weight having a preset value may be used.
[0015]
As an example, the weight of an ideographic character is defined as “8 (in), 3 (for), 10 (purchase), 5 (interest), 6 (use)”, and the co-occurrence vector of “PC” and “PC” is
PC ⇒ [170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use)]
PC ⇒ [140 (in), 120 (for), 80 (purchase), 70 (interest), 50 (use)]
Then, the co-occurrence vector used for the similarity calculation is
PC ⇒ [170 × 8 (in), 160 × 3 (for), 100 × 10 (purchase), 80 × 5 (interest), 80 × 6 (use)]
PC ⇒ [140 × 8 (in), 120 × 3 (for), 80 × 10 (purchase), 70 × 5 (interest), 50 × 6 (use)]
It becomes.
[0016]
According to the present embodiment, it is possible to prevent a decrease in accuracy by defining a small weight of an ideographic character that reduces the accuracy of similarity calculation, and to define a large weight of an ideographic character that contributes to similarity calculation. As a result, the degree of contribution of the ideographic character to the similarity calculation can be increased, so that the accuracy of the similarity calculation can be improved.
[0017]
Next, a fourth embodiment of the present invention will be described. In the fourth embodiment, the degree of association between ideograms is defined as shown in FIG. 7, and in the step of calculating the degree of similarity in the first to third embodiments, the step of calculating the degree of similarity between two words to be calculated is performed. This is a method in which ideographic characters that are not co-occurring and are also used for similarity calculation. For example, the degree of association between the ideographic characters “use” and “use” shown in FIG. 7 is defined as 8, and “use” and “use” are words that are strongly related. The similarity can be calculated in consideration of the frequency similarity.
[0018]
Here, the degree of association between ideographic characters is set by displaying two ideographic characters whose definition is to be defined on a display device such as a display and receiving an input of the degree of association between the two ideographic characters. be able to. For example, by displaying a relation degree editor between ideograms as shown in FIG. 8 and receiving a setting from the user via the input device, the relation degree between ideograms can be defined or corrected.
[0019]
In addition, the formula for calculating the similarity used in the present embodiment may be any formula as long as the relevance between different ideographic characters is considered. For example, the similarity calculation formula of the word W1 and the word W2 is such that the i-th character of the word W1 is Ci, the j-th character of the word W2 is Cj, the relevance of the character Ci and the character Cj is Rel (Ci, Cj), If the co-occurrence vector of the word W1 is X = {x1, x2,..., XI} and the co-occurrence vector of the word W2 is Y = {y1, y2,.
[0020]
(Equation 2)
Figure 2004326479
[0021]
It becomes the calculation formula.
[0022]
According to the present embodiment, the similarity is calculated even when there are few common ideographs to be co-occurred between two words whose similarity is to be calculated, or when there is no common ideographic character to be co-occurred. Therefore, there is an effect that a similarity between a word having few co-occurrence targets such as a low-frequency word and other words can be calculated.
Next, a fifth embodiment of the present invention will be described. In the fifth embodiment, the similarity is calculated using the method described in the first to fourth embodiments for all the sets of a plurality of words for which the similarity is to be calculated, as shown in FIG. First, a set of two words and data including the similarity between the words are stored in a storage device such as a memory or a hard disk. Further, by specifying a word of interest to extract a similar word, and searching for a set of words that include the word of interest from data consisting of the set of words and the degree of similarity between the words, Words having a high degree of similarity are extracted based on a specified number or words having a degree of similarity equal to or greater than a specified threshold.
[0023]
According to this embodiment, it is not necessary to calculate the degree of similarity between words when a target word for which a similar word is to be extracted is specified, so that a word similar to the target word can be quickly extracted. There is. In addition, it is necessary to update data consisting of a set of words and their similarity with the update of document data and the appearance of new words, but the similarity is calculated using co-occurrence vectors that use the frequency of ideographic characters as an element. By doing so, data consisting of the word set and similarity can be created at high speed.
[0024]
Next, a sixth embodiment of the present invention will be described. In the sixth embodiment, according to the first to fifth embodiments, a synonym having a synonymous relation with each attention word is extracted, and a set of the attention word designated by the user and the synonym is extracted. Is stored in a storage device such as a memory or a hard disk as a synonym dictionary as shown in FIG. . Here, the context information is defined by a context vector having the frequency of an ideographic character appearing in the context as an element, and it is not necessary to store the context information in a synonym for which a synonym always holds regardless of the context. Absent. This context information is used to determine the meaning when there are a plurality of meanings due to the presence of the polysemy. Examples include the word "USA" and its synonym "US", and the word "rice" and its synonym "US". If you try to replace it with "rice" and check the context vector that indicates the context information where the synonymous relationship holds,
"US-US" ⇒ [100 (travel), 80 (political), 70 (fu), 60 (depart), 55 (table), ...]
"Rice-Rice" ⇒ [200 (meal), 150 (buy), 80 (cooked), 30 (work), 20 (growth), ...]
Since the context vector composed of ideograms included in the context of interest is [2 (food), 1 (cooked), 1 (rice), ...], this context By comparing the vector and the context vector of “US-US” and “Rice-US”, it can be determined that the distance between the context vector of “US-US” is short, and the word to be replaced here is “ Rice.
According to the present embodiment, there is an effect that an optimum synonym can be automatically selected for a set of synonyms including polysynonyms. Further, since a context vector having the frequency of an ideographic character as an element is used, the number of dimensions of the context vector is small, and there is an effect that the context vector can be stored in a storage device such as a memory or a hard disk.
[0025]
【The invention's effect】
According to the present invention, by using a co-occurrence vector having a frequency of an ideographic character as an element for calculating the similarity between words, a comparison with a case of using a co-occurrence vector having a frequency of a word as an element is provided. Since the dimension can be greatly reduced, there are effects of increasing the speed of calculating the similarity and reducing the size of the co-occurrence vector. In addition, words that are co-occurring and that have been treated as different because they have different notations even though they have similar meanings, use ideographic characters to generate common characters, Since the number of co-occurrence targets increases, there is an effect that similarity calculation accuracy is improved. In addition, since the size of the context vector can be reduced by using the frequency of the ideogram as an element of the context vector used for the meaning determination, there is an effect that the context vector can be stored in a storage device such as a memory or a hard disk.
[Brief description of the drawings]
FIG. 1 is a flowchart showing a procedure for calculating an inter-word similarity according to the first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of the frequency of ideographic characters co-occurring with a word of interest.
FIG. 3 is a diagram illustrating a configuration of a document processing apparatus.
FIG. 4 is a diagram showing a screen displaying a word similar to a word of interest.
FIG. 5 is a flowchart illustrating a procedure for calculating a similarity between words according to the second embodiment of the present invention.
FIG. 6 is a diagram showing a screen for correcting a weight indicating a degree of contribution of an ideographic character to similarity calculation.
FIG. 7 is a diagram showing a definition example of a degree of association between ideographic characters.
FIG. 8 is a diagram showing a screen for correcting the degree of association between ideographic characters.
FIG. 9 is a diagram showing a storage format of similarity between words.
FIG. 10 is a diagram illustrating a set of words that are synonyms and a storage format of context information in which a synonymous relationship is established.

Claims (6)

記憶手段から文書データを読み出すステップと、
上記文書データを単語に分割するステップと、
注目単語と共起関係にある単語を上記単語データから抽出するステップと、
上記抽出された単語を表意文字に分解し、上記注目単語と表意文字との共起関係を抽出するステップと、上記抽出された共起関係の分布の類似性により単語間の類似度を計算するステップと、
上記計算された類似度を用いて、注目単語と類似度の高い単語を出力するステップとを有する単語間類似度計算方法をコンピュータに実現させるプログラム。
Reading document data from the storage means;
Dividing the document data into words;
Extracting a word having a co-occurrence relationship with the word of interest from the word data;
Decomposing the extracted words into ideographic characters, extracting a co-occurrence relationship between the word of interest and the ideographic characters, and calculating a similarity between the words based on the similarity of the distribution of the extracted co-occurrence relationships. Steps and
A program for causing a computer to implement a method for calculating an inter-word similarity degree, the method including a step of outputting a word of interest and a word having a high degree of similarity using the calculated similarity degree.
記憶手段から文書データを読み出すステップと、
上記文書データを単語に分割するステップと、
注目単語と共起関係にある表意文字を上記単語データから抽出するステップと、上記抽出された共起関係の分布の類似性により単語間の類似度を計算するステップと、
上記計算された類似度を用いて、注目単語と類似度の高い単語を出力するステップとを有する単語間類似度計算方法をコンピュータに実現させるプログラム。
Reading document data from the storage means;
Dividing the document data into words;
Extracting from the word data an ideographic character co-occurring with the word of interest; calculating the similarity between words based on the similarity of the distribution of the extracted co-occurrence relationships;
A program for causing a computer to implement a method for calculating an inter-word similarity degree, the method including a step of outputting a word of interest and a word having a high degree of similarity using the calculated similarity degree.
表意文字を表示手段に表示させて、該表意文字各々の重みの値の入力または修正を受けるステップを有し、前記類似度計算ステップは、該表意文字各々の重みを用いることを特徴とする請求項1又は2に記載のプログラム。Displaying an ideographic character on a display means and receiving input or correction of a value of the weight of each of the ideographic characters, wherein the similarity calculating step uses the weight of each of the ideographic characters. Item 3. The program according to item 1 or 2. 表意文字を表示手段に表示させて、該表意文字間の関連度の入力または修正を受けるステップを有し、前記類似度計算ステップは、該表意文字間の関連度を用いることを特徴とする請求項1乃至3の何れかに記載のプログラム。Displaying an ideographic character on a display means and receiving input or correction of the degree of association between the ideographic characters, wherein the similarity calculating step uses the degree of association between the ideographic characters. Item 4. The program according to any one of Items 1 to 3. 複数の単語各々の複数の表意文字との共起関係分布の情報を記憶する記憶手段と、入力手段を有し、上記入力手段を介して上記複数の単語の何れかの入力を受け、該単語の共起関係分布に類似する共起関係分布を上記記憶手段から検索して、該検索された共起関係分布を有する単語を表示させることを特徴とする文書処理装置。A storage unit for storing information of a co-occurrence relationship distribution with a plurality of ideographs of each of the plurality of words, and an input unit, and receiving any of the plurality of words via the input unit; A document processing apparatus for retrieving a co-occurrence relation distribution similar to the co-occurrence relation distribution from the storage means and displaying words having the searched co-occurrence relation distribution. 各々の語義に応じた表意文字の頻度からなる共起関係分布の情報を記憶する記憶手段と、入力手段を有し、語義を複数有する多義語の語義判別は、上記入力手段を介して入力された単語が出現する文脈に含まれる表意文字の頻度と、上記共起関係分布の類似性により計算することを特徴とする請求項5記載の文書処理装置。A storage means for storing information of co-occurrence relation distribution consisting of frequencies of ideographic characters corresponding to respective meanings, and an input means, and the meaning determination of a polysemy having a plurality of meanings is input through the input means. 6. The document processing apparatus according to claim 5, wherein the frequency is calculated based on the frequency of ideographic characters included in the context in which the word appears and the similarity of the co-occurrence relation distribution.
JP2003120899A 2003-04-25 2003-04-25 Document processing device Expired - Fee Related JP4055638B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003120899A JP4055638B2 (en) 2003-04-25 2003-04-25 Document processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003120899A JP4055638B2 (en) 2003-04-25 2003-04-25 Document processing device

Publications (2)

Publication Number Publication Date
JP2004326479A true JP2004326479A (en) 2004-11-18
JP4055638B2 JP4055638B2 (en) 2008-03-05

Family

ID=33499601

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003120899A Expired - Fee Related JP4055638B2 (en) 2003-04-25 2003-04-25 Document processing device

Country Status (1)

Country Link
JP (1) JP4055638B2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010152561A (en) * 2008-12-24 2010-07-08 Toshiba Corp Similar expression extraction device, server unit, and program
JP2011175574A (en) * 2010-02-25 2011-09-08 Nippon Hoso Kyokai <Nhk> Document simplification device, simplification rule table creation device, and program
JP2013105210A (en) * 2011-11-10 2013-05-30 Nippon Telegr & Teleph Corp <Ntt> Device and method for estimating word attribute, and program
JP2013109125A (en) * 2011-11-21 2013-06-06 Nippon Telegr & Teleph Corp <Ntt> Word addition device, word addition method and program
WO2016101133A1 (en) * 2014-12-23 2016-06-30 Microsoft Technology Licensing, Llc Surfacing relationships between datasets
JP2017151665A (en) * 2016-02-24 2017-08-31 日本電気株式会社 Information processing device, information processing method, and program
JP2019133478A (en) * 2018-01-31 2019-08-08 株式会社Fronteo Computing system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010152561A (en) * 2008-12-24 2010-07-08 Toshiba Corp Similar expression extraction device, server unit, and program
JP2011175574A (en) * 2010-02-25 2011-09-08 Nippon Hoso Kyokai <Nhk> Document simplification device, simplification rule table creation device, and program
JP2013105210A (en) * 2011-11-10 2013-05-30 Nippon Telegr & Teleph Corp <Ntt> Device and method for estimating word attribute, and program
JP2013109125A (en) * 2011-11-21 2013-06-06 Nippon Telegr & Teleph Corp <Ntt> Word addition device, word addition method and program
WO2016101133A1 (en) * 2014-12-23 2016-06-30 Microsoft Technology Licensing, Llc Surfacing relationships between datasets
US11256687B2 (en) 2014-12-23 2022-02-22 Microsoft Technology Licensing, Llc Surfacing relationships between datasets
JP2017151665A (en) * 2016-02-24 2017-08-31 日本電気株式会社 Information processing device, information processing method, and program
JP2019133478A (en) * 2018-01-31 2019-08-08 株式会社Fronteo Computing system

Also Published As

Publication number Publication date
JP4055638B2 (en) 2008-03-05

Similar Documents

Publication Publication Date Title
US7752032B2 (en) Apparatus and method for translating Japanese into Chinese using a thesaurus and similarity measurements, and computer program therefor
WO2009123260A1 (en) Cooccurrence dictionary creating system and scoring system
JP2011118872A (en) Method and device for determining category of unregistered word
US9164981B2 (en) Information processing apparatus, information processing method, and program
JP2005301856A (en) Method and program for document retrieval, and document retrieving device executing the same
JP4055638B2 (en) Document processing device
JP2009151390A (en) Information analyzing device and information analyzing program
JP2003263441A (en) Keyword determination database preparing method, keyword determining method, device, program and recording medium
JP4361299B2 (en) Evaluation expression extraction apparatus, program, and storage medium
JP4552401B2 (en) Document processing apparatus and method
JP2004334341A (en) Document retrieval system, document retrieval method, and recording medium
JP2011129006A (en) Semantic classification device, semantic classification method, and semantic classification program
JP2003108571A (en) Document summary device, control method of document summary device, control program of document summary device and recording medium
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program
JP4426893B2 (en) Document search method, document search program, and document search apparatus for executing the same
JP2008282328A (en) Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon
JP2005267397A (en) Phrase classification system, phrase classification method and phrase classification program
JP4985096B2 (en) Document analysis system, document analysis method, and computer program
JP2010140262A (en) Word and phrase input support device and program
JP2005025555A (en) Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
JP2004152041A (en) Program, recording medium and apparatus for extracting key phrase
WO2022137440A1 (en) Search system, search method, and computer program
JP2001290826A (en) Device and method for document classification and recording medium with recorded document classifying program
JP2007102723A (en) Document retrieval device, document retrieval method and document retrieval program
JP2003178057A (en) Phrase producing device, phrase producing method, and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20050916

RD01 Notification of change of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7421

Effective date: 20060420

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20070703

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20070823

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20070911

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20071031

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20071120

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20071203

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20101221

Year of fee payment: 3

LAPS Cancellation because of no payment of annual fees