JP2004326479A

JP2004326479A - Similarity calculating program and device between words

Info

Publication number: JP2004326479A
Application number: JP2003120899A
Authority: JP
Inventors: Naoto Akira; 直人秋良; Yasutsugu Morimoto; 康嗣森本; Atsuko Koizumi; 敦子小泉; Kazutake Kurenishi; 一毅久連石
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-04-25
Filing date: 2003-04-25
Publication date: 2004-11-18
Anticipated expiration: 2023-04-25
Also published as: JP4055638B2

Abstract

<P>PROBLEM TO BE SOLVED: To quickly and highly precisely calculate similarity from document data between words. <P>SOLUTION: In a similarity calculating program for calculating the word similarity from the document data, a co-occurrence vector with the frequency of ideograph having a co-occurrence relation with words under consideration being the target of similarity calculation as elements is generated from the document data, and the co-occurrence vector is stored in a storage means, and the similarity between the words under consideration is calculated based on the similarity of the co-occurrence vector. Thus, it is possible to quickly calculate inter-word similarity by using the co-occurrence vector with the frequency of the ideograph as elements. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、２つの単語間の類似度を計算する単語間類似度計算プログラム及び装置に関する。
【０００２】
【従来の技術】
従来、単語間の類似度を計算する手法として、単語の特徴を、その単語と共起する単語の頻度を要素とする共起ベクトルで定義し、類似度を計算しようとする２つの単語の類似度を、共起ベクトルの類似性に基づき計算する技術がある（例えば、特許文献１参照）。
【特許文献１】特開２０００−１３７７１８号公報
【発明が解決しようとする課題】
上記従来のプログラムは、単語の頻度を要素とする共起ベクトルの類似性に基づき単語間の類似度を計算するため、共起ベクトルを構成する単語の語彙数が多いと、共起ベクトルを構成する単語の語彙数である共起ベクトルの次元数が大きくなり、上記共起ベクトルを用いての類似度計算にかかる時間が長く、リアルタイムでの処理が困難であった。又、共起ベクトルのサイズが大きく、メモリやハードディスク等の記憶装置に格納することが困難であった。また、低頻度語においては、共起対象となる単語が少ないために、低頻度語と、その他の単語間で共通した共起対象の数が少ない、または共通した共起対象を有さないために、類似度計算が困難であるという問題があった。
【０００３】
本発明の目的は、単語間の類似度計算速度を高速化し、注目単語と、それ以外の複数の単語との類似度計算をリアルタイムで行うことを可能にすることと、共起ベクトルのサイズを小さくし、メモリやハードディスク等の記憶装置に、上記共起ベクトルを格納することにある。
本発明の他の目的は、低頻度語に対する類似度計算精度を向上することにある。
【０００４】
【課題を解決するための手段】
上記目的を達成するために、本願で開示する発明の概要を説明すれば以下の通りである。本発明の単語間類似度計算プログラムは、記憶装置から文書データを読み出し、該文書データを形態素解析処理を用いて単語に分割し、類似度を計算する各々の注目単語について、その注目単語と共起関係にある単語を抽出し、該共起関係にある単語から表意文字を抽出し、各々の注目単語について、その注目単語と共起関係にある表意文字の頻度を要素とする共起ベクトルを作成し、メモリやハードディスク等の記憶装置に格納する。
次に、類似する単語を抽出しようとする注目単語と、その他の単語との類似度を、該共起ベクトルの類似性に基づき計算し、該注目単語と類似度の高い単語を抽出する。
【０００５】
【発明の実施の形態】
以下、本発明の実施例を図を用いて説明する。
図１は、本発明の第１の実施形態である単語間の類似度を計算するための手順を示すフローチャートである。まず、メモリやハードディスク等の記憶装置から文書データを読み出し（Ｓ１１）、形態素解析処理を用いて文書データを単語に分割し（Ｓ１２）、単語の品詞情報を用いて、類似度を計算する各々の注目単語に対して共起関係にある単語を抽出する（Ｓ１３）。ここで、注目単語と共起関係にある単語は、注目単語と係り受けの関係にある単語、注目単語を含む文書中で出現する単語、注目単語の前後で指定した文字数の範囲にある単語等であって、この他にも注目単語と共起関係、つまり注目単語が出現する文脈に含まれる単語であれば構わない。例えば、類似度を計算する注目単語を名詞とし、注目単語と係り受けの関係にある動詞を共起関係にある単語とすると、「パソコンを起動する」という文からは、注目単語「パソコン」という注目単語と、注目単語と係り受けの関係にある単語「起動する」が得られる。
【０００６】
次に、注目単語と共起関係にある単語から表意文字を抽出し（Ｓ１４）、注目単語と共起関係にある表意文字の頻度の集計によって、各々の注目単語と共起関係にある表意文字の頻度を要素とする共起ベクトルを作成し（Ｓ１５）、メモリやハードディスク等の記憶装置に格納する。ここで、表意文字は漢字とするが、限られた注目単語と共起し、その注目単語を特徴づける文字であれば、仮名、英数字等の文字を用いても構わない。また、共起ベクトルの要素は、表意文字の出現分布を示すものであれば、頻度でなくとも構わない。例えば、「パソコン−起動する」という注目単語と単語の共起関係からは、「パソコン−起」「パソコン−動」という注目単語と表意文字の共起関係が得られる。図２に示す、複数の文書データから、注目単語と共起関係にある表意文字を抽出し、その頻度を集計した結果の例からは、注目単語「パソコン」に対して、
［１７０（入），１６０（用），１００（購），８０（利），８０（使），２０（動），１５（起），…］
、という共起ベクトルが得られる。
【０００７】
次に、類似する単語を抽出しようとする１つの注目単語と、それ以外の複数の注目単語との類似度を、上記共起ベクトルの類似性に基づき計算する（Ｓ１６）。ここでは注目単語として「パソコン、ＰＣ、ＨＤＤ、メモリ、プリンタ」があった場合に、「パソコン」が１つの注目単語で、それ以外の複数の注目単語が「ＰＣ、ＨＤＤ、メモリ、プリンタ」となり、共起ベクトルが似た単語である「ＰＣ」を抽出することになる。そして、注目単語と類似度の高い単語を、指定した個数、あるいは類似度が指定した閾値以上であるといった基準で抽出する（Ｓ１７）。ここで、類似度の計算式は、類似度を計算しようとする２つの単語各々に対する共起ベクトルのなす角を示すコサイン距離のように、共起ベクトルの類似性が求まるものであれば、方式を問わない。例えば、「パソコン」の共起ベクトル［１７０（入），１６０（用），１００（購），８０（利），８０（使）］と、「ＰＣ」の共起ベクトル［１４０（入），１２０（用），８０（購），７０（利），５０（使）］からコサイン距離を用いて類似度を計算すると、
【０００８】
【数１】

【０００９】
という類似度が得られる。ここで、括弧内の値は、注目単語と共起する表意文字と、表意文字各々の注目単語と共起する頻度を示す。
【００１０】
したがって、図３に構成を示すテキストマイニング装置や文書検索装置等の文書処理装置において、キーボードやマウス等の入力装置（３０４）によって指定された注目単語と類似度の高い単語を、記憶されている文書データ（３０６）から抽出できるので、図４に示すように、同義語、類義語、関連語といった注目単語と類似する単語をディスプレイ等の表示装置（３０３）に表示させるために用いる。尚、本願の構成は、ハードディスク等の記憶装置に記憶されたプログラム（３０７、３０８）をメモリ（３０２）に読み込んでＣＰＵが制御することで実現される。
【００１１】
本実施形態によれば、表意文字である漢字はＪＩＳ第一水準漢字すべての文字を用いたとしても最大２９６５文字と限られているため、ＪＩＳ第一水準漢字の頻度を要素とする共起ベクトルを用いることによって、母集団が未知であり語彙数の多い単語の頻度を共起ベクトルの要素に用いた場合と比較し、共起ベクトルの次元数が大きく削減できるので、類似度計算速度の高速化に効果がある。共起ベクトルを作成するために用いる文書データのサイズに依存するが、単語を用いた場合には数千〜数十万次元である共起ベクトルが、表意文字を用いることにより数百〜２９６５次元（ＪＩＳ第一水準漢字を用いる場合）に削減できるので、コサイン距離を用いた場合の処理速度は、数倍〜数十倍に高速化できる効果と、共起ベクトルのサイズを数分の一から数十分の一に削減できるという効果がある。
【００１２】
また、漢字を用いることによって、文字コードと共起ベクトルを構成する各次元の要素とを対応させることができるので、単語を用いる場合に必要な単語と、共起ベクトルを構成する各次元の要素との対応情報が不要となり、メモリ使用量の削減および処理速度の高速化ができるという効果がある。また、「利用」と「使用」のように意味が類似していても単語では異なるものとして扱われているものが、表意文字を用いると「用」が共通するといったように、共起対象を単語とした場合に異なっていた共起ベクトルの要素が、共起対象を表意文字にすることによって一部が共通するという特徴があるため、共起対象である表意文字が少ない低頻度語についても類似度の計算ができるという効果がある。
【００１３】
次に、本発明の第２の実施形態を説明する。第２の実施形態は、図５に示すフローチャートのように、第１の実施形態における注目単語と共起関係にある単語を抽出するステップを省略するもので、注目単語の前後の指定した文字数内にある文字、あるいは文書内で共起する文字等の、注目単語と共起する表意文字を直接抽出する。例えば、「パソコン」という単語に注目した場合、「パソコン」と共起関係にある単語を抽出することなく、「パソコン」と共起関係にある表意文字「入、用、購、利、使」の頻度が得られる。本実施形態によれば、注目単語と共起する単語を抽出するステップが省略できるため、共起ベクトルの作成に要する時間が短縮できるという効果がある。
【００１４】
次に、本発明の第３の実施形態を説明する。第３の実施形態は、第１または第２の何れかの実施形態における類似度を計算するステップにおいて、単語間の類似度計算に貢献する表意文字と貢献しない表意文字を考慮し、類似度計算に貢献する表意文字には大きな重みを定義し、類似度計算に貢献しない表意文字には小さな重みを定義し、共起ベクトルの各要素に重みを積算した共起ベクトルを用いて、第１の実施形態で示す方式のように類似度を計算する。ここでは、表意文字の重みを共起ベクトルの要素に積算する方式を用いるが、重みの大きい表意文字が類似度計算へ反映できる方式であれば、どのような方式であっても構わない。ここで、表意文字の重みは、ディスプレイ等の表示装置に、重みを定義または修正しようとする表意文字を表示し、該表意文字の重みの入力を受けることで設定する。例えば、図６に示すような表意文字の重みエディタを用いて表意文字の重みを定義することができる。重みが未定義の表意文字には、予め設定されている値をする重みを用いれば良い。
【００１５】
例として、表意文字の重みを「８（入），３（用），１０（購），５（利），６（使）」と定義し、「パソコン」「ＰＣ」の共起ベクトルを、
パソコン ⇒ ［１７０（入），１６０（用），１００（購），８０（利），８０（使）］
ＰＣ ⇒ ［１４０（入），１２０（用），８０（購），７０（利），５０（使）］
とすると、類似度計算に用いる共起ベクトルは、
パソコン ⇒ ［１７０×８（入），１６０×３（用），１００×１０（購），８０×５（利），８０×６（使）］
ＰＣ ⇒ ［１４０×８（入），１２０×３（用），８０×１０（購），７０×５（利），５０×６（使）］
となる。
【００１６】
本実施形態によれば、類似度計算の精度を低下させてしまう表意文字の重みを小さく定義することにより精度低下を防止することができ、類似度計算に貢献する表意文字の重みを大きく定義することによって、該表意文字の類似度計算への貢献度を高めることができるので、類似度計算の精度向上ができるという効果がある。
【００１７】
次に、本発明の第４の実施形態を説明する。第４の実施形態は、表意文字間の関連度を図７のように定義し、第１乃至第３の実施形態における類似度を計算するステップにおいて、類似度を計算しようとする２つの単語間で共通しない共起対象の表意文字も類似度計算に利用する方法である。例えば、図７に示す表意文字「使」と「用」の関連度は８と定義されており、「使」と「用」は関連が強い単語であるために、「使」と「用」の頻度の類似性を考慮して類似度を計算できる。
【００１８】
ここで、表意文字間の関連度は、ディスプレイ等の表示装置に、関連度を定義しようとする２つの表意文字を表示し、該２つの表意文字間の関連度の入力を受けることで設定することができる。例えば、図８に示すような表意文字間の関連度エディタを表示して、入力装置を介してのユーザからの設定を受けることで、表意文字間の関連度を定義、または修正できる。
【００１９】
また、本実施例で用いる類似度の計算式は、異なる表意文字間の関連度を考慮するものであれば、どのような計算式であっても構わない。例えば、単語Ｗ１と単語Ｗ２の類似度計算式は、単語Ｗ１のｉ番目の文字をＣｉ、単語Ｗ２のｊ番目の文字をＣｊ、文字Ｃｉと文字Ｃｊの関連度をＲｅｌ（Ｃｉ，Ｃｊ）、単語Ｗ１の共起ベクトルをＸ＝｛ｘ１，ｘ２，…，ｘＩ｝、単語Ｗ２の共起ベクトルをＹ＝｛ｙ１，ｙ２，…，ｙＪ｝、とすると、
【００２０】
【数２】

【００２１】
という計算式となる。
【００２２】
本実施形態によれば、類似度を計算しようとする２つの単語間で共通する共起対象の表意文字が少ない、あるいは共通する共起対象の表意文字がないといった場合にも、類似度が計算できるため、低頻度語等の共起対象が少ない単語と、その他の単語との類似度の計算ができるという効果がある。
次に、本発明の第５の実施形態を説明する。第５の実施形態は、類似度の計算対象となる複数の単語のすべての組に対して第１乃至第４の実施形態に示した方法を用いて類似度を計算し、図９に示すように２つの単語の組と、その単語間の類似度からなるデータをメモリやハードディスク等の記憶装置に格納しておく。更に、類似した単語を抽出しようとする注目単語を指定し、該単語の組とその単語間の類似度からなるデータから該注目単語が含まれる単語の組を検索することによって、該注目単語と類似度の高い単語を指定した個数あるいは指定する閾値以上の類似度を持つ単語といった基準で抽出する。
【００２３】
本実施例によれば、類似する単語を抽出しようとする注目単語が指定された際に、単語間の類似度を計算する必要がないため、高速に注目単語と類似する単語を抽出できるという効果がある。また、文書データの更新や新語の登場等に伴い、単語の組とその類似度からなるデータを更新する必要があるが、表意文字の頻度を要素とする共起ベクトルを用いて類似度を計算することによって、該単語の組と類似度からなるデータを高速に作成できる。
【００２４】
次に、本発明の第６の実施形態を説明する。第６の実施形態は、第１乃至第５の実施形態によって、各々の注目単語と同義関係にある同義語を抽出し、ユーザが指定する注目単語と、その同義語の組を、その同義関係が成り立つ文脈情報と共に、図１０に示すような同義語辞書としてメモリやハードディスクの記憶装置に記憶する。。ここで、文脈情報は、その文脈で出現する表意文字の頻度を要素とする文脈ベクトルで定義し、文脈に依存せず常に同義関係が成り立つ同義語には、文脈情報を格納しなくても構わない。この文脈情報は、多義語の存在によって、複数の語義がある場合に語義を判定するために用いる。例としては、注目単語「米国」とその同義語「米」の組と、注目単語「お米」とその同義語「米」が挙げられ、米を同義語辞書を用いて「米国」または「お米」に置換しようとする場合に、同義関係が成立する文脈情報を示す文脈ベクトルを確認すると、
「米国−米」⇒［１００（旅），８０（政），７０（府），６０（発），５５（表），…］
「お米−米」⇒［２００（食），１５０（買），８０（炊），３０（作），２０（育），…］
となっており、注目している文脈中に含まれている表意文字からなる文脈ベクトルは、［２（食），１（炊），１（飯），…］となっているため、この文脈ベクトルと「米国−米」「お米−米」の文脈ベクトルの比較をすることによって、「お米−米」の文脈ベクトルとの距離が近いことが判別でき、ここで置換すべき単語は「お米」であるということが分かる。
本実施例によれば、多義語を含む同義語の組について、最適な同義語を自動で選択できるという効果がある。また、表意文字の頻度を要素とする文脈ベクトルを用いるため、文脈ベクトルの次元数が小さく、メモリやハードディスク等の記憶装置へ文脈ベクトルを記憶できるという効果がある。
【００２５】
【発明の効果】
本発明によれば、表意文字の頻度を要素とする共起ベクトルを単語間の類似度計算に用いることによって、単語の頻度を要素とする共起ベクトルを用いる場合と比較し、共起ベクトルの次元が大きく削減できるので、類似度計算速度の高速化と、共起ベクトルのサイズの削減といった効果がある。また、共起対象である単語で、意味が類似していても異なる表記であるために別のものとして扱われていた単語が、表意文字を用いることによって、共通する文字が発生し、共通する共起対象の数が増加するために類似度計算精度が向上するという効果がある。また、語義判定に用いる文脈ベクトルの要素に、表意文字の頻度を用いることによって、文脈ベクトルのサイズを小さくできるので、メモリやハードディスク等の記憶装置へ文脈ベクトルを記憶できるという効果がある。
【図面の簡単な説明】
【図１】本発明の第１の実施形態である単語間類似度を計算する手順を示すフローチャートである。
【図２】注目単語と共起する表意文字の頻度の例を示す図である。
【図３】文書処理装置の構成を示す図である。
【図４】注目単語に類似する単語を表示する画面を示す図である。
【図５】本発明の第２の実施形態である単語間類似度を計算する手順を示すフローチャートである。
【図６】表意文字の類似度計算への貢献度を示す重みを修正する画面を示す図である。
【図７】表意文字間の関連度の定義例を示す図である。
【図８】表意文字間の関連度を修正する画面を示す図である。
【図９】単語間の類似度の保存形式を示す図である。
【図１０】同義語となる単語の組と、同義関係が成り立つ文脈情報の保存形式を示す図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an inter-word similarity calculation program and an apparatus for calculating a similarity between two words.
[0002]
[Prior art]
Conventionally, as a method of calculating the similarity between words, a feature of a word is defined by a co-occurrence vector having a frequency of a word co-occurring with the word as an element, and the similarity between two words whose similarity is to be calculated is calculated. There is a technique for calculating a degree based on the similarity of co-occurrence vectors (for example, see Patent Document 1).
[Patent Document 1] Japanese Patent Application Laid-Open No. 2000-137718 [Problems to be Solved by the Invention]
The above-mentioned conventional program calculates the similarity between words based on the similarity of the co-occurrence vector having the frequency of the word as an element. Therefore, when the number of words constituting the co-occurrence vector is large, the co-occurrence vector is formed. The number of dimensions of the co-occurrence vector, which is the number of vocabulary words, becomes large, the time required to calculate the similarity using the co-occurrence vector becomes long, and real-time processing is difficult. Further, the size of the co-occurrence vector is large, and it is difficult to store the co-occurrence vector in a storage device such as a memory or a hard disk. In addition, in low-frequency words, since the number of words to be co-occurred is small, the number of co-occurrence targets common between low-frequency words and other words is small, or there is no common co-occurrence target. In addition, there is a problem that it is difficult to calculate the similarity.
[0003]
An object of the present invention is to increase the speed of calculating the degree of similarity between words, to enable similarity calculation between a word of interest and a plurality of other words in real time, and to reduce the size of a co-occurrence vector. Another object of the present invention is to store the co-occurrence vector in a storage device such as a memory or a hard disk.
Another object of the present invention is to improve the accuracy of calculating the similarity for low-frequency words.
[0004]
[Means for Solving the Problems]
An outline of the invention disclosed in the present application to achieve the above object is as follows. The inter-word similarity calculation program according to the present invention reads out document data from a storage device, divides the document data into words using a morphological analysis process, and assigns each word of interest for which similarity is calculated to the word of interest. The co-occurrence vector having the frequency of the ideographic character co-occurring with the noted word as an element is extracted for each word of interest. It is created and stored in a storage device such as a memory or a hard disk.
Next, the similarity between the attention word for which a similar word is to be extracted and another word is calculated based on the similarity of the co-occurrence vector, and a word having a high similarity to the attention word is extracted.
[0005]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a flowchart showing a procedure for calculating the similarity between words according to the first embodiment of the present invention. First, the document data is read from a storage device such as a memory or a hard disk (S11), the document data is divided into words using morphological analysis processing (S12), and the similarity is calculated using the word class information of the words. A word having a co-occurrence relationship with the target word is extracted (S13). Here, words that have a co-occurrence relationship with the word of interest include words that have a dependency relationship with the word of interest, words that appear in the document containing the word of interest, words that are within the specified number of characters before and after the word of interest, and the like. In addition, any other word may be used as long as the word is included in a context in which the attention word appears, that is, a co-occurrence relationship with the attention word. For example, if the attention word for which the similarity is calculated is a noun, and the verb in a dependency relationship with the attention word is a word having a co-occurrence relationship, the sentence "Start up a personal computer" is called the attention word "PC". The word of interest and the word “start up” that is dependent on the word of interest are obtained.
[0006]
Next, ideographic characters are extracted from words that have a co-occurrence relationship with the target word (S14), and ideographic characters that have a co-occurrence relationship with each target word are calculated by counting the frequencies of ideographic characters that have a co-occurrence relationship with the target word. A co-occurrence vector having the frequency of as an element is created (S15) and stored in a storage device such as a memory or a hard disk. Here, the ideographic character is a kanji, but a character such as a kana or an alphanumeric character may be used as long as it co-occurs with a limited attention word and characterizes the attention word. The element of the co-occurrence vector need not be the frequency as long as it indicates the appearance distribution of ideographic characters. For example, from the co-occurrence relationship between the attention word “PC-start” and the word, a co-occurrence relationship between the attention word “PC-KI” and “PC-movement” and the ideogram can be obtained. From the example of the result of extracting the ideographic characters co-occurring with the noticed word from the plurality of document data shown in FIG.
[170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use), 20 (motion), 15 (ki), ...]
, Are obtained.
[0007]
Next, the similarity between one attention word from which a similar word is to be extracted and a plurality of other attention words is calculated based on the similarity of the co-occurrence vectors (S16). In this case, if the word of interest is “PC, PC, HDD, memory, printer”, “PC” is one word of interest and the other words of interest are “PC, HDD, memory, printer”. , "PC" which is a word having a similar co-occurrence vector. Then, words having a high degree of similarity to the target word are extracted on the basis of a specified number or a criterion that the degree of similarity is equal to or greater than a specified threshold (S17). Here, the formula for calculating the similarity may be any formula as long as the similarity of the co-occurrence vector is obtained, such as the cosine distance indicating the angle formed by the co-occurrence vector for each of the two words for which the similarity is to be calculated. Regardless. For example, the co-occurrence vector of “PC” [170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use)] and the co-occurrence vector of “PC” [140 (in), 120 (for), 80 (for purchase), 70 (for profit), 50 (for use)] to calculate the similarity using the cosine distance,
[0008]
(Equation 1)

[0009]
Is obtained. Here, the value in parentheses indicates the ideographic character co-occurring with the noted word and the frequency of each ideographic character co-occurring with the noted word.
[0010]
Therefore, in a document processing device such as a text mining device or a document search device shown in FIG. 3, a word having a high degree of similarity to the attention word specified by the input device (304) such as a keyboard or a mouse is stored. Since it can be extracted from the document data (306), as shown in FIG. 4, it is used to display words similar to the word of interest such as synonyms, synonyms, and related words on a display device (303) such as a display. Note that the configuration of the present application is realized by reading programs (307, 308) stored in a storage device such as a hard disk into the memory (302) and controlling the CPU.
[0011]
According to the present embodiment, the ideographic kanji is limited to a maximum of 2965 even if all the characters of the JIS first-level kanji are used. By using, the number of dimensions of co-occurrence vectors can be greatly reduced compared to the case where the frequency of words whose population is unknown and the number of vocabularies is large is used for the elements of co-occurrence vectors. It is effective for conversion. Depending on the size of the document data used to create the co-occurrence vector, when words are used, the co-occurrence vector which has thousands to hundreds of thousands of dimensions becomes several hundred to 2965 dimensions by using ideographic characters. (In the case of using JIS first-level kanji), the processing speed when using the cosine distance can be increased several times to several tens times, and the size of the co-occurrence vector can be reduced by a fraction. There is an effect that it can be reduced to several tenths.
[0012]
Also, by using kanji, it is possible to associate a character code with each dimensional element constituting a co-occurrence vector, so that a word necessary when using a word and an element of each dimension constituting a co-occurrence vector are used. This eliminates the need for corresponding information, and has the effect of reducing the amount of memory used and increasing the processing speed. Also, even if the meanings are similar, such as “use” and “use”, they are treated as different words, but if ideographic characters are used, the co-occurrence Since the elements of the co-occurrence vector, which were different when words were used, are partly common by making the co-occurrence target an ideogram, even low-frequency words with few ideographs to be co-occurred There is an effect that the similarity can be calculated.
[0013]
Next, a second embodiment of the present invention will be described. In the second embodiment, as in the flowchart shown in FIG. 5, the step of extracting a word having a co-occurrence relationship with the target word in the first embodiment is omitted. , Or ideographic characters that co-occur with the word of interest, such as characters that co-occur in a document. For example, if attention is paid to the word “PC”, the ideographic characters “ON, use, purchase, use, use” that co-occur with “PC” without extracting words that co-occur with “PC” Is obtained. According to the present embodiment, since the step of extracting a word that co-occurs with the attention word can be omitted, the time required for creating a co-occurrence vector can be reduced.
[0014]
Next, a third embodiment of the present invention will be described. According to the third embodiment, in the step of calculating the similarity in any of the first and second embodiments, the ideographic characters contributing to the calculation of the similarity between words and the ideographic characters not contributing are considered, and the similarity calculation is performed. A large weight is defined for ideographic characters that contribute to the similarity calculation, a small weight is defined for ideographic characters that do not contribute to the similarity calculation, and a first co-occurrence vector obtained by multiplying each element of the co-occurrence vector by a weight is used. The similarity is calculated as in the method described in the embodiment. Here, a method is used in which the weight of the ideographic character is added to the element of the co-occurrence vector. However, any method may be used as long as the ideographic character having a large weight can be reflected in the similarity calculation. Here, the weight of the ideographic character is set by displaying the ideographic character whose weight is to be defined or corrected on a display device such as a display and receiving the input of the weight of the ideographic character. For example, the weight of an ideographic character can be defined using an ideographic character weight editor as shown in FIG. For an ideographic character whose weight is undefined, a weight having a preset value may be used.
[0015]
As an example, the weight of an ideographic character is defined as “8 (in), 3 (for), 10 (purchase), 5 (interest), 6 (use)”, and the co-occurrence vector of “PC” and “PC” is
PC ⇒ [170 (in), 160 (for), 100 (purchase), 80 (interest), 80 (use)]
PC ⇒ [140 (in), 120 (for), 80 (purchase), 70 (interest), 50 (use)]
Then, the co-occurrence vector used for the similarity calculation is
PC ⇒ [170 × 8 (in), 160 × 3 (for), 100 × 10 (purchase), 80 × 5 (interest), 80 × 6 (use)]
PC ⇒ [140 × 8 (in), 120 × 3 (for), 80 × 10 (purchase), 70 × 5 (interest), 50 × 6 (use)]
It becomes.
[0016]
According to the present embodiment, it is possible to prevent a decrease in accuracy by defining a small weight of an ideographic character that reduces the accuracy of similarity calculation, and to define a large weight of an ideographic character that contributes to similarity calculation. As a result, the degree of contribution of the ideographic character to the similarity calculation can be increased, so that the accuracy of the similarity calculation can be improved.
[0017]
Next, a fourth embodiment of the present invention will be described. In the fourth embodiment, the degree of association between ideograms is defined as shown in FIG. 7, and in the step of calculating the degree of similarity in the first to third embodiments, the step of calculating the degree of similarity between two words to be calculated is performed. This is a method in which ideographic characters that are not co-occurring and are also used for similarity calculation. For example, the degree of association between the ideographic characters “use” and “use” shown in FIG. 7 is defined as 8, and “use” and “use” are words that are strongly related. The similarity can be calculated in consideration of the frequency similarity.
[0018]
Here, the degree of association between ideographic characters is set by displaying two ideographic characters whose definition is to be defined on a display device such as a display and receiving an input of the degree of association between the two ideographic characters. be able to. For example, by displaying a relation degree editor between ideograms as shown in FIG. 8 and receiving a setting from the user via the input device, the relation degree between ideograms can be defined or corrected.
[0019]
In addition, the formula for calculating the similarity used in the present embodiment may be any formula as long as the relevance between different ideographic characters is considered. For example, the similarity calculation formula of the word W1 and the word W2 is such that the i-th character of the word W1 is Ci, the j-th character of the word W2 is Cj, the relevance of the character Ci and the character Cj is Rel (Ci, Cj), If the co-occurrence vector of the word W1 is X = {x1, x2,..., XI} and the co-occurrence vector of the word W2 is Y = {y1, y2,.
[0020]
(Equation 2)

[0021]
It becomes the calculation formula.
[0022]
According to the present embodiment, the similarity is calculated even when there are few common ideographs to be co-occurred between two words whose similarity is to be calculated, or when there is no common ideographic character to be co-occurred. Therefore, there is an effect that a similarity between a word having few co-occurrence targets such as a low-frequency word and other words can be calculated.
Next, a fifth embodiment of the present invention will be described. In the fifth embodiment, the similarity is calculated using the method described in the first to fourth embodiments for all the sets of a plurality of words for which the similarity is to be calculated, as shown in FIG. First, a set of two words and data including the similarity between the words are stored in a storage device such as a memory or a hard disk. Further, by specifying a word of interest to extract a similar word, and searching for a set of words that include the word of interest from data consisting of the set of words and the degree of similarity between the words, Words having a high degree of similarity are extracted based on a specified number or words having a degree of similarity equal to or greater than a specified threshold.
[0023]
According to this embodiment, it is not necessary to calculate the degree of similarity between words when a target word for which a similar word is to be extracted is specified, so that a word similar to the target word can be quickly extracted. There is. In addition, it is necessary to update data consisting of a set of words and their similarity with the update of document data and the appearance of new words, but the similarity is calculated using co-occurrence vectors that use the frequency of ideographic characters as an element. By doing so, data consisting of the word set and similarity can be created at high speed.
[0024]
Next, a sixth embodiment of the present invention will be described. In the sixth embodiment, according to the first to fifth embodiments, a synonym having a synonymous relation with each attention word is extracted, and a set of the attention word designated by the user and the synonym is extracted. Is stored in a storage device such as a memory or a hard disk as a synonym dictionary as shown in FIG. . Here, the context information is defined by a context vector having the frequency of an ideographic character appearing in the context as an element, and it is not necessary to store the context information in a synonym for which a synonym always holds regardless of the context. Absent. This context information is used to determine the meaning when there are a plurality of meanings due to the presence of the polysemy. Examples include the word "USA" and its synonym "US", and the word "rice" and its synonym "US". If you try to replace it with "rice" and check the context vector that indicates the context information where the synonymous relationship holds,
"US-US" ⇒ [100 (travel), 80 (political), 70 (fu), 60 (depart), 55 (table), ...]
"Rice-Rice" ⇒ [200 (meal), 150 (buy), 80 (cooked), 30 (work), 20 (growth), ...]
Since the context vector composed of ideograms included in the context of interest is [2 (food), 1 (cooked), 1 (rice), ...], this context By comparing the vector and the context vector of “US-US” and “Rice-US”, it can be determined that the distance between the context vector of “US-US” is short, and the word to be replaced here is “ Rice.
According to the present embodiment, there is an effect that an optimum synonym can be automatically selected for a set of synonyms including polysynonyms. Further, since a context vector having the frequency of an ideographic character as an element is used, the number of dimensions of the context vector is small, and there is an effect that the context vector can be stored in a storage device such as a memory or a hard disk.
[0025]
【The invention's effect】
According to the present invention, by using a co-occurrence vector having a frequency of an ideographic character as an element for calculating the similarity between words, a comparison with a case of using a co-occurrence vector having a frequency of a word as an element is provided. Since the dimension can be greatly reduced, there are effects of increasing the speed of calculating the similarity and reducing the size of the co-occurrence vector. In addition, words that are co-occurring and that have been treated as different because they have different notations even though they have similar meanings, use ideographic characters to generate common characters, Since the number of co-occurrence targets increases, there is an effect that similarity calculation accuracy is improved. In addition, since the size of the context vector can be reduced by using the frequency of the ideogram as an element of the context vector used for the meaning determination, there is an effect that the context vector can be stored in a storage device such as a memory or a hard disk.
[Brief description of the drawings]
FIG. 1 is a flowchart showing a procedure for calculating an inter-word similarity according to the first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of the frequency of ideographic characters co-occurring with a word of interest.
FIG. 3 is a diagram illustrating a configuration of a document processing apparatus.
FIG. 4 is a diagram showing a screen displaying a word similar to a word of interest.
FIG. 5 is a flowchart illustrating a procedure for calculating a similarity between words according to the second embodiment of the present invention.
FIG. 6 is a diagram showing a screen for correcting a weight indicating a degree of contribution of an ideographic character to similarity calculation.
FIG. 7 is a diagram showing a definition example of a degree of association between ideographic characters.
FIG. 8 is a diagram showing a screen for correcting the degree of association between ideographic characters.
FIG. 9 is a diagram showing a storage format of similarity between words.
FIG. 10 is a diagram illustrating a set of words that are synonyms and a storage format of context information in which a synonymous relationship is established.

Claims

Reading document data from the storage means;
Dividing the document data into words;
Extracting a word having a co-occurrence relationship with the word of interest from the word data;
Decomposing the extracted words into ideographic characters, extracting a co-occurrence relationship between the word of interest and the ideographic characters, and calculating a similarity between the words based on the similarity of the distribution of the extracted co-occurrence relationships. Steps and
A program for causing a computer to implement a method for calculating an inter-word similarity degree, the method including a step of outputting a word of interest and a word having a high degree of similarity using the calculated similarity degree.

Reading document data from the storage means;
Dividing the document data into words;
Extracting from the word data an ideographic character co-occurring with the word of interest; calculating the similarity between words based on the similarity of the distribution of the extracted co-occurrence relationships;
A program for causing a computer to implement a method for calculating an inter-word similarity degree, the method including a step of outputting a word of interest and a word having a high degree of similarity using the calculated similarity degree.

Displaying an ideographic character on a display means and receiving input or correction of a value of the weight of each of the ideographic characters, wherein the similarity calculating step uses the weight of each of the ideographic characters. Item 3. The program according to item 1 or 2.

Displaying an ideographic character on a display means and receiving input or correction of the degree of association between the ideographic characters, wherein the similarity calculating step uses the degree of association between the ideographic characters. Item 4. The program according to any one of Items 1 to 3.

A storage unit for storing information of a co-occurrence relationship distribution with a plurality of ideographs of each of the plurality of words, and an input unit, and receiving any of the plurality of words via the input unit; A document processing apparatus for retrieving a co-occurrence relation distribution similar to the co-occurrence relation distribution from the storage means and displaying words having the searched co-occurrence relation distribution.

A storage means for storing information of co-occurrence relation distribution consisting of frequencies of ideographic characters corresponding to respective meanings, and an input means, and the meaning determination of a polysemy having a plurality of meanings is input through the input means. 6. The document processing apparatus according to claim 5, wherein the frequency is calculated based on the frequency of ideographic characters included in the context in which the word appears and the similarity of the co-occurrence relation distribution.