JP4468608B2

JP4468608B2 - Semantic information estimation device, semantic information estimation method, and program

Info

Publication number: JP4468608B2
Application number: JP2001131379A
Authority: JP
Inventors: 雅子望主
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-04-27
Filing date: 2001-04-27
Publication date: 2010-05-26
Anticipated expiration: 2021-04-27
Also published as: JP2002328943A

Description

【０００１】
【発明の属する技術分野】
本発明は、意味情報推定装置、意味情報推定方法、及びプログラムに関する。
【０００２】
【従来の技術】
いわゆる言語処理に用いられる単語辞書には、単語の表記や読み、活用等の情報に加え、単語の意味（分類）に関する情報が登録されている。このように単語辞書に登録された単語の意味情報は、文書検索処理や文書分類処理の際に非常に有効となる。
【０００３】
ところで、大規模な文書データベース中から所定の文書を検索する際には、キーワードによる検索が一般的である。このようなキーワードによる文書検索においては、実際には、例えばＯＡ機器を製造販売するメーカーである「○×電気」という特定のキーワードにより検索する場合に加え、「○×電気」よりも上位の概念にあたる「ＯＡ機器メーカー」といったキーワードにより検索する場合もある。
【０００４】
しかしながら、通常の文書では、企業名である「○×電気」等の固有名詞はあっても、「○×電気」の上位概念を表すことば（「ＯＡ機器メーカー」）は表現されないのが普通である。
【０００５】
そこで、このような問題を解決するための一つの方策として、企業名等の固有名詞に対し、その企業が属する業種等の意味的な情報を付与する方法がある。しかし、このような意味や知識に関する情報の単語辞書への付与や単語辞書の作成は、かなりの労力、知識、技術を要するものである。そのため、意味や知識に関する各種情報の単語辞書への付与等については、様々な手法が考えられている。
【０００６】
知識情報の効率的な獲得手法としては、自然言語文を解析し、特定の対象に対してデータベース登録を行なうものが特開平６−２２３１０９号公報において開示されている。
【０００７】
また、構文情報によって単語を推定するものとしては、構文情報を用いて単語間の関係を含む情報を一定形式で抽出する手法が、特開平７−８５０７１号公報において開示されている。特開平７−８５０７１号公報において開示されている技術によれば、構文情報を用いて単語間の関係を含む情報を一定形式で抽出する際に、未登録語の単語とその意味を推定している。例えば、「○×電気がスパッタリング装置を開発した」という表現について、「○×電気」が未知の単語であっても、共起する「装置を開発した」という表現から装置開発者として認識する方法を提案している。
【０００８】
【発明が解決しようとする課題】
しかしながら、特開平６−２２３１０９号公報に開示されている技術によれば、単語辞書にある程度の意味情報が予め登録されていることが前提とされている。すなわち、対象領域の意味とこれに対応付けられる表現とを予め規定しておく必要があり、この設定自体、非常に労力を要するものとなっている。
【０００９】
また、特開平７−８５０７１号公報に開示されている技術によれば、構文解析規則を用いた詳細な構文解析を行なって情報を獲得しているが、構文解析自体難しい技術であり、単語が未登録である場合や意味情報がない場合に解析自体が難しいことがある。また、この方法では、例えば「○×電気が通信装置を開発した」という文においても同様な結果が得られるものの、どのような装置を開発する者であるかといった詳細な意味情報まではわからない。
【００１０】
本発明の目的は、人手や構文解析等の技術を用いずに、単語辞書に登録された単語に対してその単語の意味情報を自動的に付与することである。
【００１１】
本発明の目的は、入力された文書の情報を利用し、単語辞書に未登録の単語とその意味情報とを自動的に獲得することである。
【００１２】
【課題を解決するための手段】
請求項１記載の発明の意味情報推定装置は、単語の表記を複数記憶する単語辞書記憶手段と、前記単語辞書記憶手段に記憶されている１の単語の表記の少なくとも一部を構成する文字列と共通する共通文字列を、前記１の単語以外の単語の表記の中から検索し、前記共通文字列が所定数以上検索された場合に、前記共通文字列を、前記共通文字列を含む表記の単語の意味情報として推定する意味推定手段と、前記意味推定手段により推定された前記意味情報を、前記共通文字列を含む単語の表記に対応付けて前記単語辞書記憶手段に記憶させる意味情報格納手段と、共起する単語間の関係を規定した第１共起パタンと、当該第１共起パタンの各単語の意味情報とを、対応付けて記憶するとともに、前記第１共起パタンのいずれかの単語と他の単語との階層関係又は関連付け関係を規定した第２共起パタンと、当該第２共起パタンの前記他の単語の意味情報とを、対応付けて記憶する共起パタン辞書記憶手段と、入力部から文書の入力を受付ける文書受付手段と、前記文書受付手段により受付けられた前記文書の中から、前記共起パタン辞書記憶手段に記憶されている前記第１共起パタンを満足する各単語を抽出する共起情報抽出手段と、を備え、前記意味情報格納手段は、更に、前記共起情報抽出手段により抽出された前記各単語の表記と、前記共起パタン辞書記憶手段に記憶されている当該単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させるとともに、前記共起情報抽出手段により抽出された前記各単語のいずれかの単語と他の単語との階層関係又は関連付け関係が前記共起パタン辞書記憶手段に記憶されている前記第２共起パタンに規定されている場合には、前記他の単語の表記と、前記共起パタン辞書記憶手段に記憶されている前記他の単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させる。
【００１３】
したがって、単語において共通する文字列は共通する概念を表す可能性が高いことから、これを利用することで簡単に意味を付与することが可能になる。これにより、人手や構文解析等の技術を用いずに単語辞書に登録された単語に対してその単語の意味情報を自動的に付与することが可能になる。また、文書に出現した共起パタンを利用することで単語辞書に未登録の単語とその意味情報とを自動的に獲得することが可能になる。また、共起パタンを関連付け、あるいは階層的に記述することにより、パタン記述の労力を削減することが可能になる。また、関連付けされることにより、未登録の語があっても関連する既知の情報をもとに意味情報を推定することが可能になる。更に、関連付けられた共起パタンによって、意味情報だけでなく、意味情報同士を関連付けすることが可能になるので、より詳細な意味情報を推定することが可能になる。
【００１４】
請求項２記載の発明は、請求項１記載の意味情報推定装置において、前記意味推定手段は、前記１の単語以外の単語の表記の末部分の文字列又は先頭部分の文字列の中から、前記共通文字列を検索する。
【００１５】
したがって、共通する文字列位置を制限することにより、より精度の高い意味推定が可能になる。
【００２６】
請求項３記載の発明の意味情報推定方法は、意味推定手段が、単語の表記を複数記憶する単語辞書記憶手段に記憶されている１の単語の表記の少なくとも一部を構成する文字列と共通する共通文字列を、前記１の単語以外の単語の表記の中から検索し、前記共通文字列が所定数以上検索された場合に、前記共通文字列を、前記共通文字列を含む表記の単語の意味情報として推定する意味推定工程と、意味情報格納手段が、前記意味推定工程により推定された前記意味情報を、前記共通文字列を含む単語の表記に対応付けて前記単語辞書記憶手段に記憶させる第１意味情報格納工程と、文書受付手段が、入力部から文書の入力を受付ける文書受付工程と、共起情報抽出手段が、前記文書受付工程により受付けられた前記文書の中から、共起する単語間の関係を規定した第１共起パタンと当該第１共起パタンの各単語の意味情報とを対応付けて記憶するとともに、前記第１共起パタンのいずれかの単語と他の単語との階層関係又は関連付け関係を規定した第２共起パタンと当該第２共起パタンの前記他の単語の意味情報とを対応付けて記憶する共起パタン辞書記憶手段に記憶されている前記第１共起パタンを満足する各単語を抽出する共起情報抽出工程と、前記意味情報格納手段が、更に、前記共起情報抽出工程により抽出された前記各単語の表記と、前記共起パタン辞書記憶手段に記憶されている当該単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させるとともに、前記共起情報抽出工程により抽出された前記各単語のいずれかの単語と他の単語との階層関係又は関連付け関係が前記共起パタン辞書記憶手段に記憶されている前記第２共起パタンに規定されている場合には、前記他の単語の表記と、前記共起パタン辞書記憶手段に記憶されている前記他の単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させる第２意味情報格納工程と、を含んでなる。
【００２７】
したがって、単語において共通する文字列は共通する概念を表す可能性が高いことから、これを利用することで簡単に意味を付与することが可能になる。これにより、人手や構文解析等の技術を用いずに単語辞書に登録された単語に対してその単語の意味情報を自動的に付与することが可能になる。また、文書に出現した共起パタンを利用することで単語辞書に未登録の単語とその意味情報とを自動的に獲得することが可能になる。また、共起パタンを関連付け、あるいは階層的に記述することにより、パタン記述の労力を削減することが可能になる。また、関連付けされることにより、未登録の語があっても関連する既知の情報をもとに意味情報を推定することが可能になる。更に、関連付けられた共起パタンによって、意味情報だけでなく、意味情報同士を関連付けすることが可能になるので、より詳細な意味情報を推定することが可能になる。
【００２８】
請求項４記載の発明は、請求項３記載の意味情報推定方法において、前記意味推定工程では、前記意味推定手段が、前記１の単語以外の単語の表記の末部分の文字列又は先頭部分の文字列の中から、前記共通文字列を検索する。
【００２９】
したがって、共通する文字列位置を制限することにより、より精度の高い意味推定が可能になる。
【００４０】
請求項５記載の発明のプログラムは、単語の意味情報の推定をコンピュータに実行させるためのプログラムであって、前記コンピュータに、単語の表記を複数記憶する単語辞書記憶手段に記憶されている１の単語の表記の少なくとも一部を構成する文字列と共通する共通文字列を、前記１の単語以外の単語の表記の中から検索し、前記共通文字列が所定数以上検索された場合に、前記共通文字列を、前記共通文字列を含む表記の単語の意味情報として推定する意味推定機能と、前記意味推定機能により推定された前記意味情報を、前記共通文字列を含む単語の表記に対応付けて前記単語辞書記憶手段に記憶させる第１意味情報格納機能と、入力部から文書の入力を受付ける文書受付機能と、前記文書受付機能により受付けられた前記文書の中から、共起する単語間の関係を規定した第１共起パタンと当該第１共起パタンの各単語の意味情報とを対応付けて記憶するとともに、前記第１共起パタンのいずれかの単語と他の単語との階層関係又は関連付け関係を規定した第２共起パタンと当該第２共起パタンの前記他の単語の意味情報とを対応付けて記憶する共起パタン辞書記憶手段に記憶されている前記第１共起パタンを満足する各単語を抽出する共起情報抽出機能と、前記共起情報抽出機能により抽出された前記各単語の表記と、前記共起パタン辞書記憶手段に記憶されている当該単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させるとともに、前記共起情報抽出機能により抽出された前記各単語のいずれかの単語と他の単語との階層関係又は関連付け関係が前記共起パタン辞書記憶手段に記憶されている前記第２共起パタンに規定されている場合には、前記他の単語の表記と、前記共起パタン辞書記憶手段に記憶されている前記他の単語の前記意味情報とを、対応付けて前記単語辞書記憶手段に記憶させる第２意味情報格納機能と、を実行させる。
【００４１】
したがって、単語において共通する文字列は共通する概念を表す可能性が高いことから、これを利用することで簡単に意味を付与することが可能になる。これにより、人手や構文解析等の技術を用いずに単語辞書に登録された単語に対してその単語の意味情報を自動的に付与することが可能になる。また、文書に出現した共起パタンを利用することで単語辞書に未登録の単語とその意味情報とを自動的に獲得することが可能になる。また、共起パタンを関連付け、あるいは階層的に記述することにより、パタン記述の労力を削減することが可能になる。また、関連付けされることにより、未登録の語があっても関連する既知の情報をもとに意味情報を推定することが可能になる。更に、関連付けられた共起パタンによって、意味情報だけでなく、意味情報同士を関連付けすることが可能になるので、より詳細な意味情報を推定することが可能になる。
【００４２】
請求項６記載の発明は、請求項５記載のプログラムにおいて、前記意味推定機能は、前記１の単語以外の単語の表記の末部分の文字列又は先頭部分の文字列の中から、前記共通文字列を検索する。
【００４３】
したがって、共通する文字列位置を制限することにより、より精度の高い意味推定が可能になる。
【００５６】
【発明の実施の形態】
本発明の第一の実施の形態を図１ないし図５に基づいて説明する。
【００５７】
図１は、意味情報推定装置１のハードウェア構成を概略的に示すブロック図である。図１に示すように、意味情報推定装置１は、この意味情報推定装置１の各部を集中的に制御するＣＰＵ（Central Processing Unit）２を備えており、このＣＰＵ２には、ＢＩＯＳなどを記憶した読出し専用メモリであるＲＯＭ（Read Only Memory）３と、各種データを書換え可能に記憶するＲＡＭ（Random Access Memory）４とがバス５で接続されている。さらにバス５には、外部記憶となるＨＤＤ（Hard Disk Drive）６と、ＣＤ（Compact Disc）−ＲＯＭ７を読み取るＣＤ−ＲＯＭドライブ８と、意味情報推定装置１とネットワーク９との通信を司る通信制御装置１０と、入力部として機能するキーボードやマウスなどの入力装置１１と、ＣＲＴ（Cathode Ray Tube）、ＬＣＤ（Liquid Crystal Display）などの出力装置１２とが、図示しないＩ／Ｏを介して接続されている。
【００５８】
ＲＡＭ４は、各種データを書換え可能に記憶する性質を有していることから、ＣＰＵ２の作業エリアとして機能し、例えば入力バッファ、解析バッファ等の役割を果たす。
【００５９】
また、ＨＤＤ６には、各種のプログラムを格納するプログラムファイルのほか、単語の表記及び意味情報が格納される単語辞書１３が格納されている。本実施の形態の単語辞書１３には、図２に示すように、初期状態としては例えば企業等の名称（単語の表記）のみが格納されている。
【００６０】
図１に示すＣＤ−ＲＯＭ７は、この発明の記憶媒体を実施するものであり、所定のプログラムが記憶されている。ＣＰＵ２は、ＣＤ−ＲＯＭ７に記憶されているプログラムをＣＤ−ＲＯＭドライブ８で読み取り、ＨＤＤ６にインストールする。これにより、意味情報推定装置１は、後述するような各種の処理を行なうことが可能な状態となる。
【００６１】
なお、記憶媒体としては、ＣＤ−ＲＯＭ７のみならず、ＤＶＤなどの各種の光ディスク、各種光磁気ディスク、フロッピーディスクなどの各種磁気ディスク等、半導体メモリ等の各種方式のメディアを用いることができる。また、通信制御装置１０を介してインターネットなどのネットワーク９からプログラムをダウンロードし、ＨＤＤ６にインストールするようにしてもよい。この場合に、送信側のサーバでプログラムを記憶している記憶装置も、この発明の記憶媒体である。なお、プログラムは、所定のＯＳ（Operating System）上で動作するものであってもよいし、その場合に後述の各種処理の一部の実行をＯＳに肩代わりさせるものであってもよいし、ワープロソフトなど所定のアプリケーションソフトやＯＳなどを構成する一群のプログラムファイルの一部として含まれているものであってもよい。
【００６２】
次に、意味情報推定装置１のＣＰＵ２がプログラムに基づいて実行する各種処理の内容について説明する。本発明の意味情報推定装置１は概略的には単語の意味情報を推定するものであって、図３に示すように、ＣＰＵ２がプログラムに基づいて動作することで、意味情報推定装置１には単語辞書１３に基づいて単語の意味情報を推定する機能を発揮する意味情報推定部１４が形成される。
【００６３】
次に、本実施の形態の意味情報推定部１４における意味情報推定処理の流れについて図４を参照して説明する。図４に示すように、本実施の形態の意味情報推定処理は、単語辞書１３に登録されている各単語の中で未処理の単語が有る場合には（ステップＳ１のＹ）、ステップＳ２に進み、当該単語を単語辞書１３中の他の単語と文字列比較し、共通文字列があればその文字列を記憶し、その文字列に対応付けられたカウンタを１カウントアップするとともに、単語辞書１３中の照合した文字列部分をマークする。なお、照合時にすでに照合したマークが共通文字列と完全一致する場合は重複カウントをさけるためにカウントしない。ここに、意味推定手段の機能が実行される。
【００６４】
一方、単語辞書１３に登録されている各単語の中で未処理の単語がなくなった場合、つまり単語辞書１３に登録されている全ての単語についての処理が終了した場合には（ステップＳ１のＮ）、ステップＳ３に進み、共通文字列のカウントが一定数以上であるときに、この共通文字列を意味情報として、当該文字列を持つ単語辞書１３中の単語にこの意味情報を付与する。なお、一定数は単語辞書１３の語数等によって変更できる。ここに、意味情報格納手段の機能が実行される。
【００６５】
上記の意味情報推定処理について具体例を用いて説明する。ここでは、図２に示した単語辞書１３の各単語を順に処理するものとして以下において説明する。
【００６６】
まず、「ＡＡ商事」について処理する。「ＡＡ商事」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「ＡＡ商事」は、「ＢＢＢ商事」及び「ＺＺＺ商事」に対して「商事」という単語で共通するので、「商事」という語を共通文字列としてカウント“２”として記憶する。このとき、単語辞書１３中の「ＡＡ商事」「ＢＢＢ商事」「ＺＺＺ商事」の「商事」部分をカウント済みとしてマークしておく。
【００６７】
続く、「ＢＢＢ商事」「ＺＺＺ商事」についても、「商事」が共通するが、カウント済みなので、カウントアップしない。
【００６８】
次に、「ＸＹビール」について処理する。「ＸＹビール」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「ＸＹビール」は、「ＹＹＹビール」に対して「ビール」という単語で共通するので、「ビール」という語を共通文字列としてカウント“１”として記憶する。このとき、単語辞書１３中の「ＸＹビール」「ＹＹＹビール」の「ビール」部分をカウント済みとしてマークしておく。
【００６９】
続く、「ＹＹＹビール」についても、「ビール」が共通するが、カウント済みなので、カウントアップしない。
【００７０】
単語辞書１３に登録されている全ての単語について終了すると、共通文字列のカウントを調べ、例えばカウント“１”以上のものを、意味情報とし、この共通文字列（「商事」「ビール」）を単語辞書１３の各単語の表記に対応付けて付与する。
【００７１】
ここで、図５は意味情報推定部１４の処理によって意味情報が付与された単語辞書１３の一例を示す説明図である。図５に示すように、企業等の名称に対して、それぞれの業種分野が意味情報として付与されている。
【００７２】
ここに、単語において共通する文字列は共通する概念を表す可能性が高いことから、これを利用することで簡単に意味を付与することができるので、人手や構文解析等の技術を用いずに単語辞書１３に登録された単語に対してその単語の意味情報を自動的に付与することができる。
【００７３】
次に、本発明の第二の実施の形態を図６ないし図８に基づいて説明する。なお、前述した実施の形態と同一部分は同一符号で示し説明も省略する。本実施の形態は、第一の実施の形態で説明した意味情報推定装置１の意味情報推定部１４における意味情報推定処理の変形例である。
【００７４】
本実施の形態の意味情報推定部１４における意味情報推定処理の流れについて図６を参照して説明する。図６に示すように、本実施の形態の意味情報推定処理は、単語辞書１３に登録されている各単語の中で未処理の単語が有る場合には（ステップＳ１１のＹ）、ステップＳ１２に進み、当該単語を単語辞書１３中の他の単語と文字列比較し、共通文字列が単語末か単語頭に位置する場合に、その共通文字列を記憶し、その文字列に対応付けられたカウンタを１カウントアップするとともに、単語辞書１３中の照合した文字列部分をマークする。なお、照合時にすでに照合したマークが共通文字列と完全一致する場合は重複カウントをさけるためにカウントしない。
【００７５】
一方、単語辞書１３に登録されている各単語の中で未処理の単語がなくなった場合、つまり単語辞書１３に登録されている全ての単語についての処理が終了した場合には（ステップＳ１１のＮ）、ステップＳ１３に進み、共通文字列のカウントが一定数以上であるときに、この共通文字列を意味情報として、当該文字列を持つ単語辞書１３中の単語にこの意味情報を付与する。なお、一定数は単語辞書１３の語数等によって変更できる。
【００７６】
つまり、本実施の形態の意味情報推定処理は、第一の実施の形態の意味情報推定処理と比較して、共通する文字かどうかを判定する際に照合する単語での共通文字列の位置を単語末か単語頭に制限した点が異なるものである。
【００７７】
上記の意味情報推定処理について具体例を用いて説明する。ここでは、図７に示した単語辞書１３の各単語を順に処理するものとして以下において説明する。
【００７８】
まず、「（株）○○」について処理する。「（株）○○」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「（株）○○」は、「（株）××証券」及び「（株）ＡＡ証券」に対して「（株）」という単語で共通するので、「（株）」という語を共通文字列としてカウント“２”として記憶する。このとき、単語辞書１３中の「（株）○○」「（株）××証券」「（株）ＡＡ証券」の「（株）」部分をカウント済みとしてマークしておく。
【００７９】
次に、「（株）××証券」について処理する。「（株）××証券」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「（株）××証券」は、「（株）ＡＡ証券」に対して「証券」という単語で共通するので、「証券」という語を共通文字列としてカウント“１”として記憶する。
【００８０】
続く、「（株）ＡＡ証券」については、「（株）」及び「証券」が共通するが、カウント済みなので、カウントアップしない。
【００８１】
次に、「ＸＹビール」について処理する。「ＸＹビール」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「ＸＹビール」は、「ＺＺＺビール」に対して「ビール」という単語で共通するので、「ビール」という語を共通文字列としてカウント“１”として記憶する。このとき、単語辞書１３中の「ＸＹビール」「ＺＺＺビール」の「ビール」部分をカウント済みとしてマークしておく。
【００８２】
続く、「ＺＺＺビール」についても、「ビール」が共通するが、カウント済みなので、カウントアップしない。
【００８３】
単語辞書１３に登録されている全ての単語について終了すると、共通文字列のカウントを調べ、例えばカウント“１”以上のものを、意味情報とし、この共通文字列（「（株）」「証券」「ビール」）を単語辞書１３の各単語の表記に対応付けて付与する。
【００８４】
ここで、図８は意味情報推定部１４の処理によって意味情報が付与された単語辞書１３の一例を示す説明図である。図８に示すように、共通文字列を単語末に制限することで、企業等の名称に対して、ものの概念を表す語句の上位分類にあたる語句だけを抽出することができる。ここでは、「ビール」「証券」という業種分野が意味情報として付与されている。また、共通文字列を単語頭に制限することで、企業等の名称に対して、「（株）」といった語全体の意味を特徴付ける表現を抽出できる。ここでは、「（株）」という会社形態が意味情報として付与されている。
【００８５】
ここに、共通する文字列位置を制限することにより、より精度の高い意味推定を実現することができる。
【００８６】
次に、本発明の第三の実施の形態を図９ないし図１１に基づいて説明する。なお、前述した実施の形態と同一部分は同一符号で示し説明も省略する。本実施の形態は、第一の実施の形態で説明した意味情報推定装置１の意味情報推定部１４におけるの意味情報推定処理の変形例である。本実施の形態は、第一の実施の形態及び第二の実施の形態で説明したような単語辞書１３に意味情報が全くない状態とは異なり、単語辞書１３に意味情報が未登録の単語がいくつかある場合や単語を新規に登録した場合の意味情報推定処理である。
【００８７】
本実施の形態の意味情報推定部１４における意味情報推定処理の流れについて図９を参照して説明する。図９に示すように、本実施の形態の意味情報推定処理は、単語辞書１３中に意味情報が未登録の単語が有る場合には（ステップＳ２１のＹ）、ステップＳ２２に進み、当該単語を単語辞書１３中の他の単語と文字列比較し、共通文字列があればその文字列を記憶し、その文字列に対応付けられたカウンタを１カウントアップする。ここに、第二意味推定手段の機能が実行される。
【００８８】
一方、意味情報が未登録の単語が単語辞書１３中からなくなった場合には（ステップＳ２１のＮ）、ステップＳ２３に進み、カウントが一定数以上である共通文字列、もしくは、その共通文字列をもつ単語の意味情報を、当該文字列を持つ単語辞書１３中の単語に付与する。なお、一定数は単語辞書１３の語数等によって変更できる。ここに、第二意味情報格納手段の機能が実行される。
【００８９】
上記の意味情報推定処理について具体例を用いて説明する。ここでは、図１０に示した単語辞書１３の単語を処理するものとして以下において説明する。
【００９０】
単語辞書１３中には意味情報が未登録の単語（「ＺＺＺ商事」「ＸＹＺビール」）が有ることから、まず、「ＺＺＺ商事」について処理する。「ＺＺＺ商事」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「ＺＺＺ商事」は、「ＡＡ商事」及び「ＢＢＢ商事」に対して「商事」という単語で共通するので、「商事」という語を共通文字列としてカウント“２”として記憶する。
【００９１】
次に、「ＸＹＺビール」について処理する。「ＸＹＺビール」と単語辞書１３中の他の単語の共通文字列を検索する。この場合、「ＸＹＺビール」は、「ＸＹビール」に対して「ビール」という単語で共通するので、「ビール」という語を共通文字列としてカウント“１”として記憶する。
【００９２】
単語辞書１３中の意味情報が未登録の単語（「ＺＺＺ商事」「ＸＹＺビール」）について共通文字列を調べ終わった後、共通文字列のカウントを調べ、例えばカウント“１”以上のものを、意味未登録の単語の意味情報とし、この共通文字列（「商事」「ビール」）を単語辞書１３の各単語の表記に対応付けて付与する。
【００９３】
ここで、図１１は意味情報推定部１４の処理によって意味情報が付与された単語辞書１３の一例を示す説明図である。図１１に示すように、意味情報が未登録であった単語に対して、それぞれの業種分野が意味情報として付与されている。
【００９４】
ここに、意味情報を持たない未登録の語について、既存の辞書情報から自動的に意味情報を推定することができる。
【００９５】
なお、本実施の形態においては、意味情報が未登録の単語が単語辞書１３中に有る場合について説明したが、単語自体を新規に登録した場合も同様に処理することが可能である。
【００９６】
次に、本発明の第四の実施の形態を図１２ないし図１５に基づいて説明する。なお、前述した実施の形態と同一部分は同一符号で示し説明も省略する。本実施の形態は、第一の実施の形態ないし第三の実施の形態で説明したような単語辞書１３中の単語の意味情報を単語辞書１３に基づいて推定するものではなく、入力装置１１等を介して入力された文書から単語の意味情報を推定するものである。
【００９７】
ここで、図１２は本実施の形態の意味情報推定装置１の機能ブロック図である。図１２に示すように、ＣＰＵ２がプログラムに基づいて動作することで、意味情報推定装置１には共起情報抽出部１５及び意味情報推定部１６が形成される。共起情報抽出手段を実現する共起情報抽出部１５は、概略的には、入力装置１１等を介して入力された文書に対して共起パタン辞書１７（図１３参照）を用いて、文書中から該当する表現を抽出するものである。共起パタン辞書１７は、図１３に示すように、文書中から抽出する共起表現の組（共起パタン）を記述したものである。図１３に示す例では、表記や品詞によって共起表現が指定されている。一方、意味情報推定部１６は、概略的には、共起情報抽出部１５により抽出された該当する表現に対して、単語辞書１３にその単語自体が未登録あるいは意味情報が未登録の場合に、単語とその単語の意味情報とを登録するものである。
【００９８】
まず、共起情報抽出部１５における処理について説明する。共起情報抽出部１５は、入力文書中の所定の単語の後に単語辞書１３に未登録の単語または名詞の連続が存在し、かつ、その後に助詞が存在する場合に、未登録の単語または連続する名詞に対し、その所定の単語を意味情報として付与するものである。つまり、図１３に示す共起パタン辞書１７の例においては、入力文書中の「パソコン」という単語の後の未登録の単語または連続する名詞（図１３では下線部分：共起単語）に対して、意味情報として「パソコン」及び「メーカー」を付与したものである。なお、共起単語ごとにテーブル等を用いて別の意味情報を付与することも可能である。また、ここでは入力文書は形態素解析などで解析していても、解析していなくてもよい。形態素解析をしない場合は、単語辞書１３によって品詞を特定する。
【００９９】
次に、意味情報推定部１６における処理について説明する。ここで、図１４は本実施の形態の意味情報推定部１６における意味情報推定処理の流れを示すフローチャートである。図１４に示すように、意味情報推定部１６の意味情報推定処理は、共起パタン辞書１７の共起パタンを一つずつ順に入力文書と照合し（ステップＳ３１：文書受付手段）、共起パタン中の所定の共起単語（図１３では下線部分）が、単語辞書１３中において未登録または意味情報が未登録の場合（ステップＳ３２のＹ，ステップＳ３３のＹ）、共起パタン辞書１７の指示にしたがい意味情報を付与、あるいは共起パタン辞書１７の記述を別表現に変換して付与する（ステップＳ３４）。これらステップＳ３２〜Ｓ３４の処理は、すべての共起パタンについて行なわれる。
【０１００】
上記の意味情報推定処理について具体例を用いて説明する。ここでは、例えば入力文書中に「大手メーカー○○電気がパソコンＡシリーズを発売した」という表現があった場合について考える。
【０１０１】
まず、図１３に示した共起パタン辞書１７の共起パタンに基づき、前方に「パソコン」があり、後方に未登録の単語または連続する名詞（共起単語）があり、その後ろに助詞があるかどうかを調べる。ここでは、「パソコンＡシリーズを」が該当する。次に、共起パタン辞書１７に基づき、「Ａシリーズ」を意味情報「パソコン」として抽出する。これを単語辞書１３で検索し、未登録の場合には、図１５に示すように単語辞書１３に登録する。
【０１０２】
続いて、図１３に示した共起パタン辞書１７の共起パタンに基づき、前方に「メーカー」があり、後方に未登録の単語または連続する名詞（共起単語）があり、その後ろに助詞があるかどうかを調べる。ここでは、「メーカー○○電気が」が該当する。次に、共起パタン辞書１７に基づき、「○○電気」を意味情報「メーカー」として抽出する。これを単語辞書１３で検索し、未登録の場合には単語辞書１３に「○○電気」を登録し、さらに意味情報として「メーカー」を登録する。
【０１０３】
ここに、文書に出現した共起情報を利用することで単語辞書１３に未登録の単語とその意味情報とを自動的に獲得することができる。特に、語構成の情報を利用することで、より確実にその意味を獲得することができる。
【０１０４】
次に、本発明の第五の実施の形態を図１６ないし図２０に基づいて説明する。なお、前述した実施の形態と同一部分は同一符号で示し説明も省略する。本実施の形態は、第四の実施の形態で説明した意味情報推定装置１の共起情報抽出部１５及び意味情報推定部１６における処理の変形例である。
【０１０５】
本実施の形態の単語辞書１３には、図１６に示すように、単語とその単語の意味情報とが格納されている。
【０１０６】
次に、本実施の形態の共起情報抽出部１５において用いられる共起パタン辞書１７について説明する。本実施の形態の共起パタン辞書１７は、図１７に示すように、照合する共起パタンと共起パタン内の語の意味関係とを記述したものである。なお、共起パタン内の語の意味関係は、上位または下位の関係でもその他の関係でもよい。また、図１７に示す例では、“（）”はその意味関係を持つ語を示し、“［］”は語の表記自体を指定している。ここで、“［Ｘ｜Ｙ］”は、ＸまたはＹのどちらでも良いことを意味したものである。
【０１０７】
加えて、本実施の形態の共起パタン辞書１７においては、図１７に示すように、共起パタン内での階層あるいは関連付けをした記述を行なえるようになっている。図１７に示す例では、「（Ａ）：（Ｂ）（Ｃ）」の場合には、「（Ｂ）」「（Ｃ）」の連鎖全体で意味「（Ａ）」を表しうることを示している。なお、「(名称：名詞連続ｏｒ未登録語)」は、名称を意味する表現が名詞の連続か未登録の単語によって構成されることを表しており、照合の際の一致の条件とされている。以上により、共起パタン内の表現のうち、複数単語が組み合わされてできる表現について、例えば単語内の内部的な構成単語の連鎖等の場合、これを分けて記述できるので、記述しやすくなる。図１７に示す例では、共起パタンのうち「(商品)」で示された商品を意味する単語について、（製品）（名称：名詞連続ｏｒ未登録語）といった組み合わせで表現されうることを示している。
【０１０８】
次に、意味情報推定部１６における処理について説明する。ここで、図１８は本実施の形態の意味情報推定部１６における意味情報推定処理の流れを示すフローチャートである。図１８に示すように、意味情報推定部１６の意味情報推定処理は、入力文書中の表現を順に照合するものであって、入力文書中に共起パタン辞書１７の共起パタンがあるかどうかを調べるものである。その際、入力文書中に意味関係を持つ語“（）”が有れば、“（）”による意味も含めた照合を行ない（ステップＳ４２）、入力文書中に共起パタン辞書１７の共起パタンがなければ（ステップＳ４２のＮ）、ステップＳ４３に進んで意味部分“（）”を除いた照合を行なう。
【０１０９】
入力文書中に共起パタン辞書１７の共起パタンがある場合には（ステップＳ４３のＹ）、ステップＳ４４に進み、意味部分“（）”について共起パタン中の他の共起パタンで規定があるかどうかを調べる。そして、他の共起パタンで規定がある場合には、意味部分“（）”とその該当文字列とを他の共起パタンで照合する。
【０１１０】
意味部分“（）”とその該当文字列とを他の共起パタンで照合した場合には（ステップＳ４４のＹ）、一致した共起パタンのうちもっとも未照合の少ない共起パタンを選択し（ステップＳ４５）、未照合部分が“１”であれば（ステップＳ４６のＹ）、未照合部分をその意味と認定し、一致した共起パタンと該当文字列（語）とを記憶する（ステップＳ４７）。
【０１１１】
また、入力文書中に共起パタン辞書１７の共起パタンが有る場合にも（ステップＳ４２のＹ）、一致した共起パタンと該当文字列（語）とを記憶する（ステップＳ４７）。
【０１１２】
一方、入力文書中に共起パタン辞書１７の共起パタンがない場合（ステップＳ４３のＮ）、意味部分“（）”とその該当文字列とを他の共起パタンで照合しない場合（ステップＳ４４のＮ）、未照合部分が“１”でない場合には（ステップＳ４６のＹ）、ステップＳ４１に戻る。
【０１１３】
これらステップＳ４２〜Ｓ４７の処理は、入力文書中のすべての表現パタンについて行なわれる。
【０１１４】
入力文書中のすべての表現パタンについて処理が終わると（ステップＳ４１のＮ）、一致した共起パタンと該当文字列（語）と意味とに基づき、対応する共起パタン辞書１７の意味関係の記述を単語辞書１３に書き加える（ステップＳ４８）。
【０１１５】
上記の意味情報推定処理について具体例を用いて説明する。ここでは、例えば入力文書中に
「○○電気がパソコンＡシリーズを発売した」
「○○電気がスパッタリング装置を開発した」
という表現があった場合について考える。
【０１１６】
まず、図１７に示した共起パタン辞書１７の共起パタンに基づき、「○○電気がパソコンＡシリーズを発売した」についての照合を行なう。図１７に示した共起パタン辞書１７の先頭の共起パタン「（メーカー）が（商品）を発売」について、意味“（）”も含めた照合を行ない、次に意味“（）”以外の部分だけでの照合を行なう。ここでは、意味部分（メーカー）（商品）以外が一致することになる。
【０１１７】
意味“（）”が規定された部分は一致かどうか不明であるので、それぞれの意味について他の共起パタン辞書１７に規定があるかどうか調べる。（商品）については、「（商品）：（製品）（名称：名詞連続ｏｒ未登録語）」という共起パタンが共起パタン辞書１７にあり、該当部分の「パソコンＡシリーズ」と照合する。図１６に示した単語辞書１３には「パソコン」が意味「製品」として登録されていることから、「（商品）：（製品）（名称：名詞連続ｏｒ未登録語）」という共起パタンが適用され、「パソコンＡシリーズ」は（製品）（名称）として認識され、この部分全体を（商品）と確認することができる。つまり、共起パタン辞書１７の先頭の共起パタン「（メーカー）が（商品）を発売」では、（商品）部分が一致することになる。
【０１１８】
一方、残る「（メーカー）」については、図１６に示した単語辞書１３に文字列の一致条件がない。この例では、共起パタン辞書１７の先頭の共起パタン「（メーカー）が（商品）を発売」が最も良く照合され、かつ未照合部分が“１”であることから、共起パタン辞書１７の先頭の共起パタン「（メーカー）が（商品）を発売」を照合したとみなし、未照合部分である「○○電気」を(メーカー)として推定し、単語辞書１３には未登録である「○○電気」を意味「メーカー」、「Ａシリーズ」を意味「名称」として認定し、照合した共起パタンとその単語とを記憶する。
【０１１９】
次に、「○○電気がスパッタリング装置を開発した」についての照合を行なう。図１７に示した共起パタン辞書１７の二番目の共起パタン「（メーカー）が（技術）を開発」について、意味“（）”も含めた照合を行ない、次に意味“（）”以外の部分だけでの照合を行なう。ここでは、意味部分（メーカー）（技術）以外が一致することになる。
【０１２０】
意味“（）”が規定された部分は一致かどうか不明であるので、それぞれの意味について他の共起パタン辞書１７に規定があるかどうか調べる。（技術）については、「（技術）：（技術）｜（技術）［装置｜システム］」という共起パタンが共起パタン辞書１７にあり、該当部分の「スパッタリング装置」と照合する。「（技術）：（技術）｜（技術）［装置｜システム］」という共起パタンが適用され、「スパッタリング装置」は（技術）［装置｜システム］として認識され、この部分全体を（技術）と確認することができる。つまり、共起パタン辞書１７の二番目の共起パタン「（メーカー）が（技術）を開発」では、（技術）部分が一致することになる。
【０１２１】
一方、残る「（メーカー）」については、図１６に示した単語辞書１３に文字列の一致条件がない。この例では、共起パタン辞書１７の二番目の共起パタン「（メーカー）が（技術）を開発」が最も良く照合され、かつ未照合部分が“１”であることから、共起パタン辞書１７の二番目の共起パタン「（メーカー）が（技術）を開発」を照合したとみなし、未照合部分である「○○電気」を(メーカー)として推定し、単語辞書１３には未登録である「○○電気」を意味「メーカー」、「スパッタリング」を意味「技術」として認定し、照合した共起パタンとその単語とを記憶する。
【０１２２】
すべての共起パタンを照合した後、照合された共起パタンについて、単語辞書１３中の語に意味情報が付与される。ここでは「○○電気」という語について意味「メーカー」、「Ａシリーズ」という語について意味「名称」、「スパッタリング」という語について意味「技術」を図１９に示すように単語辞書１３に登録することができる。なお、意味関係については、目的に応じて関連付けの順番の変更(例えば、メーカーを図１９中左側にする)が可能である。また、図２０に示すように、単語辞書１３自体に意味付与することも可能である。
【０１２３】
以上のように、共起パタンによって単語自体や単語の意味が未登録であっても、共起する表現によってそれらが推測でき、また意味同士が関連つけられているので、詳細な照合と、照合結果から詳細な意味情報を推定することができる。
【０１２４】
ここに、共起パタンを関連付け、あるいは階層的に記述することにより、パタン記述の労力を削減することができる。また、関連付けされることにより、未登録の語があっても関連する既知の情報をもとに意味情報を推定することができる。また、関連付けられた共起パタンによって、意味情報だけでなく、意味情報同士を関連付けすることができるので、より詳細な意味情報を推定することができる。
【０１２５】
【発明の効果】
請求項１，３，５記載の発明によれば、単語において共通する文字列は共通する概念を表す可能性が高いことから、これを利用することで簡単に意味を付与することができるので、人手や構文解析等の技術を用いずに単語辞書に登録された単語に対してその単語の意味情報を自動的に付与することができる。また、文書に出現した共起パタンを利用することで単語辞書に未登録の単語とその意味情報とを自動的に獲得することができる。また、共起パタンを関連付け、あるいは階層的に記述することにより、パタン記述の労力を削減することができる。また、関連付けされることにより、未登録の語があっても関連する既知の情報をもとに意味情報を推定することができる。更に、関連付けられた共起パタンによって、意味情報だけでなく、意味情報同士を関連付けすることが可能になるので、より詳細な意味情報を推定することが可能になる。
【０１２６】
請求項２，４，６記載の発明によれば、共通する文字列位置を制限することにより、より精度の高い意味推定を実現することができる。
【図面の簡単な説明】
【図１】本発明の第一の実施の形態の意味情報推定装置のハードウェア構成を概略的に示すブロック図である。
【図２】単語辞書の初期状態を示す説明図である。
【図３】意味情報推定装置の機能ブロック図である。
【図４】意味情報推定処理の流れを概略的に示すフローチャートである。
【図５】意味情報が付与された単語辞書を示す説明図である。
【図６】本発明の第二の実施の形態の意味情報推定装置における意味情報推定処理の流れを概略的に示すフローチャートである。
【図７】単語辞書の初期状態を示す説明図である。
【図８】意味情報が付与された単語辞書を示す説明図である。
【図９】本発明の第三の実施の形態の意味情報推定装置における意味情報推定処理の流れを概略的に示すフローチャートである。
【図１０】単語辞書の初期状態を示す説明図である。
【図１１】意味情報が付与された単語辞書を示す説明図である。
【図１２】本発明の第四の実施の形態の意味情報推定装置の機能ブロック図である。
【図１３】共起パタン辞書を示す説明図である。
【図１４】意味情報推定処理の流れを概略的に示すフローチャートである。
【図１５】意味情報が付与された単語辞書を示す説明図である。
【図１６】本発明の第五の実施の形態の意味情報推定装置の単語辞書の初期状態を示す説明図である。
【図１７】共起パタン辞書を示す説明図である。
【図１８】意味情報推定処理の流れを概略的に示すフローチャートである。
【図１９】意味情報が付与された単語辞書を示す説明図である。
【図２０】意味情報が付与された単語辞書の別の一例を示す説明図である。
【符号の説明】
１意味情報推定装置
７記憶媒体
１１入力部
１３単語辞書
１７共起パタン辞書[0001]
BACKGROUND OF THE INVENTION
  The present invention provides a semantic information estimation device, a semantic information estimation method,And programsAbout.
[0002]
[Prior art]
In a word dictionary used for so-called language processing, information on the meaning (classification) of words is registered in addition to information on notation, reading, and utilization of words. Thus, the meaning information of the words registered in the word dictionary is very effective at the time of document search processing and document classification processing.
[0003]
By the way, when searching for a predetermined document from a large-scale document database, a search using a keyword is common. In such a document search using keywords, in actuality, for example, in addition to a case where a search is performed using a specific keyword “Ox Electric” which is a manufacturer that manufactures and sells OA equipment, a concept higher than “Ox Electric” is used. There is a case where the search is performed by a keyword such as “OA device manufacturer”.
[0004]
However, in ordinary documents, although there is a proper noun such as “○ × Electricity” which is the company name, it is normal that the word representing the superordinate concept of “○ × Electricity” (“OA equipment manufacturer”) is not expressed. is there.
[0005]
Therefore, as one measure for solving such a problem, there is a method of giving semantic information such as a business type to which the company belongs to a proper noun such as a company name. However, the assignment of information related to meaning and knowledge to the word dictionary and the creation of the word dictionary require considerable labor, knowledge, and technology. For this reason, various methods have been considered for giving various information related to meaning and knowledge to the word dictionary.
[0006]
Japanese Patent Laid-Open No. 6-223109 discloses an efficient method for acquiring knowledge information by analyzing a natural language sentence and registering a database for a specific object.
[0007]
Japanese Patent Laid-Open No. 7-85071 discloses a method for extracting information including a relationship between words using syntax information in a fixed format as a method for estimating a word based on syntax information. According to the technique disclosed in Japanese Patent Laid-Open No. 7-85071, when extracting information including relations between words in a certain format using syntax information, the words of unregistered words and their meanings are estimated. Yes. For example, with respect to the expression “XX Electric has developed a sputtering device”, even if “XX Electric” is an unknown word, the method of recognizing as a device developer from the co-occurring expression “Developing the device” Has proposed.
[0008]
[Problems to be solved by the invention]
However, according to the technique disclosed in Japanese Patent Laid-Open No. 6-223109, it is assumed that a certain amount of semantic information is registered in advance in the word dictionary. That is, it is necessary to preliminarily define the meaning of the target area and the expression associated therewith, and this setting itself is very labor intensive.
[0009]
In addition, according to the technique disclosed in Japanese Patent Laid-Open No. 7-85071, information is obtained by performing detailed syntax analysis using syntax analysis rules. The analysis itself may be difficult if it is not registered or if there is no semantic information. Further, in this method, for example, a similar result can be obtained in a sentence “XX electricity has developed a communication device”, but detailed semantic information such as what device is developed is not known.
[0010]
An object of the present invention is to automatically assign semantic information of a word to a word registered in a word dictionary without using a technique such as manual operation or syntax analysis.
[0011]
An object of the present invention is to automatically acquire a word not registered in a word dictionary and its semantic information by using input document information.
[0012]
[Means for Solving the Problems]
  The semantic information estimation apparatus according to claim 1 is a word dictionary storage means for storing a plurality of word expressions, and a character string constituting at least a part of one word expression stored in the word dictionary storage means. A common character string that is common to the first word is searched from the notation of words other than the one word, and when the predetermined number or more of the common character string is searched, the common character string is a notation including the common character string. Meaning estimation means for estimating the meaning information of the word, and meaning information storage for storing the meaning information estimated by the meaning estimation means in the word dictionary storage means in association with the notation of the word including the common character string Means,The first co-occurrence pattern defining the relationship between co-occurring words and the semantic information of each word of the first co-occurrence pattern are stored in association with each other, and any word of the first co-occurrence pattern A co-occurrence pattern dictionary storage means for storing a second co-occurrence pattern that defines a hierarchical relationship or an association relationship between the second co-occurrence word and another word, and semantic information of the other word of the second co-occurrence pattern in association with each other A document receiving unit that receives an input of a document from the input unit, and each of the documents that is received by the document receiving unit and that satisfies the first co-occurrence pattern stored in the co-occurrence pattern dictionary storage unit Co-occurrence information extracting means for extracting words, and the semantic information storage means is further stored in the notation of each word extracted by the co-occurrence information extraction means and the co-occurrence pattern dictionary storage means. Said meaning of the word Information is stored in the word dictionary storage unit in association with each other, and a hierarchical relationship or association relationship between any one of the words extracted by the co-occurrence information extraction unit and another word is the co-occurrence. If the second co-occurrence pattern stored in the pattern dictionary storage means is defined, the notation of the other words and the other words stored in the co-occurrence pattern dictionary storage means Semantic information is associated and stored in the word dictionary storage means.
[0013]
  Therefore, since it is highly likely that character strings that are common in words represent a common concept, it is possible to easily give meaning by using this. As a result, it becomes possible to automatically give semantic information of a word to a word registered in the word dictionary without using a technique such as manpower or syntax analysis.Further, by using the co-occurrence pattern appearing in the document, it becomes possible to automatically acquire a word not registered in the word dictionary and its semantic information. In addition, it is possible to reduce the labor of pattern description by associating or hierarchically describing the co-occurrence patterns. Further, by being associated, even if there is an unregistered word, it is possible to estimate semantic information based on related known information. Furthermore, since the associated co-occurrence patterns allow not only semantic information but also semantic information to be associated with each other, more detailed semantic information can be estimated.
[0014]
  According to a second aspect of the present invention, in the semantic information estimation device according to the first aspect, the semantic estimation means includes:The common character string is searched from the last character string or the first character string of the notation of the word other than the one word..
[0015]
Therefore, by limiting the common character string positions, more accurate meaning estimation can be performed.
[0026]
  Claim3The semantic information estimation method according to the invention is characterized in that the common character common to the character string constituting at least a part of the notation of one word stored in the word dictionary storage means in which the meaning estimation means stores a plurality of word expressions. When a string is searched from notations of words other than the one word and the common character string is searched for a predetermined number or more, the common character string is used as the meaning information of the notation word including the common character string. And a semantic information storage unit that stores the semantic information estimated in the semantic estimation step in the word dictionary storage unit in association with a word notation including the common character string.FirstSemantic information storage process;A document receiving unit that receives a document input from the input unit; and a co-occurrence information extracting unit that defines a relationship between co-occurring words from the documents received in the document receiving step. One co-occurrence pattern and semantic information of each word of the first co-occurrence pattern are stored in association with each other, and a hierarchical relationship or an association relationship between any one word of the first co-occurrence pattern and another word is defined. Each word satisfying the first co-occurrence pattern stored in the co-occurrence pattern dictionary storage means for storing the second co-occurrence pattern and the semantic information of the other words of the second co-occurrence pattern in association with each other A co-occurrence information extracting step for extracting the word, and the semantic information storage means further includes the notation of each word extracted by the co-occurrence information extraction step and the word stored in the co-occurrence pattern dictionary storage means. The semantic information of The word dictionary storage means associates and stores the hierarchical relationship or association relation between any one of the words extracted by the co-occurrence information extraction step and another word, and the co-occurrence pattern dictionary storage means. If the second co-occurrence pattern stored in is defined in the second co-occurrence pattern, the notation of the other word and the semantic information of the other word stored in the co-occurrence pattern dictionary storage means A second semantic information storage step of storing the word dictionary storage means in association with each other;Comprising.
[0027]
  Therefore, since it is highly likely that character strings that are common in words represent a common concept, it is possible to easily give meaning by using this. As a result, it becomes possible to automatically give semantic information of a word to a word registered in the word dictionary without using a technique such as manpower or syntax analysis.Further, by using the co-occurrence pattern appearing in the document, it becomes possible to automatically acquire a word not registered in the word dictionary and its semantic information. In addition, it is possible to reduce the labor of pattern description by associating or hierarchically describing the co-occurrence patterns. Further, by being associated, even if there is an unregistered word, it is possible to estimate semantic information based on related known information. Furthermore, since the associated co-occurrence patterns allow not only semantic information but also semantic information to be associated with each other, more detailed semantic information can be estimated.
[0028]
  Claim4The described invention is claimed.3In the semantic information estimation method described above, in the semantic estimation step, the semantic estimation means calculates the common character string from the last character string or the first character string of words other than the one word. Search for.
[0029]
Therefore, by limiting the common character string positions, more accurate meaning estimation can be performed.
[0040]
  Claim5The program of the invention described is a program for causing a computer to estimate word semantic information, and the computer uses one word notation stored in word dictionary storage means for storing a plurality of word notations in the computer. When a common character string that is common to at least a part of the character string is searched from notations of words other than the one word, and the common character string is searched for a predetermined number or more, the common character string Is estimated as semantic information of a notation word including the common character string, and the semantic information estimated by the meaning estimation function is associated with the notation of the word including the common character string. Store in dictionary storageFirstSemantic information storage function,A document reception function for receiving a document input from the input unit; a first co-occurrence pattern that defines a relationship between co-occurring words from the documents received by the document reception function; and a first co-occurrence pattern The second co-occurrence pattern and the second co-occurrence pattern defining the hierarchical relationship or the association relationship between any word of the first co-occurrence pattern and other words while storing the semantic information of each word in association with each other A co-occurrence information extraction function for extracting each word satisfying the first co-occurrence pattern stored in the co-occurrence pattern dictionary storage means for storing the semantic information of the other words of the pattern in association with each other; The notation of each word extracted by the occurrence information extraction function and the semantic information of the word stored in the co-occurrence pattern dictionary storage unit are stored in the word dictionary storage unit in association with each other, and Co-occurrence information When the second co-occurrence pattern stored in the co-occurrence pattern dictionary storage unit defines a hierarchical relationship or an association relationship between any one of the words extracted by the output function and another word The second semantic information storage for storing the notation of the other word and the semantic information of the other word stored in the co-occurrence pattern dictionary storage means in association with each other in the word dictionary storage means Function andIs executed.
[0041]
  Therefore, since it is highly likely that character strings that are common in words represent a common concept, it is possible to easily give meaning by using this. As a result, it becomes possible to automatically give semantic information of a word to a word registered in the word dictionary without using a technique such as manpower or syntax analysis.Further, by using the co-occurrence pattern appearing in the document, it becomes possible to automatically acquire a word not registered in the word dictionary and its semantic information. In addition, it is possible to reduce the labor of pattern description by associating or hierarchically describing the co-occurrence patterns. Further, by being associated, even if there is an unregistered word, it is possible to estimate semantic information based on related known information. Furthermore, since the associated co-occurrence patterns allow not only semantic information but also semantic information to be associated with each other, more detailed semantic information can be estimated.
[0042]
  Claim6The described invention is claimed.5In the program described above, the meaning estimation function searches for the common character string from the last character string or the first character string in the notation of the word other than the one word.
[0043]
Therefore, by limiting the common character string positions, more accurate meaning estimation can be performed.
[0056]
DETAILED DESCRIPTION OF THE INVENTION
A first embodiment of the present invention will be described with reference to FIGS.
[0057]
FIG. 1 is a block diagram schematically showing a hardware configuration of the semantic information estimation apparatus 1. As shown in FIG. 1, the semantic information estimation device 1 includes a CPU (Central Processing Unit) 2 that centrally controls each unit of the semantic information estimation device 1. The CPU 2 stores BIOS and the like. A ROM (Read Only Memory) 3 that is a read-only memory and a RAM (Random Access Memory) 4 that stores various data in a rewritable manner are connected by a bus 5. Further, the bus 5 includes a HDD (Hard Disk Drive) 6 serving as an external storage, a CD-ROM drive 8 that reads a CD (Compact Disc) -ROM 7, and communication control that controls communication between the semantic information estimation device 1 and the network 9. A device 10, an input device 11 such as a keyboard and a mouse that function as an input unit, and an output device 12 such as a CRT (Cathode Ray Tube) and an LCD (Liquid Crystal Display) are connected via an I / O (not shown). ing.
[0058]
Since the RAM 4 has a property of storing various data in a rewritable manner, the RAM 4 functions as a work area for the CPU 2 and plays a role of, for example, an input buffer and an analysis buffer.
[0059]
The HDD 6 stores a word dictionary 13 in which word notation and semantic information are stored, in addition to program files for storing various programs. In the word dictionary 13 of the present embodiment, as shown in FIG. 2, for example, only the name of a company or the like (word notation) is stored as an initial state.
[0060]
A CD-ROM 7 shown in FIG. 1 implements the storage medium of the present invention, and stores a predetermined program. The CPU 2 reads the program stored in the CD-ROM 7 with the CD-ROM drive 8 and installs it in the HDD 6. Thereby, the semantic information estimation apparatus 1 is in a state in which various processes as described later can be performed.
[0061]
As the storage medium, not only the CD-ROM 7 but also various types of media such as semiconductor memory such as various optical disks such as DVD, various magnetic disks such as various magneto-optical disks and floppy disks, and the like can be used. Alternatively, the program may be downloaded from the network 9 such as the Internet via the communication control device 10 and installed in the HDD 6. In this case, the storage device storing the program in the server on the transmission side is also a storage medium of the present invention. Note that the program may operate on a predetermined OS (Operating System), in which case the OS may execute a part of various processes described later, or a word processor. It may be included as part of a group of program files that constitute predetermined application software such as software or an OS.
[0062]
Next, the contents of various processes executed by the CPU 2 of the semantic information estimation apparatus 1 based on the program will be described. The semantic information estimation apparatus 1 of the present invention generally estimates the semantic information of a word. As shown in FIG. 3, the CPU 2 operates based on a program, so that the semantic information estimation apparatus 1 A semantic information estimation unit 14 that functions to estimate the semantic information of a word based on the word dictionary 13 is formed.
[0063]
Next, the flow of semantic information estimation processing in the semantic information estimation unit 14 of the present embodiment will be described with reference to FIG. As shown in FIG. 4, the semantic information estimation processing according to the present embodiment performs processing in step S2 when there is an unprocessed word among the words registered in the word dictionary 13 (Y in step S1). The word is compared with other words in the word dictionary 13, and if there is a common character string, the character string is stored, the counter associated with the character string is counted up, and the word dictionary The matched character string portion in 13 is marked. In addition, when the mark already collated at the time of collation completely matches the common character string, it is not counted in order to avoid duplication count. Here, the function of the meaning estimation means is executed.
[0064]
On the other hand, when there is no unprocessed word among the words registered in the word dictionary 13, that is, when the processing for all the words registered in the word dictionary 13 is completed (N in step S1). In step S3, when the count of the common character string is equal to or larger than a certain number, the semantic information is given to the word in the word dictionary 13 having the character string, using the common character string as semantic information. The fixed number can be changed according to the number of words in the word dictionary 13 or the like. Here, the function of the semantic information storage means is executed.
[0065]
The semantic information estimation process will be described using a specific example. Here, description will be given below assuming that each word in the word dictionary 13 shown in FIG. 2 is processed in order.
[0066]
First, “AA Trading” is processed. A common character string of “AA Trading” and other words in the word dictionary 13 is searched. In this case, since “AA trading” is common to “BBB trading” and “ZZZ trading” in terms of “trade”, the word “trade” is stored as a common character string as a count “2”. At this time, the “trading” portion of “AA trading”, “BBB trading”, and “ZZZ trading” in the word dictionary 13 is marked as counted.
[0067]
“BBB Shoji” and “ZZZ Shoji” are also common to “Shosho” but are not counted up because they have already been counted.
[0068]
Next, “XY beer” is processed. A common character string of “XY beer” and other words in the word dictionary 13 is searched. In this case, since “XY beer” is common to “YYY beer” in the word “beer”, the word “beer” is stored as a common character string as a count “1”. At this time, the “beer” portion of “XY beer” and “YYY beer” in the word dictionary 13 is marked as counted.
[0069]
The following “YYY beer” is also common to “beer”, but since it has already been counted, it does not count up.
[0070]
When all the words registered in the word dictionary 13 are finished, the common character string count is checked. For example, the count of “1” or more is used as the semantic information, and this common character string (“trade” “beer”) is used. It is given in association with the notation of each word in the word dictionary 13.
[0071]
Here, FIG. 5 is an explanatory diagram showing an example of the word dictionary 13 to which the semantic information is given by the processing of the semantic information estimation unit 14. As shown in FIG. 5, each industry field is given as semantic information to the name of a company or the like.
[0072]
Here, since there is a high possibility that character strings that are common in words represent common concepts, it is possible to easily give meaning by using this, so without using techniques such as manual labor and syntax analysis. The semantic information of the word can be automatically given to the word registered in the word dictionary 13.
[0073]
Next, a second embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those of the above-described embodiment are denoted by the same reference numerals, and description thereof is omitted. This embodiment is a modification of the semantic information estimation process in the semantic information estimation unit 14 of the semantic information estimation apparatus 1 described in the first embodiment.
[0074]
The flow of the semantic information estimation process in the semantic information estimation unit 14 of the present embodiment will be described with reference to FIG. As shown in FIG. 6, the semantic information estimation processing of the present embodiment is performed in step S12 when there is an unprocessed word among the words registered in the word dictionary 13 (Y in step S11). Proceed, compare the word with another word in the word dictionary 13, and if the common character string is located at the end of the word or at the beginning of the word, store the common character string and associate it with the character string The counter is incremented by 1, and the collated character string portion in the word dictionary 13 is marked. In addition, when the mark already collated at the time of collation completely matches the common character string, it is not counted in order to avoid duplication count.
[0075]
On the other hand, when there is no unprocessed word among the words registered in the word dictionary 13, that is, when the processing for all the words registered in the word dictionary 13 is completed (N in step S11). In step S13, when the count of the common character string is equal to or larger than a certain number, the semantic information is given to the word in the word dictionary 13 having the character string, using the common character string as semantic information. The fixed number can be changed according to the number of words in the word dictionary 13 or the like.
[0076]
That is, the semantic information estimation process of the present embodiment is compared with the semantic information estimation process of the first embodiment to determine the position of the common character string in the word to be collated when determining whether or not it is a common character. The difference is that it is restricted to the end of the word or the beginning of the word.
[0077]
The semantic information estimation process will be described using a specific example. Here, description will be made below assuming that each word in the word dictionary 13 shown in FIG. 7 is processed in order.
[0078]
First, “(stock) XX” is processed. A common character string of “(stock) XX” and other words in the word dictionary 13 is searched. In this case, “(share) XX” is common to “(share) XX securities” and “(AA) securities” with the word “(share)”, so the word “(share)” Is stored as a count “2” as a common character string. At this time, the “(stock)” portion of “(stock) XX”, “(stock) XX securities”, and “(AA) securities” in the word dictionary 13 is marked as counted.
[0079]
Next, “(stock) xx securities” is processed. A common character string of “(stock) XX securities” and other words in the word dictionary 13 is searched. In this case, since “(stock) xx securities” is common to “(stock) AA securities” in the word “securities”, the word “securities” is stored as a common character string as a count “1”. .
[0080]
Regarding “(AA) Securities”, “(stock)” and “securities” are common, but they are not counted up because they have already been counted.
[0081]
Next, “XY beer” is processed. A common character string of “XY beer” and other words in the word dictionary 13 is searched. In this case, since “XY beer” is common to “ZZZ beer” in the word “beer”, the word “beer” is stored as a common character string as a count “1”. At this time, the “beer” portion of “XY beer” and “ZZZ beer” in the word dictionary 13 is marked as counted.
[0082]
As for “ZZZ beer”, “beer” is common, but it is not counted up because it has already been counted.
[0083]
When all the words registered in the word dictionary 13 are finished, the common character string count is checked. For example, the count of “1” or more is used as the semantic information, and the common character string (“(stock)” “securities” “Beer”) is assigned in association with the notation of each word in the word dictionary 13.
[0084]
Here, FIG. 8 is an explanatory diagram showing an example of the word dictionary 13 to which the semantic information is given by the processing of the semantic information estimation unit 14. As shown in FIG. 8, by restricting the common character string to the end of the word, it is possible to extract only words / phrases that are in a higher class of words / phrases that represent the concept of the object for names of companies and the like. Here, the business fields “beer” and “securities” are given as semantic information. In addition, by limiting the common character string to the beginning of the word, an expression characterizing the meaning of the entire word such as “(stock)” can be extracted for the name of a company or the like. Here, the company form “(stock)” is given as semantic information.
[0085]
Here, by limiting the common character string positions, more accurate meaning estimation can be realized.
[0086]
Next, a third embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those of the above-described embodiment are denoted by the same reference numerals, and description thereof is omitted. This embodiment is a modification of the semantic information estimation process in the semantic information estimation unit 14 of the semantic information estimation apparatus 1 described in the first embodiment. Unlike the state where the word dictionary 13 does not have any semantic information as described in the first embodiment and the second embodiment, the present embodiment includes words that have no semantic information registered in the word dictionary 13. This is semantic information estimation processing when there are several cases or when a new word is registered.
[0087]
The flow of the semantic information estimation process in the semantic information estimation unit 14 of the present embodiment will be described with reference to FIG. As shown in FIG. 9, when there is a word whose semantic information is not registered in the word dictionary 13 (Y in step S21), the semantic information estimation process of the present embodiment proceeds to step S22, and the word is Character strings are compared with other words in the word dictionary 13, and if there is a common character string, the character string is stored, and the counter associated with the character string is incremented by one. Here, the function of the second meaning estimating means is executed.
[0088]
On the other hand, when the word whose semantic information is not registered disappears from the word dictionary 13 (N in Step S21), the process proceeds to Step S23, and the common character string whose count is a certain number or more, or the common character string is selected. The semantic information of the possessed word is assigned to the word in the word dictionary 13 having the character string. The fixed number can be changed according to the number of words in the word dictionary 13 or the like. Here, the function of the second semantic information storage means is executed.
[0089]
The semantic information estimation process will be described using a specific example. Here, a description will be given below assuming that the words in the word dictionary 13 shown in FIG. 10 are processed.
[0090]
Since there is a word (“ZZZ Shoji” “XYZ Beer”) whose semantic information is not registered in the word dictionary 13, first, “ZZZ Shoji” is processed. A common character string of “ZZZ Shoji” and other words in the word dictionary 13 is searched. In this case, since “ZZ Trading” is common to “AA Trading” and “BBB Trading” in terms of “Shosho”, the word “Shosho” is stored as a common character string as a count “2”.
[0091]
Next, “XYZ beer” is processed. A common character string of “XYZ beer” and other words in the word dictionary 13 is searched. In this case, since “XYZ beer” is common to “XY beer” in the word “beer”, the word “beer” is stored as a common character string as a count “1”.
[0092]
After checking the common character string for the word whose semantic information in the word dictionary 13 is not registered (“ZZZ Shoji” “XYZ Beer”), the common character string count is checked. This common character string (“trade” and “beer”) is assigned in association with the notation of each word in the word dictionary 13 as meaning information of an unregistered word.
[0093]
Here, FIG. 11 is an explanatory diagram showing an example of the word dictionary 13 to which the semantic information is given by the processing of the semantic information estimation unit 14. As shown in FIG. 11, each industry field is assigned as semantic information to a word whose semantic information is not registered.
[0094]
Here, semantic information can be automatically estimated from existing dictionary information for an unregistered word having no semantic information.
[0095]
In the present embodiment, the case where there is a word in the word dictionary 13 whose semantic information is not registered has been described, but the same processing can be performed when a word itself is newly registered.
[0096]
Next, a fourth embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those of the above-described embodiment are denoted by the same reference numerals, and description thereof is omitted. This embodiment does not estimate the semantic information of words in the word dictionary 13 as described in the first embodiment to the third embodiment based on the word dictionary 13, but the input device 11 or the like. Is used to estimate semantic information of a word from a document input via.
[0097]
Here, FIG. 12 is a functional block diagram of the semantic information estimation apparatus 1 of the present embodiment. As shown in FIG. 12, the co-occurrence information extraction unit 15 and the semantic information estimation unit 16 are formed in the semantic information estimation device 1 by the CPU 2 operating based on the program. The co-occurrence information extraction unit 15 that implements the co-occurrence information extraction unit generally uses a co-occurrence pattern dictionary 17 (see FIG. 13) for a document input via the input device 11 or the like. The corresponding expression is extracted from the inside. As shown in FIG. 13, the co-occurrence pattern dictionary 17 describes a set of co-occurrence expressions (co-occurrence patterns) extracted from a document. In the example shown in FIG. 13, the co-occurrence expression is specified by notation or part of speech. On the other hand, the semantic information estimation unit 16 roughly corresponds to a case where the word itself is not registered in the word dictionary 13 or the semantic information is not registered for the corresponding expression extracted by the co-occurrence information extraction unit 15. A word and semantic information of the word are registered.
[0098]
First, the process in the co-occurrence information extraction unit 15 will be described. The co-occurrence information extraction unit 15 includes an unregistered word or sequence when there is a sequence of unregistered words or nouns in the word dictionary 13 after a predetermined word in the input document, and there is a particle after that. The predetermined word is given to the noun to be used as semantic information. That is, in the example of the co-occurrence pattern dictionary 17 shown in FIG. 13, for an unregistered word or a continuous noun (the underlined portion in FIG. 13: a co-occurrence word) after the word “PC” in the input document. , "PC" and "maker" are given as semantic information. It is also possible to assign different semantic information for each co-occurrence word using a table or the like. Here, the input document may or may not be analyzed by morphological analysis or the like. When the morphological analysis is not performed, the part of speech is specified by the word dictionary 13.
[0099]
Next, processing in the semantic information estimation unit 16 will be described. Here, FIG. 14 is a flowchart showing the flow of the semantic information estimation process in the semantic information estimation unit 16 of the present embodiment. As shown in FIG. 14, the semantic information estimation process of the semantic information estimation unit 16 collates the co-occurrence patterns in the co-occurrence pattern dictionary 17 one by one with the input document one by one (step S31: document accepting unit). If a predetermined co-occurrence word (underlined in FIG. 13) is not registered or meaning information is not registered in the word dictionary 13 (Y in step S32, Y in step S33), an instruction from the co-occurrence pattern dictionary 17 Accordingly, the semantic information is assigned or the description of the co-occurrence pattern dictionary 17 is converted into another expression and given (step S34). The processes in steps S32 to S34 are performed for all the co-occurrence patterns.
[0100]
The semantic information estimation process will be described using a specific example. Here, for example, consider a case where there is an expression “Large maker XX Electric has released PC A series” in the input document.
[0101]
First, based on the co-occurrence pattern of the co-occurrence pattern dictionary 17 shown in FIG. 13, there is a “computer” in the front, an unregistered word or a continuous noun (co-occurrence word) behind, and a particle after that. Find out if there is. Here, “PC A series” corresponds. Next, based on the co-occurrence pattern dictionary 17, “A series” is extracted as semantic information “PC”. This is searched in the word dictionary 13, and if it is not registered, it is registered in the word dictionary 13 as shown in FIG.
[0102]
Subsequently, based on the co-occurrence pattern of the co-occurrence pattern dictionary 17 shown in FIG. 13, there is a “maker” in the front, an unregistered word or a continuous noun (co-occurrence word) behind, and a particle after that. Find out if there is. Here, “Manufacturer XX Electric” corresponds. Next, based on the co-occurrence pattern dictionary 17, “XX electricity” is extracted as semantic information “maker”. This is searched in the word dictionary 13, and when it is not registered, “XX electricity” is registered in the word dictionary 13, and “maker” is registered as semantic information.
[0103]
Here, by using the co-occurrence information appearing in the document, it is possible to automatically acquire a word not registered in the word dictionary 13 and its semantic information. In particular, the meaning can be more reliably acquired by using the information of the word structure.
[0104]
Next, a fifth embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those of the above-described embodiment are denoted by the same reference numerals, and description thereof is omitted. This embodiment is a modification of the processing in the co-occurrence information extraction unit 15 and the semantic information estimation unit 16 of the semantic information estimation device 1 described in the fourth embodiment.
[0105]
In the word dictionary 13 of the present embodiment, as shown in FIG. 16, a word and semantic information of the word are stored.
[0106]
Next, the co-occurrence pattern dictionary 17 used in the co-occurrence information extraction unit 15 of the present embodiment will be described. As shown in FIG. 17, the co-occurrence pattern dictionary 17 according to the present embodiment describes a co-occurrence pattern to be collated and a semantic relationship between words in the co-occurrence pattern. Note that the semantic relationship of words in the co-occurrence pattern may be an upper or lower relationship or another relationship. In the example shown in FIG. 17, “()” indicates a word having the semantic relationship, and “[]” specifies the notation of the word itself. Here, “[X | Y]” means that either X or Y may be used.
[0107]
In addition, in the co-occurrence pattern dictionary 17 of the present embodiment, as shown in FIG. 17, it is possible to describe a hierarchy or association in the co-occurrence pattern. In the example shown in FIG. 17, in the case of “(A) :( B) (C)”, the meaning “(A)” can be expressed by the whole chain of “(B)” and “(C)”. ing. Note that “(name: noun sequence or unregistered word)” indicates that the expression meaning the name is composed of a sequence of nouns or unregistered words, and is used as a condition for matching when matching. Yes. As described above, among the expressions in the co-occurrence pattern, an expression formed by combining a plurality of words can be described separately because, for example, in the case of a chain of internal constituent words in the word, it is easy to describe. The example shown in FIG. 17 indicates that the word meaning the product indicated by “(product)” in the co-occurrence pattern can be expressed by a combination such as (product) (name: noun continuous or unregistered word). ing.
[0108]
Next, processing in the semantic information estimation unit 16 will be described. Here, FIG. 18 is a flowchart showing the flow of the semantic information estimation process in the semantic information estimation unit 16 of the present embodiment. As shown in FIG. 18, the semantic information estimation processing of the semantic information estimation unit 16 collates expressions in the input document in order, and whether or not the co-occurrence pattern of the co-occurrence pattern dictionary 17 exists in the input document. Is to examine. At this time, if there is a word “()” having a semantic relationship in the input document, collation including the meaning by “()” is performed (step S 42), and the co-occurrence of the co-occurrence pattern dictionary 17 is included in the input document. If there is no pattern (N in step S42), the process proceeds to step S43, and collation is performed with the meaning part “()” removed.
[0109]
If there is a co-occurrence pattern in the co-occurrence pattern dictionary 17 in the input document (Y in step S43), the process proceeds to step S44, and the meaning part “()” is defined by another co-occurrence pattern in the co-occurrence pattern. Find out if there is. Then, if there is a regulation with other co-occurrence patterns, the semantic part “()” and the corresponding character string are collated with other co-occurrence patterns.
[0110]
When the semantic part “()” and the corresponding character string are collated with other co-occurrence patterns (Y in step S44), the co-occurrence pattern with the least unmatched co-occurrence patterns is selected ( If the unmatched part is “1” (step S46: Y in step S46), the unmatched part is recognized as the meaning, and the matched co-occurrence pattern and the corresponding character string (word) are stored (step S47). ).
[0111]
Also, when the co-occurrence pattern of the co-occurrence pattern dictionary 17 is present in the input document (Y in step S42), the matched co-occurrence pattern and the corresponding character string (word) are stored (step S47).
[0112]
On the other hand, when there is no co-occurrence pattern of the co-occurrence pattern dictionary 17 in the input document (N in step S43), the semantic part “()” and the corresponding character string are not collated with other co-occurrence patterns (step S44). N), if the unmatched part is not “1” (Y in step S46), the process returns to step S41.
[0113]
The processes in steps S42 to S47 are performed for all the expression patterns in the input document.
[0114]
When the processing is completed for all the expression patterns in the input document (N in step S41), the description of the semantic relationship of the corresponding co-occurrence pattern dictionary 17 is based on the matched co-occurrence pattern, the corresponding character string (word), and meaning. Is added to the word dictionary 13 (step S48).
[0115]
The semantic information estimation process will be described using a specific example. Here, for example, in the input document
"XX Electric has released PC A series"
"XX Electric has developed a sputtering system"
Consider the case where there is an expression.
[0116]
First, based on the co-occurrence pattern of the co-occurrence pattern dictionary 17 shown in FIG. 17, “XX Electric has released the personal computer A series” is collated. The top of the co-occurrence pattern dictionary 17 shown in FIG. 17, “(manufacturer) releases (product)” is collated including the meaning “()”, and then the meaning other than the meaning “()” Check only the part. Here, except for the semantic part (manufacturer) (product), it matches.
[0117]
Since it is unclear whether or not the portion in which the meaning “()” is defined is coincident, whether or not the other co-occurrence pattern dictionary 17 has the definition for each meaning is checked. As for (product), a co-occurrence pattern “(product) :( product) (name: noun continuous or unregistered word)” is present in the co-occurrence pattern dictionary 17 and collated with “PC A series” of the corresponding part. In the word dictionary 13 shown in FIG. 16, since “PC” is registered as the meaning “product”, a co-occurrence pattern “(product) :( product) (name: noun continuous or unregistered word)” is present. Applied, “PC A series” is recognized as (product) (name), and this entire part can be confirmed as (product). That is, in the co-occurrence pattern “(maker) releases (product)” at the top of the co-occurrence pattern dictionary 17, the (product) part matches.
[0118]
On the other hand, for the remaining “(maker)”, there is no character string matching condition in the word dictionary 13 shown in FIG. In this example, the co-occurrence pattern dictionary 17 at the top of the co-occurrence pattern dictionary 17 “(maker) sells (product)” is best collated and the unverified part is “1”. It is assumed that the first co-occurrence pattern “(maker) has released (product)” has been collated, and the unverified part “XX Electric” is estimated as (maker) and is not registered in the word dictionary 13. “XX Electric” is recognized as “Manufacturer”, “A Series” is recognized as “Name”, and the collated co-occurrence pattern and its word are stored.
[0119]
Next, collation is performed on “XX Electric has developed a sputtering system”. The second co-occurrence pattern “(maker) develops (technology)” in the co-occurrence pattern dictionary 17 shown in FIG. 17 is collated including the meaning “()”, and then the meaning other than the meaning “()”. Match only the part of. Here, except for the semantic part (manufacturer) (technique), they match.
[0120]
Since it is unclear whether or not the portion in which the meaning “()” is defined is coincident, whether or not the other co-occurrence pattern dictionary 17 has the definition for each meaning is checked. As for (technology), a co-occurrence pattern “(technology) :( technology) | (technology) [apparatus | system]” is present in the co-occurrence pattern dictionary 17 and collated with the “sputtering apparatus” of the corresponding part. The co-occurrence pattern of “(Technology) :( Technology) | (Technology) [Device | System]” is applied, and “Sputtering device” is recognized as (Technology) [Device | System]. It can be confirmed. That is, in the second co-occurrence pattern “(maker) develops (technique)” in the co-occurrence pattern dictionary 17, the (technical) part matches.
[0121]
On the other hand, for the remaining “(maker)”, there is no character string matching condition in the word dictionary 13 shown in FIG. In this example, the second co-occurrence pattern “(maker) develops (technology)” in the co-occurrence pattern dictionary 17 is best collated, and the unmatched part is “1”. It is assumed that the second co-occurrence pattern of 17 “(maker) developed (technology)” is collated, and the unverified part “XX Electric” is estimated as (maker) and is not registered in the word dictionary 13 Is recognized as “maker” and “sputtering” as “technology”, and the collated co-occurrence pattern and the word are stored.
[0122]
After all the co-occurrence patterns are collated, semantic information is given to the words in the word dictionary 13 for the collated co-occurrence patterns. Here, the meaning “maker”, the meaning “name” for the word “A series”, and the meaning “technology” for the word “sputtering” are registered in the word dictionary 13 as shown in FIG. be able to. Regarding the semantic relationship, the order of association can be changed according to the purpose (for example, the manufacturer is on the left side in FIG. 19). In addition, as shown in FIG. 20, it is possible to give meaning to the word dictionary 13 itself.
[0123]
As mentioned above, even if the word itself and the meaning of the word are not registered by the co-occurrence pattern, they can be guessed by the co-occurring expression and the meanings are related to each other. Detailed semantic information can be estimated from the results.
[0124]
Here, it is possible to reduce the labor of pattern description by associating or hierarchically describing the co-occurrence patterns. Further, by being associated, even if there is an unregistered word, it is possible to estimate semantic information based on related known information. Moreover, since not only the semantic information but also the semantic information can be associated with each other by the associated co-occurrence pattern, more detailed semantic information can be estimated.
[0125]
【The invention's effect】
  Claim 1,3,5According to the described invention, since common character strings in words are likely to represent a common concept, meaning can be easily given by using this, so that techniques such as manpower and syntax analysis can be used. The semantic information of the word can be automatically given to the word registered in the word dictionary without using.Further, by using the co-occurrence pattern appearing in the document, it is possible to automatically acquire a word not registered in the word dictionary and its semantic information. Moreover, the labor of pattern description can be reduced by associating or hierarchically describing the co-occurrence patterns. Further, by being associated, even if there is an unregistered word, it is possible to estimate semantic information based on related known information. Furthermore, since the associated co-occurrence patterns allow not only semantic information but also semantic information to be associated with each other, more detailed semantic information can be estimated.
[0126]
  Claim 2,4,6According to the described invention, it is possible to achieve more accurate meaning estimation by limiting the common character string positions.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing a hardware configuration of a semantic information estimation apparatus according to a first embodiment of the present invention.
FIG. 2 is an explanatory diagram showing an initial state of a word dictionary.
FIG. 3 is a functional block diagram of a semantic information estimation apparatus.
FIG. 4 is a flowchart schematically showing a flow of semantic information estimation processing.
FIG. 5 is an explanatory diagram showing a word dictionary to which semantic information is assigned.
FIG. 6 is a flowchart schematically showing a flow of semantic information estimation processing in the semantic information estimation apparatus according to the second embodiment of the present invention.
FIG. 7 is an explanatory diagram showing an initial state of a word dictionary.
FIG. 8 is an explanatory diagram showing a word dictionary to which semantic information is assigned.
FIG. 9 is a flowchart schematically showing a flow of semantic information estimation processing in the semantic information estimation apparatus according to the third embodiment of the present invention;
FIG. 10 is an explanatory diagram showing an initial state of a word dictionary.
FIG. 11 is an explanatory diagram showing a word dictionary to which semantic information is assigned.
FIG. 12 is a functional block diagram of a semantic information estimation apparatus according to a fourth embodiment of this invention.
FIG. 13 is an explanatory diagram showing a co-occurrence pattern dictionary.
FIG. 14 is a flowchart schematically showing a flow of semantic information estimation processing.
FIG. 15 is an explanatory diagram showing a word dictionary to which semantic information is assigned.
FIG. 16 is an explanatory diagram showing an initial state of a word dictionary of the semantic information estimation apparatus according to the fifth embodiment of the present invention.
FIG. 17 is an explanatory diagram showing a co-occurrence pattern dictionary.
FIG. 18 is a flowchart schematically showing a flow of semantic information estimation processing.
FIG. 19 is an explanatory diagram showing a word dictionary to which semantic information is assigned.
FIG. 20 is an explanatory diagram showing another example of a word dictionary to which semantic information is assigned.
[Explanation of symbols]
1 Semantic information estimation device
7 Storage media
11 Input section
13 word dictionary
17 Co-occurrence pattern dictionary

Claims

Word dictionary storage means for storing a plurality of word expressions;
A common character string common to a character string constituting at least a part of a notation of one word stored in the word dictionary storage unit is searched from notation of words other than the one word, and the common character Meaning estimating means for estimating the common character string as semantic information of a notation word including the common character string when a predetermined number or more of columns are searched,
Semantic information storage means for storing the semantic information estimated by the semantic estimation means in the word dictionary storage means in association with the notation of the word including the common character string;
The first co-occurrence pattern defining the relationship between co-occurring words and the semantic information of each word of the first co-occurrence pattern are stored in association with each other, and any word of the first co-occurrence pattern A co-occurrence pattern dictionary storage means for storing a second co-occurrence pattern that defines a hierarchical relationship or an association relationship between the second co-occurrence word and another word, and semantic information of the other word of the second co-occurrence pattern in association with each other ,
A document receiving means for receiving a document input from the input unit;
Co-occurrence information extracting means for extracting each word satisfying the first co-occurrence pattern stored in the co-occurrence pattern dictionary storage means from the documents received by the document receiving means;
With
The semantic information storage means further associates the notation of each word extracted by the co-occurrence information extraction means with the semantic information of the word stored in the co-occurrence pattern dictionary storage means. While being stored in the word dictionary storage unit, a hierarchical relationship or association relationship between any one of the words extracted by the co-occurrence information extraction unit and another word is stored in the co-occurrence pattern dictionary storage unit. If the second co-occurrence pattern is defined, the notation of the other word is associated with the semantic information of the other word stored in the co-occurrence pattern dictionary storage means. Semantic information estimation device stored in the word dictionary storage means .

2. The semantic information estimating apparatus according to claim 1, wherein the meaning estimating means searches for the common character string from a character string at the end of a notation of a word other than the one word or a character string at the beginning.

A word other than the one word is a common character string common to a character string constituting at least a part of the notation of one word stored in a word dictionary storage means for storing a plurality of word expressions. When the common character string is searched for a predetermined number or more, the meaning estimation step of estimating the common character string as semantic information of a notation word including the common character string,
Semantic information storage means, said semantic information estimated by the meaning estimation step, a first semantic information storage step of storing in said word dictionary storage means in association with the representation of the word containing the common character string,
A document accepting step in which the document accepting means accepts an input of a document from the input unit;
A co-occurrence information extracting unit includes a first co-occurrence pattern that defines a relationship between co-occurring words from the documents received by the document receiving step, and semantic information of each word of the first co-occurrence pattern; Are stored in association with each other, and the second co-occurrence pattern defining the hierarchical relationship or the association relationship between any word of the first co-occurrence pattern and the other word and the other word of the second co-occurrence pattern A co-occurrence information extracting step of extracting each word satisfying the first co-occurrence pattern stored in the co-occurrence pattern dictionary storing means for storing the semantic information in association with each other;
The semantic information storage means further associates the notation of each word extracted by the co-occurrence information extraction step with the semantic information of the word stored in the co-occurrence pattern dictionary storage means. While being stored in the word dictionary storage means, a hierarchical relationship or association relationship between any one of the words extracted by the co-occurrence information extraction step and other words is stored in the co-occurrence pattern dictionary storage means. If the second co-occurrence pattern is defined, the notation of the other word is associated with the semantic information of the other word stored in the co-occurrence pattern dictionary storage means. A second semantic information storage step for storing in the word dictionary storage means,
A semantic information estimation method comprising:

Wherein in the sense estimating step, said means estimating means from among the strings of the string or the top part of the partial end of notation of words other than the word of the 1, meaning according to claim 3, wherein searching for the common character string Information estimation method.

A program for causing a computer to estimate word semantic information, wherein the computer
A common character string common to a character string constituting at least a part of one word notation stored in the word dictionary storage means for storing a plurality of word notations is selected from the word notations other than the one word. A semantic estimation function that searches and estimates the common character string as semantic information of a notation word including the common character string when the predetermined number or more of the common character strings are searched;
A first semantic information storage function for storing the semantic information estimated by the semantic estimation function in the word dictionary storage unit in association with a notation of a word including the common character string;
A document reception function for receiving input of a document from the input unit;
The first co-occurrence pattern defining the relationship between co-occurring words and the semantic information of each word of the first co-occurrence pattern are stored in association with each other from the documents received by the document receiving function. The second co-occurrence pattern defining the hierarchical relationship or the association relationship between any one word of the first co-occurrence pattern and another word is associated with the semantic information of the other word of the second co-occurrence pattern. A co-occurrence information extracting function for extracting each word satisfying the first co-occurrence pattern stored in the co-occurrence pattern dictionary storing means;
The notation of each word extracted by the co-occurrence information extraction function and the semantic information of the word stored in the co-occurrence pattern dictionary storage unit are stored in the word dictionary storage unit in association with each other. In the second co-occurrence pattern, a hierarchical relationship or association relationship between any one of the words extracted by the co-occurrence information extraction function and another word is stored in the co-occurrence pattern dictionary storage unit. If defined, the notation of the other word and the semantic information of the other word stored in the co-occurrence pattern dictionary storage unit are stored in the word dictionary storage unit in association with each other. A second semantic information storage function;
A program for running

6. The program according to claim 5 , wherein the meaning estimation function searches for the common character string from a character string at the end of a notation of a word other than the one word or a character string at the beginning.