JP4114462B2

JP4114462B2 - Information search device and information search system

Info

Publication number: JP4114462B2
Application number: JP2002329974A
Authority: JP
Inventors: 宏行大沼
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2002-11-13
Filing date: 2002-11-13
Publication date: 2008-07-09
Anticipated expiration: 2022-11-13
Also published as: JP2004164332A

Description

【０００１】
【発明の属する技術分野】
本発明は，電子メールなどの文書（電子文書）を管理する文書管理システムにおける情報検索装置および情報検索システムに関する。
【０００２】
【従来の技術】
従来より，電子メールなど文書を管理する文書管理システムには，本文の全文検索などの機能が提供されている。文書から，日時，場所などの表現や，会議への参加依頼などのアクションを抽出し，例えば，「開催地が○○ドームであるイベント」，「期限が今週中の回答依頼」に関する情報が書いてありそうな文書といった，文書中の単語の種類を利用して，様々な条件で検索することが可能である。
【０００３】
さらに，ユーザの利便性向上のために，特開平１０−６９４７２では，電子メールを対象として，本文中の日付や時間などのうち，アクションを起こす期限を抽出し，期限が近い順にソートするなどの機能が提案されている。
【０００４】
また，佐藤らの「電子ニュースのダイジェスト自動生成」（情報処理学会論文誌，Vol.36,No.10 pp.2371-2379）では，電子ニュースから，会議の開催期日，開催地，論文締切日などを抽出している。これらは，あらかじめ定まった情報のみを抽出するための技術として有効である。
【０００５】
一方，特開平９−２６９９４０では，電子メールの本文から日付や時間などの情報を抽出し，日付などを検索キーにして，その日付を本文に含むメールを検索する機能が提案されている。さらに，場所，事柄なども抽出している。これらの機能により，メール本文中の，日付や時間，場所，事柄などを切り出して表示したり，メール中の日時が，現時点より前なのか後ろなのかによって，メールをソートして表示できる。しかし，例えば，その日付が開催日なのか締切日を表すのかを特定するまでの検索は行っていない。
【０００６】
【特許文献１】
特開平１０−６９４７２号公報
【特許文献２】
特開平９−２６９９４０号公報
【非特許文献１】
「電子ニュースのダイジェスト自動生成」（情報処理学会論文誌，Vol.36,No.10 pp.2371-2379）
【０００７】
【発明が解決しようとする課題】
ところで，現実の文書情報検索においては，例えば，「開催日を含んでいる文書」を検索する場合や，さらに具体的な日時を指定して，「開催日として２００２年８月３日を含む文書」を検索する場合，さらには，開催日などを区別せず，「日時として２００２年８月３日を含む文書」を検索する場合など，さまざまな条件にて検索可能であることが要求される。このような条件で検索できるようにするためには，特定の分野の文書に対してだけでなく，様々な文書に対して，単語の種類を特定する処理等が必要である。
【０００８】
また，抽出誤りで，単語の種類が間違っていた場合にも，ある程度正しく検索できるようにする必要がある。例えば，「開催日」として抽出すべきところを，単に，「日時」としてしか抽出できなかった場合には正しく検索できないが，そのような場合でも，検索漏れがないように検索できるようにしなければならない。これらを実現しようとするとき，従来の技術のいずれの方法を組み合わせても，有効に検索することはできない。
【０００９】
それは，次のような問題点があるからである。
［問題点１］佐藤らの「電子ニュースのダイジェスト自動生成」の手法は，会議開催に関するニュースなど，分野を限定している場合に有効である。様々な分野の文書に，この手法を使う場合には，会議開催に関する情報を抽出するための規則，回答依頼の締切に関する情報を抽出するための規則など様々な規則を用意する必要がある。そして，前処理として，送信された電子メールが，会議開催に関する電子メールかどうかを選別する処理が必要となる。しかし，１つの電子メールに様々な情報を含んでいる場合もあり，必ずしも選別処理がうまくいくとは限らない。
【００１０】
［問題点２］佐藤らの「電子ニュースのダイジェスト自動生成」など，分野を限定して特定の情報を抽出する場合，抽出結果を表形式にするのが一般的である。しかし，この場合には，その抽出結果がどれくらい正しいかという情報が失われてしまう。従って，検索結果を信用していいのか判断できず，使いづらい場合がある。
【００１１】
［問題点３］特開平９−２６９９４０では，電子メールの本文から日付や時間などの情報を抽出しているが，その日付が締切日なのか，開催日なのかを区別していない。従って，イベントの開催日で検索するといった検索条件は設定できない。一方，日付が締切日なのか，開催日なのかまで抽出した場合には，抽出誤りなどが起こる可能性があり，検索漏れが起こることがある。
【００１２】
本発明は，従来の情報検索装置あるいは情報検索システムが有する上記問題点に鑑みてなされたものであり，本発明の目的は，従来の情報検索装置あるいは情報検索システムが有する上記問題点を解決することの可能な，新規かつ改良された情報検索装置および情報検索システムを提供することである。
【００１３】
【課題を解決するための手段】
上記課題を解決するため，本発明の第１の観点によれば，文書を検索する情報検索装置が提供される。本発明の情報提供装置は，請求項１に記載のように，文書を記憶する文書記憶部（６）と，文書を検索するための情報として，文書中の文字列をその分類ごとに格納する抽出情報記憶部（５）と，検索条件となる文字列が入力される条件入力部（７）と，検索条件として入力された文字列にもとづいて文書記憶部から文書を検索する検索部（９）とを備え，検索部（９）は，検索条件として入力された文字列の分類とは無関係に文書を検索するように，検索条件を緩めて文書を選択する候補選択部（１０）と，選択された文書と検索条件となる文字列との適合度を計算する適合度計算部（１１）とを備えたことを特徴とする。
【００１４】
さらに，文書記憶部に格納される文書が入力される文書入力部（１）と，入力された文書から文字列を抽出するための抽出規則を格納する抽出規則記憶部（３）と，抽出規則記憶部に記憶されている抽出規則と照合して，文書入力部から入力された文書から文字列を抽出する抽出部（２）と，抽出部で抽出された文字列の分類を判断し，その文字列を抽出情報記憶部に登録する文書情報登録部（４）とを備えるようにしてもよい。
【００１５】
この場合，抽出情報記憶部（５）における文字列の分類は，請求項６に記載のように，抽出規則記憶部（３）に記憶されている抽出規則に応じて定められる大分類と，その大分類を細分化した小分類とからなるようにしてもよい。また，請求項７に記載のように，大分類は，さらに，抽出規則記憶部（３）に格納されている抽出規則とは無関係に，抽出情報記憶部（５）に格納された文書中の名詞を登録するための分類を持つようにしてもよい。抽出規則の不備などで，本来，イベントや場所として抽出できなくても，名詞として単語情報記憶部に登録されていることで，検索漏れをなくすことができる。
【００１６】
かかる情報検索装置によれば，上記［問題点１］，［問題点２］，［問題点３］を解決することが可能である。以下説明する。
【００１７】
抽出情報記憶部（５）が，文書中の文字列を，「イベント」，「アクション」，「場所」，「日時」などの分類ごとに格納し，会議開催に関するニュースなど特定の分野ごとの抽出規則を持たない構成にしたことで，その文書がどのような分野であるのかを前もって選別する処理を不要にできる。こうして［問題点１］は解消された。さらに，抽出情報記憶部（５）が，文書中の文字列をその分類（大分類および小分類）ごとに格納することによって，詳細な検索を行うことができる。例えば，日時を検索するにあたり，その日時が「締切日」なのか「開催日」なのかを分類しておくことによって，「締切日」と「開催日」を区別して検索することができる。
【００１８】
また，候補選択部（１０）は，検索条件として入力された文字列の分類とは無関係に文書を検索するように，検索条件を緩めて文書を選択する。これは，抽出誤りを考慮して，選択可能な文書を増やすためである。具体的には，次のようにして検索条件を緩めることができる。
１：検索条件として入力された文字列が具体値である場合には，その具体値のみが一致するように検索条件を緩めて，文書を選択する。
２：検索条件として入力された文字列が具体値である場合には，その具体値のみが一致するように検索条件を緩め，検索条件として入力された文字列が具体値でない場合には，大分類のみが一致するように検索条件を緩めて，文書を選択する。
【００１９】
このようにして，検索条件を緩めて文書を選択することによって，抽出規則を用いた抽出結果が間違っていても，検索漏れがないようにすることができる。こうして［問題点３］は解消された。
【００２０】
そして，上記のようにして候補選択部により選択された文書について，適合度計算部（１０）は適合度の計算を行う。具体的には，以下のように適合度の計算を行うことができる。
１：選択された文書と検索条件となる文字列との適合度を，選択された文書中の文字列の分類を利用して計算する。
２：選択された文書と検索条件となる文字列との適合度を，選択された文書中の文字列の位置関係を利用して計算する。
【００２１】
特に▲２▼のように，文書を検索するための情報の抽出時に，文書中の日時，場所などを表す単語を表形式に格納しないようにし，それぞれの位置情報を記憶しておく。そして，検索時には，それぞれの単語の種類や単語間の位置関係を利用して，検索条件との適合度を計算する。適合度の高い順に整列してユーザに表示することで，検索結果を信用していいのかをユーザが判断できるようにする。こうして［問題点２］は解消された。
【００２２】
また，上記課題を解決するため，本発明の第２の観点によれば，ネットワーク（１２０）で接続されたサーバ部（１００）と端末部（１１０）とを有し，文書を検索する情報検索システムが提供される。
【００２３】
本発明の情報検索システムにおいて，サーバ部（１００）は，検索条件および検索結果を送受信する通信部（１３）と，文書を記憶する文書記憶部（６）と，文書を検索するための情報として，文書中の文字列をその分類ごとに格納する抽出情報記憶部（５）と，検索条件として入力された文字列にもとづいて文書記憶部から文書を検索する検索部（９）とを備えたことを特徴とする。
【００２４】
また，本発明の情報検索システムにおいて，端末部（１１０）は，検索条件および検索結果を送受信する通信部（１４）と，検索条件となる文字列が入力される条件入力部（７）とを備え，検索部（９）は，検索条件として入力された文字列の分類とは無関係に文書を検索するように，検索条件を緩めて文書を選択する候補選択部（１０）と，選択された文書と検索条件となる文字列との適合度を計算する適合度計算部（１１）とを備えたことを特徴とする。
【００２５】
かかる情報検索システムによれば，上記本発明の第１の観点にかかる情報検索装置と実質的に同様の効果を有するほか，例えば，インターネットなどのネットワークを経由して電子メールを受け取り，受け取った電子メールを順次登録し，ユーザの要求に応じて検索する場合に対しても適用できる。また，本発明のサーバ部（１００）の機能を，いわゆるＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅｒ）サービスとして提供することができ，ビジネスとしても展開可能である。
【００２６】
【発明の実施の形態】
以下に添付図面を参照しながら，本発明にかかる情報検索装置および情報検索システムの好適な実施の形態について詳細に説明する。なお，本明細書および図面において，実質的に同一の機能構成を有する構成要素については，同一の符号を付することにより重複説明を省略する。
【００２７】
（第１の実施の形態）
本実施の形態では，文書に含まれる日時，場所，イベントやアクションを登録し，また，それを利用して検索することの可能な情報処理装置について説明する。特に，上記の３つの問題点を解決するために，それぞれ次のようなアプローチをとる。
【００２８】
［問題点１］に対しては，あらかじめ，分野ごとの抽出規則を持たない構成にする。これによって，その文書がどのような分野であるのかを前もって選別する処理を不要とする。
【００２９】
［問題点２］に対しては，単語の抽出時に，文書中の日時，場所などを表す単語を表形式に格納しないようにし，それぞれの位置情報を記憶する。検索時に，それぞれの単語の種類や単語間の位置関係などを利用して，検索条件との適合度を計算する。適合度をユーザに表示したり，適合度の高い順に整列してユーザに表示したりすることで，検索結果をどの程度信用していいのかをユーザが判断できるようにする。
【００３０】
［問題点３］に対しては，単語の種類を抽出する処理において，その日時が締切日なのか，開催日なのかを区別しておく。これによって，例えば，ユーザが，開催日と締切日を区別して検索できるようにする。一方，抽出結果が間違っている可能性を考慮して，検索漏れがないように，検索条件としてユーザが指定した単語の種類に関わらず，候補文書を検索するようにユーザが設定した検索条件より，検索条件を緩めるようにする。
【００３１】
図１は，本実施の形態にかかる情報検索装置２０のシステム構成を示す説明図である。情報検索装置２０は，図１に示したように，文書入力部１と，抽出部２と，抽出規則記憶部３と，文書情報登録部４と，抽出情報記憶部５と，文書記憶部６と，条件入力部７と，表示部８と，検索部９を含んで構成されている。また，検索部９は，候補選択部１０，適合度計算部１１，候補整列部１２を含んで構成されている。
【００３２】
情報検索装置２０の構成要素のうち，文書入力部１，抽出部２，抽出規則記憶部３，文書情報登録部４，抽出情報記憶部５，および文書記憶部６は，文書情報の登録を行うための構成要素である。また，条件入力部７，表示部８，および検索部９は，文書情報の検索を行うための構成要素である。以下に，各構成要素について詳細に説明する。
【００３３】
（文書入力部１）
文書入力部１は，登録対象の文書を受け付ける。
【００３４】
（抽出部２）
抽出部２は，文書入力部１から入力された文書に対して，抽出規則記憶部３に格納されている抽出規則と照合して適合する情報を抽出する。具体的には，入力された文書がメールの場合には，サブジェクトや本文に含まれる日時，イベント，場所，アクションなどを抽出する。これら日時，イベント，場所，アクションなどを「大分類」と称する。また，例えば，イベントに対して，そのイベントが，セミナーなのか会議なのかを細分化した情報も抽出する。これら細分化した情報を「小分類」と称する。
【００３５】
（抽出規則記憶部３）
抽出規則記憶部３は，文書入力部から入力された文書から所定の情報を抽出するための抽出規則を格納している。図２は，抽出規則記憶部３に格納された抽出規則の一例を示す説明図である。例えば，図２（ａ）の規則Ａ−１は，「イベント−会議」のための抽出規則で，「第９回環境フォーラム」などの文字列が一致する。規則Ａ−２は，大分類がイベントで，小分類を特に規定していない抽出規則で，「祇園祭」，「サマーフェスティバル」などの文字列が一致する。規則Ａ−３，Ａ−４も同様である。
【００３６】
また，抽出規則記憶部３は，図２（ｂ）に示したように，見出し語に関する抽出規則も格納している。見出し語に関する抽出規則とは，例えば，図３（ａ）の文書の８行目の「日時：２００２年８月６日（木）」の見出し語「日時：」のように，見出し語を抽出するための規則である。例えば，規則Ｂ−１は，「日時−開催日」の見出し語の抽出規則で，「日時：」，「開催日：」などの文字列が一致する。
【００３７】
また，抽出規則記憶部３は，図２（ｃ）に示したように，見出し語を利用した日時の分類規則も格納している。規則Ｃ−１からＣ−３までが見出し語を利用した日時の分類規則である。例えば，図３（ａ）の文書の８行目は，日付「２００２年８月６日（木）」に<DATE>タグを設定し，見出し語「日時：」に</HD DATE OPEN>タグを設定すると，次のようになる。
<HD DATE OPEN>日時：</HD><DATE>2002年8月6日(木)</DATE>
ここで，「２００２年８月６日（木）」は，見出し語「日時：」の後ろにあるので，開催日と考えられる。そこで，規則Ｃ−１を適用して，この日付を開催日に分類する。
【００３８】
なお，以上説明した抽出規則記憶部３に格納される抽出規則は，図２に示された一例に限定されるものではない。例えば，特開平０９−２６９９４０の図１０に記載された正規表現や図１１に記載された形式でもよい。
【００３９】
図３（ａ）の文書に対して，図２の規則を用いて抽出処理をした結果は，図３（ｂ）に示したようになる。図３（ｂ）において，「日時−開催日」は<DATE OPEN>タグで，「日時−締切日」は<DATE LIMIT>タグで囲んでいる。また，「イベント−会議」は<EVT CONFERENCE>タグで，「アクション−参加依頼」は<ACT JOIN>タグで，「アクション−回答依頼」は<ACT ASK>タグで，「場所」は<POS>タグでそれぞれ囲んでいる。
【００４０】
（文書情報登録部４）
文書情報登録部４は，抽出部２で抽出された文字列に対して，その日時が，「今日」や「明日」などの相対的な時間である場合に，その日時を絶対的な時間に変更する。また，形態素解析を行い，例えば，図３（ａ）の９行目の「第１会議室」を含む文の動詞が，「行う」であることを利用して，「第１会議室」が開催地であることを判断したりする。そして，これら文書の解析結果を抽出情報記憶部５に登録する。
【００４１】
（抽出情報記憶部５）
抽出情報記憶部５は，文書の解析結果として文書中に出現する単語を，日時やイベントなどの分類とともに記憶する。図４に抽出情報記憶部５のデータの例を示す。図４では，図３（ｂ）の抽出結果の分類例を示す。抽出情報記憶部５は，文書番号項目，連番項目，タイプ項目，外部表現項目，内部表現項目，行位置項目，文位置項目の各項目から構成される。
【００４２】
文書番号項目は，個々の文書に付けられる識別番号である。連番項目は，抽出された個々の単語を識別する番号である。タイプ項目は，抽出された単語の分類であり，大分類と小分類にわけられる。外部表現項目は，抽出された単語が文書中に出現した表現である。内部表現項目は，「本日」などの相対的な時間を，「２００２／８／３」などの絶対的な時間にした場合に値が入る。行位置項目は，文書中でその単語が出現した行番号が入る。文位置項目は，その単語が出現したのが，何番目の文であるかという情報が入る。ただし，対象文書がメールである場合，サブジェクトにある単語の行位置と文位置には，−１を設定する。
【００４３】
（文書記憶部６）
文書記憶部６は，文書自体を，文書番号を付けて記憶しておく。文書番号は，抽出情報記憶部５の文書番号項目と対応付けられる。
【００４４】
（条件入力部７）
条件入力部７は，ユーザからの検索条件の入力などの操作を受け付ける。
【００４５】
（表示部８）
表示部８は，入力画面の表示や検索結果の出力などを行う。
【００４６】
（検索部９）
検索部９は，条件入力部７から入力された検索条件に対して検索処理を行う。検索部９は，候補選択部１０と，適合度計算部１１と，候補整列部１２を含んで構成されている。
【００４７】
（候補選択部１０）
候補選択部１０は，ユーザの検索条件に一致する文書を候補文書として検索する。その際に，抽出部２や文書情報登録部４の抽出結果の誤りを考慮して，ユーザが設定した検索条件にヒットする文書よりも，多くの文書にヒットするように検索条件を補正する。
【００４８】
（適合度計算部１１）
適合度計算部１１は，候補選択部１０によって挙げられた文書から，抽出情報記憶部５のタイプ項目や行位置項目や文位置項目を利用して，検索条件との適合度を計算する。検索条件の大分類と小分類が，ヒットしたレコード間のそれらと一致しているレコードや，ヒットしたレコード間の行位置や文位置が近いものに高い適合度を与える。
【００４９】
（候補整列部１２）
候補整列部１２は，適合度の高いものから順に整列を行う。
【００５０】
本実施の形態にかかる情報検索装置２０は以上のように構成されている。次いで，情報検索装置２０の動作について説明する。まず，文書情報の登録処理について説明し，次いで，文書情報の検索処理について説明する。図５は，情報検索装置２０の登録処理の動作を示すフローチャートである。ここでは，例として，図３（ａ）に示す文書の登録処理について説明する。
【００５１】
＜登録処理＞
（ステップＳ１００）
抽出部２は，文書入力部１から入力された文書に対して抽出処理を行う。図２に示す抽出規則を利用する。図３（ａ）のメールに対して，図３（ｂ）に示す結果が得られる。
（ステップＳ１１０）
文書情報登録部４は，前処理として，文書番号を発番する。
（ステップＳ１２０）
文書情報登録部４は，文書入力部１から入力された文書を文書記憶部６に登録する。
（ステップＳ１３０）
文書情報登録部４は，ステップＳ１００で得られた文書に対して，図６のフローチャートに示す処理を実行する。以下，図６のフローチャートを参照しながら説明する。
【００５２】
（ステップＳ１０００）
行番号を０にセットする。文番号を０にセットする。
（ステップＳ１０１０）
文書のすべての行に対して，ステップＳ１０２０からステップＳ１１００までの処理を実行する。すべての行に対して処理が終われば処理を終了する。未処理の行があればステップＳ１０２０へ行く。
（ステップＳ１０２０）
未処理の行を処理対象とし，行番号を１増やす。文番号を１増やす。
（ステップＳ１０３０）
処理対象の行を形態素に区切る。例えば，図３（ｂ）の９行目は，「<POS>」「第１会議室」「</POS>」「で」「行う（動詞）」「ます」「。」となる。ただし，タグで囲まれた範囲（例えば，「第１会議室」）は１つの形態素として扱う。
（ステップＳ１０４０）
すべての形態素に対して，ステップＳ１０５０からステップＳ１１００までの処理を実行する。実行したらステップＳ１０１０へ戻る。
【００５３】
（ステップＳ１０５０）
処理対象の形態素が，「。（句点）」ならステップＳ１０６０へ行く。<HD>タグ（見出しタグ）以外のタグならば，その終了タグを見つけ，ステップＳ１０７０へ行く。それ以外ならステップＳ１０４０へ戻る。例えば，図３（ｂ）の９行目は，「<POS>」が見つかった時点で，その終了タグ「</POS>」まで進め，ステップＳ１０７０へ行く。「で」，「行う」，「ます」のときは，タグでも句点でもないので，何も処理せず，ステップＳ１０４０に戻る。「。」のときは，ステップＳ１０６０へ行く。
【００５４】
（ステップＳ１０６０）
文番号を１増やす。ステップＳ１０４０へ戻る。
（ステップＳ１０７０）
そのタグに小分類が設定されているかどうかを調べる。設定されていないなら，小分類を決定するために，ステップＳ１０８０へ行く。設定されていたら，ステップＳ１０９０へ行く。
【００５５】
（ステップＳ１０８０）
小分類を決定する。このタグ以降の形態素に，どのようなサ変名詞または動詞があるかを調べる。タグが<DATE>タグであり，その後の形態素に，「開催（サ変名詞）」，「行う（動詞）」，「開く（動詞）」があれば，小分類を開催日と判断する。「締切（サ変名詞）」があれば，締切日と判断する。タグが<POS>タグであり，その後の形態素に，「開催（サ変名詞）」，「行う（動詞）」，「開く（動詞）」があれば，小分類を開催地と判断する。ステップＳ１０９０へ行く。例えば，図３（ｂ）の９行目は，「<POS>」〜「</POS>」の後に，「行う（動詞）」があるので，「<POS>」〜「</POS>」は開催地と判断する。
【００５６】
（ステップＳ１０９０）
もし，タグが<DATE>タグであるならば，絶対的な時間を計算する。方法として，例えば，特開平１０−６９４７２に開示された方法を利用する。ステップＳ１１００へ行く。
（ステップＳ１１００）
抽出情報記憶部５にデータを登録する。登録したらステップＳ１０４０へ戻る。
【００５７】
＜検索処理＞
次に，検索処理について説明する。図７は，本実施の形態にかかる情報検索装置２０の検索処理の動作を示すフローチャートである。また，抽出情報記憶部５のデータを図９（ａ）に示したものとする。以下，このフローチャートを参照しながら説明する。
【００５８】
本実施の形態では，様々な条件で検索するために，例えば，図８に示すような，検索条件入力画面をユーザに表示する。図８には，タイプ欄と具体値欄がある。タイプ欄は，検索する単語の種類を選択する欄である。タイプ欄には，「日時」，「イベント」，「場所」，「アクション」の大分類と，「開催日」，「締切日」などの小分類がある。「指定なし」は，特に小分類を指定せずに検索するための欄である。小分類欄には，その分類を検索条件として指定できるように，チェックボタンがある。以下の説明では，個々のチェックボタンを指し示すために，「日時−開催日」，「アクション−指定なし」という表記をする。
【００５９】
「開催日を含んでいる文書」を検索する場合には，「日時−開催日」にチェックを付ける。「開催日として２００２年８月３日を含む文書」を検索する場合には，「日時−開催日」にチェックを付け，その右にある具体値欄に，「２００２／８／３」と入力する。特に具体値を検索条件にいれる必要がないなら，具体値欄は空白でもよい。
【００６０】
また，締切日や開催日を区別せず，すべての日時を検索したい場合には，「日時−指定なし」にチェックを付ける。「日時として２００２年８月３日を含む文書」を検索する場合は，「日時−すべて」にチェックを付け，その右にある具体値欄に，「２００２／８／３」と入力する。
【００６１】
複数のタイプにチェックを付けることでＡＮＤ検索を行う。例えば，「開催地が○○ドームで，開催日が２００２年８月３日のセミナーを含む文書」を検索する場合には，図８のように３箇所にチェックし，「日時−開催日」と「場所−開催地」に対して具体値を入力する。
【００６２】
（ステップＳ２００）
候補選択部１０は，条件入力部７から入力された検索条件に対して，抽出誤りを考慮して条件を緩め，検索にヒットする文書を増やす。
次のようにして条件を緩める。
（１）ユーザが具体値を設定している場合には，大分類や小分類の抽出漏れに対応するために，具体値だけで検索する。
（２）ユーザが具体値を設定していない場合には，小分類の抽出漏れに対応するために，小分類が設定されていないレコードもヒットするように，大分類だけで検索する。
【００６３】
例えば，図８で，「場所−開催地」＝「○○ドーム」という条件が指定されているが，抽出情報記憶部５のデータの中から，外部表現が「○○ドーム」であるすべてのレコードがヒットするようにする。
【００６４】
図８に示した検索条件では，抽出情報記憶部５のうち，次の（条件１），（条件２），（条件３）を満たす文書を指定している。
（条件１）タイプ項目が「日時−開催日」で，内部表現が「２００２／８／３」であるレコードがある。
（条件２）タイプ項目が「イベント−セミナー」であるレコードがある。
（条件３）タイプ項目が「場所−開催地」で，外部表現が「○○ドーム」であるレコードがある。
【００６５】
本実施の形態では，検索条件を緩め，次の検索条件で検索する。
（条件１’）内部表現が「２００２／８／３」であるレコードがある。
（条件２’）大分類項目が「イベント」であるレコードがある。
（条件３’）外部表現が「○○ドーム」であるレコードがある。
【００６６】
（ステップＳ２１０）
候補選択部１０は，抽出情報記憶部５を検索する。例えば，図９（ａ）において，網掛け模様で示されたレコード（連番１，２，６，８，９，１０，１１）がヒットする。
【００６７】
（ステップＳ２２０）
候補選択部１０は，ヒットしたレコードの中で，文書番号が同じものごとにまとめ，すべての条件を満たしている文書番号を処理対象とする。例えば，図９において，文書番号１のデータでは，（条件３’）を満たさないため処理対象としない。文書番号２のデータは，すべての条件を満たすので処理対象とする。
【００６８】
（ステップＳ２３０）
適合度計算部１１は，ステップＳ２２０で処理対象とした文書に対して，検索条件との適合度を計算する。まず，各レコードと検索条件との適合度を計算する。これは，レコードの大分類や小分類が検索条件と一致したものほど高くなるように設定する。図１０に，適合度の設定値を示す。検索条件に具体値が設定されている場合と，設定されていない場合で異なる計算をする。検索条件に具体値が設定されている場合には，検索されたレコードの大分類と小分類が，検索条件と一致していれば適合度を１とする。検索されたレコードの大分類のみが，検索条件と一致し，検索されたレコードの小分類が空白なら，適合度を０．９とする。検索されたレコードの大分類のみが，検索条件と一致していれば適合度を０．５とする。検索されたレコードの大分類と検索条件の大分類が異なっていれば適合度を０．１とする。
【００６９】
一方，検索条件に具体値が設定されてない場合には，検索されたレコードの大分類と小分類が，検索条件と一致していれば適合度を１とする。検索されたレコードの大分類のみが，検索条件と一致し，検索されたレコードの小分類が空白なら，適合度を０．８５とする。検索されたレコードの大分類のみが，検索条件と一致していれば適合度を０．５とする。
【００７０】
例えば，図９の連番８は，（条件２’）で一致しており，具体値が設定されてない場合に該当する。大分類のみが検索条件と一致しており，小分類が検索条件と異なるので，適合度は０．５となる。連番１０は，（条件１’）で一致しており，具体値が設定されている場合に該当する。大分類が検索条件と一致しており，小分類が空白なので，適合度は０．９となる。連番１１は，（条件３’）で一致しており，具体値が設定されている場合に該当する。大分類が検索条件と一致しており，小分類が空白なので，適合度は０．９となる。この適合度の計算結果を，図９（ｂ）の「Ｓ２３０の適合度」欄に示す。
【００７１】
（ステップＳ２４０）
単語の出現位置の関係にもとづいて適合度を計算する。検索条件が複数ある場合に，一致したレコードの文位置の距離が近いものほど高い点数を与えるようにする。レコードの文位置が同じなら，適合度を１にする。複数のレコードの間で，文位置が異なるなら，一致した各レコードに対して，文位置の差が最小（距離が最小）の値αを計算し，
適合度＝１−０．１α
として適合度を計算する。図９において，連番９は，連番１０と文位置の差が，１（＝３−２）で最小なので，適合度は，０．９（＝１−０．１×（３−２））となる。この適合度の計算結果を，図９（ｂ）の「Ｓ２４０の適合度」欄に示す。
【００７２】
（ステップＳ２５０）
ステップＳ２３０，Ｓ２４０で計算した２つの適合度について，レコードごとに
β×（ステップＳ２３０で求めた適合度）＋（１−β）×（ステップＳ２４０で求めた適合度）
を計算し，総計を計算する。図９において，β＝０．５とすると，
連番８は，０．５×０．５＋０．５×０．７＝０．６
連番９は，０．５×０．５＋０．５×０．９＝０．７
連番１０は，０．５×０．９＋０．５×０．９＝０．９
連番１１は，０．５×０．９＋０．５×０．８＝０．８５
となる。連番８と連番９は同じ（条件２’）に一致しているので，適合度が高い連番９だけを総計する。結果として文書番号２の適合度は，２．４５（＝０．７＋０．９＋０．８５）となる。図８の検索条件の場合，条件が３つあるため，適合度の最大値は３であることを考えると，比較的条件に適合した文書である。
【００７３】
（ステップＳ２６０）
候補整列部１２は，ステップＳ２５０で計算した，各文書ごとの適合度を用いてソートする。
【００７４】
（ステップＳ２７０）
表示部８は，検索結果を出力する。出力例を図１１に示す。
【００７５】
（第１の実施の形態の効果）
以上説明したように，本実施の形態によれば，あらかじめ，分野ごとの抽出規則を持たない構成にしたことで，その文書がどのような分野であるのかを前もって選別する処理を不要にできる。
【００７６】
また，抽出時に，文書中の日時，場所などを表す単語を表形式に格納しないようにし，それぞれの位置情報を記憶する。検索時に，それぞれの単語の種類や単語間の位置関係などを利用して，検索条件との適合度を計算する。適合度の高い順に整列してユーザに表示することで，検索結果を信用していいのかをユーザが判断することができる。
【００７７】
また，単語の種類を抽出する処理において，その日時が締切日なのか，開催日なのかを区別しておく。これによって，例えば，ユーザが，開催日と締切日を区別して検索できるようにする。一方，検索条件としてユーザが指定した単語の種類に関わらず，候補文書を検索するようにユーザが設定した検索条件より，検索条件を緩めるようにすることで，抽出結果が間違っていても，検索漏れがないようにすることができる。
【００７８】
なお，本実施の形態では，文書から単語の種類を決定し，抽出情報記憶部５に登録する処理（登録処理）を，文書入力部１，抽出部２，抽出規則記憶部３，および文書情報登録部４で行っている。また，文書を検索する処理（検索処理）を，条件入力部７，表示部８，および検索部９で行っている。あらかじめ，抽出情報記憶部５と文書記憶部６が作成されている場合には，図１２に示した情報検索装置３０の構成を採用してもよい。すなわち，検索処理を行う構成要素である入力部７，表示部８，および検索部９と，抽出情報記憶部５と文書記憶部６だけから構成されるようにしてもよい。
【００７９】
（第２の実施の形態）
上記第１の実施の形態では，一般的な文書の検索について説明した。本実施の形態では，インターネットなどのネットワークを経由して電子メールを受け取り，受け取った電子メールを順次登録し，ユーザの要求に応じて検索する場合について説明する。
【００８０】
図１３は，本実施の形態にかかる情報検索システム２００のシステム構成を示す説明図である。上記第１の実施の形態の構成と比べて，通信部１３，通信部１４，サーバ１００，クライアント１１０，ネットワーク１２０が追加されている。サーバ１００とクライアント１１０は，ネットワーク１２０に接続されている。
【００８１】
（サーバ１００）
サーバ１００は，図１３に示したように，文書入力部１と，抽出部２と，抽出規則記憶部３と，文書情報登録部４と，抽出情報記憶部５と，文書記憶部６と，検索部９と，通信部１３を含んで構成されている。また，検索部９は，候補選択部１０，適合度計算部１１，候補整列部１２を含んで構成されている。サーバ１００は，ネットワーク１２０を通して受信する電子メールに対して，メール情報の登録を行う。また，クライアント１１０から送られるユーザからの検索要求を受け取り，検索結果を発信する。通信部１３は，ネットワーク１２０を通して，それらの情報を受発信する。なお，通信部１３以外の各構成要素については，上記第１の実施の形態と実質的に同様であるため，重複説明を省略する。
【００８２】
（クライアント１１０）
クライアント１１０は，図１３に示したように，条件入力部７と，表示部８と，通信部１４を含んで構成されている。クライアント１１０は，パソコンなどの計算機や，携帯電話をはじめとする携帯端末などであり，ユーザの検索条件を受け付け，通信部１４からネットワーク１２０に接続し，サーバ１００に問い合わせて検索結果を出力する。なお，通信部１４以外の各構成要素については，上記第１の実施の形態と実質的に同様であるため，重複説明を省略する。
【００８３】
（ネットワーク１２０）
ネットワーク１２０は，プロトコル，トポロジー，伝送媒体の種類などを問わない。ネットワーク１２０の典型的な例としては，インターネットが挙げられる。
【００８４】
文書情報の登録処理は，通信部１３が電子メールを受信したときに実行される。第１の実施の形態と同様に，ステップＳ１００以降の処理が，受信した電子メールに対して実行されることで，電子メールの情報が，抽出情報記憶部５に登録される。また，文書情報の検索処理は，ユーザが，例えば，図８の入力画面に入力したときに実行される。図８の入力画面は，ＣＧＩ（ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ）などを使って表示される。
【００８５】
図１４は，本実施の形態にかかる情報検索システム４０の動作を示すフローチャートである。以下に，図１４を参照しながら説明する。
（ステップＳ３００）
クライアント１１０において，条件入力部７は，入力された検索条件を通信部１４に伝える。
【００８６】
（ステップＳ３１０）
通信部１４は，入力された検索条件を，ネットワーク１２０を介してサーバ１００に伝える。
【００８７】
（ステップＳ３２０）
サーバ１００は，通信部１３によって，検索条件を受信し，第１の実施の形態のステップＳ２００以降の処理を実行する。
【００８８】
（ステップＳ３３０）
サーバ１００において，通信部１３は，検索結果を，ネットワーク１２０を介してクライアント１１０に伝える。
【００８９】
（ステップＳ３４０）
表示部８は，検索結果を出力する。
【００９０】
（第２の実施の形態の効果）
以上説明したように，本実施の形態によれば，インターネットなどのネットワークを経由して電子メールを受け取り，受け取った電子メールを順次登録し，ユーザの要求に応じて検索する場合に対しても適用できる。
【００９１】
（第３の実施の形態）
上記第１の実施の形態では，抽出情報記憶部５に，タグで囲まれた単語しか登録していない。従って，抽出規則の不備などで，本来，イベントや場所として登録されるべき単語が登録されていない場合が考えられる。このような抽出漏れに対処するために，本実施の形態では，タグで囲まれた単語以外に，文書に出現する名詞など，大分類や小分類が具体的に付けられない単語も登録するようにする。
【００９２】
本実施の形態にかかる情報検索装置５０は，図１５に示したように，上記第１の実施の形態にかかる情報検索装置２０の抽出情報記憶部５を，単語情報記憶部１５に置き換えて構成されている。単語情報記憶部１５は，タグで囲まれた単語以外に，文書に出現する名詞など，大分類や小分類が具体的に付けられない単語も登録するようにする。他の構成要素については，上記情報検索装置２０の各構成要素と実質的に同様であるので，重複説明を省略する。
【００９３】
次いで，本実施の形態にかかる情報検索装置５０の動作について説明する。
文書に出現する名詞も単語情報記憶部１５に登録するために，ステップＳ１０３０の形態素解析の結果を利用する。図６に対する変更点を図１６に示す。
【００９４】
まず，第１の実施の形態のステップＳ１０５０の処理を次のように変更する。（ステップＳ１２００）
処理対象の形態素が，「。（句点）」ならステップＳ１０６０へ行く。<HD>タグ（見出しタグ）以外のタグならば，その終了タグを見つけ，ステップＳ１０７０へ行く。その形態素が名詞ならば，ステップＳ１２１０へ行く。それ以外ならステップＳ１０４０へ戻る。例えば，図３（ｂ）の３行目は「山田（名詞）」を，５行目は「以下（名詞）」，「日程（名詞）」を登録することになる。
【００９５】
また，ステップＳ１２１０を次のようにする。
（ステップＳ１２１０）
大分類を「単語」として，単語情報記憶部１５にデータを登録する。登録したらステップＳ１０４０へ戻る。
【００９６】
また，第１の実施の形態のステップＳ１１００の登録先を単語情報記憶部１５に変更する。結果として，図３（ｂ）の文書の単語情報記憶部１５のデータは，図１７のようになる。図１７に示したように，大分類に「単語」が加わり，外部表現として，「皆様」，「山田」，「以下」，「日程」，「全員」，「参加」，「方」が，名詞として単語情報記憶部１５に登録されている。
【００９７】
（第３の実施の形態の効果）
以上説明したように，本実施の形態によれば，抽出規則の不備などで，本来，イベントや場所として抽出できなくても，名詞として単語情報記憶部１５に登録されていることで，検索漏れをなくすことができる。
【００９８】
以上，添付図面を参照しながら本発明にかかる情報検索装置および情報検索システムの好適な実施形態について説明したが，本発明はかかる例に限定されない。当業者であれば，特許請求の範囲に記載された技術的思想の範疇内において各種の変更例または修正例に想到し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。
【００９９】
例えば，以下のような応用例が考えられる。
（１）図１０の適合度を，大分類に応じて設定することも可能である。また，ステップＳ２４０，Ｓ２５０の計算方法はこれに限らない。
（２）図７のステップＳ２１０で，具体値の部分一致で検索してもよい。また，図１０の表中「検索条件に具体値がある場合」に，具体値が部分一致した場合の適合度を追加してもよい。
（３）大分類や小分類は上記実施の形態で挙げたものに限らない。例えば，人名や組織名などの大分類も可能である。
（４）候補整列部１２において，必ずしも適合度順にソートする必要はない。日付順に現在に近いものから順にソートし，一覧表示する際に適合度を表示してもよい。
（５）図８のような入力画面のかわりに，音声入力を行ってもよい。
（６）図６のステップＳ１０２０で，各行ごとに文番号を１増やしているため，１文の途中で改行がある場合に，正しく文番号を計算できていない。そこで，ステップＳ１００の前処理として，１文の途中で改行が入らないように処理してもよい。
（７）図８の日時の条件に対して，指定された日時の前後一週間の範囲に拡大して候補文書を検索してもよい。また，図８の入力方法も，期間を入力できるようにしてもよい。
（８）文書記憶部６の記憶対象は，文書自体でなくてもよく，例えば，文書の保存場所を指し示す情報（例えばＵＲＬ）でもよい。
【０１００】
【発明の効果】
以上説明したように，本発明によれば，あらかじめ，分野ごとの抽出規則を持たない構成にしたことで，その文書がどのような分野であるのかを前もって選別する処理を不要にできる。
【０１０１】
また，抽出時に，文書中の日時，場所などを表す単語を表形式に格納しないようにし，それぞれの位置情報を記憶する。検索時に，それぞれの単語の種類や単語間の位置関係などを利用して，検索条件との適合度を計算する。適合度の高い順に整列してユーザに表示することで，検索結果を信用していいのかをユーザが判断することができる。
【０１０２】
また，単語の種類を抽出する処理において，その日時が締切日なのか，開催日なのかを区別しておく。これによって，例えば，ユーザが，開催日と締切日を区別して検索できるようにする。一方，検索条件としてユーザが指定した単語の種類に関わらず，候補文書を検索するようにユーザが設定した検索条件より，検索条件を緩めるようにすることで，抽出結果が間違っていても，検索漏れがないようにすることができる。
【０１０３】
また，インターネットなどのネットワークを経由して電子メールを受け取り，受け取った電子メールを順次登録し，ユーザの要求に応じて検索する場合に対しても適用できる。
【０１０４】
また，抽出規則の不備などで，本来，イベントや場所として抽出できなくても，名詞として単語情報記憶部に登録されていることで，検索漏れをなくすことができる。
【図面の簡単な説明】
【図１】第１の実施の形態にかかる情報検索装置の構成を示す説明図である。
【図２】抽出規則記憶部のデータを示す説明図である。
【図３】文書の抽出結果の一例を示す説明図である。
【図４】抽出情報記憶部のデータを示す説明図である。
【図５】第１の実施の形態における登録処理の動作を示す流れ図である。
【図６】図５のステップＳ１３０の詳細を示す流れ図である。
【図７】第１の実施の形態における検索処理の動作を示す流れ図である。
【図８】検索条件入力画面の一例を示す説明図である。
【図９】抽出動作記憶部のデータを示す説明図である。
【図１０】適合度の設定値を示す説明図である。
【図１１】検索結果の出力の一例を示す説明図である。
【図１２】第１の実施の形態にかかる情報検索装置の別の構成を示す説明図である。
【図１３】第２の実施の形態にかかる情報検索システムの構成を示す説明図である。
【図１４】第２の実施の形態における情報検索システムの動作を示す流れ図である。
【図１５】第３の実施の形態にかかる情報検索装置の構成を示す説明図である。
【図１６】図５のステップＳ１３０の詳細を示す流れ図である。
【図１７】単語情報記憶部のデータを示す説明図である。
【符号の説明】
１文書入力部
２抽出部
３抽出規則記憶部
４文書情報登録部
５抽出情報記憶部
６文書記憶部
７条件入力部
８表示部
９検索部
１０候補選択部
１１適合度計算部
１２候補整列部
１３通信部
１４通信部
１５単語情報記憶部
２０情報検索装置
３０情報検索装置
４０情報検索システム
５０情報検索装置
１００サーバ
１１０クライアント
１２０ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information retrieval apparatus and an information retrieval system in a document management system that manages documents (electronic documents) such as electronic mail.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, document management systems for managing documents such as e-mails have been provided with functions such as full-text search for text. Actions such as date and time, location, and meeting request are extracted from the document. For example, information related to "Event where the venue is XX Dome" and "Request for reply during this week" is written. It is possible to search under various conditions using the types of words in the document, such as likely documents.
[0003]
Furthermore, in order to improve user convenience, Japanese Patent Laid-Open No. 10-69472 extracts e-mail deadlines for action from among dates and times in the text, and sorts them in order of close deadlines. A function is proposed.
[0004]
Sato et al., “Automatic Digest Generation of Electronic News” (Journal of Information Processing Society of Japan, Vol.36, No.10 pp.2371-2379), from electronic news, conference dates, venues, and paper deadlines. Etc. are extracted. These are effective as techniques for extracting only predetermined information.
[0005]
On the other hand, Japanese Patent Laid-Open No. 9-269940 proposes a function of extracting information such as date and time from the text of an electronic mail and searching for mail including the date in the text using the date as a search key. In addition, locations and matters are also extracted. With these functions, you can cut out and display the date, time, place, matter, etc. in the mail text, and sort and display the mail according to whether the date and time in the mail is before or after the current time. However, for example, the search is not performed until it is specified whether the date represents an event date or a deadline date.
[0006]
[Patent Document 1]
Japanese Patent Laid-Open No. 10-69472
[Patent Document 2]
Japanese Patent Laid-Open No. 9-269940
[Non-Patent Document 1]
“Automatic digest generation of electronic news” (Journal of Information Processing Society of Japan, Vol.36, No.10 pp.2371-2379)
[0007]
[Problems to be solved by the invention]
By the way, in the actual document information search, for example, when searching for “documents including the date of the event”, or by specifying a more specific date and time, “documents including August 3, 2002 as the date of event” In addition, it is required to be searchable under various conditions, such as when searching for “documents including August 3, 2002 as date and time” without distinguishing the date of the event. . In order to be able to search under such conditions, not only a document in a specific field but also a process for specifying a word type for various documents is required.
[0008]
In addition, even if the word type is wrong due to an extraction error, it is necessary to be able to search correctly to some extent. For example, a place that should be extracted as “date” can not be searched correctly if it can only be extracted as “date and time”, but even in such a case, it is necessary to be able to search so that there is no omission. Don't be. When trying to realize these, it is impossible to search effectively by combining any of the conventional methods.
[0009]
This is because of the following problems.
[Problem 1] Sato et al.'S "automatic generation of electronic news digest" method is effective when the field is limited, such as news related to conferences. When this method is used for documents in various fields, it is necessary to prepare various rules such as a rule for extracting information on holding a meeting and a rule for extracting information on the deadline of response requests. As pre-processing, it is necessary to select whether or not the transmitted e-mail is an e-mail related to holding a conference. However, there are cases where various information is included in one e-mail, and the sorting process is not always successful.
[0010]
[Problem 2] When extracting specific information in a limited field, such as “Automatic digest generation of electronic news” by Sato et al., The extraction result is generally in the form of a table. However, in this case, information on how accurate the extraction result is lost. Therefore, it may be difficult to use the search result because it cannot be determined whether the search result can be trusted.
[0011]
[Problem 3] In Japanese Patent Laid-Open No. 9-269940, information such as date and time is extracted from the body of an e-mail, but it is not distinguished whether the date is a deadline or a date. Therefore, it is not possible to set search conditions such as searching by the date of event. On the other hand, when extracting whether the date is the deadline or the date of the event, an extraction error or the like may occur, and a search omission may occur.
[0012]
The present invention has been made in view of the above-mentioned problems of the conventional information search apparatus or information search system, and an object of the present invention is to solve the above-mentioned problems of the conventional information search apparatus or information search system. It is an object of the present invention to provide a new and improved information retrieval apparatus and information retrieval system capable of performing the above.
[0013]
[Means for Solving the Problems]
In order to solve the above problems, according to a first aspect of the present invention, an information search apparatus for searching for a document is provided. As described in claim 1, the information providing apparatus of the present invention stores a document storage unit (6) for storing a document and character strings in the document for each classification as information for searching the document. An extraction information storage unit (5), a condition input unit (7) for inputting a character string as a search condition, and a search unit (9) for searching a document from the document storage unit based on the character string input as the search condition The search unit (9) includes a candidate selection unit (10) that selects a document by loosening the search condition so that the document is searched regardless of the classification of the character string input as the search condition, A fitness level calculation unit (11) for calculating a fitness level between the selected document and a character string as a search condition is provided.
[0014]
Further, a document input unit (1) for inputting a document stored in the document storage unit, an extraction rule storage unit (3) for storing an extraction rule for extracting a character string from the input document, and an extraction rule The extraction unit (2) that extracts the character string from the document input from the document input unit by comparing with the extraction rule stored in the storage unit, and determines the classification of the character string extracted by the extraction unit, You may make it provide the document information registration part (4) which registers a character string into an extraction information storage part.
[0015]
In this case, the classification of the character strings in the extraction information storage unit (5) is, as described in claim 6, a major classification determined according to the extraction rules stored in the extraction rule storage unit (3), and You may make it consist of the small classification which subdivided the large classification. Further, as described in claim 7, the large classification is further stored in the document stored in the extraction information storage unit (5) regardless of the extraction rule stored in the extraction rule storage unit (3). You may make it have the classification for registering a noun. Even if it cannot be originally extracted as an event or place due to an incomplete extraction rule, it is possible to eliminate omissions by registering it as a noun in the word information storage unit.
[0016]
According to such an information retrieval apparatus, [Problem 1], [Problem 2], and [Problem 3] can be solved. This will be described below.
[0017]
Extraction information storage unit (5) stores character strings in documents for each category such as “event”, “action”, “location”, “date / time”, etc. By adopting a configuration having no rules, it is possible to eliminate the process of selecting in advance what field the document is in. Thus, [Problem 1] has been resolved. Further, the extraction information storage unit (5) stores the character string in the document for each classification (major classification and minor classification), so that a detailed search can be performed. For example, when searching for the date and time, by classifying whether the date and time is the “deadline date” or “the date of the event”, it is possible to perform a search by distinguishing the “date of deadline” and the “date of event”.
[0018]
  In addition, the candidate selection unit (10) selects the document by relaxing the search condition so that the document is searched regardless of the classification of the character string input as the search condition. This is to increase the number of selectable documents in consideration of extraction errors. Specifically, the search condition can be relaxed as follows.
  1:If the string entered as a search condition is a specific value, relax the search condition so that only the specific value matches and select the document.
  2:If the character string entered as a search condition is a concrete value, relax the search condition so that only the concrete value matches, and if the character string entered as a search condition is not a concrete value, only the major classification Select documents by loosening search conditions so that matches.
[0019]
In this way, by selecting a document by loosening the search condition, it is possible to prevent a search omission even if the extraction result using the extraction rule is wrong. Thus, [Problem 3] has been resolved.
[0020]
  For the document selected by the candidate selection unit as described above, the fitness level calculation unit (10) calculates the fitness level. Specifically, the fitness can be calculated as follows.
  1:Calculates the degree of matching between the selected document and the character string that is the search condition, using the classification of the character string in the selected document.
  2:Calculates the degree of matching between the selected document and the character string that is the search condition, using the positional relationship of the character string in the selected document.
[0021]
In particular, as shown in (2), when extracting information for searching a document, words representing the date, place, etc. in the document are not stored in a table format, and the respective position information is stored. At the time of retrieval, the degree of matching with the retrieval condition is calculated using the type of each word and the positional relationship between the words. By arranging and displaying to the user in descending order of relevance, the user can determine whether the search result can be trusted. Thus, [Problem 2] has been resolved.
[0022]
  In order to solve the above-mentioned problem, according to the second aspect of the present invention,,An information search system is provided that includes a server unit (100) and a terminal unit (110) connected via a network (120) and searches for a document.
[0023]
In the information search system of the present invention, the server unit (100) includes a communication unit (13) for transmitting / receiving search conditions and search results, a document storage unit (6) for storing documents, and information for searching for documents. And an extraction information storage unit (5) for storing character strings in the document for each classification, and a search unit (9) for searching the document from the document storage unit based on the character string input as a search condition. It is characterized by that.
[0024]
In the information search system of the present invention, the terminal unit (110) includes a communication unit (14) for transmitting / receiving a search condition and a search result, and a condition input unit (7) for inputting a character string as a search condition. The search unit (9) is selected with the candidate selection unit (10) that selects the document by relaxing the search condition so that the document is searched regardless of the classification of the character string input as the search condition. A fitness level calculation unit (11) that calculates the fitness level of a document and a character string as a search condition is provided.
[0025]
  According to such an information retrieval system, in addition to having substantially the same effect as the information retrieval device according to the first aspect of the present invention, for example,,The present invention can also be applied to a case where e-mails are received via a network such as the Internet, the received e-mails are sequentially registered, and searched according to a user request. Further, the function of the server unit (100) of the present invention can be provided as a so-called ASP (Application Service Provider) service, and can be developed as a business.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of an information search apparatus and an information search system according to the present invention will be described below in detail with reference to the accompanying drawings. In the present specification and drawings, components having substantially the same functional configuration are denoted by the same reference numerals, and redundant description is omitted.
[0027]
(First embodiment)
In the present embodiment, an information processing apparatus capable of registering a date and time, a place, an event, and an action included in a document and searching using the registered information will be described. In particular, the following approaches are taken to solve the above three problems.
[0028]
For [Problem 1], a configuration that does not have an extraction rule for each field in advance. This eliminates the need to select in advance what field the document is in.
[0029]
For [Problem 2], at the time of word extraction, words representing the date, place, etc. in the document are not stored in a table format, and each position information is stored. At the time of retrieval, the degree of matching with the retrieval condition is calculated using the type of each word and the positional relationship between the words. By displaying the relevance level to the user or arranging and displaying the relevance level to the user, the user can determine how much the search result can be trusted.
[0030]
For [Problem 3], in the process of extracting the word type, it is distinguished whether the date and time is the deadline or the date of the event. In this way, for example, the user can make a search by distinguishing the event date and the deadline date. On the other hand, considering the possibility that the extraction result is wrong, the search condition set by the user to search for candidate documents is used regardless of the type of word specified by the user as a search condition so that there are no omissions. , Try to relax the search conditions.
[0031]
FIG. 1 is an explanatory diagram showing a system configuration of the information search apparatus 20 according to the present embodiment. As shown in FIG. 1, the information retrieval device 20 includes a document input unit 1, an extraction unit 2, an extraction rule storage unit 3, a document information registration unit 4, an extraction information storage unit 5, and a document storage unit 6. And a condition input unit 7, a display unit 8, and a search unit 9. The search unit 9 includes a candidate selection unit 10, a fitness calculation unit 11, and a candidate alignment unit 12.
[0032]
Among the constituent elements of the information search device 20, the document input unit 1, the extraction unit 2, the extraction rule storage unit 3, the document information registration unit 4, the extraction information storage unit 5, and the document storage unit 6 register document information. It is a component for. The condition input unit 7, the display unit 8, and the search unit 9 are components for searching for document information. Each component will be described in detail below.
[0033]
(Document input part 1)
The document input unit 1 receives a document to be registered.
[0034]
(Extractor 2)
The extraction unit 2 extracts information that matches the document input from the document input unit 1 by matching with the extraction rules stored in the extraction rule storage unit 3. Specifically, when the input document is an email, the date and time, event, location, action, etc. included in the subject or text are extracted. These dates, events, places, actions, etc. are referred to as “major classification”. In addition, for example, for an event, information that subdivides whether the event is a seminar or a meeting is extracted. These subdivided information is referred to as “small classification”.
[0035]
(Extraction rule storage unit 3)
The extraction rule storage unit 3 stores an extraction rule for extracting predetermined information from the document input from the document input unit. FIG. 2 is an explanatory diagram showing an example of the extraction rules stored in the extraction rule storage unit 3. For example, rule A-1 in FIG. 2A is an extraction rule for “event-conference”, and matches a character string such as “9th Environmental Forum”. Rule A-2 is an extraction rule in which the major classification is an event and the minor classification is not specified, and character strings such as “Gion Festival” and “Summer Festival” match. The same applies to the rules A-3 and A-4.
[0036]
Further, as shown in FIG. 2B, the extraction rule storage unit 3 also stores extraction rules relating to headwords. The extraction rule related to the headword is, for example, the headword is extracted as the headword “Date:” of “Date: Thursday, August 6, 2002” on the 8th line of the document in FIG. It is a rule to do. For example, rule B-1 is a rule for extracting a headword of “date-time-date”, and character strings such as “date-time:” and “date-date:” match.
[0037]
The extraction rule storage unit 3 also stores date and time classification rules using headwords, as shown in FIG. Rules C-1 to C-3 are date and time classification rules using headwords. For example, in the 8th line of the document in FIG. 3A, the <DATE> tag is set for the date “August 6, 2002 (Thursday)”, and the </ HD DATE OPEN> tag is set for the headword “Date:”. If is set, it becomes as follows.
<HD DATE OPEN> Date: </ HD> <DATE> August 6, 2002 (Thursday) </ DATE>
Here, since “August 6, 2002 (Thursday)” is behind the headword “Date:”, it is considered to be the date of the event. Therefore, rule C-1 is applied to classify this date into the date of the event.
[0038]
The extraction rules stored in the extraction rule storage unit 3 described above are not limited to the example shown in FIG. For example, the regular expression described in FIG. 10 of Japanese Patent Laid-Open No. 09-269940 and the format described in FIG.
[0039]
The result of extracting the document of FIG. 3A using the rules of FIG. 2 is as shown in FIG. In FIG. 3B, “date-time-date” is enclosed in a <DATE OPEN> tag, and “date-time-deadline” is enclosed in a <DATE LIMIT> tag. Also, “Event-Conference” is an <EVT CONFERENCE> tag, “Action-Participation Request” is an <ACT JOIN> tag, “Action-Response Request” is an <ACT ASK> tag, and “Location” is <POS> Each is surrounded by tags.
[0040]
(Document Information Registration Unit 4)
When the date and time of the character string extracted by the extraction unit 2 is a relative time such as “today” or “tomorrow”, the document information registration unit 4 sets the date and time to an absolute time. change. Also, morphological analysis is performed. For example, the first verb of the sentence including “First Conference Room” on line 9 in FIG. Judge that it is a venue. Then, the analysis results of these documents are registered in the extracted information storage unit 5.
[0041]
(Extracted information storage unit 5)
The extracted information storage unit 5 stores words appearing in the document as a result of document analysis, together with classifications such as date and time. FIG. 4 shows an example of data in the extracted information storage unit 5. FIG. 4 shows an example of classification of the extraction results of FIG. The extracted information storage unit 5 includes document number items, serial number items, type items, external representation items, internal representation items, line position items, and sentence position items.
[0042]
The document number item is an identification number assigned to each document. The serial number item is a number for identifying each extracted word. The type item is a classification of extracted words, and is divided into a large classification and a small classification. The external expression item is an expression in which the extracted word appears in the document. The internal expression item has a value when the relative time such as “today” is set to an absolute time such as “2002/8/3”. The line position item contains the line number where the word appears in the document. The sentence position item contains information indicating what number sentence the word appears in. However, when the target document is mail, −1 is set as the line position and sentence position of the word in the subject.
[0043]
(Document storage unit 6)
The document storage unit 6 stores the document itself with a document number. The document number is associated with the document number item in the extraction information storage unit 5.
[0044]
(Condition input unit 7)
The condition input unit 7 receives operations such as input of search conditions from the user.
[0045]
(Display unit 8)
The display unit 8 displays an input screen and outputs search results.
[0046]
(Search unit 9)
The search unit 9 performs a search process on the search condition input from the condition input unit 7. The search unit 9 includes a candidate selection unit 10, a fitness calculation unit 11, and a candidate alignment unit 12.
[0047]
(Candidate selection unit 10)
The candidate selection unit 10 searches for documents that match the user search conditions as candidate documents. At this time, in consideration of errors in the extraction results of the extraction unit 2 and the document information registration unit 4, the search condition is corrected so that more documents are hit than the documents that hit the search condition set by the user.
[0048]
(Fitness calculation unit 11)
The fitness level calculation unit 11 calculates the level of fitness with the search condition using the type item, line position item, and sentence position item of the extracted information storage unit 5 from the document listed by the candidate selection unit 10. The major classification and minor classification of the search conditions give a high degree of conformity to records that match those between hit records and those that are close to the line position or sentence position between hit records.
[0049]
(Candidate alignment unit 12)
The candidate alignment unit 12 performs alignment in descending order of fitness.
[0050]
The information search device 20 according to the present embodiment is configured as described above. Next, the operation of the information search device 20 will be described. First, document information registration processing will be described, and then document information search processing will be described. FIG. 5 is a flowchart showing the registration processing operation of the information search apparatus 20. Here, as an example, the document registration process shown in FIG.
[0051]
<Registration process>
(Step S100)
The extraction unit 2 performs an extraction process on the document input from the document input unit 1. The extraction rule shown in FIG. 2 is used. The result shown in FIG. 3B is obtained for the mail in FIG.
(Step S110)
The document information registration unit 4 issues a document number as preprocessing.
(Step S120)
The document information registration unit 4 registers the document input from the document input unit 1 in the document storage unit 6.
(Step S130)
The document information registration unit 4 performs the process shown in the flowchart of FIG. 6 on the document obtained in step S100. Hereinafter, a description will be given with reference to the flowchart of FIG.
[0052]
(Step S1000)
Set the line number to 0. Set the sentence number to 0.
(Step S1010)
The processing from step S1020 to step S1100 is executed for all the lines of the document. When the process is completed for all the rows, the process is terminated. If there is an unprocessed line, the process goes to step S1020.
(Step S1020)
Unprocessed lines are processed and the line number is incremented by one. Increase the sentence number by one.
(Step S1030)
Divide the line to be processed into morphemes. For example, the ninth line in FIG. 3B is “<POS>”, “first conference room”, “</ POS>”, “de”, “do (verb)”, “mas”, and “.”. However, a range surrounded by tags (for example, “first meeting room”) is treated as one morpheme.
(Step S1040)
The processing from step S1050 to step S1100 is executed for all morphemes. If executed, the process returns to step S1010.
[0053]
(Step S1050)
If the morpheme to be processed is “. (Punctuation mark)”, go to step S1060. If it is a tag other than the <HD> tag (heading tag), the end tag is found, and the process goes to step S1070. Otherwise, the process returns to step S1040. For example, in the ninth line of FIG. 3B, when “<POS>” is found, the process proceeds to the end tag “</ POS>”, and the process goes to step S1070. If "de", "do", or "mas", it is neither a tag nor a punctuation mark, so nothing is processed and the process returns to step S1040. If ".", Go to step S1060.
[0054]
(Step S1060)
Increase the sentence number by one. The process returns to step S1040.
(Step S1070)
Check if a small classification is set for the tag. If not, go to step S1080 to determine the minor category. If set, go to step S1090.
[0055]
(Step S1080)
Determine the minor classification. Check what varieties or verbs exist in the morphemes after this tag. If the tag is a <DATE> tag, and the subsequent morpheme contains “held (sa variable noun)”, “do (verb)”, and “open (verb)”, the subclass is determined to be the date of the event. If there is a “Deadline”, it is judged as the deadline. If the tag is a <POS> tag, and the subsequent morpheme contains "hold (sa variable noun)", "do (verb)", and "open (verb)", the subclass is determined to be the venue. Go to step S1090. For example, the 9th line in FIG. 3B has “do (verb)” after “<POS>” to “</ POS>”, so “<POS>” to “</ POS>”. Is judged as the venue.
[0056]
(Step S1090)
If the tag is a <DATE> tag, calculate the absolute time. As a method, for example, the method disclosed in JP-A-10-69472 is used. Go to step S1100.
(Step S1100)
Data is registered in the extracted information storage unit 5. If registered, it will return to step S1040.
[0057]
<Search process>
Next, the search process will be described. FIG. 7 is a flowchart showing the search processing operation of the information search apparatus 20 according to this embodiment. The data in the extracted information storage unit 5 is assumed to be as shown in FIG. Hereinafter, description will be given with reference to this flowchart.
[0058]
In the present embodiment, in order to search under various conditions, for example, a search condition input screen as shown in FIG. 8 is displayed to the user. FIG. 8 includes a type column and a specific value column. The type column is a column for selecting the type of word to be searched. The type column includes major classifications such as “date and time”, “event”, “location”, and “action”, and minor classifications such as “date” and “deadline”. “Not specified” is a column for searching without specifying a small classification. The small category column has a check button so that the category can be designated as a search condition. In the following description, in order to indicate each check button, the notation “date-date-date” and “action-not specified” are used.
[0059]
When searching for “documents including the date of the event”, check “date and time-date of event”. To search for “documents that include August 3, 2002 as the date”, check “Date-Date” and enter “2002/8/3” in the value field to the right of it. To do. The concrete value field may be blank if it is not particularly necessary to include the concrete value in the search condition.
[0060]
In addition, if you want to search all dates and times without distinguishing the deadline date or the date of the event, check "Date and time-not specified". When searching for “documents including August 3, 2002 as date and time”, check “Date and time-all” and enter “2002/8/3” in the specific value column to the right of the check.
[0061]
An AND search is performed by checking multiple types. For example, when searching for “documents including seminars where the venue is XX dome and the date is August 3, 2002”, check three places as shown in FIG. Enter a specific value for “Location-Venue”.
[0062]
(Step S200)
The candidate selection unit 10 relaxes the conditions in consideration of extraction errors with respect to the search conditions input from the condition input unit 7 and increases the number of documents that hit the search.
Relax the conditions as follows:
(1) When the user has set a specific value, a search is performed using only the specific value in order to cope with omission of extraction of the large classification and the small classification.
(2) When the user has not set a specific value, in order to deal with the omission of extraction of the small category, the search is performed using only the large category so that the record without the small category is hit.
[0063]
For example, in FIG. 8, the condition “place-venue” = “XX dome” is specified, but from the data in the extracted information storage unit 5, all the external representations are “XX dome”. Make the record hit.
[0064]
In the search condition shown in FIG. 8, a document that satisfies the following (condition 1), (condition 2), and (condition 3) in the extracted information storage unit 5 is specified.
(Condition 1) There is a record whose type item is “date-time-date” and the internal representation is “2002/8/3”.
(Condition 2) There is a record whose type item is “event-seminar”.
(Condition 3) There is a record whose type item is “place-venue” and the external expression is “XX dome”.
[0065]
In this embodiment, the search condition is relaxed and the search is performed using the following search condition.
(Condition 1 ') There is a record whose internal representation is "2002/8/3".
(Condition 2 ') There is a record whose major classification item is "event".
(Condition 3 ') There is a record whose external expression is "XX dome".
[0066]
(Step S210)
The candidate selection unit 10 searches the extracted information storage unit 5. For example, in FIG. 9A, the records (serial numbers 1, 2, 6, 8, 9, 10, 11) indicated by the shaded pattern are hit.
[0067]
(Step S220)
The candidate selection unit 10 collects the hit records with the same document numbers and sets the document numbers satisfying all the conditions as the processing target. For example, in FIG. 9, the data of document number 1 does not satisfy (Condition 3 ') and is not processed. The data of document number 2 is a processing target because it satisfies all the conditions.
[0068]
(Step S230)
The fitness level calculation unit 11 calculates the fitness level with the search condition for the document to be processed in step S220. First, the degree of fitness between each record and the search condition is calculated. This is set so that the larger the major classification or minor classification of the record matches the search condition, the higher. FIG. 10 shows the set value of the fitness. Different calculations are performed when a specific value is set for the search condition and when it is not set. When a specific value is set in the search condition, the goodness of fit is set to 1 if the major classification and minor classification of the retrieved record match the retrieval condition. If only the major classification of the retrieved record matches the retrieval condition and the minor classification of the retrieved record is blank, the fitness is 0.9. If only the major classification of the retrieved record matches the retrieval condition, the fitness is set to 0.5. If the major classification of the retrieved record is different from the major classification of the search condition, the fitness is set to 0.1.
[0069]
On the other hand, if no specific value is set in the search condition, the fitness is set to 1 if the major and minor classifications of the retrieved records match the search condition. If only the major classification of the retrieved record matches the retrieval condition, and the minor classification of the retrieved record is blank, the fitness is 0.85. If only the major classification of the retrieved record matches the retrieval condition, the fitness is set to 0.5.
[0070]
For example, serial number 8 in FIG. 9 corresponds to the case where (condition 2 ') matches and no specific value is set. Since only the large classification matches the search condition and the small classification is different from the search condition, the fitness is 0.5. The serial number 10 corresponds to (condition 1 ') and corresponds to a case where a specific value is set. Since the major classification matches the search condition and the minor classification is blank, the fitness is 0.9. The serial number 11 corresponds to (condition 3 ') and corresponds to a case where a specific value is set. Since the major classification matches the search condition and the minor classification is blank, the fitness is 0.9. The calculation result of this fitness is shown in the “S230 fitness” column in FIG. 9B.
[0071]
(Step S240)
The fitness is calculated based on the relationship between the appearance positions of words. When there are multiple search conditions, the closer the sentence position of the matched record is, the higher the score is given. If the sentence position of the record is the same, the fitness is set to 1. If the sentence position is different among multiple records, the value α with the smallest sentence position difference (minimum distance) is calculated for each matched record.
Goodness of fit = 1-0.1α
The fitness is calculated as In FIG. 9, since the serial number 9 has the smallest difference between the serial number 10 and the sentence position of 1 (= 3-2), the fitness is 0.9 (= 1−0.1 × (3-2) ) The calculation result of this fitness is shown in the “S240 fitness” column in FIG. 9B.
[0072]
(Step S250)
For each record, the two goodnesses calculated in steps S230 and S240
β × (goodness determined in step S230) + (1−β) × (goodness determined in step S240)
And calculate the grand total. In FIG. 9, when β = 0.5,
Serial number 8 is 0.5 × 0.5 + 0.5 × 0.7 = 0.6
Serial number 9 is 0.5 × 0.5 + 0.5 × 0.9 = 0.7
Serial number 10 is 0.5 × 0.9 + 0.5 × 0.9 = 0.9
Serial number 11 is 0.5 × 0.9 + 0.5 × 0.8 = 0.85
It becomes. Since the serial number 8 and the serial number 9 match the same (condition 2 '), only the serial number 9 having a high degree of fitness is totaled. As a result, the fitness of document number 2 is 2.45 (= 0.7 + 0.9 + 0.85). In the case of the search condition of FIG. 8, since there are three conditions, considering that the maximum value of the fitness is 3, the document is relatively suitable for the condition.
[0073]
(Step S260)
The candidate sorting unit 12 sorts using the fitness for each document calculated in step S250.
[0074]
(Step S270)
The display unit 8 outputs the search result. An output example is shown in FIG.
[0075]
(Effects of the first embodiment)
As described above, according to the present embodiment, since the configuration does not have the extraction rule for each field in advance, it is possible to eliminate the process of selecting in advance what field the document is.
[0076]
Further, at the time of extraction, words representing the date, time, place, etc. in the document are not stored in a table format, and the respective position information is stored. At the time of retrieval, the degree of matching with the retrieval condition is calculated using the type of each word and the positional relationship between the words. By arranging and displaying to the user in descending order of the fitness, the user can determine whether the search result can be trusted.
[0077]
Also, in the process of extracting the type of word, it is distinguished whether the date and time is a deadline date or a holding date. In this way, for example, the user can make a search by distinguishing the event date and the deadline date. On the other hand, regardless of the type of word specified by the user as a search condition, the search condition is relaxed rather than the search condition set by the user to search for candidate documents. There can be no leakage.
[0078]
In the present embodiment, the processing for determining the type of word from the document and registering it in the extraction information storage unit 5 (registration processing) is performed by the document input unit 1, extraction unit 2, extraction rule storage unit 3, and document information. This is done by the registration unit 4. Further, a process for searching for a document (search process) is performed by the condition input unit 7, the display unit 8, and the search unit 9. When the extracted information storage unit 5 and the document storage unit 6 are created in advance, the configuration of the information search device 30 shown in FIG. 12 may be adopted. That is, the input unit 7, the display unit 8, and the search unit 9, which are components that perform the search process, and the extracted information storage unit 5 and the document storage unit 6 may be included.
[0079]
(Second Embodiment)
In the first embodiment, a general document search has been described. In the present embodiment, a case will be described in which electronic mail is received via a network such as the Internet, the received electronic mail is sequentially registered, and searched according to a user request.
[0080]
FIG. 13 is an explanatory diagram showing a system configuration of the information search system 200 according to the present embodiment. Compared to the configuration of the first embodiment, a communication unit 13, a communication unit 14, a server 100, a client 110, and a network 120 are added. Server 100 and client 110 are connected to network 120.
[0081]
(Server 100)
As shown in FIG. 13, the server 100 includes a document input unit 1, an extraction unit 2, an extraction rule storage unit 3, a document information registration unit 4, an extraction information storage unit 5, a document storage unit 6, The search unit 9 and the communication unit 13 are included. The search unit 9 includes a candidate selection unit 10, a fitness calculation unit 11, and a candidate alignment unit 12. The server 100 registers mail information for electronic mail received through the network 120. In addition, it receives a search request sent from the client 110 from the user, and sends a search result. The communication unit 13 receives and transmits such information through the network 120. In addition, since each component other than the communication unit 13 is substantially the same as that of the first embodiment, a duplicate description is omitted.
[0082]
(Client 110)
As illustrated in FIG. 13, the client 110 includes a condition input unit 7, a display unit 8, and a communication unit 14. The client 110 is a computer such as a personal computer or a mobile terminal such as a mobile phone. The client 110 accepts user search conditions, connects to the network 120 from the communication unit 14, inquires the server 100, and outputs search results. Note that the components other than the communication unit 14 are substantially the same as those in the first embodiment, and a duplicate description thereof is omitted.
[0083]
(Network 120)
The network 120 may be of any protocol, topology, type of transmission medium, etc. A typical example of the network 120 is the Internet.
[0084]
The document information registration process is executed when the communication unit 13 receives an electronic mail. As in the first embodiment, the processing after step S100 is executed for the received email, so that the email information is registered in the extracted information storage unit 5. The document information search process is executed when the user inputs, for example, the input screen shown in FIG. The input screen of FIG. 8 is displayed using CGI (Common Gateway Interface) or the like.
[0085]
FIG. 14 is a flowchart showing the operation of the information search system 40 according to this embodiment. This will be described below with reference to FIG.
(Step S300)
In the client 110, the condition input unit 7 transmits the input search condition to the communication unit 14.
[0086]
(Step S310)
The communication unit 14 transmits the input search condition to the server 100 via the network 120.
[0087]
(Step S320)
The server 100 receives the search condition by the communication unit 13 and executes the processes after step S200 in the first embodiment.
[0088]
(Step S330)
In the server 100, the communication unit 13 transmits the search result to the client 110 via the network 120.
[0089]
(Step S340)
The display unit 8 outputs the search result.
[0090]
(Effect of the second embodiment)
As described above, according to the present embodiment, the present invention is also applied to a case where an electronic mail is received via a network such as the Internet, the received electronic mail is sequentially registered, and searched according to a user request. it can.
[0091]
(Third embodiment)
In the first embodiment, only the words surrounded by the tags are registered in the extracted information storage unit 5. Therefore, there may be a case where a word that should originally be registered as an event or a place is not registered due to an incomplete extraction rule. In order to deal with such extraction omissions, in this embodiment, in addition to the words surrounded by tags, words such as nouns appearing in the document that cannot be specifically assigned a major classification or minor classification are also registered. To.
[0092]
As shown in FIG. 15, the information search device 50 according to the present embodiment is configured by replacing the extracted information storage unit 5 of the information search device 20 according to the first embodiment with a word information storage unit 15. Has been. In addition to the words surrounded by the tags, the word information storage unit 15 registers words such as nouns that appear in the document, such as no major classification or minor classification. Other constituent elements are substantially the same as the constituent elements of the information retrieval apparatus 20, and a duplicate description is omitted.
[0093]
Next, the operation of the information search device 50 according to this embodiment will be described.
In order to register the noun appearing in the document in the word information storage unit 15, the result of the morphological analysis in step S1030 is used. Changes to FIG. 6 are shown in FIG.
[0094]
First, the process of step S1050 of the first embodiment is changed as follows. (Step S1200)
If the morpheme to be processed is “. (Punctuation mark)”, go to step S1060. If it is a tag other than the <HD> tag (heading tag), the end tag is found, and the process goes to step S1070. If the morpheme is a noun, go to step S1210. Otherwise, the process returns to step S1040. For example, “Yamada (noun)” is registered in the third line of FIG. 3B, and “following (noun)” and “schedule (noun)” are registered in the fifth line.
[0095]
Step S1210 is performed as follows.
(Step S1210)
Data is registered in the word information storage unit 15 with the major classification as “word”. If registered, it will return to step S1040.
[0096]
In addition, the registration destination in step S1100 of the first embodiment is changed to the word information storage unit 15. As a result, the data in the word information storage unit 15 of the document in FIG. 3B is as shown in FIG. As shown in FIG. 17, “words” are added to the major classification, and “everyone”, “Yamada”, “below”, “schedule”, “everyone”, “participation”, “how”, It is registered in the word information storage unit 15 as a noun.
[0097]
(Effect of the third embodiment)
As described above, according to the present embodiment, a search omission is caused by being registered in the word information storage unit 15 as a noun even if it cannot be originally extracted as an event or place due to incomplete extraction rules. Can be eliminated.
[0098]
The preferred embodiments of the information retrieval apparatus and information retrieval system according to the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to such examples. It will be obvious to those skilled in the art that various changes or modifications can be conceived within the scope of the technical idea described in the claims, and these are naturally within the technical scope of the present invention. It is understood that it belongs.
[0099]
For example, the following application examples can be considered.
(1) The degree of conformity shown in FIG. 10 can be set according to the large classification. Further, the calculation method of steps S240 and S250 is not limited to this.
(2) In step S210 of FIG. 7, a search may be made by partial matching of specific values. In addition, in the table of FIG. 10, the degree of suitability when the specific values partially match may be added to “when there is a specific value in the search condition”.
(3) Major classification and minor classification are not limited to those mentioned in the above embodiment. For example, a large classification such as a person name or an organization name is possible.
(4) The candidate sorting unit 12 does not necessarily need to sort in the order of suitability. Sorted in order from the date closest to the current date, the degree of fitness may be displayed when the list is displayed.
(5) Voice input may be performed instead of the input screen as shown in FIG.
(6) Since the sentence number is incremented by 1 for each line in step S1020 of FIG. 6, the sentence number cannot be calculated correctly when there is a line break in the middle of one sentence. Therefore, as pre-processing in step S100, processing may be performed so that line breaks do not occur in the middle of one sentence.
(7) With respect to the date and time conditions shown in FIG. 8, the candidate documents may be searched by expanding to a range of one week before and after the specified date and time. Also, the input method of FIG. 8 may be configured to allow the period to be input.
(8) The storage target of the document storage unit 6 may not be the document itself, but may be, for example, information (for example, URL) indicating the storage location of the document.
[0100]
【The invention's effect】
As described above, according to the present invention, it is possible to eliminate the process of selecting in advance what field the document is in by having a configuration that does not have an extraction rule for each field in advance.
[0101]
Further, at the time of extraction, words representing the date, time, place, etc. in the document are not stored in a table format, and the respective position information is stored. At the time of retrieval, the degree of matching with the retrieval condition is calculated using the type of each word and the positional relationship between the words. By arranging and displaying to the user in descending order of the fitness, the user can determine whether the search result can be trusted.
[0102]
Also, in the process of extracting the type of word, it is distinguished whether the date and time is a deadline date or a holding date. In this way, for example, the user can make a search by distinguishing the event date and the deadline date. On the other hand, regardless of the type of word specified by the user as a search condition, the search condition is relaxed rather than the search condition set by the user to search for candidate documents. There can be no leakage.
[0103]
The present invention can also be applied to a case where electronic mail is received via a network such as the Internet, the received electronic mail is sequentially registered, and searched according to a user request.
[0104]
In addition, even if it cannot be extracted as an event or place due to incomplete extraction rules, it is possible to eliminate omissions by registering it as a noun in the word information storage unit.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram illustrating a configuration of an information search device according to a first embodiment;
FIG. 2 is an explanatory diagram showing data in an extraction rule storage unit;
FIG. 3 is an explanatory diagram illustrating an example of a document extraction result;
FIG. 4 is an explanatory diagram showing data in an extraction information storage unit;
FIG. 5 is a flowchart showing an operation of a registration process in the first embodiment.
6 is a flowchart showing details of step S130 in FIG. 5. FIG.
FIG. 7 is a flowchart showing an operation of search processing in the first embodiment.
FIG. 8 is an explanatory diagram showing an example of a search condition input screen.
FIG. 9 is an explanatory diagram showing data in an extraction operation storage unit;
FIG. 10 is an explanatory diagram showing a set value of fitness.
FIG. 11 is an explanatory diagram illustrating an example of search result output;
FIG. 12 is an explanatory diagram illustrating another configuration of the information search device according to the first embodiment;
FIG. 13 is an explanatory diagram showing a configuration of an information search system according to a second embodiment.
FIG. 14 is a flowchart showing the operation of the information search system in the second embodiment.
FIG. 15 is an explanatory diagram illustrating a configuration of an information search device according to a third embodiment;
FIG. 16 is a flowchart showing details of step S130 in FIG. 5;
FIG. 17 is an explanatory diagram showing data in a word information storage unit;
[Explanation of symbols]
1 Document input part
2 Extraction unit
3 Extraction rule storage
4 Document Information Registration Department
5 Extracted information storage unit
6 Document storage
7 Condition input part
8 Display section
9 Search part
10 Candidate selection section
11 Conformity calculator
12 Candidate alignment part
13 Communication Department
14 Communication Department
15 Word information storage
20 Information retrieval device
30 Information retrieval device
40 Information retrieval system
50 Information retrieval device
100 servers
110 clients
120 network

Claims

In an information retrieval device for retrieving a document,
A document storage unit for storing documents;
As information for searching the document, an extraction information storage unit that stores character strings in the document for each classification;
A condition input section in which a character string and the classification can be input;
A candidate selection unit that selects a candidate document by searching a document from the document storage unit using a character string input to the condition input unit as a search condition , and the selected candidate document and the condition input unit A search unit including a goodness-of-fit calculation unit for calculating goodness of fit using the classified classification ;
With
The fitness level calculation unit assigns a first fitness level to candidate documents with the same classification among the candidate documents, and a fitness level lower than the first fitness level for a candidate document with the classifications not matching. An information retrieval device characterized by providing

The candidate selection unit selects the candidate document using the classification as a search condition when only the classification is input without inputting a character string into the condition input unit,
The relevance level calculation unit applies a lower relevance to the candidate document selected by the candidate selection unit than the first relevance level when no character string is input to the condition input unit and only the classification is input. The information search device according to claim 1, wherein a degree is given .

The classification is divided into at least a major classification and a minor classification,
The fitness calculation unit uses the selected candidate document and the major classification and minor classification input to the condition input unit, and the major classification matches in the candidate document , and the minor classification classification the second fit was applied to the candidate documents that match, the only major category is characterized by imparting a lower fitness than the second adaptation degree to the candidate documents that match, to claim 1 The information retrieval device described.

The classification is divided into at least a major classification and a minor classification,
The candidate selection unit selects the candidate document by using the major classification as a search condition when a character string is not input to the condition input unit and the major classification and minor classification are input .
The relevance calculation unit uses the selected candidate document and the small classification input to the condition input unit, and among the candidate documents, a third relevance degree is applied to a candidate document that matches the small classification. the grant, characterized in that said imparting less fit than the third adaptation degree to the candidate document the small classification do not coincide, the information retrieval apparatus according to claim 1.

An extraction unit for extracting a predetermined character string included in the document;
When the date information is included in the document, the extraction unit classifies the date information as either an event date or a deadline date based on an appearance position relationship between the date information and a predetermined character string. The information retrieval device according to claim 1, wherein the information retrieval device is stored in the extracted information storage unit.

The character string classification in the extraction information storage unit is composed of a large classification determined according to the extraction rules stored in the extraction rule storage unit and a small classification obtained by subdividing the large classification The information search device according to any one of claims 1 to 5.

The major classification further has a classification for registering a noun in the document stored in the extraction information storage unit regardless of the extraction rule stored in the extraction rule storage unit. An information retrieval device according to any one of claims 3, 4 and 6.

In an information retrieval system that retrieves documents,
A server unit and a terminal unit connected via a network;
The server part
A communication unit that transmits and receives a character string and classification input to the condition input unit of the terminal unit and a search result with the terminal unit ;
A document storage unit for storing documents;
As information for searching the document, an extraction information storage unit that stores character strings in the document for each classification;
A candidate selection unit that searches a document from the document storage unit and selects a candidate document by using a character string input to the condition input unit of the terminal unit as a search condition, and uses the selected candidate document and the classification A search unit including a fitness calculation unit that calculates the fitness by
With
The terminal part is
A condition input section in which a character string and the classification can be input;
A communication unit that transmits and receives a character string and classification input to the condition input unit, and a search result;
With
The fitness level calculation unit assigns a first fitness level to candidate documents with the same classification among the candidate documents, and a fitness level lower than the first fitness level for a candidate document with the classifications not matching. An information retrieval system characterized by providing

9. The information retrieval system according to claim 8, wherein the document is an electronic mail exchanged via the network.

The candidate selection unit selects the candidate document using the classification as a search condition when only the classification is input without inputting a character string into the condition input unit,
The relevance level calculation unit applies a lower relevance to the candidate document selected by the candidate selection unit than the first relevance level when no character string is input to the condition input unit and only the classification is input. The information search system according to claim 8 or 9, wherein a degree is given .

The classification is divided into at least a major classification and a minor classification,
The fitness calculation unit uses the selected candidate document and the major classification and minor classification input to the condition input unit, and the major classification matches in the candidate document , and the minor classification classification the second fit was applied to the candidate documents that match, the only major category is characterized by imparting a lower fitness than the second adaptation degree to the candidate documents that match, claim 8 or 9. The information search system according to 9 .

The classification is divided into at least a major classification and a minor classification,
The candidate selection unit selects the candidate document by using the major classification as a search condition when a character string is not input to the condition input unit and the major classification and minor classification are input .
The relevance calculation unit uses the selected candidate document and the small classification input to the condition input unit, and among the candidate documents, a third relevance degree is applied to a candidate document that matches the small classification. The information retrieval system according to claim 8 , wherein a fitness level lower than the third fitness level is given to a candidate document that does not match the small classification .

An extraction unit for extracting a predetermined character string included in the document;
When the date information is included in the document, the extraction unit classifies the date information as either an event date or a deadline date based on an appearance position relationship between the date information and a predetermined character string. The information search system according to claim 8, wherein the information search system is stored in the extracted information storage unit.

The character string classification in the extraction information storage unit is composed of a large classification determined according to the extraction rules stored in the extraction rule storage unit and a small classification obtained by subdividing the large classification An information retrieval system according to any one of claims 8 to 13.

The major classification further has a classification for registering a noun in the document stored in the extraction information storage unit regardless of the extraction rule stored in the extraction rule storage unit. 15. An information retrieval system according to any one of claims 11, 12, and 14.