JP2005018555A

JP2005018555A - Document data retrieval method and program

Info

Publication number: JP2005018555A
Application number: JP2003184379A
Authority: JP
Inventors: Hiroko Ao; 裕子青; Toshihisa Takagi; 利久高木
Original assignee: Kanebo Ltd
Current assignee: Kanebo Ltd
Priority date: 2003-06-27
Filing date: 2003-06-27
Publication date: 2005-01-20
Anticipated expiration: 2023-06-27
Also published as: JP4221249B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document data retrieval method for sharply improving a correction rate and a cover rate. <P>SOLUTION: A computer for retrieving document data associated with a keyboard is made to execute a step 12 for making an alias dictionary preparing means prepare an alias dictionary by collecting an alias corresponding to a keyword from various public databases, a step 13 for making a document retrieval means retrieve and extract the document data where terms mentioned in the alias dictionary are used, a step 14 for making a simplified dictionary preparing means prepare a simplified character dictionary by extracting simplified character information from the document data, a step 15 for making a validity determination means check the terms used in the document data retrieved by the document retrieval means by using the alias dictionary and the simplified character dictionary, and determine the validity of the document data associated with the keyword and a step 16 for making an output means output the document data whose validity is determined by the validity determination means as a retrieval output result. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、多数の文献データ中からキーワードを手掛かりに所望の文献データを検索するための文献データ検索方法及びプログラムに関するものである。
【０００２】
【従来の技術】
例えば、生物医学系の研究者が研究活動に必要な情報を収集しようとする場合、ＭＥＤＬＩＮＥ（生物医学系論文を格納した文献データベース）等のオンライン文献等にアクセスし、必要な情報を得ることが一般に行われている。この場合、できるだけ検索結果の精度を上げるための工夫として、キーワードを複数入力する手法が一般的に用いられている。
【０００３】
【発明が解決しようとする課題】
しかし、近年ＭＥＤＬＩＮＥ等のデータベースへの登録論文数が急激に増加しており、それにともない、研究に必要な文献を精度よく効率的に入手することが困難になっている。その主たる要因が、遺伝子もしくはタンパク質名の別名での使用である。研究者が必要な情報を遺伝子もしくはタンパク質名をキーワードとして用いて文献を検索しようとしても、名称の多義性のために無関係な文献が多数含まれてしまうという問題、及び同義性のために必要な文献が検索結果から漏れてしまうという問題が生じている。また、入力した複数のキーワードが検索結果の精度に多大な影響を及ぼすため、適切なキーワードの設定が求められるが、適切なキーワードを設定するだけでも、検索対象に関する高度な事前知識が要求される場合があり、時間や手間がかかってしまう。このため、生物医学系の学術論文から研究に必要な文献を検索する場合、欲しい情報の含まれる文献だけを網羅的・効率的に検索するのが難しく、検索結果に多数の無関係な文献が含まれてしまう。研究者は、結局中身を確認しながら文献を取捨選択する必要があるため、文献の入手だけでかなりの労力が必要となっている。また、本来入手したい文献が検索から漏れていても、その事実に気が付くことは不可能である。
尚、本発明は遺伝子名もしくはタンパク質名をキーワードとする検索についてのみ適用できるとする趣旨ではなく、検索キーワードは遺伝子名もしくはタンパク質名以外のいかなるものであってもよい。
【０００４】
本発明の目的は、キーワードを用いて多数の文献中から必要な文献を効率的且つ網羅的に絞り込むことができる文献データ検索方法及びプログラムを提供することにある。
【０００５】
【課題を解決するための手段】
上記課題を解決するため、本発明によれば、多数の文献データのうち所与のキーワードに関連する文献データのみをコンピュータを利用して検索するための方法であって、別名辞書作成手段が前記キーワードに対応する別名を各種公共データベースから収集し前記キーワードに対する別名辞書を作成するステップと、文献検索手段が前記別名辞書にのっている用語が用いられている文献データを検索して取り出すステップと、略字辞書作成手段が前記文献検索手段によって取り出された文献データから略字情報を抽出して略字辞書を作成するステップと、妥当性判別手段が前記別名辞書と前記略字辞書とを用いて前記文献検索手段が検索した文献データ内で用いられている用語をチェックして前記キーワードに関連する文献データであるか否かの妥当性を判別するステップと、出力手段が前記妥当性判別手段によって妥当性を有すると判別された文献データを検索出力結果として出力するステップとを備えたことを特徴とする文献データ検索方法が提案される。
【０００６】
また、所与のキーワードに関連する文献データを検索するためにコンピュータを、前記キーワードに対応する別名を各種公共データベースから収集し前記キーワードに対する別名辞書を作成する別名辞書作成手段、前記別名辞書にのっている用語が用いられている文献データを検索して取り出す文献検索手段、前記文献検索手段によって取り出された文献データから略字情報を抽出して略字辞書を作成する略字辞書作成手段、前記別名辞書と前記略字辞書とを用いて前記文献検索手段が検索した文献データ内で用いられている用語をチェックして前記キーワードに関連する文献データであるか否かの妥当性を判別する妥当性判別手段、及び前記妥当性判別手段によって妥当性を有すると判別された文献データを検索出力結果として出力する出力手段として機能させるための文献データ検索プログラムが提案される。
【０００７】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態の一例につき詳細に説明する。
【０００８】
図１は本発明による文献検索装置のシステム構成図、図２はこの文献検索装置により実行される文献検索処理のフロー図である。
【０００９】
文献検索装置１は、コンピュータ装置を利用して構成されたもので、２は中央処理装置（ＣＰＵ）、３は検索の対象となる文献データが蓄積されている文献データベース、４は外部メモリ、５は文献検索のための処理プログラムが格納されている内部メモリ、６は入出力装置で、これらはバス７に接続されている。入出力装置６にはキーボード８、表示装置９、印刷装置１０が接続されている。文献検索装置１は、外部の各種データベースに通信ネットワークＮを介してアクセスすることができるようにするため通信制御部７Ａを備えており、文献検索装置１は、公衆回線等を介して通信ネットワークＮに接続されている文献データベース３Ａ、公共データベース３Ｂにアクセスし、必要な文献を取り込んで外部メモリ４内に格納しておくことができる構成となっている。なお、図１では、データベースは文献データベース３Ａと公共データベース３Ｂの２つだけが例示されているが、文献データベースの数はいくつあってもよい。
【００１０】
図２は、内部メモリ５に格納されている処理プログラムが中央処理装置２において実行されることにより遂行される文献データ検索処理の処理フローを示す図である。
【００１１】
先ず、ステップ１１で利用者がキーボード８から文献検索のためのキーワードを入力すると、ステップ１２では、通信制御部７Ａにより通信ネットワークＮを介して文献データベース３Ａ及び公共データベース３Ｂにアクセスするなどして各種公共データベースにおいて使用されている所与のキーワードに対する別名を探し出して、別名辞書を作成する。この別名辞書は外部メモリ４に格納される。また、各々の別名から逆に所与のキーワードを呼び出すためのファイルＩを作成し、作成されたファイルＩも外部メモリ４に格納される。この別名辞書は、所与のキーワードについてどのような別名が用いられているかを公共データベースから適宜の手段で作成することができる。
【００１２】
次のステップ１３では、通信制御部７Ａにより通信ネットワークＮを介して文献データベース３にアクセスし、例えば該当する生物医学関係の文献データベースに格納されている文献データ中から、作成された別名辞書にのっている用語が用いられている全ての文献データを検索、ダウンロード（収集）する。収集された文献データは外部メモリ４に格納される。このように、所与のキーワードに対する多数の別名をも検索キーワードとして文献検索を行うので、文献検索のカバー率の改善が期待される。ある文献データを収集する根拠となった検索キーワード群（検索キーワードが１つの場合もある）は、該文献データＩＤに対応する検索キーワードのリスト（Ａ）として作成され、リスト（Ａ）は外部メモリ４に格納される。
【００１３】
ステップ１４では、ステップ１３の検索処理によって収集された各文献データから略字情報を抽出する。抽出された略字情報は、該文献データＩＤに対応する略字辞書として外部メモリ４に格納される。ここで、略字情報は、文献データ中に出現する略字とこれに対応する正式名称（定義）との組み合わせから成る情報であり、略字抽出処理は、公知の適宜の装置、方法を用いることができる。
【００１４】
ステップ１５では、外部メモリ４内に格納されている別名辞書とファイルＩ、リスト（Ａ）、略字辞書とを用いて、ステップ１３で検索された各文献データの妥当性を判別する。すなわちステップ１３で検索された文献集合の中には、用語の多義性によって所与のキーワードとは何等関連しない文献データが収集されている可能性がある。そこで、検索の精度を上げるため、検索された文献データの妥当性を、所与のキーワードに関連する文献データと言えるか否かと言う観点から判別し、できる限り、所与のキーワードとは関連しない文献データを排除する。この妥当性の判別は、以下の条件で行う。
【００１５】
判別の対象となる文献データから略字抽出処理によって略字辞書が作成された場合には、
１．該略字辞書の略字部分に、該文献データを収集する根拠となった検索キーワードが存在し、且つ該定義部分にある基準を満たす単語が１つでも含まれていること。
もしくは、
該略字辞書の定義部分に、該文献データを収集する根拠となった検索キーワードが存在し、且つ該略字部分に、ある基準を満たす単語が１つでも含まれていること。
ここで、ある基準を満たす単語とは、所与のキーワードの別名辞書にのっている全単語の中から選択される。
２．該文献データのタイトルもしくはアブストラクト（概略）部分に該検索キーワードが存在すること。
３．該タイトルもしくは該アブストラクト部分において発見した該検索キーワードの前後の単語が、いずれも所定の不許可単語でないこと。
以上の３条件全てをクリアした場合のみ、該文献データを所望する文献データとしての妥当性ありと判別して、ステップ１６に入る。
【００１６】
判別の対象となる文献データから略字抽出処理によって略字辞書が作成されなかった場合、もしくは、略字辞書が作成された場合でも、略字もしくは定義のいずれにも該文献データを収集する根拠となった検索キーワードが含まれない場合には、
１．該文献データのタイトルもしくはアブストラクト部分に、ある基準を満たす単語が１つでも含まれていること。
ただし、該検索キーワードが一単語のみから構成されており、且つ、ある基準を満たす単語であった場合には、該検索キーワード以外の、ある基準を満たす単語が１つでも含まれなければならない。
ここで、ある基準を満たす単語とは、所与のキーワードの別名辞書にのっている全単語の中から選択される。
もしくは、
検索キーワードが６文字以上で構成されていること。
ただし、この文字数はこれに限定せず、検索対象に合わせて適宜設定を変更することが可能である。
２．該タイトルもしくは該アブストラクト部分に該検索キーワードが存在すること。
３．該タイトルもしくは該アブストラクト部分において発見した該検索キーワードの前後の単語が、いずれも所定の不許可単語でないこと。
以上の３条件全てをクリアした場合のみ、該文献データを所望する文献データとしての妥当性ありと判別して、ステップ１６に入る。
【００１７】
以上のようにして、各文献データについての妥当性の判別結果を得、ステップ１６で妥当性があると判別された文献データのみを検索結果として出力する。妥当性ありと判別されなかった文献データは検索結果から除外される。この検索結果の出力フォーマットは、利用者の要求に応じて、文献データのＩＤ（番号）やタイトル、アブストラクト、オーサー等を自由に組み合わせることが可能である。出力は表示装置９により表示し、及び又は印刷装置１０により印刷することにより行うことができる。尚、検索結果は、また、内部メモリ５に格納しておき、いつでも取り出して表示、印刷できるようにしておくことができる。
【００１８】
また、文献データ判別のための情報対象としてタイトルもしくはアブストラクト部分を利用しているが、これらは一例であり、文献データに付随している全ての情報から適宜に選択することができる。
【００１９】
図３〜５には、ステップ１５で実行される妥当性判別処理のより具体的な実施の態様が示されている。妥当性判別処理プログラム２０は、先ず該文献データを収集する根拠となった検索キーワードと所与のキーワードとの間に妥当性があるかどうかの判別を行い、検索キーワードのリスト（Ａ）から妥当性のある検索キーワードのみを残したリスト（Ｂ）を作成する。
【００２０】
ステップ２１では、先ず妥当性判別の対象となる文献データの中に未判別データがあるか否かを調べ、存在する場合にはステップ２１の判別結果はＹＥＳとなり、ステップ２２に入る。以降、文献データが空になるまで各文献データについてステップ２２〜３５を繰り返し実行し、文献データの妥当性を判別する。
【００２１】
ステップ２２では、一件分の文献データを外部メモリ４から内部メモリ５に取り込む。続いて、該文献ＩＤに対応する略字辞書とリスト（Ａ）を、それぞれ外部メモリ４から内部メモリ５に取り込む。
【００２２】
ステップ２３では、リスト（Ａ）の中に未判別の検索キーワードがあるか否かを調べ、存在する場合にはステップ２３の判別結果はＹＥＳとなり、ステップ２４に入る。以降、リスト（Ａ）が空になるまで各検索キーワードについてステップ２４〜３２を繰り返し実行し、検索キーワードの妥当性を判別する。
【００２３】
ステップ２４では、リスト（Ａ）から検索キーワードを１つ読み込む。次に、該検索キーワードに対応するファイルＩを外部メモリ４から内部メモリ５に取り込む。さらに、ファイルＩから該検索キーワードに対応する所与のキーワードを読み込み、該所与のキーワードに対応する別名辞書を外部メモリ４から内部メモリ５に取り込む。取り込んだ当該文献データに対応する略字辞書には、該文献中のカッコ記号内にあるデータ（例えば略字）とそのカッコ記号の直前にあるデータ（例えばその略字に対する定義）との組み合わせデータが記録されている。カッコ記号内に定義があり、そのカッコ記号の直前に略字がある場合もある。いずれにしても、以下の説明においてはカッコ記号内のデータをインナと称し、カッコ記号の直前のデータをアウタと称する。
尚、ファイルＩ内に該検索キーワードに対応する所与のキーワードが複数存在する場合もある。これは、多義性のために、全く異なる所与のキーワードに対して全く同一の検索キーワードが収集されるためである。この場合は、ステップ２４（該所与のキーワードに対応する別名辞書を外部メモリ４から内部メモリ５に取り込む）〜３２を全ての所与のキーワードに対して行った後、ステップ２３に戻る。
【００２４】
ステップ２４で判別に必要な情報を取り込んだ後は、ステップ２５に入る。ステップ２５では、略字辞書のインナに該検索キーワードと一致するものが存在するか否かを判別する。一致するインナが存在する場合にはステップ２５の判別結果はＹＥＳとなり、ステップ２６に入る。
【００２５】
ステップ２６では、該所与のキーワードに対応する別名辞書の中に、略字辞書の該インナに対応するアウタと一致する別名が存在するか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ２６の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。
【００２６】
ステップ２６の判別結果がＮＯの場合、ステップ２７に入る。ステップ２７では、略字辞書の該インナに対応するアウタと一致する、ある基準を満たす単語が１つでも含まれているか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ２７の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。
ここで、ある基準とは、該所与のキーワードの別名辞書にのっている単語であり、かつ、４文字以上で構成されており、かつ、予め用意された不許可単語以外であるとする。以上、全ての基準を満たした単語のみが、ある基準を満たす単語として該当する。ただし、この基準設定はこれに限定せず、検索対象に合わせて適宜設定を変更することが可能である。ステップ２７の判別結果がＮＯの場合には、該検索キーワードは該所与のキーワードとは無関係と判定し、そのままステップ２３に戻る。
【００２７】
ステップ２５の判別結果がＮＯの場合には、ステップ２８に入る。ステップ２８では、略字辞書のアウタに該検索キーワードと一致するものが存在するか否かを判別する。一致するアウタが存在する場合にはステップ２８の判別結果はＹＥＳとなり、ステップ２９に入る。
【００２８】
ステップ２９では、該所与のキーワードに対応する別名辞書の中に、略字辞書の該アウタに対応するインナと一致する別名が存在するか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ２９の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。
【００２９】
ステップ２９の判別結果がＮＯの場合、ステップ３０に入る。ステップ３０では、略字辞書の該アウタに対応するインナと一致する、ある基準を満たす単語が１つでも含まれているか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ３０の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。
ここで、ある基準とは、ステップ２７で使用した基準と同じである。ステップ３０の判別結果がＮＯの場合には、該検索キーワードは該所与のキーワードとは無関係と判定し、そのままステップ２３に戻る。
【００３０】
ステップ２８の判別結果がＮＯの場合、すなわち略字辞書を参照して該キーワードがインナもしくはアウタのいずれにも含まれていないと判別された場合、または当該文献データに対応する略字辞書そのものが存在しない場合には、ステップ３１に入る。ステップ３１では、該文献データのタイトルもしくはアブストラクト部分に、ある基準を満たす単語が１つでも含まれているか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ３１の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。
ここで、ある基準とは、ステップ２７で使用した基準と同じである。
【００３１】
ステップ３１の判別結果がＮＯの場合には、ステップ３２に入る。ステップ３２では、該検索キーワードが６文字以上で構成されているか否かをチェックし、該検索キーワードの妥当性を判別する。妥当性ありと判別した場合にはステップ３２の判別結果はＹＥＳとなり、該検索キーワードをリスト（Ｂ）に保存して、ステップ２３に戻る。ステップ３２の判別結果がＮＯの場合には、該検索キーワードは該所与のキーワードとは無関係と判定し、そのままステップ２３に戻る。
【００３２】
ステップ２３の判別結果がＮＯの場合、すなわちリスト（Ａ）に未判定の検索キーワードが存在しない場合には、ステップ３３に入る。ステップ３３では、リスト（Ｂ）の中に未判別の検索キーワードがあるか否かを調べ、存在する場合にはステップ３３の判別結果はＹＥＳとなり、ステップ３４に入る。以降、リスト（Ｂ）が空になるまで各検索キーワードについてステップ３４〜３５を繰り返し実行し、検索キーワードの妥当性を判別する。
【００３３】
ステップ３４では、リスト（Ｂ）から検索キーワードを１つ読み込む。次に、該検索キーワードが該文献データのタイトルもしくはアブストラクトに存在するか否かを判別する。該検索キーワードが該文献データ内に存在する場合にはステップ３４の判別結果はＹＥＳとなり、ステップ３５に入る。ステップ３４の判別結果がＮＯの場合には、ステップ３３に戻る。
【００３４】
ステップ３５では、該文献データのタイトルもしくはアブストラクト中に発見された、該検索キーワード前後の単語が所定の不許可単語であるか否かをチェックし、前後どちらの単語ともに不許可単語に該当しなかった場合にはステップ３５の判別結果がＹＥＳとなり、検索結果として該文献ＩＤと該所与のキーワードの組み合せをリストＩＩに保存して、ステップ３３に戻る。
ただし、該検索キーワード前後の単語とは同一文章内に限られ、該検索キーワードが文頭もしくは文末にて使用される場合は、直後もしくは直前の単語のみを対象とする。
【００３５】
ステップ３５の判別結果がＮＯの場合、すなわち該検索キーワードの前もしくは後のどちらか一方でも不許可単語に該当した場合には、そのままステップ３３に戻る。
【００３６】
ステップ３３の判別結果がＮＯの場合、すなわちリスト（Ｂ）に未判定の検索キーワードが存在しない場合には、ステップ２１に戻る。
【００３７】
このようにして、用意された全ての処理対象文のデータについての検索処理が終了すると、ステップ２１の判別結果がＮＯとなり、このプログラムの処理が終了する。検索結果は、リストＩＩから出力に必要な情報を取出した後利用者に提示される。
【００３８】
【発明の効果】
本発明によれば、検索精度すなわち正解率及びカバー率を飛躍的に改善させることができ、研究者の情報収集時間・労力を大幅に節減することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態の一例を示す構成図。
【図２】検索処理システムにおいて実行される処理プログラムを示すフローチャート。
【図３】図２の妥当性判別処理の詳細フローチャートの一部を示す図。
【図４】図２の妥当性判別処理の詳細フローチャートの一部を示す図。
【図５】図２の妥当性判別処理の詳細フローチャートの一部を示す図。
【符号の説明】
１文献検索装置
２中央処理装置
３文献データベース
４外部メモリ
５内部メモリ
６入出力装置
７バス
８キーボード
９表示装置
１０印刷装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document data search method and program for searching desired document data from a large number of document data using a keyword as a clue.
[0002]
[Prior art]
For example, when a biomedical researcher wants to collect information necessary for research activities, he / she can access online literature such as MEDLINE (bibliographic database storing biomedical articles) and obtain necessary information. Generally done. In this case, as a device for increasing the accuracy of the search result as much as possible, a technique of inputting a plurality of keywords is generally used.
[0003]
[Problems to be solved by the invention]
However, in recent years, the number of papers registered in a database such as MEDLINE has increased rapidly, and accordingly, it is difficult to obtain documents necessary for research accurately and efficiently. The main factor is the use of another name for gene or protein name. Even if researchers try to search for documents using gene or protein names as necessary keywords, many unrelated documents are included due to the ambiguity of names, and it is necessary for synonyms. There is a problem that documents are leaked from search results. In addition, since the input keywords have a great influence on the accuracy of search results, it is necessary to set appropriate keywords. However, advanced prior knowledge about search targets is required just by setting appropriate keywords. It may take time and effort. For this reason, when searching for documents required for research from biomedical academic papers, it is difficult to comprehensively and efficiently search only for documents containing the desired information, and the search results include many unrelated documents. It will be. Researchers need to select documents while confirming their contents, so a considerable amount of labor is required only by obtaining the documents. In addition, even if a document originally desired to be obtained is missing from the search, it is impossible to notice the fact.
The present invention is not intended to be applicable only to searches using gene names or protein names as keywords, and the search keywords may be anything other than gene names or protein names.
[0004]
An object of the present invention is to provide a document data search method and program capable of efficiently and comprehensively narrowing down required documents from a large number of documents using keywords.
[0005]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, according to the present invention, there is provided a method for searching only document data related to a given keyword among a large number of document data using a computer, wherein the alias dictionary creating means Collecting aliases corresponding to the keywords from various public databases and creating an alias dictionary for the keywords; and searching and retrieving literature data in which the term search means uses the terms in the alias dictionary; Abbreviated dictionary creating means extracting abbreviation information from the document data extracted by the document retrieving means to create an abbreviation dictionary; and a validity determining means using the alias dictionary and the abbreviation dictionary for the document retrieval. Check whether the term used in the document data retrieved by the means is the document data related to the keyword And a document data search method comprising: a step of determining validity of the document; and a step of outputting the document data determined as having been validated by the validity determination unit as a search output result. Proposed.
[0006]
Further, an alias dictionary creating means for collecting an alias corresponding to the keyword from various public databases and creating an alias dictionary for the keyword to search for literature data related to the given keyword, Document retrieval means for retrieving and retrieving document data using the terminology used, Abbreviation dictionary creation means for creating abbreviation dictionary by extracting abbreviation information from document data retrieved by the document retrieval means, and the alias dictionary And abbreviation dictionary to check the terminology used in the document data searched by the document search unit and determine the validity of whether the document data is related to the keyword The document data determined to be valid by the validity determination means is output as a search output result. A literature data search program for functioning as a force means is proposed.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an example of an embodiment of the present invention will be described in detail with reference to the drawings.
[0008]
FIG. 1 is a system configuration diagram of a document search apparatus according to the present invention, and FIG. 2 is a flowchart of a document search process executed by the document search apparatus.
[0009]
The document retrieval device 1 is configured by using a computer device, 2 is a central processing unit (CPU), 3 is a document database in which document data to be searched is stored, 4 is an external memory, 5 Is an internal memory in which a processing program for searching documents is stored, 6 is an input / output device, and these are connected to a bus 7. A keyboard 8, a display device 9, and a printing device 10 are connected to the input / output device 6. The document search apparatus 1 includes a communication control unit 7A so that various external databases can be accessed via the communication network N. The document search apparatus 1 is connected to the communication network N via a public line or the like. The document database 3A and the public database 3B connected to each other are accessed, and necessary documents can be fetched and stored in the external memory 4. In FIG. 1, only two databases, the document database 3A and the public database 3B, are illustrated, but the number of document databases may be any number.
[0010]
FIG. 2 is a diagram showing a processing flow of a document data search process performed by the processing program stored in the internal memory 5 being executed in the central processing unit 2.
[0011]
First, when a user inputs a keyword for document search from the keyboard 8 in step 11, in step 12, the communication control unit 7A accesses the document database 3A and the public database 3B via the communication network N, and so on. Find an alias for a given keyword used in a public database and create an alias dictionary. This alias dictionary is stored in the external memory 4. A file I for calling a given keyword is created from each alias, and the created file I is also stored in the external memory 4. This alias dictionary can create what alias is used for a given keyword from the public database by an appropriate means.
[0012]
In the next step 13, the communication control unit 7 </ b> A accesses the document database 3 via the communication network N, and, for example, from the document data stored in the corresponding biomedical document database, Search and download (collect) all document data that uses the term. The collected literature data is stored in the external memory 4. As described above, since a document search is performed using a large number of aliases for a given keyword as a search keyword, an improvement in the coverage ratio of the document search is expected. A search keyword group (which may be a single search keyword) that is the basis for collecting certain document data is created as a search keyword list (A) corresponding to the document data ID, and the list (A) is stored in an external memory. 4 is stored.
[0013]
In step 14, abbreviation information is extracted from each document data collected by the search process in step 13. The extracted abbreviation information is stored in the external memory 4 as an abbreviation dictionary corresponding to the document data ID. Here, the abbreviation information is information composed of a combination of an abbreviation appearing in the document data and a formal name (definition) corresponding to the abbreviation, and the abbreviation extraction process can be performed by using a known appropriate device or method. .
[0014]
In step 15, the validity of each document data searched in step 13 is determined using the alias dictionary, file I, list (A), and abbreviation dictionary stored in the external memory 4. That is, in the document set searched in step 13, there is a possibility that document data not related to a given keyword is collected due to the ambiguity of terms. Therefore, in order to improve the accuracy of the search, the validity of the searched document data is determined from the viewpoint of whether or not it can be said that the document data is related to the given keyword, and is not related to the given keyword as much as possible. Eliminate literature data. This validity determination is performed under the following conditions.
[0015]
When an abbreviation dictionary is created by the abbreviation extraction process from the document data to be determined,
1. The abbreviation portion of the abbreviation dictionary includes a search keyword that is the basis for collecting the document data, and includes at least one word that satisfies the criteria in the definition portion.
Or
A search keyword that is the basis for collecting the document data exists in the definition part of the abbreviation dictionary, and the abbreviation part includes at least one word that satisfies a certain criterion.
Here, a word satisfying a certain criterion is selected from all words in an alias dictionary for a given keyword.
2. The search keyword exists in the title or abstract (outline) part of the document data.
3. None of the words before and after the search keyword found in the title or the abstract part is a predetermined disallowed word.
Only when all the above three conditions are cleared, the document data is determined to be valid as the desired document data, and the process enters step 16.
[0016]
Search that is the basis for collecting document data in either abbreviations or definitions, even if an abbreviation dictionary is not created by the abbreviation extraction process from the document data to be determined, or even if an abbreviation dictionary is created If no keyword is included,
1. The title or abstract part of the document data contains at least one word that satisfies a certain standard.
However, when the search keyword is composed of only one word and satisfies a certain criterion, at least one word that satisfies a certain criterion other than the search keyword must be included.
Here, a word satisfying a certain criterion is selected from all words in an alias dictionary for a given keyword.
Or
The search keyword must consist of at least 6 characters.
However, the number of characters is not limited to this, and the setting can be changed as appropriate according to the search target.
2. The search keyword exists in the title or the abstract part.
3. None of the words before and after the search keyword found in the title or the abstract part is a predetermined disallowed word.
Only when all the above three conditions are cleared, the document data is determined to be valid as the desired document data, and the process enters step 16.
[0017]
As described above, the validity determination result for each document data is obtained, and only the document data determined to be valid in step 16 is output as the search result. Literature data that has not been determined to be valid is excluded from the search results. The output format of this search result can be freely combined with the ID (number), title, abstract, author, etc. of the document data according to the user's request. The output can be performed by displaying on the display device 9 and / or printing by the printing device 10. The search result can also be stored in the internal memory 5 so that it can be retrieved, displayed and printed at any time.
[0018]
Moreover, although the title or abstract part is used as an information object for document data discrimination, these are merely examples, and can be appropriately selected from all the information attached to the document data.
[0019]
3 to 5 show a more specific embodiment of the validity determination process executed in step 15. The validity determination processing program 20 first determines whether or not there is a validity between the search keyword that is the basis for collecting the document data and the given keyword, and the validity is determined from the search keyword list (A). A list (B) is created in which only the search keywords that have characteristics are left.
[0020]
In step 21, first, it is checked whether or not there is undetermined data in the document data to be validated. If it exists, the determination result in step 21 is YES and step 22 is entered. Thereafter, steps 22 to 35 are repeatedly executed for each document data until the document data becomes empty, and the validity of the document data is determined.
[0021]
In step 22, one document data is fetched from the external memory 4 into the internal memory 5. Subsequently, the abbreviation dictionary and list (A) corresponding to the document ID are fetched from the external memory 4 into the internal memory 5 respectively.
[0022]
In step 23, it is checked whether or not there is an undetermined search keyword in the list (A). If it exists, the determination result in step 23 is YES, and step 24 is entered. Thereafter, steps 24 to 32 are repeatedly executed for each search keyword until the list (A) becomes empty, and the validity of the search keyword is determined.
[0023]
In step 24, one search keyword is read from the list (A). Next, the file I corresponding to the search keyword is taken into the internal memory 5 from the external memory 4. Further, a given keyword corresponding to the search keyword is read from the file I, and an alias dictionary corresponding to the given keyword is taken into the internal memory 5 from the external memory 4. In the abbreviation dictionary corresponding to the imported document data, combination data of data (for example, abbreviation) within the parenthesis symbol and data immediately before the parenthesis symbol (for example, definition for the abbreviation) is recorded. ing. There may be a definition in parentheses and there may be an abbreviation just before the parenthesis. In any case, in the following description, data in parentheses is referred to as inner, and data immediately before the parentheses is referred to as outer.
There may be a plurality of given keywords corresponding to the search keyword in the file I. This is because, for the sake of ambiguity, the same search keyword is collected for a completely different given keyword. In this case, Steps 24 (capture the alias dictionary corresponding to the given keyword from the external memory 4 to the internal memory 5) to 32 are performed for all the given keywords, and then the process returns to Step 23.
[0024]
After fetching information necessary for discrimination in step 24, step 25 is entered. In step 25, it is determined whether or not there is an inner word of the abbreviation dictionary that matches the search keyword. If there is a matching inner, the determination result in step 25 is YES and step 26 is entered.
[0025]
In step 26, it is checked whether or not an alias matching the outer corresponding to the inner in the abbreviation dictionary exists in the alias dictionary corresponding to the given keyword, and the validity of the search keyword is determined. . If it is determined that there is validity, the determination result in step 26 is YES, the search keyword is stored in the list (B), and the process returns to step 23.
[0026]
If the determination result in step 26 is NO, step 27 is entered. In step 27, it is checked whether or not at least one word that matches an outer corresponding to the inner in the abbreviation dictionary is satisfied, and the validity of the search keyword is determined. If it is determined that there is validity, the determination result in step 27 is YES, the search keyword is stored in the list (B), and the process returns to step 23.
Here, a certain criterion is a word that is in the alias dictionary of the given keyword, is composed of four or more characters, and is other than a non-permitted word prepared in advance. . As described above, only words that satisfy all the criteria correspond to words that satisfy a certain criterion. However, the reference setting is not limited to this, and the setting can be appropriately changed according to the search target. If the decision result in the step 27 is NO, it is judged that the search keyword is unrelated to the given keyword, and the process returns to the step 23 as it is.
[0027]
If the determination result in step 25 is NO, step 28 is entered. In step 28, it is determined whether or not there is a match with the search keyword in the outer part of the abbreviation dictionary. If there is a matching outer, the determination result in step 28 is YES and step 29 is entered.
[0028]
In step 29, it is checked whether or not an alias matching the inner corresponding to the outer of the abbreviation dictionary exists in the alias dictionary corresponding to the given keyword, and the validity of the search keyword is determined. . If it is determined that there is validity, the determination result in step 29 is YES, the search keyword is stored in the list (B), and the process returns to step 23.
[0029]
If the determination result in step 29 is NO, step 30 is entered. In step 30, it is checked whether or not at least one word that matches the inner corresponding to the outer of the abbreviation dictionary and satisfies a certain criterion is included, and the validity of the search keyword is determined. If it is determined that there is validity, the determination result in step 30 is YES, the search keyword is stored in the list (B), and the process returns to step 23.
Here, the certain standard is the same as the standard used in step 27. If the determination result in step 30 is NO, it is determined that the search keyword is unrelated to the given keyword, and the process returns to step 23 as it is.
[0030]
If the determination result in step 28 is NO, that is, if it is determined that the keyword is not included in either the inner or outer by referring to the abbreviation dictionary, or there is no abbreviation dictionary corresponding to the document data. If so, step 31 is entered. In step 31, it is checked whether the title or abstract part of the document data contains at least one word that satisfies a certain criterion, and the validity of the search keyword is determined. If it is determined that there is validity, the determination result in step 31 is YES, the search keyword is stored in the list (B), and the process returns to step 23.
Here, the certain standard is the same as the standard used in step 27.
[0031]
If the determination result in step 31 is NO, the process enters step 32. In step 32, it is checked whether or not the search keyword is composed of 6 characters or more, and the validity of the search keyword is determined. If it is determined that there is validity, the determination result in step 32 is YES, the search keyword is stored in the list (B), and the process returns to step 23. If the determination result in step 32 is NO, it is determined that the search keyword is unrelated to the given keyword, and the process returns to step 23 as it is.
[0032]
If the decision result in the step 23 is NO, that is, if there is no undetermined search keyword in the list (A), the process enters the step 33. In step 33, it is checked whether or not there is an undetermined search keyword in the list (B). If it exists, the determination result in step 33 is YES, and step 34 is entered. Thereafter, steps 34 to 35 are repeatedly executed for each search keyword until the list (B) becomes empty, and the validity of the search keyword is determined.
[0033]
In step 34, one search keyword is read from the list (B). Next, it is determined whether or not the search keyword exists in the title or abstract of the document data. If the search keyword exists in the document data, the determination result in step 34 is YES and the process enters step 35. If the determination result in step 34 is NO, the process returns to step 33.
[0034]
In step 35, it is checked whether or not the words before and after the search keyword found in the title or abstract of the document data are predetermined disallowed words, and neither of the words before or after corresponds to the disallowed word. If YES in step 35, the determination result in step 35 is YES, the combination of the document ID and the given keyword is saved in the list II as a search result, and the process returns to step 33.
However, the words before and after the search keyword are limited to the same sentence, and when the search keyword is used at the beginning or end of a sentence, only the word immediately after or immediately before is targeted.
[0035]
If the decision result in the step 35 is NO, that is, if any of the search keywords before or after the search keyword corresponds to an unauthorized word, the process returns to the step 33 as it is.
[0036]
If the determination result in step 33 is NO, that is, if there is no undetermined search keyword in the list (B), the process returns to step 21.
[0037]
In this way, when the search processing for all prepared processing target sentence data is completed, the determination result in step 21 is NO, and the processing of this program is completed. The search result is presented to the user after extracting information necessary for output from the list II.
[0038]
【The invention's effect】
According to the present invention, it is possible to dramatically improve the search accuracy, that is, the accuracy rate and the coverage rate, and it is possible to greatly reduce the researcher's information collection time and labor.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing an example of an embodiment of the present invention.
FIG. 2 is a flowchart showing a processing program executed in the search processing system.
FIG. 3 is a diagram showing a part of a detailed flowchart of validity determination processing in FIG. 2;
FIG. 4 is a diagram showing a part of a detailed flowchart of validity determination processing in FIG. 2;
FIG. 5 is a view showing a part of a detailed flowchart of validity determination processing in FIG. 2;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Literature retrieval device 2 Central processing unit 3 Literature database 4 External memory 5 Internal memory 6 Input / output device 7 Bus 8 Keyboard 9 Display device 10 Printing device

Claims

A method for searching only literature data related to a given keyword among a large number of literature data using a computer,
An alias dictionary creating means collecting aliases corresponding to the keywords from various public databases and creating an alias dictionary for the keywords;
A step in which a document retrieval means retrieves and retrieves document data in which terms in the alias dictionary are used;
Abbreviated dictionary creating means for extracting abbreviation information from the document data extracted by the document search means to create an abbreviation dictionary;
The validity determination means checks the terms used in the document data searched by the document search means using the alias dictionary and the abbreviation dictionary, and whether or not the document data is related to the keyword. Determining gender,
A document data search method comprising: a step of outputting document data determined to be valid by the validity determination unit as a search output result.

A computer to search for literature data related to a given keyword,
Alias dictionary creating means for collecting aliases corresponding to the keywords from various public databases and creating an alias dictionary for the keywords;
Literature retrieval means for retrieving and retrieving literature data in which terms in the alias dictionary are used,
Abbreviation dictionary creation means for creating abbreviation dictionary by extracting abbreviation information from the literature data extracted by the literature search means;
Validity for determining whether or not the document data is related to the keyword by checking the terms used in the document data searched by the document search means using the alias dictionary and the abbreviation dictionary Sex discrimination means,
A literature data search program for functioning as output means for outputting literature data determined to be valid by the validity judgment means as a search output result.