JP2005004545A

JP2005004545A - Full-text searching device, program for processing document data, and recording medium

Info

Publication number: JP2005004545A
Application number: JP2003168375A
Authority: JP
Inventors: Kensaku Yamamoto; 研策山本; Yuichi Kojima; 裕一小島; Hiroko Ida; 裕子井田; Yukiko Hiraoka; 優希子平岡; Yasutsugu Ogawa; 泰嗣小川; Masayuki Kameda; 雅之亀田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-06-12
Filing date: 2003-06-12
Publication date: 2005-01-06
Anticipated expiration: 2023-06-12
Also published as: JP4262529B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a full-text searching device corresponding to document data in a plurality of languages. <P>SOLUTION: A registering processing means 7 causes document data inputted from an input means 1 and their language information to be registered in a document data storage part 3 along with an identifier (document identifier) which represents the document data. Also, the number of document data of that language is increased and registered in a language-based number storage part 4 and the registering processing means 7 uses a plurality of dividing process means 6 for different languages to divide the document data into index units for normalization. The index units with different notations after the normalization are registered in a full-text index storage part 5 along with information on an appearance position within the document data and the document identifier of the document data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の文書データから指定された文字列を含む文書データを検索する全文検索装置、文書データの処理プログラム及び記録媒体に関し、特に、文書管理システム、電子図書館システム、特許公報検索システム、といった大量の文書データを管理する際に極めて好適な全文検索装置、文書データの処理プログラム及び記録媒体に関する。
【０００２】
【従来の技術】
近年の情報通信技術の発達により、大量の電子化文書にアクセスできる環境が整いつつある。このような状況下において、ユーザが所望の文書を精度よく、さらには高速に検索する文書検索装置が提案されている。この文書検索装置には、キーワード検索手法や全文検索手法が用いられている。
【０００３】
全文検索手法は、任意の検索文字列と検索対象の全ての文書との間で照合を行い、検索文字列を含む文書を漏れなく抽出する方法である。これにより、キーワード検索手法のように検索対象となる全ての文書に対してキーワードを予め付与する必要がない。また、ユーザが取得したいキーワードをインターネット上から検索漏れのないように取得することが可能となる。
【０００４】
なお、全文検索において、全文索引を作成するためには、文書データを索引単位に分割する分割処理手段が必要である。分割処理手段としては、Ｎ文字のつながりを単位として分割する方法や、形態素解析を用いて単語単位で分割する方法等がある。
【０００５】
Ｎ文字のつながりを単位として分割する方法は、文書データをＮ文字単位に分割し、各文字列が含まれた文書データの情報と、文字列の位置情報とによるインデックスを検索し、文字列間の位置関係を調べ、該当する文書データを抽出する方法である。
【０００６】
また、形態素解析方法は、日本語を解析するための辞書を使用して、入力された文書データを単語単位に分解し、該分解した単語単位の中から名詞などのキーワードとなる用語を抽出する方法である。
【０００７】
しかしながら、最も中核となる大規模インデックス作成部分の技術である全文検索技術は、様々な意味でのトレードオフが避けられない。
【０００８】
そこで、複数の言語の文を含む文書データに対して、検索の際に用いるインデックスを作成して文書データの検索を行う新規な全文検索装置がある（特許文献１参照）。
【０００９】
この特許文献１は、複数の言語の文を含む文書データを格納し、該格納した文書データと異なる言語の文書データに対応してそれぞれに形態素解析を行う。そして、文書データのキーワードを抽出し、該抽出したキーワードに対応する文書データの識別子と共にインデックスとして登録する。そして、入力された検索条件から単語を切り出し、該切り出した単語とインデックスとのキーワードを照合して、その照合結果により検索条件に適合する文書データを読み出す。これにより、ユーザの所望に応じた多言語文書データの登録処理や検索処理が可能となる。
【００１０】
また、多言語文書データに関する索引を言語ごとに分けて格納することで文書データを管理し、該管理した文書データを用いて文書データの検索処理を行う新規な全文検索装置がある（特許文献２参照）。
【００１１】
この特許文献２は、複数の言語の文字を含む多言語文書データの言語を識別し、該識別した多言語データに関する索引を言語別に作成し、言語ごとに格納する。そして、言語ごとの索引を使用して多言語文書データの検索を行うことで、言語ごとに区別して管理することが可能となる。
【００１２】
また、本出願人による従来技術として、複数の文書データをまとめて１つの文書データとして管理、または、検索できる全文検索装置がある（特許文献３参照）。
【００１３】
この特許文献３は、部分文書に分割された文書データを１つの文書データとして登録し、該登録された複数の文書データの中から指定された文字列を含む部分文書データを検索することで、部分文書単位での取り出しや、検索語の出現箇所表示を行うことが可能となる。
【００１４】
【特許文献１】
特開平９−５０４４２号公報
【特許文献２】
特開２００１−６７３６８号公報
【特許文献３】
特開２００１−２４９９４３号公報
【００１５】
【発明が解決しようとする課題】
しかしながら、特許文献１は、１つの文書データが複数の言語から構成されているため制限が少ないが、分割処理手段（多言語キーワード抽出部）において、１つの文書データに対して複数の形態素解析処理を行うため、処理速度は低下すると考えられる。また、キーワードしか抽出していないため、任意の文字列での全文検索が不可能である。
【００１６】
また、特許文献２では、１つの文書データが複数の言語からなっているため、制限が少ないが、言語別に文書データ及び索引を格納するので、登録処理や検索処理の処理速度が低下すると考えられる。
【００１７】
また、特許文献３では、複数の文書データをまとめて１つの文書データとして管理、検索できるが、言語情報に関する記述は何ら言及されていない。
【００１８】
また、Ｎ文字のつながりを単位とする分割処理手段では、言語情報への依存性はなくなるが、分割数が増え索引単位が多くなるので、全文索引記憶部に要する領域が大きくなる、また、登録処理、検索処理、で処理すべきデータ量が多くなるため処理速度が低下する等の問題点が考えられる。
【００１９】
一方、形態素解析を用いる分割処理手段は、単語単位に分割するので索引単位の数は少なくなるが、その実装が言語情報に依存するという問題がある。例えば、欧米語のようにワード間に空白がある言語の場合、分割処理手段はスペースなどの文字の切れ目を検出し、文字の連続部分を単語とすればよく実装が簡単である。しかし、日本語のように文書中に空白などの区切りのない言語の場合、ワード自体の識別が困難である。従って、分割処理手段は、構文解析技法を用いて単語を検出する必要があり実装は難しいものになる。これは、従来の全文検索装置では、全文検索の対象とする言語を１つとし、その言語に最適な１つの分割処理手段しか有していないことになる。また、日本語の表記の多様性（複合語、カタカナ異表記）から、単語のインデックスを作成しても検索漏れが生じていた。
【００２０】
本発明は上記事情に鑑みてなされたものであり、複数の言語の文書データに対応した全文検索装置、文書データの処理プログラム及び記録媒体を提供することを目的とする。
【００２１】
【課題を解決するための手段】
かかる目的を達成するために本発明は以下のような特徴を有する。
請求項１記載の発明は、それぞれ異なる言語で記述された複数の文書データから、指定された文字列を含む文書データを検索する全文検索装置であって、文書データと該文書データの言語情報とを入力する入力手段と、文書データと該文書データの言語情報とを記憶する文書データ記憶手段と、言語情報ごとに文書データの件数を記憶する言語別件数記憶手段と、文書データに対して該文書データの言語情報の単語分割処理と表記正規化処理とを行い索引単位に分割する複数の分割処理手段と、検索の際に用いる索引情報を言語情報によらず記憶する全文索引記憶手段と、入力手段から入力された文書データと言語情報とを、その文書データを示す文書識別情報と共に文書データ記憶手段に登録し、その言語情報の文書データの件数を増加させ言語別件数記憶手段に登録すると共に、その言語情報の分割処理手段を用いて文書データを索引単位に分割し、表記正規化処理後の表記が異なる索引単位をその文書データ中での出現位置情報と文書データの文書識別情報と共に索引情報として全文索引記憶手段に登録する登録処理手段と、を有することを特徴とする。
【００２２】
請求項２記載の発明は、請求項１記載の全文検索装置において、入力手段を用いて入力される検索文字列を、言語情報ごとの複数の分割処理手段を用いて分割し、該分割して得られた複数の正規化検索文字列を用いて検索式を生成し、該生成した検索式を用いて全文索引記憶手段から検索する検索処理手段を有することを特徴とする。
【００２３】
請求項３記載の発明は、請求項２記載の全文検索装置において、検索処理手段は、言語別件数記憶手段に記憶されている言語情報を調べ、それらの言語情報の分割処理手段のみを用いて分割して得られた複数の正規化検索文字列を用いて、検索式を生成することを特徴とする。
【００２４】
請求項４記載の発明は、文書データと該文書データの言語情報とを入力する入力手段と、文書データと該文書データの言語情報とを記憶する文書データ記憶手段と、言語情報ごとに文書データの件数を記憶する言語別件数記憶手段と、文書データに対して該文書データの言語情報の単語分割処理と表記正規化処理とを行い索引単位に分割する複数の分割処理手段と、検索の際に用いる索引情報を言語情報によらず記憶する全文索引記憶手段と、文書データの登録、検索の処理を行う処理手段と、を有し、それぞれ異なる言語で記述された複数の文書データから、指定された文字列を含む文書データを検索する全文検索装置において、実行される文書データの処理プログラムであって、入力手段から入力された文書データと言語情報とを、その文書データを示す文書識別情報と共に文書データ記憶手段に登録し、その言語情報の文書データの件数を増加させ言語別件数記憶手段に登録すると共に、その言語情報の分割処理手段を用いて文書データを索引単位に分割し、表記正規化処理後の表記が異なる索引単位をその文書データ中での出現位置情報と文書データの文書識別情報と共に索引情報として全文索引記憶手段に登録する登録処理を、処理手段に実行させることを特徴とする。
【００２５】
請求項５記載の発明は、請求項４記載の文書データの処理プログラムにおいて、入力手段を用いて入力される検索文字列を、言語情報ごとの複数の分割処理手段を用いて分割し、該分割して得られた複数の正規化検索文字列を用いて検索式を生成し、該生成した検索式を用いて全文索引記憶手段から検索する検索処理を、処理手段に実行させることを特徴とする。
【００２６】
請求項６記載の発明は、請求項５記載の文書データの処理プログラムにおいて、検索処理は、言語別件数記憶手段に記憶されている言語情報を調べ、それらの言語情報の分割処理手段のみを用いて分割して得られた複数の正規化検索文字列を用いて、検索式を生成することを特徴とする。
【００２７】
請求項７記載の発明は、請求項４から請求項６の何れか１項に記載の文書データの処理プログラムをコンピュータ読み取り可能な記録媒体に記録したことを特徴とする。
【００２８】
【発明の実施の形態】
以下、添付図面を参照しながら本発明にかかる実施の形態について詳細に説明する。
【００２９】
まず、図１を参照しながら、本発明にかかる全文検索装置の構成について説明する。
【００３０】
本発明にかかる全文検索装置は、入力手段１と、出力手段２と、文書データ記憶部３と、言語別件数記憶部４と、全文索引記憶部５と、複数の分割処理手段６と、登録処理手段７と、検索処理手段８と、を有して構成されている。以下、全文検索装置を構成する各部の機能について説明する。
【００３１】
入力手段１は、文書データとその文書データの言語情報、または、検索条件を入力する処理部である。
【００３２】
出力手段２は、検索処理手段８により検索された検索結果を出力する処理部である。
【００３３】
文書データ記憶部３は、文書データとその文書データの言語情報とを記憶する記憶部である。
【００３４】
言語別件数記憶部４は、言語情報ごとの文書データの件数を記憶する記憶部である。
【００３５】
全文索引記憶部５は、検索処理の際に用いる索引情報を言語情報によらず記憶する記憶部である。
【００３６】
分割処理手段６は、文書データにその文書データの言語情報の単語分割処理と表記正規化処理とを行い、索引単位に分割する処理部である。
【００３７】
登録処理手段７は、入力手段１から入力された文書データとその言語情報とを、その文書データを示す識別子（文書識別子）と共に、文書データ記憶部３に登録する。また、その言語の文書データ件数を増加し、言語別件数記憶部４に登録すると共に、言語ごとの複数の分割処理手段６を用いて文書データを索引単位に分割、正規化し、その正規化した後の表記が異なる索引単位を、その文書データの中での出現位置情報と文書データの文書識別子と共に全文索引記憶部５に登録する処理部である。
【００３８】
検索処理手段８は、入力手段１を用いて入力される検索条件を、言語ごとの複数の分割処理手段６を用いて分割し、その分割して得られた複数の正規化検索文字列を用いて検索式を生成し、該生成した索引式を用いて全文索引記憶部５から検索する処理部である。
【００３９】
なお、全文検索装置において文書データの登録時に索引単位の表記を正規化して全文索引記憶部５に格納し、また、検索処理時においても検索語の表記を正規化して検索処理を行うことにより、クライアントの表記によらない検索結果を求めることが可能となる。このとき行われる表記の正規化も言語に依存する処理である。
【００４０】
特に、欧米語を対象とするステミングは、言語に依存する処理である。ステミングとは、語の語幹を基に検索を行うことで、例えば、英語の「ｗａｌｋｅｄ」、「ｗａｌｋｅｒ」、「ｗａｌｋｉｎｇ」はステミングにより「ｗａｌｋ」となり、これらの語を同時に検索結果とすることができるというものである。
【００４１】
例えば、英語では単数形・複数形の正規化、活用形の正規化、音標記号の表記の正規化、大文字・小文字の正規化などの処理が正規化処理になる。なお、これらも言語に依存する処理である。日本語の場合には、漢字の正字・異体字の正規化、カタカナの表記の正規化（「インターフェース」、「インターフェイス」、「インタフェイス」を同じ物とするなど）、記号の表記の正規化（括弧類、引用符、中黒のあるなしなど）、全角・半角の正規化などの処理が正規化処理になる。
【００４２】
次に、図２を参照しながら本実施の形態における全文検索装置のハードウェア構成について説明する。
【００４３】
本実施の形態における全文検索装置は、入力装置２０と、出力装置２１と、主制御装置２２と、記憶装置２３と、入出力制御装置２４と、を有して構成される。
【００４４】
上記構成からなる全文検索装置において、入力装置２０は、入力手段１で実現される。また、出力装置２１は、出力手段２で実現される。また、主制御装置２２は、分割処理手段６と、登録処理手段７と、検索処理手段８と、の各種処理手段で実現される。なお、記憶装置２３は、文書データ記憶部３と、言語別件数記憶部４と、全文索引記憶部５と、の各種記憶部に相当する。例えば、１つの限られた記憶装置を用いて本発明にかかる全文検索を行う場合、検索処理をメインに行うか、または、登録処理をメインに行うかで、その使用する領域を割り当てる。これにより処理を効率的に行うことが可能となる。なお、主制御装置２２にはＣＰＵ、メモリ等が用いられる。また、入出力制御装置２４は、主制御装置２２の制御信号に従って入力装置２０と出力装置２１とを制御する。
【００４５】
次に、図３を参照しながら本実施の形態におけるクライアント／サーバでのハードウェア構成について説明する。
【００４６】
クライアント側２５は、入力装置２０と、出力装置２１と、主制御装置２２と、ネットワーク制御装置２６と、入出力制御装置２４と、を有して構成されている。
サーバ側２７は、ネットワーク制御装置２６と、主制御装置２２と、記憶装置２３と、を有して構成されている。なお、クライアント２５とサーバ２７とはネットワーク２８を介して接続されている。
【００４７】
上記構成からなるクライアント／サーバでのハードウェア構成において、入力装置２０は、入力手段１で実現される。また、出力装置２１は出力手段２で実現される。また、主制御装置２２は、分割処理手段６と、登録処理手段７と、検索処理手段８と、の各処理手段で実現される。なお、記憶装置２３は、文書データ記憶部３と、言語別件数記憶部４と、全文検索記憶部５と、の各記憶部に相当する。また、ネットワーク制御装置２６は、ネットワーク２８を介してクライアント２５とサーバ２７との間のデータ伝送等の制御を行う。さらに、クライアント２５の入出力制御装置２４は、主制御装置２２の制御信号に従って入力装置２０と出力装置２１とを制御する。
【００４８】
ここで、本発明にかかる全文検索装置における処理動作の概要を説明する。
まず、第１段階として、検索データを収集する。この検索データは、ローカルディスク上のデータであることもあり、イントラネット内部にロボットを走らせて収集したデータであることもある。また、インターネット全体にロボットを走らせることもある。
【００４９】
第２段階として、第１段階で収集した検索データを文書ファイルに通し、インデクサにおいてインデクシングする。
【００５０】
次に、実際の運用段階である第３段階として、クライアントからの検索要求に対して検索エンジン部分のインデックスを用いて検索し、検索結果をクライアントに送信する。
【００５１】
ここで、上記第３段階において用いるインデックスを作成するために欠かすことの出来ない分割処理手段の一例として形態素解析について説明する。
【００５２】
日本語などの文書では単語間に空白がなく区切られていないので、そのままの文書では基本とする検索インデックスを作成することが出来ない。そこで、形態素辞書と形態素に関する文法の知識を用いて、文書を単語単位に「分かち書き」し、それぞれにおける語の構文上の役割を決める形態素解析を用いる。「分かち書き」とは、文書を書く時に語と語、または、文節と文節の間に空白を置く書き方である。
【００５３】
なお、形態素とは、それ以上分割出来ない語の単位であり、一般的には、意味を持つ最小の要素のことをいい、文はこの形態素で構成される。例えば、「文書を検索する」という文を、「文書」、「を」、「検索」、「する」、という形態素に分割することで、それぞれの形態素に意味を与え、構文解析や意味解析、文脈理解などの自然言語処理に活用する。
【００５４】
形態素解析を行うための重要な処理は、「分かち書き」と「単語の品詞の同定」と「辞書にない語の処理」との３つであるが、形態素に分解する際、その切り出しには多様な組み合わせがあり、さらに切り出された方法には品詞上の多様な方法が発生する。
【００５５】
なお、以下に主な解析方法について説明する。
【００５６】
最長一致法は、与えられた文字列を右から順に走査し、辞書に登録されている単語のうち最も長く一致するものを選択する方法である。
【００５７】
字種切り法は、区切り符号（句読点など）、漢字、カタカナ、英字、数字、平仮名など、字種の切れ目を利用して、優先度でまとめて切る方法である。
【００５８】
文節数最小法は、文書を文節数が最小になるように切る方法である。なお、文節とは、「名詞＋助詞」、「動詞」などのまとまりを言う。
【００５９】
接続規則法は、単語Ａの次に単語Ｂが接続可能かどうかを記載した接続表を用いる方法である。単語では組み合わせが膨大になるため、品詞の接続、字種の接続、熟語を形成する接続などの規則を適用する。
【００６０】
このように形態解析方式の全文検索では、形態素とその形態素に与えられた意味情報を用いて、キーワードの自動抽出や索引（検索インデックス）作成、あるいは、自然言語インターフェースを実現している。
【００６１】
次に、本発明にかかる全文検索装置における処理動作について説明する。
【００６２】
（文書データの登録処理）
まず、図４を参照しながら、全文検索装置における文書データの登録処理について説明する。
【００６３】
文書データの登録処理を実行するには、まず、クライアントが文書データを作成し（ステップＳ１）、入力手段１からその作成した文書データとその文書データの言語情報とを登録処理手段７に登録する（ステップＳ２）。次に、登録処理手段７は、入力手段１から入力された文書データとその文書データの言語情報とを文書データ記憶部３に登録し、同時にその文書データを示す識別子（文書識別子）を設定する（ステップＳ３）。なお、文書データ記憶部３の概念図を図５に示す。例えば、図５には、「・・・検索処理手段は、・・・ｓｅａｒｃｈ命令にしたがい・・・」という文書データは、文書識別子が１００で設定されており、言語情報が言語Ａで設定されている。また、「・・・ＳｉｅｓｕｃｈｔｅｎｄａｓＨｏｌｚｎａｃｈｄｅｍｆｅｈｌｅｎｄｅｎｋｉｎｄ．・・・」という文書データは、文書識別子が１０５で設定されており、言語情報が言語Ｂで設定されている。
【００６４】
次に、登録処理手段７は、その言語の文書データ件数を増加させ、言語別件数記憶部４に登録する（ステップＳ４）。なお、言語別件数記憶部４の概念図を図６に示す。言語別件数記憶部４は、言語情報と、文書データ件数と、で構成されており、各言語情報毎に文書データ件数が記憶されている。
【００６５】
次に、登録処理手段７は、その言語情報の分割処理手段６を用いて文書データを索引単位に分割、正規化する（ステップＳ５）。そして、登録処理手段７は、正規化後の表記が異なる索引単位を、その文書データ中での出現位置情報と文書識別子と共に全文索引記憶部５に登録する（ステップＳ６）。
【００６６】
なお、全文索引記憶部５の概念図を図７に示す。全文索引記憶部５は、索引単位と転置リスト｛文書識別子，出現回数，出現位置｝とで構成されており、例えば、索引単位：「検索」に対する転置リスト：｛１００，２，（２０，６０）｝において、「１００」は文書識別子（文書ＩＤ）を示し、「２」は出現回数を示し、「（２０、６０）」は出現位置を示す。
【００６７】
従って、「検索」に対する転置リスト｛１００，２，（２０，６０）｝は、「文書１００には「検索」は２回出現し、その出現位置は２０，６０文字目である」という情報を示す。また、「ｓｅａｒｃｈ」に対する転置リスト｛１００，３，（２６０，２８０，３２０）｝は、「文書１００には「ｓｅａｒｃｈ」は３回出現し、その出現位置は２６０，２８０，３２０文字目である」という情報を示す。同様に、「ｓｕｃｈｅ」に対する転置リスト｛１０５，２，（１０，５０）｝は、「文書１００には「ｓｕｃｈｅ」は２回出現し、その出現位置は１０，５０文字目である」という情報を示すこととなる。
【００６８】
このように、全文検索装置において、登録処理手段７は、入力手段１から入力された文書データとその言語情報とを、その文書データを示す識別子（文書識別子）と共に、文書データ記憶部３に登録する。また、その言語の文書データ件数を増加し、言語別件数記憶部４に登録すると共に、登録処理手段７は、言語ごとの複数の分割処理手段６を用いて文書データを索引単位に分割、正規化し、その正規化後の表記が異なる索引単位を、その文書データの中での出現位置情報と文書データの文書識別子と共に全文索引記憶部５に登録することで、検索処理の効率化を図ることが可能となる。
【００６９】
（文書データの第１の検索処理）
次に、図８を参照しながら、全文検索装置における文書データの第１の検索処理について説明する。
【００７０】
文書データの検索処理を実行するには、まず、クライアントが入力手段１から検索文字列を検索処理手段８に入力する（ステップＳ１０）。そして、検索処理手段８は言語ごとの複数の分割処理手段６を用いて、入力手段１から入力された検索文字列を索引単位に分割、正規化する（ステップＳ１１）。次に、検索処理手段８は、分割して得られた複数の索引単位の論理和を検索条件とし、全文索引を用いて索引単位（検索文字列）を含む文書データの文書識別子の集合（Ｒｓ）を全文索引記憶部５から取得する（ステップＳ１２）。そして、検索処理手段８は、全文索引記憶部５から取得した索引単位（検索文字列）を含む文書データの文書識別子の集合（Ｒｓ）を、出力手段２を通じてクライアントに出力する（ステップＳ１３）。
【００７１】
このように、全文検索装置において、検索処理手段８は、入力手段１から入力される検索文字列を、言語ごとの複数の分割処理手段６を用いて分割し、その分割して得られた複数の正規化した検索文字列を用いて検索式を生成し、全文索引記憶部５から検索することで、検索文字列の言語情報を付加することなく検索処理を実行できるため、クライアントの負荷を軽減することが可能となる。
【００７２】
（文書データの第２の検索処理）
次に、図９を参照しながら、全文検索装置における文書データの第２の検索処理について説明する。
【００７３】
文書データの検索処理を実行するには、まず、クライアントが入力手段１から検索文字列を検索処理手段８に入力する（ステップＳ２０）。次に、検索処理手段８は言語別件数記憶部４に記憶されている言語情報を調べ、それらの言語情報の分割処理手段６を用いて、入力手段１により入力された検索文字列を索引単位に分割、正規化する（ステップＳ２１）。そして、検索処理手段８は、得られた複数の索引単位の論理和を検索条件とし、全文索引を用いて索引単位（検索文字列）を含む文書データの文書識別子の集合（Ｒｓ）を全文索引記憶部５から取得する（ステップＳ２２）。そして、検索処理手段８は、全文索引記憶部５から取得した索引単位（検索文字列）を含む文書データの文書識別子の集合（Ｒｓ）を、出力手段２を通じてクライアントに出力する（ステップＳ２３）。
【００７４】
このように、全文検索装置において、検索処理手段８は、言語識別数記憶部４に記憶されている言語情報を調べ、それらの言語情報の分割処理手段６のみを用いて、入力手段１により入力された検索文字列を索引単位に分割し、その分割して得られた複数の正規化検索文字列を用いて検索式を生成することで、検索文字列の展開数を最小にするので、検索処理を効率よく行うことが可能となり、クライアントが検索結果を取得するまでの待ち時間を短くすることが可能となる。
【００７５】
なお、上述する実施の形態は、本発明の好適な実施の形態であり、本発明の要旨を逸脱しない範囲内において種々変更実施が可能である。例えば、上記の実施の形態における全文検索装置における処理動作をプログラムとして、コンピュータ等の情報処理装置において実行させることでも、本実施の形態における全文検索装置を構築することは可能である。また、そのプログラムをコンピュータ読み取り可能な記録媒体に記録して、その記録媒体を情報処理装置に搭載させることでも、本実施の形態における全文検索装置を構築することは可能である。
【００７６】
【発明の効果】
以上の説明より明らかなように本発明は以下のような効果を奏する。
本発明にかかる全文検索装置、文書データの処理プログラム及び記録媒体は、文書データの言語情報を付与することなく、文書データの処理を行うことが可能となるため、クライアントの負荷を軽減することができる。
【図面の簡単な説明】
【図１】本発明にかかる全文検索装置の構成を示すブロック図である。
【図２】スタンドアロン形態でのハードウェアの構成を示すブロック図である。
【図３】クライアント／サーバ形態でのハードウェアの構成を示すブロック図である。
【図４】本発明にかかる全文検索装置における文書データの登録処理の手順を示すフローチャートである。
【図５】本発明にかかる全文検索装置を構成する文書データ記憶部の概念図である。
【図６】本発明にかかる全文検索装置を構成する言語別件数記憶部の概念図である。
【図７】本発明にかかる全文検索装置を構成する全文索引記憶部の概念図である。
【図８】本発明にかかる全文検索装置における文書データの第１の検索処理の手順を示すフローチャートである。
【図９】本発明にかかる全文検索装置における文書データの第２の検索処理の手順を示すフローチャートである。
【符号の説明】
１入力手段
２出力手段
３文書データ記憶部
４言語別件数記憶部
５全文索引記憶部
６分割処理手段
７登録処理手段
８検索処理手段
２０入力装置
２１出力装置
２２主制御装置
２３記憶装置
２４入出力制御装置
２５クライアント
２６ネットワーク制御装置
２７サーバ
２８ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a full-text search device, a document data processing program, and a recording medium that search document data including a specified character string from a plurality of document data, and in particular, a document management system, an electronic library system, a patent publication search system, The present invention relates to a full-text search apparatus, a document data processing program, and a recording medium, which are extremely suitable for managing such a large amount of document data.
[0002]
[Prior art]
With the recent development of information communication technology, an environment in which a large amount of electronic documents can be accessed is being prepared. Under such circumstances, there has been proposed a document retrieval apparatus that allows a user to retrieve a desired document with high accuracy and at a high speed. This document search apparatus uses a keyword search method or a full-text search method.
[0003]
The full-text search method is a method in which an arbitrary search character string and all documents to be searched are collated and a document including the search character string is extracted without omission. This eliminates the need for assigning keywords in advance to all documents to be searched as in the keyword search method. In addition, it is possible to acquire keywords that the user wants to acquire from the Internet without omission of search.
[0004]
Note that in order to create a full-text index in full-text search, a division processing unit that divides document data into index units is necessary. As the division processing means, there are a method of dividing a connection of N characters as a unit, a method of dividing a word unit using morphological analysis, and the like.
[0005]
A method of dividing N character sequences into units is to divide document data into N character units, search for an index based on document data information including each character string and character string position information, and between character strings. This is a method of examining the positional relationship between the two and extracting corresponding document data.
[0006]
The morpheme analysis method uses a dictionary for analyzing Japanese, decomposes input document data into word units, and extracts terms that are keywords such as nouns from the decomposed word units. Is the method.
[0007]
However, full-text search technology, which is the technology for creating a large-scale index, which is the most core, cannot avoid tradeoffs in various ways.
[0008]
Therefore, there is a new full-text search device that creates an index to be used for searching document data including sentences in a plurality of languages and searches the document data (see Patent Document 1).
[0009]
This patent document 1 stores document data including sentences in a plurality of languages, and performs morphological analysis on each of the stored document data corresponding to document data in a different language. Then, the keyword of the document data is extracted and registered as an index together with the identifier of the document data corresponding to the extracted keyword. Then, a word is cut out from the input search condition, the keyword of the cut-out word and the index are collated, and document data matching the search condition is read based on the collation result. Thereby, registration processing and search processing of multilingual document data according to the user's request can be performed.
[0010]
In addition, there is a novel full-text search device that manages document data by storing an index related to multilingual document data separately for each language, and performs document data search processing using the managed document data (Patent Document 2). reference).
[0011]
This patent document 2 identifies the language of multilingual document data including characters in a plurality of languages, creates an index related to the identified multilingual data for each language, and stores it for each language. By searching the multilingual document data using the index for each language, it becomes possible to distinguish and manage for each language.
[0012]
Further, as a prior art by the present applicant, there is a full-text search device that can manage or search a plurality of document data as a single document data (see Patent Document 3).
[0013]
This patent document 3 registers document data divided into partial documents as one document data, and searches for partial document data including a designated character string from the plurality of registered document data. Extraction in units of partial documents and display of appearances of search terms can be performed.
[0014]
[Patent Document 1]
JP-A-9-50442
[Patent Document 2]
JP 2001-67368 A
[Patent Document 3]
Japanese Patent Laid-Open No. 2001-249943
[0015]
[Problems to be solved by the invention]
However, although Patent Document 1 has few restrictions because one document data is composed of a plurality of languages, a plurality of morphological analysis processes are performed on one document data in a division processing unit (multilingual keyword extraction unit). Therefore, the processing speed is considered to decrease. In addition, since only keywords are extracted, full-text search with an arbitrary character string is impossible.
[0016]
Further, in Patent Document 2, since one document data is composed of a plurality of languages, there are few restrictions. However, since document data and an index are stored for each language, it is considered that the processing speed of registration processing and search processing is reduced. .
[0017]
In Patent Document 3, a plurality of document data can be managed and searched as a single document data, but no description about language information is mentioned.
[0018]
In addition, the division processing unit using N character units as a unit has no dependency on language information, but the number of divisions increases and the number of index units increases, so the area required for the full-text index storage unit becomes large. There may be problems such as a decrease in processing speed due to an increase in the amount of data to be processed in the processing and search processing.
[0019]
On the other hand, since the division processing means using morphological analysis divides into word units, the number of index units decreases, but there is a problem that the implementation depends on language information. For example, in the case of a language having a space between words, such as Western languages, the division processing means detects the break of characters such as a space, and a continuous portion of the characters can be used as a word, and the implementation is simple. However, it is difficult to identify the word itself in a language such as Japanese where there is no separation such as a blank in the document. Therefore, the division processing means needs to detect a word using a parsing technique and is difficult to implement. This is because the conventional full-text search apparatus has one language as a target of full-text search and has only one division processing means optimal for the language. In addition, due to the variety of Japanese notation (compound words, katakana different notation), search omissions occurred even when word indexes were created.
[0020]
The present invention has been made in view of the above circumstances, and an object thereof is to provide a full-text search device, a document data processing program, and a recording medium corresponding to document data in a plurality of languages.
[0021]
[Means for Solving the Problems]
In order to achieve this object, the present invention has the following features.
The invention according to claim 1 is a full-text search device for searching document data including a designated character string from a plurality of document data described in different languages, the document data and language information of the document data, Input means for inputting document data, document data storage means for storing document data and language information of the document data, number-of-languages storage means for storing the number of document data for each language information, and the document data A plurality of division processing means for performing word division processing and notation normalization processing of document data linguistic information and dividing the data into index units; a full-text index storage means for storing index information to be used for search regardless of language information; Document data and language information input from the input means are registered in the document data storage means together with document identification information indicating the document data, and the number of document data of the language information is increased. Register in the number-of-languages storage means and divide the document data into index units by using the language information division processing means, and the appearance position information in the document data with different notation after notation normalization processing And registration processing means for registering in the full text index storage means as index information together with the document identification information of the document data.
[0022]
The invention described in claim 2 is the full-text search device according to claim 1, wherein the search character string input using the input means is divided using a plurality of division processing means for each language information, A search expression is generated using the plurality of obtained normalized search character strings, and search processing means for searching from the full-text index storage means using the generated search expression is provided.
[0023]
According to a third aspect of the present invention, in the full-text search device according to the second aspect, the search processing means examines the linguistic information stored in the number-by-language number storage means, and uses only the linguistic information division processing means. A search expression is generated using a plurality of normalized search character strings obtained by division.
[0024]
According to a fourth aspect of the present invention, there is provided input means for inputting document data and language information of the document data, document data storage means for storing document data and language information of the document data, and document data for each language information. A number-by-language number storage means for storing the number of entries, a plurality of division processing means for performing word division processing and notation normalization processing of the language information of the document data on the document data, and dividing into index units; Specified from a plurality of document data described in different languages, and a full-text index storage means for storing index information used in the database regardless of language information and a processing means for registering and retrieving document data In a full-text search apparatus for searching for document data including a character string, a document data processing program to be executed, the document data and language information input from the input means, The document data is registered in the document data storage means together with the document identification information indicating the document data, the number of document data of the language information is increased and registered in the number-by-language number storage means, and the document data is stored using the language information division processing means. A registration process that divides the index unit into index units and registers the index units with different notation after notation normalization processing as index information together with the appearance position information in the document data and the document identification information of the document data in the full-text index storage means It is made to make a means perform.
[0025]
According to a fifth aspect of the present invention, in the document data processing program according to the fourth aspect, the search character string input using the input unit is divided using a plurality of division processing units for each language information, and the division is performed. A search expression is generated using a plurality of normalized search character strings obtained as described above, and the processing means is caused to execute search processing for searching from the full-text index storage means using the generated search expression. .
[0026]
According to a sixth aspect of the present invention, in the document data processing program according to the fifth aspect, the search processing examines the language information stored in the number-by-language number storage means and uses only the language information division processing means. A search expression is generated using a plurality of normalized search character strings obtained by dividing the search.
[0027]
The invention according to claim 7 is characterized in that the document data processing program according to any one of claims 4 to 6 is recorded on a computer-readable recording medium.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0029]
First, the configuration of the full-text search apparatus according to the present invention will be described with reference to FIG.
[0030]
The full-text search apparatus according to the present invention includes an input unit 1, an output unit 2, a document data storage unit 3, a number-by-language storage unit 4, a full-text index storage unit 5, a plurality of division processing units 6, and a registration The processing unit 7 and the search processing unit 8 are included. Hereinafter, the function of each part which comprises a full text search device is demonstrated.
[0031]
The input unit 1 is a processing unit that inputs document data and language information of the document data or search conditions.
[0032]
The output unit 2 is a processing unit that outputs the search result searched by the search processing unit 8.
[0033]
The document data storage unit 3 is a storage unit that stores document data and language information of the document data.
[0034]
The number-of-languages storage unit 4 is a storage unit that stores the number of document data for each language information.
[0035]
The full-text index storage unit 5 is a storage unit that stores index information used in search processing regardless of language information.
[0036]
The division processing unit 6 is a processing unit that performs word division processing and notation normalization processing of the language information of the document data, and divides the document data into index units.
[0037]
The registration processing unit 7 registers the document data input from the input unit 1 and its language information in the document data storage unit 3 together with an identifier (document identifier) indicating the document data. In addition, the number of document data of the language is increased and registered in the number-of-languages storage unit 4, and the document data is divided and normalized into index units using a plurality of division processing means 6 for each language, and the normalized data is obtained. This is a processing unit for registering index units having different notations in the full-text index storage unit 5 together with appearance position information in the document data and a document identifier of the document data.
[0038]
The search processing unit 8 divides a search condition input using the input unit 1 using a plurality of division processing units 6 for each language, and uses a plurality of normalized search character strings obtained by the division. A processing unit that generates a search expression and searches the full-text index storage unit 5 using the generated index expression.
[0039]
In the full-text search device, the notation of the index unit is normalized and stored in the full-text index storage unit 5 at the time of registration of document data, and the search processing is performed by normalizing the notation of the search word even during the search process. It becomes possible to obtain a search result that does not depend on the notation of the client. The normalization of the notation performed at this time is also language-dependent processing.
[0040]
In particular, stemming for Western languages is a language-dependent process. Stemming is a search based on the stem of a word. For example, English “walked”, “walker”, and “walking” become “walk” by stemming, and these words can be used as search results simultaneously. It can be done.
[0041]
For example, in English, normalization processing includes singular / plural normalization, inflection normalization, normalization of phonetic symbol notation, and normalization of uppercase and lowercase letters. These are also language-dependent processes. In the case of Japanese, normalization of Chinese characters and variants, normalization of katakana notation (such as “interface”, “interface”, “interface” are the same thing), normalization of symbol notation Processing such as normalization of full-width / half-width (such as parentheses, quotes, and without middle-black) is a normalization process.
[0042]
Next, the hardware configuration of the full-text search apparatus in the present embodiment will be described with reference to FIG.
[0043]
The full-text search device according to the present embodiment includes an input device 20, an output device 21, a main control device 22, a storage device 23, and an input / output control device 24.
[0044]
In the full-text search device having the above configuration, the input device 20 is realized by the input means 1. The output device 21 is realized by the output unit 2. The main control device 22 is realized by various processing means such as a division processing means 6, a registration processing means 7, and a search processing means 8. The storage device 23 corresponds to various storage units including the document data storage unit 3, the number-of-languages storage unit 4, and the full-text index storage unit 5. For example, when a full-text search according to the present invention is performed using one limited storage device, an area to be used is allocated depending on whether the search process is performed mainly or the registration process is performed mainly. This makes it possible to perform processing efficiently. The main controller 22 uses a CPU, a memory, or the like. Further, the input / output control device 24 controls the input device 20 and the output device 21 in accordance with a control signal from the main control device 22.
[0045]
Next, the hardware configuration of the client / server in this embodiment will be described with reference to FIG.
[0046]
The client side 25 includes an input device 20, an output device 21, a main control device 22, a network control device 26, and an input / output control device 24.
The server side 27 has a network control device 26, a main control device 22, and a storage device 23. The client 25 and the server 27 are connected via a network 28.
[0047]
In the hardware configuration of the client / server configured as described above, the input device 20 is realized by the input means 1. The output device 21 is realized by the output means 2. Further, the main control device 22 is realized by the processing means of the division processing means 6, the registration processing means 7, and the search processing means 8. The storage device 23 corresponds to each of the storage units of the document data storage unit 3, the number-of-languages storage unit 4, and the full-text search storage unit 5. The network control device 26 controls data transmission and the like between the client 25 and the server 27 via the network 28. Further, the input / output control device 24 of the client 25 controls the input device 20 and the output device 21 according to the control signal of the main control device 22.
[0048]
Here, the outline of the processing operation in the full-text search apparatus according to the present invention will be described.
First, as a first stage, search data is collected. This search data may be data on a local disk, or may be data collected by running a robot inside the intranet. In some cases, robots run across the Internet.
[0049]
As the second stage, the search data collected in the first stage is passed through the document file and indexed by the indexer.
[0050]
Next, as a third stage, which is an actual operation stage, a search request from the client is searched using the index of the search engine part, and the search result is transmitted to the client.
[0051]
Here, morphological analysis will be described as an example of a division processing unit that is indispensable for creating an index used in the third stage.
[0052]
In documents such as Japanese, there is no space between words and they are not separated, so a basic search index cannot be created for a document as it is. Therefore, using the morphological dictionary and grammatical knowledge about morphemes, the document is “word-written” in units of words, and morphological analysis is used to determine the syntactic role of each word. “Square writing” is a method of writing a space between words or words or between clauses when writing a document.
[0053]
A morpheme is a unit of a word that cannot be further divided, and generally refers to the smallest element that has meaning, and a sentence is composed of this morpheme. For example, by dividing the sentence “search document” into morphemes “document”, “to”, “search”, “to”, each morpheme is given meaning, and syntax analysis and semantic analysis Use for natural language processing such as context understanding.
[0054]
There are three important processes for performing morphological analysis: “separation”, “identification of word parts of speech”, and “processing of words not in the dictionary”. There are various combinations, and various methods on the part of speech occur in the cut out method.
[0055]
The main analysis methods will be described below.
[0056]
The longest match method is a method of scanning a given character string in order from the right and selecting the longest match among the words registered in the dictionary.
[0057]
The character type cutting method is a method of cutting by priority using character type breaks such as delimiters (such as punctuation marks), kanji, katakana, letters, numbers, hiragana.
[0058]
The phrase number minimum method is a method of cutting a document so that the number of phrases is minimized. The phrase means a group of “noun + particle”, “verb”, and the like.
[0059]
The connection rule method uses a connection table that describes whether or not the word B can be connected after the word A. Since the combinations of words are enormous, rules such as part-of-speech connections, character type connections, and connections that form phrases are applied.
[0060]
As described above, in the morphological analysis full-text search, morphemes and semantic information given to the morphemes are used to automatically extract keywords, create an index (search index), or implement a natural language interface.
[0061]
Next, the processing operation in the full-text search apparatus according to the present invention will be described.
[0062]
(Document data registration process)
First, document data registration processing in the full-text search apparatus will be described with reference to FIG.
[0063]
To execute document data registration processing, the client first creates document data (step S1), and registers the created document data and the language information of the document data in the registration processing unit 7 from the input unit 1. (Step S2). Next, the registration processing unit 7 registers the document data input from the input unit 1 and the language information of the document data in the document data storage unit 3, and simultaneously sets an identifier (document identifier) indicating the document data. (Step S3). A conceptual diagram of the document data storage unit 3 is shown in FIG. For example, in FIG. 5, the document data “... search processing means ... according to the search command ...” has the document identifier set to 100 and the language information set to language A. ing. In addition, the document data “... Sie succin das Holz nach dem fehlenden kind...” Has a document identifier 105 and language information B.
[0064]
Next, the registration processing means 7 increases the number of document data in that language and registers it in the number-by-language number storage unit 4 (step S4). In addition, the conceptual diagram of the number storage part 4 according to language is shown in FIG. The number-of-languages storage unit 4 includes language information and the number of document data, and the number of document data is stored for each language information.
[0065]
Next, the registration processing means 7 uses the language information division processing means 6 to divide and normalize the document data into index units (step S5). Then, the registration processing means 7 registers index units having different notations after normalization in the full-text index storage unit 5 together with appearance position information and document identifiers in the document data (step S6).
[0066]
A conceptual diagram of the full-text index storage unit 5 is shown in FIG. The full-text index storage unit 5 includes an index unit and a transposed list {document identifier, appearance count, appearance position}. For example, the transposed list for the index unit: “search”: {100, 2, (20, 60) )}, “100” indicates a document identifier (document ID), “2” indicates the number of appearances, and “(20, 60)” indicates an appearance position.
[0067]
Therefore, the transposed list {100, 2, (20, 60)} for “search” includes information that “search” appears twice in the document 100 and its appearance position is the 20th and 60th characters ”. Show. In addition, the transposed list {100, 3, (260, 280, 320)} for “search” appears “in the document 100,“ search ”appears three times, and its appearance position is the 260th, 280, 320th character. ". Similarly, the transposed list {105, 2, (10, 50)} for “suche” has information that “in the document 100,“ suche ”appears twice and its appearance position is the 10th and 50th characters”. Will be shown.
[0068]
As described above, in the full-text search apparatus, the registration processing unit 7 registers the document data input from the input unit 1 and its language information in the document data storage unit 3 together with an identifier (document identifier) indicating the document data. To do. In addition, the number of document data of the language is increased and registered in the number-of-languages storage unit 4, and the registration processing unit 7 divides the document data into index units by using a plurality of division processing units 6 for each language. The index unit having a different representation after normalization is registered in the full-text index storage unit 5 together with the appearance position information in the document data and the document identifier of the document data, thereby improving the efficiency of the search process. Is possible.
[0069]
(First search processing of document data)
Next, a document data first search process in the full-text search apparatus will be described with reference to FIG.
[0070]
To execute document data search processing, the client first inputs a search character string from the input means 1 to the search processing means 8 (step S10). Then, the search processing unit 8 uses the plurality of division processing units 6 for each language to divide and normalize the search character string input from the input unit 1 into index units (step S11). Next, the search processing means 8 uses a logical sum of a plurality of index units obtained by division as a search condition, and uses a full-text index to collect a set of document identifiers (Rs) of document data including the index unit (search character string). ) Is acquired from the full-text index storage unit 5 (step S12). Then, the search processing means 8 outputs a set of document identifiers (Rs) of document data including the index unit (search character string) acquired from the full-text index storage unit 5 to the client through the output means 2 (step S13).
[0071]
As described above, in the full-text search apparatus, the search processing unit 8 divides the search character string input from the input unit 1 by using the plurality of division processing units 6 for each language, and a plurality of pieces obtained by the division. By generating a search expression using the normalized search character string and searching from the full-text index storage unit 5, search processing can be executed without adding language information of the search character string, thereby reducing the load on the client It becomes possible to do.
[0072]
(Second search process of document data)
Next, the second search processing of document data in the full-text search device will be described with reference to FIG.
[0073]
To execute document data search processing, the client first inputs a search character string from the input means 1 to the search processing means 8 (step S20). Next, the search processing means 8 examines the linguistic information stored in the number-of-languages storage unit 4 and uses the linguistic information division processing means 6 to convert the search character string input by the input means 1 into an index unit. And normalize (step S21). Then, the search processing means 8 uses the obtained logical sum of a plurality of index units as a search condition, and uses the full-text index to set a set (Rs) of document identifiers of document data including the index unit (search character string) as a full-text index. Obtained from the storage unit 5 (step S22). Then, the search processing means 8 outputs a set of document identifiers (Rs) of the document data including the index unit (search character string) acquired from the full-text index storage unit 5 to the client through the output means 2 (step S23).
[0074]
In this way, in the full-text search device, the search processing means 8 examines the linguistic information stored in the language identification number storage unit 4 and inputs it by the input means 1 using only the linguistic information division processing means 6. The search string is divided into index units, and a search expression is generated using a plurality of normalized search strings obtained by dividing the search string, thereby minimizing the number of search string expansions. The processing can be performed efficiently, and the waiting time until the client acquires the search result can be shortened.
[0075]
The above-described embodiment is a preferred embodiment of the present invention, and various modifications can be made without departing from the gist of the present invention. For example, it is possible to construct the full-text search device in the present embodiment by causing the information processing device such as a computer to execute the processing operation in the full-text search device in the above-described embodiment as a program. In addition, the full-text search apparatus according to the present embodiment can be constructed by recording the program on a computer-readable recording medium and mounting the recording medium on an information processing apparatus.
[0076]
【The invention's effect】
As apparent from the above description, the present invention has the following effects.
The full-text search device, the document data processing program, and the recording medium according to the present invention can process document data without adding language information of the document data, thereby reducing the load on the client. it can.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a full-text search apparatus according to the present invention.
FIG. 2 is a block diagram showing a hardware configuration in a stand-alone form.
FIG. 3 is a block diagram showing a hardware configuration in a client / server configuration.
FIG. 4 is a flowchart showing a procedure of document data registration processing in the full-text search apparatus according to the present invention.
FIG. 5 is a conceptual diagram of a document data storage unit constituting the full-text search device according to the present invention.
FIG. 6 is a conceptual diagram of the number-of-languages storage unit constituting the full-text search apparatus according to the present invention.
FIG. 7 is a conceptual diagram of a full-text index storage unit constituting the full-text search device according to the present invention.
FIG. 8 is a flowchart showing a procedure of first search processing of document data in the full-text search device according to the present invention.
FIG. 9 is a flowchart showing a procedure of second search processing of document data in the full-text search device according to the present invention.
[Explanation of symbols]
1 Input means
2 Output means
3 Document data storage
4. Number storage section by language
5 Full text index storage
6 Division processing means
7 Registration processing means
8 Search processing means
20 input devices
21 Output device
22 Main controller
23 Storage device
24 I / O control device
25 clients
26 Network controller
27 servers
28 network

Claims

A full-text search device for searching document data including a designated character string from a plurality of document data described in different languages,
Input means for inputting document data and language information of the document data;
Document data storage means for storing document data and language information of the document data;
Number-of-languages storage means for storing the number of document data for each language information;
A plurality of division processing means for performing word division processing and notation normalization processing of language information of the document data and dividing the document data into index units;
Full-text index storage means for storing index information used in the search regardless of language information;
Document data and language information input from the input means are registered in the document data storage means together with document identification information indicating the document data, and the number of document data of the language information is increased to increase the number of document data by language. The document data is divided into index units by using the division processing means for the language information, and the index units having different notations after the notation normalization processing are displayed in the document data on the position information and the document. Registration processing means for registering in the full-text index storage means as index information together with document identification information of data;
A full-text search device characterized by comprising:

The search character string input using the input unit is divided using the plurality of division processing units for each of the language information, and a search formula is used using the plurality of normalized search character strings obtained by the division. The full-text search apparatus according to claim 1, further comprising a search processing unit that generates a search from the full-text index storage unit using the generated search expression.

The search processing means examines language information stored in the number-of-languages number storage means, and uses a plurality of normalized search character strings obtained by dividing the language information using only the division processing means. The full-text search device according to claim 2, wherein the search formula is generated.

Input means for inputting document data and language information of the document data, document data storage means for storing the document data and language information of the document data, and the number of documents by language for storing the number of document data for each language information Storage means, a plurality of division processing means for performing word division processing and notation normalization processing of the language information of the document data for the document data, and dividing the index data into index units; A document including a specified character string from a plurality of document data described in different languages, and a full-text index storage unit that stores the document data and a processing unit that performs document data registration and search processing A document data processing program to be executed in a full-text search device for searching data,
Document data and language information input from the input means are registered in the document data storage means together with document identification information indicating the document data, and the number of document data of the language information is increased to increase the number of document data by language. The document data is divided into index units by using the division processing means for the language information, and the index units having different notations after the notation normalization processing are displayed in the document data on the position information and the document. A document data processing program for causing the processing means to execute registration processing for registering in the full-text index storage means as index information together with document identification information of the data.

The search character string input using the input unit is divided using the plurality of division processing units for each of the language information, and a search formula is used using the plurality of normalized search character strings obtained by the division. 5. The document data processing program according to claim 4, wherein the processing unit is caused to execute a search process for searching from the full-text index storage unit using the generated search expression.

The search process includes
Using the plurality of normalized search character strings obtained by examining the language information stored in the number-of-languages storage means and dividing only the division processing means of those language information, the search formula is 6. The document data processing program according to claim 5, wherein the document data processing program is generated.

7. A computer-readable recording medium on which the document data processing program according to claim 4 is recorded.