JP3698400B2

JP3698400B2 - Multilingual document processing apparatus, multilingual document processing method, and recording medium

Info

Publication number: JP3698400B2
Application number: JP24056599A
Authority: JP
Inventors: 修片山; 隆正小山
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1999-08-26
Filing date: 1999-08-26
Publication date: 2005-09-21
Anticipated expiration: 2019-08-26
Also published as: JP2001067368A

Description

【０００１】
【発明の属する技術分野】
本発明は、情報処理分野における多言語文書の登録検索に利用される多言語文書処理装置、多言語文書処理方法及びその多言語文書処理方法を実行するプログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
近年のコンピュータやワードプロセッサの普及により、電子化された大量の文書データが蓄積され、必要に応じて文書データを検索する文書データベースの実用化が進んでいる。文書データベースにおいては、通信ネットワークの発達や国際化に伴い、複数の言語が混在した多言語文書のデータを扱う機会が増加しつつある。
【０００３】
多言語文書を蓄積して管理する文書データベースにおける従来の多言語文書処理方法を図１６及び図１７に基づいて説明する。
多言語文書を登録する際には、入力された登録すべき多言語文書データに基づいて、多言語索引作成部５０１において検索用の多言語の索引を作成し、多言語索引格納部５０２に格納する。また、多言語文書データの実体を実体格納部５０３に格納する。検索を行う際には、入力された検索条件を示す検索文字列を基に、多言語索引照合部５０４によってその検索文字列と多言語索引格納部５０２に格納されている多言語索引とを照合し、検索条件に合致した文書の情報を検索結果として出力する。そして、この検索結果に基づき、実体抽出部５０５によって対応する多言語文書データの実体を実体格納部５０３から抽出し、多言語文書として出力する。
【０００４】
このような多言語文書データの索引や実体を格納する場合、図１７に示すように、カラムとレコードからなる表形式のデータベース構造を用いて、そのデータベースにおける複数のカラム５１１，５１２，５１３…にそれぞれ多言語文書データを格納するような方法が一般に採られている。カラム５１１〜５１３には、アクセスする単位としてカラムごとに属性（文書名など）が定義され、それぞれのカラムは対応する属性によってのみアクセスが可能となっている。このとき、カラム５１１に多言語文書データ全体をそのまま格納するか、カラム５１１に多言語文書データの任意の部分を格納し、カラム５１２，５１３に多言語文書データのその他の部分を格納する。このように従来では、複数言語を含む多言語文書データを、そのまま文書の構成に従って単数又は複数のカラムに格納し、検索等を行うようになっていた。
【０００５】
多言語の情報を処理する装置としては、特開平１−２１３７４４号公報、特開平１１−３３３８号公報などに開示されているものがあり、特に多言語文書の登録検索に関するものとして、特開平９−５０４４２号公報には複数の言語の文を含む文書に対して検索に用いるインデックスを作成して登録し、該インデックスにより文書の検索を行う多言語文書登録検索装置が開示されている。
【０００６】
また、複数のカラムに対するアクセスに関する方法としては、特開平６−６８１５１号公報のように該当するカラムと別テーブルのカラムをリンクさせる方法、特開平６−２２３１１８号公報のようにデータ構造の論理定義情報に結合情報などを含める方法、特開平８−１３７７３５号公報のように仮想的エンティティを記述するテーブルを設ける方法などが開示されている。
【０００７】
【発明が解決しようとする課題】
上述したような従来の多言語文書処理装置及び方法では、多言語文書データを格納して管理する場合に、格納領域としては特に言語を意識することなく複数言語が混在した状態で格納するようになっていた。このため、多言語文書データの管理に手間がかかったり、検索等を行う際のアクセスに時間がかかるなどの問題点が生じていた。また、多言語文書データが格納された複数のカラムにアクセスする場合に、従来ではアクセス手順が複雑化し、高速検索が困難であるなどの問題点があった。
【０００８】
本発明は、上記事情に鑑みてなされたもので、多言語文書に関する情報を言語ごとに区別して管理することができ、各情報に素早くアクセスして検索等の処理を容易かつ高速に行うことが可能な多言語文書処理装置、多言語文書処理方法及び記録媒体を提供することを目的とする。
【０００９】
【課題を解決するための手段】
本発明による多言語文書処理装置は、複数の言語の文字を含む多言語文書データの言語を識別する言語識別手段と、前記多言語文書データに関する索引を言語別に作成する索引作成手段と、前記索引を言語ごとに格納する索引格納手段と、前記言語ごとの索引を使用して多言語文書データの検索を行う検索手段と、を備えたものである。
【００１０】
また、好ましくは、前記索引格納手段は、データベースにおける一つのカラムを分割してそれぞれに言語種別を設定した複数の格納領域を備えており、この複数の格納領域から言語種別に対応する格納領域を選択して索引を格納することとする。
【００１１】
また、好ましくは、前記索引格納手段は、データベースにおけるそれぞれのカラムに言語種別を設定した複数のカラムを備えており、この複数のカラムから言語種別に対応するカラムを選択して索引を格納することとする。
【００１２】
また、好ましくは、前記索引格納手段は、データベースにおける複数のカラム又は一つのカラムを分割した複数の格納領域のそれぞれにデータ格納時の格納言語種別とデータ検索時の検索言語種別とを設定した複数の格納部を備え、この複数の格納部から格納言語種別に対応する格納部を選択して索引を格納するものであり、前記検索手段は、データ検索時に指定された言語種別を含む検索言語種別に対応する格納部を参照し、その格納部の索引により検索を行うこととする。
【００１３】
さらに、前記格納言語種別は、前記格納部を構成する一つのカラム又は格納領域に対して唯一の言語種別がそれぞれ設定されることが好ましい。
【００１４】
また、前記索引格納手段は、前記複数の格納部としてデータベースにおける一つのカラムを分割してそれぞれに格納言語種別と検索言語種別とを設定した複数の格納領域を備えており、前記格納言語種別は、前記一つの格納領域に対して唯一の言語種別がそれぞれ設定され、これらの格納言語種別のうちの一つが前記カラムの言語種別として設定されることが好ましい。
【００１５】
また、前記検索言語種別は、少なくとも一つの言語種別を含む言語種別の組からなり、この言語種別の組がそれぞれの格納部を構成する一つのカラム又は格納領域に対して設定され、この検索言語種別における一つの言語種別は当該格納部に設定された格納言語種別であることが好ましい。
【００１６】
また、好ましくは、前記多言語文書データを言語別でかつ所定文字数以内の複数のページに分割するページ分割手段を備え、前記索引作成手段は、前記言語別のページごとに索引を作成することとする。
【００１７】
また、好ましくは、前記多言語文書データの実体をデータベースにおける一つのカラムに又は複数のカラムに別々に格納する実体格納手段を備え、この多言語文書データの実体と前記多言語文書データの索引とは別々の格納手段に格納することとする。
【００１８】
また、好ましくは、前記言語識別手段は、多言語文書データに含まれる言語識別情報により言語を識別するものであり、前記索引作成手段は、前記言語識別情報を所定の特殊文字に変換し、該特殊文字を含む全ての文字の文字連鎖を言語別に作成することとする。
【００１９】
また、好ましくは、前記索引作成手段は、多言語文書データの単語又は２文字の連語を所定の対応文字に変換し、該対応文字を含む全ての文字の文字連鎖を言語別に作成することとする。
【００２０】
さらに、前記索引作成手段は、多言語文書データが１文字からなる場合は、その文字に全ての文字と結合可能でかつ検索対象とならない所定の特殊文字を付加して文字連鎖を作成することが好ましい。
【００２１】
また、前記索引は、該当する多言語文書データの文書を識別する文書情報と、該文書を所定単位ごとに分割したページを示すページ情報と、該文書内又はページ内における文字の相対的な出現順位或いは絶対的な出現位置の情報とを含むことが好ましい。
【００２２】
また、本発明による多言語文書処理装置は、複数の言語の文字を含む多言語文書データの言語を識別する言語識別手段と、前記多言語文書データを言語別でかつ所定単位ごとの複数のページに分割するページ分割手段と、前記多言語文書データに関する索引を前記言語別のページごとに作成する索引作成手段と、前記索引を言語ごとに格納する索引格納手段と、を備えたものである。
【００２３】
また、好ましくは、前記多言語文書データの実体を格納する実体格納手段を備えることとする。
【００２４】
また、好ましくは、前記言語ごとの索引を使用して検索文字列に該索引が含まれるか否かを判定して多言語文書データの検索を行う検索手段を備えることとする。
【００２５】
また、好ましくは、前記言語識別手段は、多言語文書データに含まれる言語識別情報により言語を識別するものであり、前記ページ分割手段は、前記言語識別情報から次の言語識別情報までの文字列を１つのページ又は所定単位ごとに分割した連続するページとして、複数のページに分割してページに格納することとする。
【００２６】
また、好ましくは、前記索引作成手段は、該当する多言語文書データの文書を識別する文書番号と、該文書におけるページを示すページ情報と、該文書内又はページ内における文字の相対的な出現順位或いは絶対的な出現位置の情報とを含むものを索引とすることとする。
【００２７】
また、好ましくは、前記検索手段による検索結果に基づいて検索文字列を含む多言語文書データの文書情報を取得し、この文書情報に該当する文書の多言語文書データの実体を抽出する実体抽出手段を備えることとする。
【００２８】
本発明による多言語文書処理方法は、複数の言語の文字を含む多言語文書データの言語を識別する言語識別ステップと、前記多言語文書データに関する索引を言語別に作成する索引作成ステップと、前記索引を言語ごとに格納する索引格納ステップと、を有するものである。
【００２９】
また、好ましくは、前記索引格納ステップにおいて、データベースにおける一つのカラムを分割してそれぞれに言語種別を設定した複数の格納領域を設け、この複数の格納領域から言語種別に対応する格納領域を選択して索引を格納することとする。
【００３０】
また、好ましくは、前記索引格納ステップにおいて、データベースにおけるそれぞれのカラムに言語種別を設定した複数のカラムを設け、この複数のカラムから言語種別に対応するカラムを選択して索引を格納することとする。
【００３１】
また、好ましくは、前記言語ごとの索引を使用して多言語文書データの検索を行う検索ステップを有し、前記索引格納ステップにおいて、データベースにおける複数のカラム又は一つのカラムを分割した複数の格納領域のそれぞれにデータ格納時の格納言語種別とデータ検索時の検索言語種別とを設定した複数の格納部を設け、この複数の格納部から格納言語種別に対応する格納部を選択して索引を格納し、前記検索ステップにおいて、データ検索時に指定された言語種別を含む検索言語種別に対応する格納部を参照し、その格納部の索引により検索を行うこととする。
【００３２】
また、好ましくは、前記多言語文書データを言語別でかつ所定文字数以内の複数のページに分割するページ分割ステップを有し、前記索引作成ステップにおいて、前記言語別のページごとに索引を作成することとする。
【００３３】
また、本発明による多言語文書処理方法は、複数の言語の文字を含む多言語文書データの言語を識別する言語識別ステップと、前記多言語文書データを言語別でかつ所定単位ごとの複数のページに分割するページ分割ステップと、前記多言語文書データに関する索引を前記言語別のページごとに作成する索引作成ステップと、前記索引を言語ごとに格納する索引格納ステップと、を有するものである。
【００３４】
本発明による記録媒体は、本発明に係る多言語文書処理方法を実行するためのプログラムとして記録したコンピュータにより読み取り可能なものである。
【００３５】
本発明では、多言語文書処理における文書管理において、複数の言語の文字を含む多言語文書データの言語を識別し、多言語文書データに関する索引を言語別に作成して、この索引を言語ごとに格納する。この際、データベースにおける１つのカラムに複数の言語の格納領域を備え、言語別に１つの格納領域又は複数の格納領域にデータを格納するか、又は、１つのカラムに格納するデータの言語を設定してデータの格納時に複数のカラムの中から該当する言語のカラムを識別して格納する。これにより、多言語文書データを言語別に処理し言語別に格納することが可能となる。或いは、１つの多言語文書データに対して所定単位ごとの複数のページに分割し、言語種別ごとでページごとに索引を作成して言語別に格納する。これにより、検索文字列指定時に言語種別及びページごとに索引にアクセスして検索することが可能となる。
【００３６】
上記作用により、複数の異なる種類の言語に関するデータを各々別々に又は種類別に取り扱うことが可能となり、データ管理上の手順が簡略化される。また、登録時のデータ格納や検索時のデータ照合などのためにカラム又はその中の格納領域にアクセスする際に、言語種別によって対応する格納領域のみにアクセスすることが可能であるため、容易かつ素早いアクセスによって多言語文書データの高速な登録や検索が可能となる。
【００３７】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態を説明する。
本実施形態では、多言語文書処理装置及び方法として、多言語文書を管理するにあたり、検索のための索引の作成及び格納処理、その索引を用いた検索処理について説明する。なお、それぞれの実施形態の説明では、本発明に係る多言語文書処理装置及び方法について詳述するが、本発明に係る記録媒体については、多言語文書処理方法を実行させるためのプログラムを記録した記録媒体であることから、その説明は以下の多言語文書処理方法の説明に含まれるものである。
【００３８】
［第１実施形態］
図１は本発明の第１実施形態に係る多言語文書処理装置の機能的概略構成を示すブロック図、図２は多言語文書データを格納及び参照する部分の機能的構成を示すブロック図である。
【００３９】
図１に示すように、本実施形態の多言語文書処理装置は、多言語文書データに関する索引等を言語ごとに分けて格納し管理する構成となっており、入力された登録すべき多言語文書データを各言語別に識別する言語識別手段に該当する登録文字列言語識別部１１、多言語文書データの言語別索引を作成する索引作成手段に該当する言語別索引作成部１２、作成した索引データを言語別に設けられた格納領域に格納する索引格納手段に該当する言語別索引格納部１３、登録する多言語文書の実体データを格納する実体格納手段に該当する実体格納部１４、検索時に入力された検索文字列を各言語別に識別する検索文字列言語識別部１５、検索文字列の言語別索引を作成する検索文字列言語別索引作成部１６、検索文字列の言語別索引と登録された多言語文書の言語別索引とを照合して検索を行う検索手段に該当する言語別索引照合部１７、言語別索引の照合に基づく検索結果により多言語文書の実体データを抽出する実体抽出部１８を有している。
【００４０】
図２は、第１実施形態における多言語文書処理装置の主要部の機能的構成として、多言語文書データを言語別に格納し参照する機能部分を示したものである。第１実施形態では、言語種別により多言語文書データ（索引データ又は実体データ）の格納先を切り替える入出力切替部２１、多言語文書データの格納先を識別する言語種別に関する情報を記憶する言語種別記憶部２２、言語種別が言語α、言語β、言語γのデータをそれぞれ格納するデータ格納部２３，２４，２５を有している。この図２に示す部分は、図１に示す多言語文書処理装置において主に言語別索引格納部１３に対応する。
【００４１】
入出力切替部２１は、言語種別記憶部２２に記憶されている言語種別の情報を参照して入出力を切り替え、格納や参照のためにアクセスする多言語文書データの言語種別が言語αの場合は言語αデータ格納部２３に、言語βの場合は言語βデータ格納部２４に、言語γの場合は言語γデータ格納部２５にそれぞれアクセスできるように、データの入出力を行う。なお、ここでは、説明のため言語種別のとる値の範囲を言語αから言語γの３つとしているが、この言語種別の値の範囲は制限がなく、言語種別に対応するデータ格納部は２つ以上でいくつあってもよい。
【００４２】
図３は第１実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示したものである。第１実施形態では、データベース構造における一つのカラムを複数の格納領域に分割し、各格納領域に言語種別ごとに分けた多言語文書データをそれぞれ格納する。
【００４３】
図３（Ａ）に示すように、カラム３１は、文書名などのアクセスする単位を表す属性（カラム名）３２が定義され、この属性３２によって対応するカラムにアクセスして多言語文書データの格納や参照が可能となっている。このカラム３１は、データ格納部２３，２４，２５に対応するように、言語α，言語β，言語γの言語種別ごとに設けられた複数の格納領域３３Ａ，３３Ｂ，３３Ｃに分割された構成となっている。また、図３（Ｂ）に示すように、言語種別記憶部２２に対応して、カラム３１内の各格納領域に割り当てた言語種別を示す言語種別情報３６が設定され、カラム３１の外部又は内部の所定箇所に記憶されている。
【００４４】
このような構成のカラム３１にアクセスする場合、複数の格納領域３３Ａ，３３Ｂ，３３Ｃの中から、言語種別情報３６に基づいていずれかの格納領域を選択し、対応する言語種別の格納領域にアクセスする。このとき、属性を指定することによって該当するカラムへのアクセスを指示すると、アクセス対象となる多言語文書データの言語種別に応じて、カラム内の対応する言語種別の格納領域にのみアクセスが可能となる。多言語文書データを属性３２のカラム３１に格納する際、言語種別情報３６を参照して、格納するデータの言語種別が言語αの場合は格納領域３３Ａが、言語βの場合は格納領域３３Ｂが、言語γの場合は格納領域３３Ｃが、それぞれ選択され、選択された格納領域にデータが格納される。なお、ここでは、カラム３１には３つの格納領域がある場合を示しているが、格納領域の数は多言語文書データの言語種別の数に応じていくつでも構わない。
【００４５】
また、多言語文書の実体データは、１つのカラムにまとめて或いは複数のカラムに別々に格納し、索引データと実体データとを別々の格納手段（カラム、ファイル、ディレクトリ、ディスク等の記録媒体など）に格納するようにする。
【００４６】
このように、データベースのカラムにおいて複数のデータ格納領域を設定し、言語種別による格納領域の選択を行う機能を設けることにより、１つの属性に対応するカラムに対して複数の言語別に格納領域を選択してデータを格納することが可能となる。
【００４７】
次に、上記のような多言語文書処理装置及び方法において、複数の言語の文章からなる多言語文書として日本語と英語が混在した文書データを対象とし、一つのカラムに日本語と英語の索引をそれぞれの言語別の格納領域に格納し検索する場合の動作手順について説明する。
【００４８】
図４は日本語の索引と英語の索引をそれぞれの格納領域に格納した状態を示す説明図、図５は言語種別を識別するための言語識別情報や英単語又は英連語を置き換える特殊文字を示す説明図、図６は登録多言語文書データの索引及び検索文字列の索引を作成する手順を示す説明図である。
【００４９】
ここでは、図４に示すように、属性が「本文」のカラム４１に言語種別が英語の格納領域４２Ａと言語種別が日本語の格納領域４２Ｂとを設け、それぞれの言語の索引を格納する場合を例示する。本実施形態では、多言語文書データにおいて、言語識別情報として、以下の文字列が日本語であることを表す＜日本語＞と、英語であることを表す＜英語＞とがそれぞれ設けられているものとする。また、日本語の文字は２バイト、英語の文字は１バイトで、それぞれが分かち書き文となっているとする。なお、言語識別情報は、上記のように言語が切り換わる位置で文字列ごとに設けるもの（タグなど）に限らず、個々の文字ごとに設けても良い。言語識別情報としては、構造化文書のタグ、文字のフォントを切り換えるためのフォント情報を含む識別コードや制御コード、JIS X 0202(ISO 2022)拡張符号化方式のエスケープシーケンスなどを用いることができるし、文字コードによっては言語識別情報が無くても言語種別が判別可能な場合は特に言語識別情報を設けていない多言語文書データであっても以下と同様にして言語別に索引を作成して格納することが可能である。
【００５０】
図５（Ａ）は言語識別情報と置き換える特殊文字との対応を示したものであり、索引を作成するときには、「＜日本語＞」は「^V」の特殊文字に、「＜英語＞」は「^W」の特殊文字にそれぞれ置き換える。また、英語の索引を作成する場合は、図５（Ｂ）に示すように英単語を表す文字列をまとめて１文字の特殊文字に置き換えたり、図５（Ｃ）に示すように英連語（英語のアルファベット文字列）をまとめて（ここでは２文字ごと）１文字の特殊文字に置き換える。ここでは、図５（Ｂ）のように英語文字列の単語「This」を「0x1」、「is」を「0x2」（0xは１６進数を示す）のそれぞれの対応文字に変換するようにする。
【００５１】
なお、索引を登録する多言語文書データが１文字からなる場合は、全ての文字と結合可能でかつ検索対象とならない特殊文字（使用されていない制御コードに対応する文字コードなどのフォントが割り当てられていない文字）をその文字に付加して文字連鎖を作成する。
【００５２】
図６（Ａ）は、「これは This is文書です」を表す登録多言語文書データ４３から登録用の索引を作成する手順を示したものである。登録多言語文書データ４３には言語識別情報４４ａ，４４ｂ，４４ｃが含まれており、日本語文字列「これは」、「文章です」と英語文字列「This is」が区別されている。このとき、言語識別情報は図５（Ａ）の対応表により「＜日本語＞」は「^V」、「＜英語＞」は「^W」の特殊文字に置き換え、英語文字列の単語は図５（Ｃ）の対応表により「This」は「0x1」、「is」は「0x2」の対応文字にそれぞれ変換する。これにより、言語識別情報は、特殊文字に置き換えられて各言語の文字列の両端で共有される。
【００５３】
そして、日本語文字列「これは」については、「これは^W」として索引４５ａ，４５ｂ，４５ｃを作成し、英語文字列「This is」については、「^W 0x1 0x2 ^V」として索引４５ｄ，４５ｅ，４５ｆを作成し、また日本語文字列「文章です」については、「^V文章です」として索引４５ｇ，４５ｈ，４５ｉ，４５ｊを作成する。なお、この例では簡単にするために登録多言語文書データ４３の先頭の言語識別情報４４ａに対応する特殊文字を省略しているが、文字列先頭に特殊文字「^V」を付加して「^Vこれは^W」の索引を作成するようにしても良い。このように作成した索引は、２文字連鎖のものであり、図示しないが各文字連鎖ごとの文書内における相対的な出現順位又は絶対的な出現位置の情報を含む索引データとして格納される。
【００５４】
このとき、入力された登録多言語文書データ４３において言語識別情報４４ａ，４４ｂ，４４ｃによって言語種別を識別して、それぞれの言語の文字列に対応する索引を作成し、図４に示すようにカラム４１の各格納領域４２Ａ，４２Ｂに言語別に格納する。ここでは、まず日本語文字列「これは」に対応する索引４５ａ〜４５ｃを作成して日本語の格納領域４２Ｂに格納し、次いで英語文字列「This is」に対応する索引４５ｄ〜４５ｆを作成して英語の格納領域４２Ａに格納し、さらに、日本語文字列「文章です」に対応する索引４５ｇ〜４５ｊを作成して日本語の格納領域４２Ｂに格納する。これにより、日本語の索引は格納領域４２Ｂに、英語の索引は格納領域４２Ａにそれぞれ分離されて格納される。
【００５５】
登録した多言語文書データに対して検索を行う場合は、入力された検索文字列について同様に索引を作成し、格納されている多言語文書データの索引と照合して一致しているか否かを判断する。この索引の照合結果によって、検索文字列にヒットした多言語文書データ内の文字列があるかどうかが検出される。そして、索引データが格納されているカラムの属性などから、文書名などの多言語文書データに関する情報を得て検索結果として出力する。また、使用者の指示などに応じて多言語文書データの実体データを抽出して出力する。
【００５６】
図６（Ｂ）は、「これは This is文書」を表す検索文字列４６から検索用の索引を作成する手順を示したものである。検索文字列４６は言語識別情報４７ａ，４７ｂ，４７ｃを含んでおり、日本語文字列「これは」、「文章です」と英語文字列「This is」とが区別されている。上述した登録多言語文書データ４３の場合と同様にして、日本語と英語の言語別に検索文字列の索引４８ａ〜４８ｈが作成される。このとき、先頭の言語識別情報４７ａにより言語種別を日本語に設定し、文字列「これは^W」の索引４８ａ，４８ｂ，４８ｃを作成し、日本語の格納領域４２Ｂの索引４５ａ，４５ｂ，４５ｃに対して、索引の各文字の出現順位の順に、すなわち索引４８ａは索引４５ａと、索引４８ｂは索引４５ｂと、索引４８ｃは索引４５ｃと照合する。
【００５７】
次いで、多言語文書データの索引４５ｃの特殊文字「^W」により索引文字列終端の言語が英語に切り替わることを検出し、検索文字列の言語識別情報４７ｂにより言語種別を英語に設定し、文字列「^W 0x1 0x2 ^V」の索引４８ｄ，４８ｅ，４８ｆを作成し、英語の格納領域４２Ａの索引４５ｄ，４５ｅ，４５ｆに対して、索引の各文字の出現順位の順に、すなわち索引４８ｄは索引４５ｄと、索引４８ｅは索引４５ｅと、索引４８ｆは索引４５ｆと照合する。このとき、多言語文書データの索引４５ｃと索引４５ｄの検出により「This is」が「これは」に連続していることを検出し、さらに、索引４５ｆの特殊文字「^V」により索引文字列終端の言語が日本語に切り替わることを検出する。そして、検索文字列の言語識別情報４７ｃにより言語種別を日本語に設定し、文字列「^V文章」の索引４８ｇ，４８ｈを作成し、日本語の格納領域４２Ｂの索引４５ｇ，４５ｈに対して、索引の各文字の出現順位の順に、すなわち索引４８ｇは索引４５ｇと、索引４８ｈは索引４５ｈと照合する。このとき、多言語文書データの索引４５ｆと索引４５ｇの検出により「文章」が「This is」に連続していることが検出される。
【００５８】
以上の照合によって、検索文字列の索引４８ａ〜４８ｈと多言語文書データの索引４５ａ〜４５ｈとが一致した場合は、これらの索引の文字連鎖に対応した文字列、すなわち検索文字列４６が登録多言語文書データ４３において含まれることが検出されたことになる。
【００５９】
上記の例では、日本語と英語の２つの異なる言語が連続する文字列で登録及び検索する例を示したが、言語ごとに別々に格納された索引を別々に利用して言語別に検索することも可能である。例えば、登録多言語文書データ４３に対して英語検索により「This」で検索する場合は、言語種別を英語に設定して格納領域４２Ａに格納された索引のみと照合するれば良い。
【００６０】
本実施形態では、多言語文書データの索引データなどを格納する一つのカラムを複数の格納領域に分割し、言語ごとに分離してそれぞれの格納領域にデータを格納するようにしている。これにより、多言語文書データを管理する場合に、複数の異なる種類の言語に関するデータを言語別に取り扱うことができ、データ管理上の手順を簡略化できる。また、登録時のデータ格納や検索時のデータ照合などのためにカラムにアクセスする際に、言語種別によって対応する格納領域のみにアクセスすることができ、容易かつ素早いアクセスによって高速な登録や検索が可能となる。
【００６１】
［第２実施形態］
図７は第２実施形態に係る多言語文書データを登録及び検索する部分の機能的構成を示すブロック図である。
【００６２】
第２実施形態では、多言語文書処理装置の主要部の機能的構成として、多言語文書データを言語別に格納及び参照可能なように、データベースの各カラムの属性と言語種別を定義するデータ定義部５１、入力される多言語文書データに対し言語別の索引等の登録処理を行う言語別登録部５２、多言語文書データを指定カラムに格納するデータ格納部５３、言語種別に従って指定カラムに対して言語別の検索処理を行う言語別検索部５４を有している。
【００６３】
言語別登録部５２及び言語別検索部５４は、データ定義部５１で定義された言語種別に従って、それぞれ指定カラムに対して対応する言語の登録処理、検索処置を行う。これにより、複数のカラムにおいてそれぞれのカラムに言語種別を設定し、複数の異なる言語のデータをそれぞれ対応する指定カラムに対して同時に登録、検索することが可能となる。なお、データ定義部５１により定義する属性数はいくつであっても良い。
【００６４】
図８は第２実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示したものである。第２実施形態では、データベース構造における複数のカラムのそれぞれに対して言語種別を割り当てて定義し、各カラムに言語種別ごとに分けた多言語文書データをそれぞれ格納する。
【００６５】
図８（Ａ）に示すように、属性Ａ，属性Ｂ，属性Ｃがそれぞれ定義されたカラム６１，６２，６３を有し、これらのカラムの属性に対して図８（Ｂ）に示すように言語種別として言語α，言語β，言語γのデータ定義情報６４が定義される。複数のカラム６１，６２，６３に対してデータを格納する際には、データ定義情報６４を参照して言語種別に対応する属性のカラムを判別し、そのカラム（指定カラム）に対してアクセスする。これにより、多言語文書データの実体及び索引を登録する場合に、言語種別ごとに索引作成等の言語処理を行って対応するカラムに登録すべきデータを格納することができる。また、複数のカラム６１，６２，６３に格納されたデータを検索する場合は、データ定義情報６４を参照して言語種別に対応する属性のカラムを判別し、そのカラム（指定カラム）に対してアクセスすることにより、言語種別ごとに検索文字列照合等の言語処理を行って検索することができる。
【００６６】
次に、第２実施形態の多言語文書処理装置及び方法において、複数の言語の文章からなる多言語文書として日本語と英語が混在した文書データを対象とし、複数のカラムに対して日本語と英語の索引をそれぞれの対応するカラムに言語別に格納し検索する場合の動作手順について説明する。
【００６７】
図９は日本語の索引と英語の索引をそれぞれのカラムに格納した状態を示す説明図である。ここでは、図９（Ｂ）に示すようにデータ定義情報７３を設定し、図９（Ａ）に示すように属性が「本文Ａ」で言語種別が「日本語」のカラム７１と、属性が「本文Ｂ」で言語種別が「英語」のカラム７２とを設け、それぞれの言語別の索引を対応するカラムに格納する。登録多言語文書データ及び検索文字列は図６に示したものと同様の場合を例示する。
【００６８】
登録多言語文書データの索引を作成して格納する場合、日本語文字列の索引４５ａ〜４５ｃ，４５ｇ〜４５ｊは対応する属性「本文Ａ」を指定してカラム７１に格納し、英語文字列の索引４５ｄ〜４５ｆは対応する属性「本文Ｂ」を指定してカラム７２に格納する。
【００６９】
検索文字列４６によって検索する場合、まず日本語文字列の索引４８ａ〜４８ｃを索引の各文字の出現順位の順に属性「本文Ａ」のカラム７１に格納された索引４５ａ〜４５ｃと照合する。次いで、英語文字列の索引４８ｄ〜４８ｆを索引の各文字の出現順位の順に属性「本文Ｂ」のカラム７２に格納された索引４５ｄ〜４５ｆと照合する。このとき、多言語文書データの索引４５ｃと索引４５ｄの検出により「This is」が「これは」に連続していることが検出される。そして、日本語文字列の索引４８ｇ，４８ｈを索引の各文字の出現順位の順に属性「本文Ａ」のカラム７１に格納された索引４５ｇ，４５ｈと照合する。このとき、多言語文書データの索引４５ｆと索引４５ｇの検出により「文章」が「This is」に連続していることが検出される。
【００７０】
以上の照合によって、検索文字列の索引４８ａ〜４８ｈと多言語文書データの索引４５ａ〜４５ｈとが一致した場合は、これらの索引の文字連鎖に対応した文字列、すなわち検索文字列４６が登録多言語文書データ４３において含まれることが検出されたことになる。
【００７１】
第２実施形態では、多言語文書データの索引データなどを格納する複数のカラムを言語種別ごとに定義し、言語ごとにカラムを区別してそれぞれのカラムにデータを格納するようにしている。これにより、第１実施形態と同様に、多言語文書データを管理する場合に複数の異なる種類の言語に関するデータを言語別に取り扱うことができ、日本語と英語など複数言語が連続する文書データの登録及び検索が容易かつ高速に実行可能となる。
【００７２】
この第２実施形態は、それぞれの言語に関する索引等のデータを１つの専用のカラムに格納して言語を別々に検索する方法により多言語文書データを管理する場合に特に効果的である。また、第２実施形態の多言語文書処理装置及び方法では、一度カラムの属性を言語別に定義してしまえば、言語種別を意識することなく言語別にカラムにアクセスして検索することができる。例えば、属性として「本文Ａ」を指定すると言語種別が日本語となり、日本語文字列の登録及び検索が行われ、同様に「本文Ｂ」を指定すると英語文字列の登録及び検索を行うことができる。
【００７３】
［第３実施形態］
図１０は第３実施形態に係る多言語文書データを格納及び検索する部分の機能的構成を示すブロック図である。
【００７４】
第３実施形態では、多言語文書処理装置の主要部の機能的構成として、多言語文書データを言語別に格納及び参照可能なように、格納時に多言語文書データの格納先を選択する格納領域選択部８１、格納時及び検索時の言語種別を記憶する言語種別記憶部８２、言語種別が言語α、言語β、言語γのデータをそれぞれ格納するデータ格納部８３，８４，８５、各言語のデータ格納部８３，８４，８５に格納する言語種別を記憶する格納言語種別記憶部８６、検索時にデータ格納部８３，８４，８５を選択する検索領域選択部８７、各言語のデータ格納部８３，８４，８５における検索言語種別の組を記憶する検索言語種別記憶部８８を有している。なお、ここでは、説明のため言語種別を３つとしているが、言語種別及び対応するデータ格納部は２つ以上のいくつでも良い。
【００７５】
格納領域選択部８１は、多言語文書データの索引データ等を格納する場合に、言語種別記憶部８２に入力された格納時の言語種別がいずれであるかを格納言語種別記憶部８６にある格納言語種別情報を参照して識別し、データ格納部８３，８４，８５のうちの対応する格納言語種別のデータ格納部を選択し、データの格納を行う。また、検索領域選択部８７は、多言語文書データを検索する場合に、言語種別記憶部８２に入力された検索時の言語種別がいずれであるかを検索言語種別記憶部８８にある検索言語種別の組の情報を参照して識別し、データ格納部８３，８４，８５のうちの対応する検索言語種別の組のデータ格納部を選択し、データの検索を行う。
【００７６】
図１１は第３実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示したものである。第３実施形態では、データベース構造における一つのカラムを複数の格納領域に分割し、各格納領域に格納言語種別と検索言語種別の組とを設定して、言語種別ごとに分けた多言語文書データをそれぞれ対応する格納領域に格納するとともに、検索文字列の言語種別に応じて対応する格納領域にアクセスして検索を行う。
【００７７】
図１１（Ａ）に示すように、カラム９１は、文書名などのアクセスする単位を表す属性９２が定義されるとともに、データ格納部８３，８４，８５に対応するように、言語α，言語β，言語γの言語種別ごとに設けられた複数の格納領域９３Ａ，９３Ｂ，９３Ｃに分割された構成となっている。なお、属性９２には多言語文書データの主となる言語種別の情報も含まれるものとする。また、この例では、第１実施形態と同様に１つのカラムを複数の格納領域に分割して言語別にデータを格納する場合を示したが、第２実施形態と同様に複数のカラムのそれぞれに格納言語種別及び検索言語種別を定義して言語別にデータを格納するようにしても同様な作用効果が得られる。
【００７８】
また、図１１（Ｂ）に示すように、格納言語種別記憶部８６及び検索言語種別記憶部８８に対応して、カラム９１内の各格納領域に割り当てた格納言語種別及び検索言語種別を示す言語種別情報９６が設定され、カラム９１の外部又は内部の所定箇所に記憶されている。検索言語種別は、多言語文書データにおいて用いられる格納言語種別を含む言語種別の組を示したものである。例えば、検索言語種別Ｅは言語α、検索言語種別Ｆは言語α及び言語β、検索言語種別Ｇは言語α及び言語γとする。ここで、格納言語種別は各カラム又は格納領域において唯一の言語種別が設定される。また、検索言語種別は１つ以上の言語種別の組からなり、その中の１つの言語種別が格納言語種別となるように設定される。
【００７９】
入力された多言語文書データをカラム９１に格納する場合、言語種別情報９６の格納言語種別に基づいて、複数の格納領域９３Ａ，９３Ｂ，９３Ｃの中からいずれかの格納領域を選択し、対応する言語種別の格納領域にアクセスして格納する。すなわち、格納言語種別が言語αの場合は格納領域９３Ａが、言語βの場合は格納領域９３Ｂが、言語γの場合は格納領域９３Ｃが選択される。また、カラム９１に格納された多言語文書データの検索を行う場合は、言語種別情報９６の検索言語種別に基づいていずれかの格納領域を選択し、対応する言語種別の格納領域にアクセスしてデータを参照する。この場合、検索言語種別Ｅの場合は格納領域９３Ａが、検索言語種別Ｆの場合は格納領域９３Ｂが、検索言語種別Ｇの場合は格納領域９３Ｃが選択される。すなわち、言語αの場合は格納領域９３Ａ，９３Ｂ，９３Ｃの全格納領域が、言語βの場合は格納領域９３Ｂが、言語γの場合は格納領域９３Ｃが選択されることになる。なお、ここでは、カラム９１には３つの格納領域が多重化された場合を示しているが、この多重化した格納領域の数はいくつでも構わない。
【００８０】
このように、データベースのカラムにおいて複数のデータ格納領域を設定し、格納する言語種別と検索する言語種別の組とによりそれぞれの格納領域の選択を行う機能を設けることにより、１つの属性に対応するカラムに対して複数の言語別に格納領域を選択してデータを格納及び検索することが可能となる。
【００８１】
次に、第３実施形態の多言語文書処理装置及び方法において、複数の言語の文章からなる多言語文書として日本語と英語が混在した文書データを対象とし、複数のカラムに対して日本語と英語の索引をそれぞれの対応するカラムに言語別に格納し検索する場合の動作手順について説明する。
【００８２】
図１２は日本語の索引と英語の索引をそれぞれの格納領域に格納した状態を示す説明図、図１３は登録多言語文書データの索引及び検索文字列の索引を作成する手順を示す説明図である。
【００８３】
ここでは、図１２に示すように、属性が「本文（日本語）」のカラム１０１に格納言語種別が日本語で検索言語種別が日本語である格納領域１０２Ａと、格納言語種別が英語で検索言語種別が日本語及び英語である格納領域１０２Ｂとを設け、それぞれの言語の索引を格納する場合を例示する。
【００８４】
この場合、日本語の索引は格納領域１０２Ａに、英語の索引は格納領域１０２Ｂにそれぞれ分割されて格納される。検索を行う際には、日本語が指定された場合は格納領域１０２Ａ及び１０２Ｂにアクセス可能となり、英語が指定された場合は格納領域１０２Ｂのみにアクセス可能となって、検索文字列によって検索が実行される。主となる言語種別である日本語を指定して検索を行う場合は、英語の文字は日本語の中に埋め込まれたものと判断し、日本語と同じ方法で索引を作成し検索する。
【００８５】
図１３（Ａ）は、「これは This is文書です」を表す登録多言語文書データ１０３から登録用の索引を作成する手順を示したものである。登録多言語文書データ１０３には第１実施形態と同様に言語識別情報が含まれており、「これは」と「文章です」の日本語文字列１０４ａ，１０４ｃと、「This is」の英語文字列１０４ｂとが区別されている。まず、言語識別情報を省略して文字列１０４ａ，１０４ｂ，１０４ｃを連結し、英語文字列「This is」を対応文字「0x1 0x2」に変換した連結文字データ１０５とする。
【００８６】
そして、カラム１０１に定義された主となる言語種別（ここでは日本語）により、日本語文字列として索引１０６ａ〜１０６ｈを作成する。この場合、日本語文字列「これは」に関する索引である、索引１０６ａから日本語文字列１０４ａと英語文字列１０４ｂとの連結を示す索引１０６ｃまでを格納領域１０２Ａに格納し、英語文字列「This is」に関する索引である、索引１０６ｄから英語文字列１０４ｂと日本語文字列１０４ｃとの連結を示す索引１０６ｅまでを格納領域１０２Ｂに格納し、日本語文字列「文章です」に関する索引である索引１０６ｆから索引１０６ｈまでを格納領域１０２Ａに格納する。
【００８７】
このように登録された多言語文書データに対して検索を行う場合は、入力された検索文字列について同様に索引を作成し、格納されている多言語文書データの索引と照合して一致しているか否かを判断する。図１３（Ｂ）は、「これは This is文書」を表す検索文字列１０７から検索用の索引を作成する手順を示したものである。検索文字列１０７は日本語文字列１０８ａ，１０８ｃと英語文字列１０８ｂとを含んでいるため、２つの格納領域１０２Ａ，１０２Ｂの検索言語種別の組の両方に含まれる日本語を指定する。これにより、検索文字列１０７について格納領域１０２Ａと１０２Ｂの両方にアクセスして検索することができる。
【００８８】
このとき、上述した登録多言語文書データ１０３の場合と同様にして、検索文字列１０７の連結文字データ１０９から索引１１０ａ〜１１０ｆを作成し、格納されている索引１０６ａ〜１０６ｆと各文字の出現順位の順に照合する。すなわち、索引１１０ａ，１１０ｂ，１１０ｃを日本語の格納領域１０２Ａの索引１０６ａ，１０６ｂ，１０６ｃと出現順位に従って照合し、索引１１０ｄ，１１０ｅを日本語及び英語の格納領域１０２Ｂの索引１０６ｄ，１０６ｅと出現順位に従って照合し、索引１１０ｆを日本語の格納領域１０２Ａの索引１０６ｆと照合する。
【００８９】
以上の照合によって、検索文字列の索引１１０ａ〜１１０ｆと多言語文書データの索引１０６ａ〜１０６ｆとが一致した場合は、これらの索引の文字連鎖に対応した文字列、すなわち検索文字列１０７が登録多言語文書データ１０３において含まれることが検出されたことになる。
【００９０】
第３実施形態では、多言語文書データの索引データなどを格納する一つのカラムを複数の格納領域に分割し、各格納領域に格納言語種別と検索言語種別との組を定義して、それぞれの格納言語種別に対応する格納領域にデータを格納するとともに、対応する検索言語種別の格納領域にアクセスして検索するようにしている。これにより、第１実施形態と同様に、多言語文書データを管理する場合に複数の異なる種類の言語に関するデータを言語別に取り扱うことができ、日本語と英語など複数言語が連続する文書データの登録及び検索が容易かつ高速に実行可能となる。
【００９１】
この第３実施形態は、複数の言語からなる多言語文書データを登録して管理する際に、その索引を１つの言語の索引として扱う多言語の登録検索を行う場合にに特に効果的である。例えば、主となる言語種別（上記例では日本語）の検索文字列では特に言語種別を意識することなく全格納領域にアクセスして検索でき、他の言語（上記例では英語）の検索文字列では一部の格納領域のみにアクセスするため、高速な検索が可能である。
【００９２】
［第４実施形態］
図１４は第４実施形態に係る多言語文書データを格納及び検索する部分の機能的構成を示すブロック図である。
【００９３】
第４実施形態では、多言語文書処理装置の主要部の機能的構成として、多言語文書データを言語別にページごとに格納可能なように、多言語で構成された文書データを読み取って文書ごとに識別するための文書情報（文書番号）を付与する多言語文書データ入力部１２１、入力された多言語文書データからタグなどの言語識別情報を検出して言語種別を判定する言語識別手段に該当する言語種別判定部１２２、判定された言語種別に基づいて多言語文書データに対して文書番号単位で言語別にページ番号の割り付けを行うページ分割手段に該当するページ分割部１２３、文書番号、ページ番号、言語種別を取得して各ページに含まれる文書データに対して言語別に索引を作成する索引作成手段に該当する言語別索引作成部１２４、作成された索引を言語別にカラムに格納する索引格納手段に該当する言語別索引格納部１２５、文書番号と多言語文書データそのものの実体を格納する実体格納手段に該当する実体格納部１２６を有している。
【００９４】
ページ分割部１２３は、多言語文書データを文書番号単位で言語別に分割してページ番号の割り付けを行い、その言語種別の文書データの長さが予め設定した１ページの長さを超えた場合には複数ページにさらに分割して言語種別ごとにページ番号を割り付ける。言語別索引作成部１２４は、各ページに含まれる文書データに対して各文字の出現順位又は出現位置を計算し、文書番号、ページ番号、文字の出現順位又は出現位置を含む索引データを言語種別ごとに分割して作成する。言語別索引格納部１２５は、作成された索引データを例えば言語種別ごとに索引ファイルとして格納する。
【００９５】
また、多言語文書データを高速検索可能なように、検索文字列と指定された検索言語種別を読み取る検索文字列入力部１２７、言語別索引格納部１２５に格納された検索言語種別に対応する索引と検索文字列とを照合して検索を行う検索手段に該当する文字列検索部１２８、文字列検索部１２８の検索結果に基づいて該当する文書番号の多言語文書データの実体を実体格納部１２６から抽出し出力する実体抽出手段に該当する実体抽出部１２９を有している。
【００９６】
文字列検索部１２８は、指定された検索言語種別に対応する索引ファイルを言語別索引格納部１２５から読み取り、検索文字列を含む索引ファイルを検出して索引データの文字列と検索文字列とが一致するかを判定し、一致した索引データに該当する文書番号を出力する。実体抽出部１２９は、文字列検索部１２８により取得された文書番号に対応する文書データの実体を読み出して検索結果として出力する。
【００９７】
図１５は第４実施形態における多言語文書データの格納及び検索に関する多言語文書処理方法を概念的に示したものである。図１５において、（Ａ）は多言語文書データの登録（索引格納）に関する動作を、（Ｂ）は多言語文書データの検索に関する動作を示している。
【００９８】
多言語文書データの索引を登録する場合は、図１５（Ａ）に示すように、登録多言語文書データ１３１を言語種別ごと及びページごとに分割して索引を作成し格納する。この登録多言語文書データ１３１は、＜日本語＞、＜英語＞のタグにより日本語と英語の言語種別が区別されている。なお、これらの言語の他に、中国語、韓国語など多数の言語をタグで示して区別することも可能である。
【００９９】
まず、入力した登録多言語文書データ１３１に文書番号として「本文Ｘ」を付与する。なお、文書番号は「文書１」などの連続番号とか、任意の番号や符号でも良い。また、この登録多言語文書データ１３１の実体は実体データ１３９として格納される。次いで、登録多言語文書データ１３１における文字列の言語種別をタグにより判定し、言語種別ごとに複数ページに分割してページ番号を付与する。図１５の例は、言語種別が日本語でページ番号Ｐ１が割り付けられた文書レコード１３２ａ、言語種別が英語でページ番号Ｐ２が割り付けられた文書レコード１３２ｂ、言語種別が日本語で複数ページに分割されてページ番号Ｐ３〜Ｐ７が割り付けられた文書レコード１３２ｃ〜１３２ｇを示している。
【０１００】
そして、複数のページごとに分割された文書レコード１３２ａ〜１３２ｇに対して、それぞれ上述した実施形態と同様に索引を作成する。本実施形態では、文書番号「本文Ｘ」、ページ番号「Ｐ１」〜「Ｐ７」、文字連鎖の情報を含む索引データを作成し、索引ファイルとして言語種別ごとにカラムに格納する。すなわち、日本語の文書レコード１３２ａ，１３２ｃ〜１３２ｇに関する索引データは索引ファイル１３３ａ〜１３３ｆとして日本語の格納領域に格納され、英語の文書レコード１３２ｂに関する索引データは索引ファイル１３４ａとして英語の格納領域に格納される。なお、索引データとしては、文字連鎖だけでなく、各文字の出現順位や出現位置も合わせて格納しても良い。
【０１０１】
上記のように格納された多言語文書データに対する検索の第１例を図１５（Ｂ）に示す。この第１例は、多言語検索文字列データ１３５として、検索文字列が「文書」で、検索言語種別として「日本語」が指定された場合の動作である。このとき、入力された多言語検索文字列データ１３５に基づいて検索言語種別を判断し、日本語の索引ファイルを指定する。そして、日本語の索引ファイルの中に検索文字列「文書」の文字連鎖が含まれるかどうかを判定し、この「文書」が含まれる索引ファイル１３３ｃを検出する。さらに、この索引ファイル１３３ｃに格納されている索引データ１３６として対応する文書番号「本文Ｘ」を取得する。次いで、「本文Ｘ」に該当する実体データ１３９を読み出して検索結果として出力する。なお、検索結果としては、第１段階として文書番号を基にした文書データの識別情報のみを出力し、その後ユーザの指示に応じて実体データを出力するようにしても良い。
【０１０２】
また、図１５（Ｃ）は多言語文書データに対する検索の第２例である。この第２例のように、検索文字列と検索言語種別に加えてページ間隔を指定した多言語検索文字列データ１３７を用いて検索することも可能である。このページ間隔は、検索文字列が所定の範囲内にまとまって存在するか又はバラバラに存在するかを判定するいわゆる近傍検索に用いられるもので、一致した文字列の出現位置の間隔の指定範囲（同一の検索文字列の出現範囲指定値）に対応するものである。ここでは、検索文字列が「文」、検索言語種別が「日本語」、ページ間隔として「５ページ以内」が指定された場合の動作を示す。
【０１０３】
この場合、入力された多言語検索文字列データ１３７に基づいて検索言語種別を判断し、日本語の索引ファイルの中に検索文字列「文」の文字連鎖が含まれるかどうかを判定して、この「文」が含まれる索引ファイル１３３ｃ及び１３３ｅを検出する。そして、これらの索引ファイル１３３ｃ，１３３ｅに格納されている索引データ１３８として、ページ番号「Ｐ３」，「Ｐ７」を含むデータ「本文Ｘ、Ｐ３、文書」「本文Ｘ、Ｐ７、文章」を取得する。次いで、ページ間隔が７−３＋１＝５ページであることを算出して、指定ページ間隔である「５ページ以内」かどうかを判定する。この判定結果により、この場合は５ページ以内であるので、索引ファイル１３３ｃ，１３３ｅに対応する索引データの文書番号「本文Ｘ」を取得し、「本文Ｘ」に該当する実体データ１３９を読み出して検索結果として出力する。
【０１０４】
以上の手順により、多言語文書データのページ別の登録とともに、格納された多言語文書データに対する検索が行われ、検索文字列に一致した文書データが抽出される。
【０１０５】
第４実施形態では、多言語文書データを言語種別ごとかつ所定文字数ごとに複数ページに分割して、格納及び検索を行うようにしている。これにより、多言語文書データを管理する場合に、複数の異なる種類の言語に関するデータをページ別に取り扱うことができるため、言語別の管理がさらにしやすくなり、日本語と英語など複数言語が連続する文書データの登録及び検索が容易かつ高速に実行可能となる。
【０１０６】
以上説明したように、本実施形態によれば、多言語文書処理における文書管理において、１つのカラムに複数の言語の格納領域を備え、言語別に１つの格納領域又は複数の格納領域にデータを格納するか、又は、１つのカラムに格納するデータの言語を設定してデータの格納時に複数のカラムの中から該当する言語のカラムを自動的に識別することにより、多言語文書データを言語別に処理し言語別に格納することが可能となる。また、１つの文書データに対して複数のページに分割し、かつ言語種別ごとにページと言語種別を組にした索引ファイルを作成して言語別のカラムに格納することにより、検索文字列指定時に言語種別及びページごとにカラムにアクセスして検索することが可能となる。
【０１０７】
このとき、データベースのカラムにおいて、１つのカラムに複数の格納領域を多重化し、これらの格納領域の中の言語種別に対応する１つの格納領域にアクセスしたり、複数のカラムのそれぞれに言語種別を定義して該当する言語種別のカラムにアクセスすることが容易に実行可能である。
【０１０８】
上記作用により、複数の異なる種類の言語データを各々別々に又は種類別に扱うことができ、その結果、多言語文書検索において言語別の検索を行う場合に、指定した言語の索引を直ちにアクセスして探索できるので、多言語文書を高速に検索することができる。また、特定の言語だけの索引を削除することも可能であり、１つの言語しかなかった索引を多言語に拡張することも容易に行うことができるため、規模の縮小や拡大などのスケーラビリティが高いデータベースを構築できるなど、多大な効果が得られる。
【０１０９】
【発明の効果】
以上説明したように本発明によれば、多言語文書に関する情報を言語ごとに区別して管理することができ、各情報に素早くアクセスして検索等の処理を容易かつ高速に行うことが可能となる効果が得られる。
【図面の簡単な説明】
【図１】本発明の第１実施形態に係る多言語文書処理装置の機能的概略構成を示すブロック図である。
【図２】第１実施形態に係る多言語文書データを格納及び参照する部分の機能的構成を示すブロック図である。
【図３】第１実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示した説明図である。
【図４】第１実施形態において日本語の索引と英語の索引をそれぞれの格納領域に格納した状態を示す説明図である。
【図５】言語種別を識別するための言語識別情報や英単語又は英連語を置き換える特殊文字を示す説明図である。
【図６】第１実施形態において登録多言語文書データの索引及び検索文字列の索引を作成する手順を示す説明図である。
【図７】第２実施形態に係る多言語文書データを登録及び検索する部分の機能的構成を示すブロック図である。
【図８】第２実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示した説明図である。
【図９】第２実施形態において日本語の索引と英語の索引をそれぞれのカラムに格納した状態を示す説明図である。
【図１０】第３実施形態に係る多言語文書データを格納及び検索する部分の機能的構成を示すブロック図である。
【図１１】第３実施形態における多言語文書データの格納に関する多言語文書処理方法を概念的に示した説明図である。
【図１２】第３実施形態において日本語の索引と英語の索引をそれぞれの格納領域に格納した状態を示す説明図である。
【図１３】第３実施形態において登録多言語文書データの索引及び検索文字列の索引を作成する手順を示す説明図である。
【図１４】第４実施形態に係る多言語文書データを格納及び検索する部分の機能的構成を示すブロック図である。
【図１５】第４実施形態における多言語文書データの格納及び検索に関する多言語文書処理方法を概念的に示した説明図である。
【図１６】従来の多言語文書処理装置の機能的概略構成を示すブロック図である。
【図１７】従来の多言語文書データの格納方法を概念的に示した説明図である。
【符号の説明】
１１登録文字列言語識別部
１２言語別索引作成部
１３言語別索引格納部
１４実体格納部
１５検索文字列言語識別部
１６検索文字列言語別索引作成部
１７言語別索引照合部
１８実体抽出部
２１入出力切替部
２２言語種別記憶部
２３，２４，２５データ格納部
３１カラム
３２属性
３３Ａ，３３Ｂ，３３Ｃ格納領域
３６言語種別情報[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a multilingual document processing apparatus, a multilingual document processing method, and a recording medium on which a program for executing the multilingual document processing method is used for registration search of multilingual documents in the information processing field.
[0002]
[Prior art]
With the spread of computers and word processors in recent years, a large amount of digitized document data is accumulated, and a document database for searching for document data as necessary is being put to practical use. In document databases, with the development and internationalization of communication networks, opportunities to handle multilingual document data in which a plurality of languages are mixed are increasing.
[0003]
A conventional multilingual document processing method in a document database for storing and managing multilingual documents will be described with reference to FIGS.
When registering a multilingual document, the multilingual index creation unit 501 creates a multilingual index for search based on the input multilingual document data to be registered, and stores the multilingual document in the multilingual index storage unit 502. To do. The entity of the multilingual document data is stored in the entity storage unit 503. When performing a search, the multilingual index collating unit 504 collates the multilingual index stored in the multilingual index storage unit 502 with the multilingual index collating unit 504 based on the input search character string indicating the search condition. Then, the document information that matches the search condition is output as a search result. Based on the search result, the entity of the corresponding multilingual document data is extracted from the entity storage unit 503 by the entity extraction unit 505 and output as a multilingual document.
[0004]
When storing such an index or entity of multilingual document data, as shown in FIG. 17, using a tabular database structure consisting of columns and records, a plurality of columns 511, 512, 513,. A method of storing multilingual document data is generally adopted. In the columns 511 to 513, attributes (document names and the like) are defined for each column as an access unit, and each column can be accessed only by the corresponding attribute. At this time, the entire multilingual document data is stored as it is in the column 511, or an arbitrary part of the multilingual document data is stored in the column 511, and other parts of the multilingual document data are stored in the columns 512 and 513. As described above, conventionally, multilingual document data including a plurality of languages is stored as it is in one or a plurality of columns according to the structure of the document, and is searched.
[0005]
As devices for processing multilingual information, there are devices disclosed in JP-A-1-213744, JP-A-11-3338, and the like. No.-50442 discloses a multilingual document registration / retrieval device that creates and registers an index used for searching for a document including sentences in a plurality of languages, and searches the document using the index.
[0006]
Further, as a method related to access to a plurality of columns, a method of linking a corresponding column and a column of another table as in JP-A-6-68151, or a logical definition of a data structure as in JP-A-6-223118 A method of including combined information in information, a method of providing a table describing a virtual entity as disclosed in Japanese Patent Laid-Open No. 8-137735, and the like are disclosed.
[0007]
[Problems to be solved by the invention]
In the conventional multilingual document processing apparatus and method as described above, when storing and managing multilingual document data, the storage area is stored in a mixed state without being conscious of the language. It was. For this reason, problems have arisen in that it takes time to manage multilingual document data, and it takes time to perform access when performing a search or the like. Further, when accessing a plurality of columns in which multilingual document data is stored, conventionally, there are problems such as complicated access procedures and difficulty in high-speed search.
[0008]
The present invention has been made in view of the above circumstances, and can manage information related to multilingual documents separately for each language, and can quickly access each information and perform processing such as search easily and at high speed. An object is to provide a possible multilingual document processing apparatus, multilingual document processing method, and recording medium.
[0009]
[Means for Solving the Problems]
A multilingual document processing apparatus according to the present invention , Double Language identifying means for identifying the language of multilingual document data including characters in a number of languages, index creating means for creating an index related to the multilingual document data for each language, and index storing means for storing the index for each language Search means for searching for multilingual document data using the index for each language.
[0010]
Preferably, the index storage means includes a plurality of storage areas in which one column in the database is divided and a language type is set for each, and a storage area corresponding to the language type is selected from the plurality of storage areas. Select and store the index.
[0011]
Preferably, the index storage means includes a plurality of columns in which a language type is set for each column in the database, and selects a column corresponding to the language type from the plurality of columns to store the index. And
[0012]
Preferably, the index storage means includes a plurality of columns in which a storage language type at the time of data storage and a search language type at the time of data search are set in each of a plurality of storage areas obtained by dividing a plurality of columns or one column in a database. A storage unit corresponding to a storage language type is selected from the plurality of storage units and an index is stored, and the search means includes a search language type including a language type specified at the time of data search Referring to the storage unit corresponding to, a search is performed using the index of the storage unit.
[0013]
Furthermore, as the storage language type, it is preferable that a unique language type is set for each column or storage area constituting the storage unit.
[0014]
Further, the index storage means includes a plurality of storage areas in which one column in the database is divided as the plurality of storage units and a storage language type and a search language type are respectively set, and the storage language type is Preferably, only one language type is set for the one storage area, and one of these storage language types is set as the language type of the column.
[0015]
The search language type is composed of a set of language types including at least one language type, and the set of language types is set for one column or storage area constituting each storage unit. One language type in the type is preferably a storage language type set in the storage unit.
[0016]
Preferably, the apparatus further comprises page dividing means for dividing the multilingual document data into a plurality of pages by language and within a predetermined number of characters, and the index creating means creates an index for each page by language. To do.
[0017]
Preferably, the apparatus further comprises entity storage means for separately storing the entity of the multilingual document data in one column or a plurality of columns in the database, and the entity of the multilingual document data, the index of the multilingual document data, Are stored in separate storage means.
[0018]
Preferably, the language identification unit identifies a language based on language identification information included in multilingual document data, and the index creation unit converts the language identification information into a predetermined special character, and A character chain of all characters including special characters will be created for each language.
[0019]
Preferably, the index creating means converts words or two-letter collocations of multilingual document data into predetermined corresponding characters, and creates a character chain of all characters including the corresponding characters for each language. .
[0020]
Further, when the multilingual document data is composed of one character, the index creating means may create a character chain by adding a predetermined special character that can be combined with all characters and is not a search target. preferable.
[0021]
The index includes document information for identifying a document of the corresponding multilingual document data, page information indicating a page obtained by dividing the document into predetermined units, and relative appearance of characters in the document or page. It is preferable that information on rank or absolute appearance position is included.
[0022]
The multilingual document processing apparatus according to the present invention , Double Language identifying means for identifying the language of multilingual document data including characters in a number of languages, page dividing means for dividing the multilingual document data into a plurality of pages for each language and for each predetermined unit, and the multilingual document Index creation means for creating an index relating to data for each page for each language, and index storage means for storing the index for each language.
[0023]
Preferably, an entity storage means for storing the entity of the multilingual document data is provided.
[0024]
Preferably, the apparatus further comprises search means for searching for multilingual document data by using the index for each language to determine whether or not the search character string includes the index.
[0025]
Preferably, the language identifying unit identifies a language based on language identification information included in multilingual document data, and the page dividing unit includes a character string from the language identification information to the next language identification information. Are divided into a plurality of pages and stored in a page as one page or continuous pages divided every predetermined unit.
[0026]
Preferably, the index creating means includes a document number for identifying a document of the corresponding multilingual document data, page information indicating a page in the document, and a relative appearance order of characters in the document or page. Alternatively, an index including information on an absolute appearance position is used.
[0027]
Preferably, the entity extracting unit obtains the document information of the multilingual document data including the search character string based on the search result by the search unit, and extracts the entity of the multilingual document data of the document corresponding to the document information. It shall be provided with.
[0028]
A multilingual document processing method according to the present invention provides: , Double A language identifying step for identifying a language of multilingual document data including characters in a number of languages, an index creating step for creating an index for the multilingual document data for each language, and an index storing step for storing the index for each language , Has.
[0029]
Preferably, in the index storage step, a plurality of storage areas each having a language type set for each column in the database are provided, and a storage area corresponding to the language type is selected from the plurality of storage areas. Store the index.
[0030]
Preferably, in the index storing step, a plurality of columns in which a language type is set is provided for each column in the database, and an index is stored by selecting a column corresponding to the language type from the plurality of columns. .
[0031]
Preferably, the method further comprises a search step for searching for multilingual document data using an index for each language, and a plurality of storage areas obtained by dividing a plurality of columns or one column in the database in the index storage step. Are provided with a plurality of storage units in which the storage language type at the time of data storage and the search language type at the time of data search are set, and an index is stored by selecting a storage unit corresponding to the storage language type from the plurality of storage units In the search step, the storage unit corresponding to the search language type including the language type specified at the time of data search is referred to, and the search is performed using the index of the storage unit.
[0032]
Preferably, the method further comprises a page dividing step for dividing the multilingual document data into a plurality of pages for each language and within a predetermined number of characters, and creating an index for each page for each language in the index creating step. And
[0033]
Also, the multilingual document processing method according to the present invention is , Double A language identifying step for identifying a language of multilingual document data including characters in a number of languages, a page dividing step for dividing the multilingual document data into a plurality of pages for each language and for each predetermined unit, and the multilingual document An index creation step for creating an index for data for each page for each language; and an index storage step for storing the index for each language.
[0034]
The recording medium according to the present invention comprises: According to the present invention It can be read by a computer recorded as a program for executing a multilingual document processing method.
[0035]
In the present invention, in document management in multilingual document processing, the language of multilingual document data including characters in a plurality of languages is identified, an index relating to the multilingual document data is created for each language, and the index is stored for each language. To do. At this time, a single column in the database is provided with a storage area for a plurality of languages, and data is stored in one storage area or a plurality of storage areas for each language, or the language of data stored in a single column is set. When the data is stored, the corresponding language column is identified from the plurality of columns and stored. Thereby, multilingual document data can be processed for each language and stored for each language. Alternatively, one multilingual document data is divided into a plurality of pages for each predetermined unit, and an index is created for each page for each language type and stored for each language. This makes it possible to search by accessing the index for each language type and page when specifying a search character string.
[0036]
By the above operation, it becomes possible to handle data related to a plurality of different types of languages separately or by type, and the procedure for data management is simplified. In addition, when accessing a column or a storage area in the column for data storage at the time of registration or data collation at the time of search, it is possible to access only the corresponding storage area depending on the language type. Fast access enables high-speed registration and retrieval of multilingual document data.
[0037]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
In the present embodiment, as a multilingual document processing apparatus and method, an index creation and storage process for searching and a search process using the index when managing a multilingual document will be described. In the description of each embodiment, the multilingual document processing apparatus and method according to the present invention will be described in detail. On the recording medium according to the present invention, a program for executing the multilingual document processing method is recorded. Since it is a recording medium, the description is included in the following description of the multilingual document processing method.
[0038]
[First Embodiment]
FIG. 1 is a block diagram showing a functional schematic configuration of a multilingual document processing apparatus according to the first embodiment of the present invention, and FIG. 2 is a block diagram showing a functional configuration of a part that stores and references multilingual document data. .
[0039]
As shown in FIG. 1, the multilingual document processing apparatus according to the present embodiment is configured to store and manage an index or the like related to multilingual document data for each language. Registered character string language identification unit 11 corresponding to language identification means for identifying data for each language, language-specific index creation unit 12 corresponding to language creation means for creating a language-specific index for multilingual document data, and created index data A language-specific index storage unit 13 corresponding to an index storage unit for storing in a storage area provided for each language, an entity storage unit 14 corresponding to an entity storage unit for storing entity data of a multilingual document to be registered, input at the time of retrieval A search character string language identifying unit 15 for identifying a search character string for each language, a search character string language index creating unit 16 for creating a language index for the search character string, and a language index for the search character string are registered. A language-specific index matching unit 17 corresponding to a search unit that performs a search by matching a language-specific index of a multilingual document, and an entity extracting unit 18 that extracts entity data of a multilingual document based on a search result based on the language-based index matching. have.
[0040]
FIG. 2 shows a functional part that stores and refers to multilingual document data by language as a functional configuration of the main part of the multilingual document processing apparatus according to the first embodiment. In the first embodiment, the input / output switching unit 21 that switches the storage destination of multilingual document data (index data or entity data) according to the language type, and the language type that stores information about the language type that identifies the storage destination of the multilingual document data The storage unit 22 includes data storage units 23, 24, and 25 that store data of language α, language β, and language γ, respectively. The portion shown in FIG. 2 mainly corresponds to the language-specific index storage unit 13 in the multilingual document processing apparatus shown in FIG.
[0041]
The input / output switching unit 21 switches the input / output by referring to the language type information stored in the language type storage unit 22, and the language type of the multilingual document data accessed for storage or reference is the language α. Is input / output so that it can access the language α data storage unit 23, the language β for the language β data storage unit 24, and the language γ for the language γ data storage unit 25. Here, for the sake of explanation, the range of values taken by the language type is assumed to be three from language α to language γ. However, the range of values of the language type is not limited, and there are 2 data storage units corresponding to the language type. There can be any number of two or more.
[0042]
FIG. 3 conceptually shows the multilingual document processing method related to the storage of multilingual document data in the first embodiment. In the first embodiment, one column in the database structure is divided into a plurality of storage areas, and multilingual document data divided for each language type is stored in each storage area.
[0043]
As shown in FIG. 3A, an attribute (column name) 32 representing a unit to be accessed such as a document name is defined in the column 31, and the corresponding column is accessed by this attribute 32 to store multilingual document data. And can be referenced. The column 31 is divided into a plurality of storage areas 33A, 33B, and 33C provided for each language type of the language α, language β, and language γ so as to correspond to the data storage units 23, 24, and 25. It has become. Further, as shown in FIG. 3B, language type information 36 indicating the language type assigned to each storage area in the column 31 is set corresponding to the language type storage unit 22, and the external or internal of the column 31 is set. Are stored at predetermined locations.
[0044]
When accessing the column 31 having such a configuration, one of the storage areas 33A, 33B, and 33C is selected based on the language type information 36, and the corresponding language type storage area is accessed. To do. At this time, if the access to the corresponding column is instructed by specifying the attribute, it is possible to access only the storage area of the corresponding language type in the column according to the language type of the multilingual document data to be accessed. Become. When the multilingual document data is stored in the column 31 of the attribute 32, the storage type 33A is referred to when the language type of the data to be stored is the language α, and the storage region 33B is stored when the language β is the language β. In the case of the language γ, the storage area 33C is selected, and the data is stored in the selected storage area. Here, a case where there are three storage areas in the column 31 is shown, but the number of storage areas may be any number according to the number of language types of the multilingual document data.
[0045]
In addition, entity data of multilingual documents are stored in one column or separately in a plurality of columns, and index data and entity data are stored in different storage means (columns, files, directories, recording media such as disks, etc. ).
[0046]
In this way, by setting a plurality of data storage areas in the database column and providing a function for selecting a storage area by language type, a storage area can be selected for each column corresponding to one attribute by a plurality of languages. Thus, data can be stored.
[0047]
Next, in the multilingual document processing apparatus and method as described above, as a multilingual document composed of sentences in a plurality of languages, document data in which Japanese and English are mixed is targeted, and a Japanese and English index is provided in one column. A description will be given of the operation procedure when storing and searching in each language storage area.
[0048]
FIG. 4 is an explanatory diagram showing a state in which a Japanese index and an English index are stored in respective storage areas, and FIG. 5 shows language identification information for identifying a language type and special characters for replacing English words or English collocations. FIG. 6 is an explanatory diagram showing a procedure for creating an index of registered multilingual document data and an index of search character strings.
[0049]
In this case, as shown in FIG. 4, a storage area 42A whose language type is English and a storage area 42B whose language type is Japanese are provided in the column 41 whose attribute is “text”, and an index for each language is stored. Is illustrated. In the present embodiment, in multilingual document data, <Japanese> representing that the following character string is Japanese and <English> representing English are provided as language identification information, respectively. Shall. In addition, it is assumed that Japanese characters are 2 bytes and English characters are 1 byte, each of which is a divided sentence. The language identification information is not limited to information provided for each character string (such as a tag) at the position where the language is switched as described above, but may be provided for each character. As language identification information, tags of structured documents, identification codes and control codes including font information for switching fonts of characters, escape sequences of JIS X 0202 (ISO 2022) extended coding system, etc. can be used. If there is no language identification information depending on the character code, the language type can be identified, and even in the case of multilingual document data without language identification information, an index is created and stored for each language in the same manner as described below. It is possible.
[0050]
FIG. 5A shows the correspondence between the language identification information and the special character to be replaced. When creating an index, “<Japanese>” is replaced with the special character “^ V” and “<English>”. Replace with the special character "^ W". Also, when creating an English index, character strings representing English words are collectively replaced with one special character as shown in FIG. 5B, or an English collocation (as shown in FIG. 5C). English alphabet strings) are replaced together (in this case, every two characters) and replaced with one special character. Here, as shown in FIG. 5B, the word “This” in the English character string is converted to the corresponding characters “0x1” and “is” is converted to “0x2” (0x indicates a hexadecimal number). .
[0051]
If the multilingual document data to be indexed consists of one character, special characters that can be combined with all characters and are not searchable (fonts such as character codes corresponding to unused control codes are assigned. Character chain) is added to the character to create a character chain.
[0052]
FIG. 6A shows a procedure for creating an index for registration from the registered multilingual document data 43 representing “This is this document”. The registered multilingual document data 43 includes language identification information 44a, 44b, and 44c, and the Japanese character strings “This is” and “This is a sentence” are distinguished from the English character string “This is”. At this time, the language identification information is replaced with a special character “^ V” for “<Japanese>” and “^ W” for “<Japanese>” according to the correspondence table in FIG. According to the correspondence table of FIG. 5C, “This” is converted to “0x1” and “is” is converted to “0x2”. Thereby, the language identification information is replaced with special characters and shared by both ends of the character strings of each language.
[0053]
And for the Japanese character string “This is”, indexes 45a, 45b, and 45c are created as “This is ^ W”, and for the English character string “This is”, it is indexed as “^ W 0x1 0x2 ^ V”. 45d, 45e, and 45f are created, and the index 45g, 45h, 45i, and 45j is created for the Japanese character string “sentence” as “^ V sentence”. In this example, for the sake of simplicity, the special character corresponding to the language identification information 44a at the beginning of the registered multilingual document data 43 is omitted, but the special character “^ V” is added to the beginning of the character string. ^ V This is a ^ W "index may be created. The index created in this way is of a two-character chain, and is stored as index data including information on the relative appearance rank or absolute appearance position in the document for each character chain, although not shown.
[0054]
At this time, the language type is identified by the language identification information 44a, 44b, 44c in the input registered multilingual document data 43, and an index corresponding to the character string of each language is created. As shown in FIG. Each of the storage areas 42A and 42B is stored for each language. Here, first, the indexes 45a to 45c corresponding to the Japanese character string “this is” are created and stored in the Japanese storage area 42B, and then the indexes 45d to 45f corresponding to the English character string “This is” are created. Are stored in the English storage area 42A, and indexes 45g to 45j corresponding to the Japanese character string “sentence” are created and stored in the Japanese storage area 42B. As a result, the Japanese index is stored separately in the storage area 42B, and the English index is stored separately in the storage area 42A.
[0055]
When searching for registered multilingual document data, an index is created in the same way for the input search character string, and it is checked whether it matches with the stored multilingual document data index. to decide. Whether or not there is a character string in the multilingual document data that has hit the search character string is detected based on the collation result of this index. Then, information related to multilingual document data such as the document name is obtained from the attribute of the column storing the index data and output as a search result. In addition, entity data of multilingual document data is extracted and output in accordance with a user instruction.
[0056]
FIG. 6B shows a procedure for creating a search index from the search character string 46 representing “This is this is document”. The search character string 46 includes language identification information 47a, 47b, 47c, and the Japanese character string “This is”, “Sentence” is distinguished from the English character string “This is”. Similarly to the case of the registered multilingual document data 43 described above, the search character string indexes 48a to 48h are created for each of the Japanese and English languages. At this time, the language type is set to Japanese by the first language identification information 47a, the indexes 48a, 48b, 48c of the character string “This is ^ W” are created, and the indexes 45a, 45b, For 45c, the index 48a matches the index 45a, the index 48b matches the index 45b, and the index 48c matches the index 45c.
[0057]
Next, it is detected that the language at the end of the index character string is switched to English by the special character “^ W” of the index 45c of the multilingual document data, and the language type is set to English by the language identification information 47b of the search character string. Indexes 48d, 48e, and 48f of the column “^ W 0x1 0x2 ^ V” are created, and the indexes 45d, 45e, and 45f in the English storage area 42A are in the order of appearance of each character in the index, that is, the index 48d is The index 45d, the index 48e are checked against the index 45e, and the index 48f is checked against the index 45f. At this time, it is detected by detecting the indexes 45c and 45d of the multilingual document data that “This is” continues to “This is”, and the index character string is detected by the special character “^ V” of the index 45f. Detects that the terminal language is switched to Japanese. Then, the language type is set to Japanese by the language identification information 47c of the search character string, the indexes 48g and 48h of the character string “^ V sentence” are created, and the indexes 45g and 45h of the Japanese storage area 42B are created. In the order of appearance of each character in the index, that is, the index 48g is checked against the index 45g, and the index 48h is checked against the index 45h. At this time, it is detected that “sentence” continues to “This is” by detecting the index 45f and the index 45g of the multilingual document data.
[0058]
If the index 48a to 48h of the search character string matches the index 45a to 45h of the multilingual document data by the above collation, the character string corresponding to the character chain of these indexes, that is, the search character string 46 is registered. It is detected that it is included in the language document data 43.
[0059]
In the above example, two different languages, Japanese and English, are registered and searched with a continuous character string. However, the index stored separately for each language should be used separately to search by language. Is also possible. For example, when the registered multilingual document data 43 is searched for “This” by English search, the language type is set to English and only the index stored in the storage area 42A is checked.
[0060]
In the present embodiment, one column for storing index data of multilingual document data is divided into a plurality of storage areas, and data is stored in each storage area separately for each language. Thus, when managing multilingual document data, data related to a plurality of different types of languages can be handled for each language, and the procedure for data management can be simplified. In addition, when accessing a column for data storage during registration or data collation during search, only the corresponding storage area can be accessed depending on the language type, and high-speed registration and search can be performed easily and quickly. It becomes possible.
[0061]
[Second Embodiment]
FIG. 7 is a block diagram showing a functional configuration of a part for registering and searching multilingual document data according to the second embodiment.
[0062]
In the second embodiment, as a functional configuration of the main part of the multilingual document processing apparatus, a data definition unit that defines attributes and language types of each column of the database so that multilingual document data can be stored and referenced by language. 51, a language-specific registration unit 52 that performs registration processing such as a language-specific index on input multilingual document data, a data storage unit 53 that stores multilingual document data in a specified column, and a specified column according to a language type A language-specific search unit 54 that performs a search process for each language is provided.
[0063]
The language-specific registration unit 52 and the language-specific search unit 54 perform a corresponding language registration process and search process for the specified column, respectively, according to the language type defined by the data definition unit 51. As a result, it is possible to set a language type for each column in a plurality of columns, and simultaneously register and search a plurality of different language data in the corresponding designated columns. The number of attributes defined by the data definition unit 51 may be any number.
[0064]
FIG. 8 conceptually shows a multilingual document processing method related to storage of multilingual document data in the second embodiment. In the second embodiment, a language type is assigned and defined for each of a plurality of columns in the database structure, and multilingual document data divided for each language type is stored in each column.
[0065]
As shown in FIG. 8 (A), there are columns 61, 62, 63 in which attribute A, attribute B, and attribute C are defined, respectively, and the attributes of these columns are shown in FIG. 8 (B). Data definition information 64 of language α, language β, and language γ is defined as the language type. When data is stored in a plurality of columns 61, 62, 63, the attribute definition column corresponding to the language type is discriminated by referring to the data definition information 64, and the column (designated column) is accessed. . Thereby, when registering the substance and index of multilingual document data, it is possible to store the data to be registered in the corresponding column by performing language processing such as index creation for each language type. When searching for data stored in a plurality of columns 61, 62, 63, the column of the attribute corresponding to the language type is determined with reference to the data definition information 64, and the column (designated column) is determined. By accessing, it is possible to search by performing language processing such as search character string matching for each language type.
[0066]
Next, in the multilingual document processing apparatus and method according to the second embodiment, document data in which Japanese and English are mixed as a multilingual document composed of sentences in a plurality of languages, and Japanese and An operation procedure when an English index is stored and searched for each language in a corresponding column will be described.
[0067]
FIG. 9 is an explanatory diagram showing a state in which a Japanese index and an English index are stored in respective columns. Here, the data definition information 73 is set as shown in FIG. 9B, the column 71 with the attribute “text A” and the language type “Japanese” as shown in FIG. A column 72 of “text B” and language type “English” is provided, and an index for each language is stored in the corresponding column. The registered multilingual document data and the search character string exemplify a case similar to that shown in FIG.
[0068]
When an index of registered multilingual document data is created and stored, the Japanese character string indexes 45a to 45c and 45g to 45j designate the corresponding attribute “text A” and store it in the column 71. The indexes 45 d to 45 f specify the corresponding attribute “text B” and store it in the column 72.
[0069]
When searching by the search character string 46, first, the index 48a to 48c of the Japanese character string is collated with the indexes 45a to 45c stored in the column 71 of the attribute “text A” in the order of appearance of each character of the index. Next, the indexes 48d to 48f of the English character strings are collated with the indexes 45d to 45f stored in the column 72 of the attribute “text B” in the order of appearance of each character of the index. At this time, it is detected that “This is” continues to “This is” by detecting the indexes 45c and 45d of the multilingual document data. Then, the indexes 48g and 48h of the Japanese character strings are collated with the indexes 45g and 45h stored in the column 71 of the attribute “text A” in the order of appearance of each character in the index. At this time, it is detected that “sentence” continues to “This is” by detecting the index 45f and the index 45g of the multilingual document data.
[0070]
If the index 48a to 48h of the search character string matches the index 45a to 45h of the multilingual document data by the above collation, the character string corresponding to the character chain of these indexes, that is, the search character string 46 is registered. It is detected that it is included in the language document data 43.
[0071]
In the second embodiment, a plurality of columns for storing index data of multilingual document data and the like are defined for each language type, and the data is stored in each column by distinguishing the column for each language. Thus, as in the first embodiment, when managing multilingual document data, data related to a plurality of different types of languages can be handled by language, and registration of document data in which a plurality of languages such as Japanese and English are continuous is registered. And the search can be executed easily and at high speed.
[0072]
This second embodiment is particularly effective when multilingual document data is managed by a method in which data such as an index related to each language is stored in one dedicated column and languages are searched separately. Further, in the multilingual document processing apparatus and method of the second embodiment, once the column attributes are defined for each language, it is possible to access and search the columns for each language without being aware of the language type. For example, if “text A” is specified as an attribute, the language type is Japanese, and a Japanese character string is registered and searched. Similarly, if “text B” is specified, an English character string is registered and searched. it can.
[0073]
[Third Embodiment]
FIG. 10 is a block diagram showing a functional configuration of a part for storing and retrieving multilingual document data according to the third embodiment.
[0074]
In the third embodiment, as a functional configuration of the main part of the multilingual document processing apparatus, a storage area selection for selecting a storage destination of multilingual document data at the time of storage so that multilingual document data can be stored and referred to by language Unit 81, a language type storage unit 82 for storing the language type at the time of storage and search, data storage units 83, 84, 85 for storing data of language α, language β, and language γ, and data for each language A storage language type storage unit 86 for storing the language types stored in the storage units 83, 84, 85, a search area selection unit 87 for selecting the data storage units 83, 84, 85 at the time of search, and a data storage unit 83, 84 for each language , 85, a search language type storage unit 88 for storing a set of search language types. Here, three language types are used for explanation, but the number of language types and corresponding data storage units may be two or more.
[0075]
When storing index data of multilingual document data or the like, the storage area selection unit 81 stores in the storage language type storage unit 86 which language type at the time of storage input to the language type storage unit 82 is stored. The language type information is identified with reference to the data, and the data storage unit corresponding to the storage language type is selected from the data storage units 83, 84, and 85, and the data is stored. In addition, when searching for multilingual document data, the search area selection unit 87 determines which language type at the time of search input to the language type storage unit 82 is in the search language type storage unit 88. The group information is identified with reference to the data set, and the data storage unit corresponding to the search language type is selected from the data storage units 83, 84, and 85, and the data is searched.
[0076]
FIG. 11 conceptually shows a multilingual document processing method related to storage of multilingual document data in the third embodiment. In the third embodiment, one column in the database structure is divided into a plurality of storage areas, a set of storage language type and search language type is set in each storage area, and multilingual document data divided for each language type Are stored in the corresponding storage areas, and the corresponding storage areas are accessed according to the language type of the search character string to perform a search.
[0077]
As shown in FIG. 11A, the column 91 defines an attribute 92 representing a unit to be accessed, such as a document name, and the language α and language β to correspond to the data storage units 83, 84, and 85. , Γ is divided into a plurality of storage areas 93A, 93B, 93C provided for each language type. It should be noted that the attribute 92 includes information on the main language type of the multilingual document data. Further, in this example, as in the first embodiment, a case where one column is divided into a plurality of storage areas and data is stored according to language is shown. However, as in the second embodiment, each of the plurality of columns is stored. Even if the storage language type and the search language type are defined and data is stored for each language, the same effect can be obtained.
[0078]
Further, as shown in FIG. 11B, in correspondence with the storage language type storage unit 86 and the search language type storage unit 88, the language indicating the storage language type and the search language type assigned to each storage area in the column 91. Type information 96 is set and stored in a predetermined location outside or inside the column 91. The search language type indicates a set of language types including a storage language type used in multilingual document data. For example, the search language type E is language α, the search language type F is language α and language β, and the search language type G is language α and language γ. Here, the storage language type is set to a unique language type in each column or storage area. The search language type is composed of a set of one or more language types, and one of the language types is set to be a storage language type.
[0079]
When the input multilingual document data is stored in the column 91, one of the storage areas 93A, 93B, and 93C is selected from the storage areas 93A, 93B, and 93C on the basis of the storage language type of the language type information 96, and a corresponding operation is performed. Access and store the language type storage area. That is, the storage area 93A is selected when the storage language type is the language α, the storage area 93B is selected when the storage language type is language β, and the storage area 93C is selected when the storage language type is language γ. When searching multilingual document data stored in the column 91, one of the storage areas is selected based on the search language type of the language type information 96, and the corresponding language type storage area is accessed. Browse the data. In this case, the storage area 93A is selected for the search language type E, the storage area 93B is selected for the search language type F, and the storage area 93C is selected for the search language type G. That is, in the case of the language α, all the storage areas 93A, 93B, 93C are selected, in the case of the language β, the storage area 93B is selected, and in the case of the language γ, the storage area 93C is selected. Although the column 91 shows a case where three storage areas are multiplexed here, the number of multiplexed storage areas is not limited.
[0080]
In this way, by setting a plurality of data storage areas in the database column and providing a function for selecting each storage area based on the combination of the language type to be stored and the language type to be searched, it corresponds to one attribute. Data can be stored and retrieved by selecting storage areas for a plurality of languages for the column.
[0081]
Next, in the multilingual document processing apparatus and method according to the third embodiment, document data in which Japanese and English are mixed as a multilingual document composed of sentences in a plurality of languages, and Japanese and An operation procedure when an English index is stored and searched for each language in a corresponding column will be described.
[0082]
FIG. 12 is an explanatory diagram illustrating a state in which a Japanese index and an English index are stored in respective storage areas, and FIG. 13 is an explanatory diagram illustrating a procedure for creating an index of registered multilingual document data and an index of a search character string. is there.
[0083]
Here, as shown in FIG. 12, a storage area 102A in which the storage language type is Japanese and the search language type is Japanese, and the storage language type is English are searched in the column 101 having the attribute “text (Japanese)”. A case will be exemplified in which a storage area 102B in which the language type is Japanese and English is provided and an index of each language is stored.
[0084]
In this case, the Japanese index is stored in the storage area 102A, and the English index is stored in the storage area 102B. When performing a search, the storage areas 102A and 102B can be accessed when Japanese is specified, and only the storage area 102B can be accessed when English is specified, and the search is performed using the search character string. Is done. When searching by specifying Japanese as the main language type, it is determined that English characters are embedded in Japanese, and an index is created and searched in the same manner as Japanese.
[0085]
FIG. 13A shows a procedure for creating a registration index from the registered multilingual document data 103 representing “This is this document”. Similar to the first embodiment, the registered multilingual document data 103 includes language identification information, Japanese character strings 104a and 104c of “This is” and “Sentence”, and English characters of “This is”. The column 104b is distinguished. First, the character string 104a, 104b, 104c is concatenated with the language identification information omitted, and the concatenated character data 105 is obtained by converting the English character string “This is” into the corresponding character “0x1 0x2”.
[0086]
Then, indexes 106 a to 106 h are created as Japanese character strings according to the main language type (in this case, Japanese) defined in the column 101. In this case, the index 106a to the index 106c indicating the concatenation of the Japanese character string 104a and the English character string 104b, which are indexes related to the Japanese character string “This is”, are stored in the storage area 102A, and the English character string “This The index 106d from the index 106d to the index 106e indicating the concatenation of the English character string 104b and the Japanese character string 104c is stored in the storage area 102B, and the index 106f that is an index related to the Japanese character string “sentence” To the index 106h are stored in the storage area 102A.
[0087]
When searching for multilingual document data registered in this way, an index is created in the same manner for the input search character string, and the index matches the stored multilingual document data index. Determine whether or not. FIG. 13B shows a procedure for creating a search index from the search character string 107 representing “This is this is document”. Since the search character string 107 includes Japanese character strings 108a and 108c and an English character string 108b, the Japanese characters included in both sets of search language types in the two storage areas 102A and 102B are designated. As a result, the search character string 107 can be searched by accessing both the storage areas 102A and 102B.
[0088]
At this time, as in the case of the registered multilingual document data 103 described above, indexes 110a to 110f are created from the concatenated character data 109 of the search character string 107, and the stored indexes 106a to 106f and the appearance order of each character are displayed. Match in the order of. That is, the indexes 110a, 110b, and 110c are collated with the indexes 106a, 106b, and 106c in the Japanese storage area 102A according to the appearance order, and the indexes 110d and 110e are compared with the indexes 106d and 106e in the Japanese and English storage area 102B. The index 110f is checked against the index 106f in the Japanese storage area 102A.
[0089]
If the index 110a to 110f of the search character string matches the index 106a to 106f of the multilingual document data by the above collation, the character string corresponding to the character chain of these indexes, that is, the search character string 107 is registered. It is detected that it is included in the language document data 103.
[0090]
In the third embodiment, one column for storing index data of multilingual document data is divided into a plurality of storage areas, and a set of storage language type and search language type is defined in each storage area. Data is stored in the storage area corresponding to the storage language type, and the storage area corresponding to the search language type is accessed and searched. Thus, as in the first embodiment, when managing multilingual document data, data related to a plurality of different types of languages can be handled by language, and registration of document data in which a plurality of languages such as Japanese and English are continuous is registered. And the search can be executed easily and at high speed.
[0091]
The third embodiment is particularly effective when performing multilingual registration search in which multilingual document data composed of a plurality of languages is registered and managed, and the index is handled as an index of one language. . For example, a search character string of the main language type (Japanese in the above example) can access and search the entire storage area without being particularly conscious of the language type, and a search character string in another language (English in the above example) Since only a part of the storage area is accessed, a high-speed search is possible.
[0092]
[Fourth Embodiment]
FIG. 14 is a block diagram showing a functional configuration of a part for storing and retrieving multilingual document data according to the fourth embodiment.
[0093]
In the fourth embodiment, as the functional configuration of the main part of the multilingual document processing apparatus, the multilingual document data is read for each document by reading the multilingual document data so that the multilingual document data can be stored for each page. The multilingual document data input unit 121 for assigning document information (document number) for identification corresponds to language identifying means for detecting language identification information such as a tag from the input multilingual document data and determining the language type. A language type determining unit 122, a page dividing unit 123 corresponding to a page dividing unit that assigns a page number for each language in document number units to multilingual document data based on the determined language type, a document number, a page number, A language-specific index creation unit 124 corresponding to an index creation unit that obtains a language type and creates an index for each language for document data included in each page has been created. Language index storage unit 125 corresponding to the index storage means for storing the argument by language in the column, has the entity storage 126 corresponding to the entity storage means for storing the substance of the document ID and the multilingual document data itself.
[0094]
The page dividing unit 123 divides multilingual document data by language in units of document numbers and assigns page numbers, and when the length of the document data of the language type exceeds a preset length of one page. Is further divided into a plurality of pages and a page number is assigned to each language type. The language-specific index creation unit 124 calculates the appearance rank or appearance position of each character for the document data included in each page, and sets the index data including the document number, page number, character appearance rank or appearance position as the language type Create each segment separately. The language-specific index storage unit 125 stores the created index data as an index file for each language type, for example.
[0095]
Also, an index corresponding to the search language type stored in the search character string input unit 127 that reads the search character string and the specified search language type and the language-specific index storage unit 125 so that the multilingual document data can be searched at high speed. The character string search unit 128 corresponding to the search means for performing a search by comparing the search character string with the search character string, and the entity of the multilingual document data of the corresponding document number based on the search result of the character string search unit 128 An entity extracting unit 129 corresponding to the entity extracting means for extracting and outputting from.
[0096]
The character string search unit 128 reads an index file corresponding to the specified search language type from the language-specific index storage unit 125, detects an index file including the search character string, and finds the character string of the index data and the search character string. It is determined whether they match, and the document number corresponding to the matched index data is output. The entity extraction unit 129 reads the document data entity corresponding to the document number acquired by the character string search unit 128 and outputs it as a search result.
[0097]
FIG. 15 conceptually shows a multilingual document processing method related to storage and retrieval of multilingual document data in the fourth embodiment. 15A shows an operation related to registration (index storage) of multilingual document data, and FIG. 15B shows an operation related to search of multilingual document data.
[0098]
When registering an index of multilingual document data, as shown in FIG. 15A, the registered multilingual document data 131 is divided for each language type and for each page, and an index is created and stored. In the registered multilingual document data 131, Japanese and English language types are distinguished by tags of <Japanese> and <English>. In addition to these languages, it is also possible to distinguish many languages such as Chinese and Korean with tags.
[0099]
First, “text X” is assigned to the input registered multilingual document data 131 as a document number. The document number may be a continuous number such as “Document 1” or an arbitrary number or code. The entity of the registered multilingual document data 131 is stored as entity data 139. Next, the language type of the character string in the registered multilingual document data 131 is determined by a tag, and a page number is assigned by dividing the language type into a plurality of pages. The example of FIG. 15 shows a document record 132a in which the language type is Japanese and the page number P1 is assigned, a document record 132b in which the language type is English and the page number P2 is assigned, and the language type is Japanese and is divided into a plurality of pages. The document records 132c to 132g to which the page numbers P3 to P7 are assigned are shown.
[0100]
Then, an index is created for each of the document records 132a to 132g divided into a plurality of pages as in the above-described embodiment. In this embodiment, index data including document number “text X”, page numbers “P1” to “P7”, and character chain information is created, and stored as an index file in a column for each language type. That is, the index data relating to the Japanese document records 132a, 132c to 132g is stored in the Japanese storage area as index files 133a to 133f, and the index data relating to the English document record 132b is stored in the English storage area as the index file 134a. Is done. As index data, not only the character chain but also the appearance rank and appearance position of each character may be stored together.
[0101]
FIG. 15B shows a first example of search for multilingual document data stored as described above. The first example is an operation when the search character string is “document” and the search language type is “Japanese” as the multilingual search character string data 135. At this time, the search language type is determined based on the input multilingual search character string data 135, and a Japanese index file is designated. Then, it is determined whether or not the character string of the search character string “document” is included in the Japanese index file, and the index file 133c including this “document” is detected. Further, the document number “text X” corresponding to the index data 136 stored in the index file 133c is acquired. Next, the entity data 139 corresponding to “text X” is read and output as a search result. As a search result, only the identification information of the document data based on the document number may be output as the first step, and then the entity data may be output in accordance with a user instruction.
[0102]
FIG. 15C is a second example of search for multilingual document data. As in the second example, it is also possible to perform a search using the multilingual search character string data 137 in which the page interval is specified in addition to the search character string and the search language type. This page interval is used for so-called neighborhood search for determining whether search character strings are present within a predetermined range or apart, and a specified range of intervals of appearance positions of matched character strings ( (Appearance range specification value of the same search character string). Here, the operation when the search character string is “sentence”, the search language type is “Japanese”, and “within 5 pages” is specified as the page interval is shown.
[0103]
In this case, the search language type is determined based on the input multilingual search character string data 137, and it is determined whether the character chain of the search character string “sentence” is included in the Japanese index file, Index files 133c and 133e including this “sentence” are detected. Then, as the index data 138 stored in these index files 133c and 133e, data “text X, P3, document” “text X, P7, text” including page numbers “P3” and “P7” are acquired. . Next, it is calculated that the page interval is 7-3 + 1 = 5 pages, and it is determined whether or not the designated page interval is “within 5 pages”. As a result of this determination, in this case, since there are no more than five pages, the document number “text X” of the index data corresponding to the index files 133c and 133e is acquired, and the entity data 139 corresponding to “text X” is read and searched. Output as a result.
[0104]
Through the above procedure, the multilingual document data is registered for each page and the stored multilingual document data is searched, and the document data matching the search character string is extracted.
[0105]
In the fourth embodiment, multilingual document data is divided into a plurality of pages for each language type and for each predetermined number of characters for storage and retrieval. As a result, when managing multilingual document data, data related to multiple different types of languages can be handled on a page-by-page basis, making it easier to manage by language, and multiple languages such as Japanese and English are consecutive. Registration and retrieval of document data can be performed easily and at high speed.
[0106]
As described above, according to the present embodiment, in document management in multilingual document processing, one column is provided with storage areas for a plurality of languages, and data is stored in one storage area or a plurality of storage areas for each language. Multilingual document data can be processed by language by setting the language of data stored in one column and automatically identifying the corresponding language column from multiple columns when storing data However, it is possible to store them by language. In addition, by dividing a single document data into multiple pages and creating an index file that combines a page and language type for each language type and storing them in a language-specific column, a search character string can be specified. It is possible to search by accessing the column for each language type and page.
[0107]
At this time, in the database column, a plurality of storage areas are multiplexed in one column, and one storage area corresponding to the language type in these storage areas is accessed, or the language type is assigned to each of the plurality of columns. It is easy to define and access the corresponding language type column.
[0108]
With the above action, multiple different types of language data can be handled separately or by type. As a result, when searching by language in multilingual document search, the specified language index is immediately accessed. Since it is possible to search, multilingual documents can be searched at high speed. In addition, it is possible to delete an index for only a specific language, and an index that has only one language can be easily expanded to multiple languages, so the scalability such as reduction or expansion of scale is high. A great effect can be obtained, such as the construction of a database.
[0109]
【The invention's effect】
As described above, according to the present invention, information related to multilingual documents can be managed separately for each language, and each information can be quickly accessed to perform processing such as search easily and at high speed. An effect is obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional schematic configuration of a multilingual document processing apparatus according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a functional configuration of a part for storing and referring to multilingual document data according to the first embodiment.
FIG. 3 is an explanatory diagram conceptually showing a multilingual document processing method related to storage of multilingual document data in the first embodiment.
FIG. 4 is an explanatory diagram showing a state in which a Japanese index and an English index are stored in respective storage areas in the first embodiment.
FIG. 5 is an explanatory diagram showing language identification information for identifying a language type and special characters for replacing English words or English collocations.
FIG. 6 is an explanatory diagram showing a procedure for creating an index of registered multilingual document data and an index of search character strings in the first embodiment.
FIG. 7 is a block diagram showing a functional configuration of a part for registering and searching multilingual document data according to the second embodiment.
FIG. 8 is an explanatory view conceptually showing a multilingual document processing method related to storage of multilingual document data in the second embodiment.
FIG. 9 is an explanatory diagram showing a state in which a Japanese index and an English index are stored in respective columns in the second embodiment.
FIG. 10 is a block diagram showing a functional configuration of a part that stores and retrieves multilingual document data according to the third embodiment.
FIG. 11 is an explanatory diagram conceptually showing a multilingual document processing method related to storage of multilingual document data in the third embodiment.
FIG. 12 is an explanatory diagram showing a state in which a Japanese index and an English index are stored in respective storage areas in the third embodiment.
FIG. 13 is an explanatory diagram showing a procedure for creating an index of registered multilingual document data and an index of search character strings in the third embodiment.
FIG. 14 is a block diagram showing a functional configuration of a part for storing and retrieving multilingual document data according to the fourth embodiment.
FIG. 15 is an explanatory diagram conceptually showing a multilingual document processing method related to storage and retrieval of multilingual document data in the fourth embodiment.
FIG. 16 is a block diagram showing a schematic functional configuration of a conventional multilingual document processing apparatus.
FIG. 17 is an explanatory diagram conceptually showing a conventional method for storing multilingual document data.
[Explanation of symbols]
11 Registered character string language identifier
12 Language indexing section
13 Language-specific index storage
14 Entity storage
15 Search string language identifier
16 Search string language index creation part
17 Language-specific index matching section
18 Entity extraction unit
21 I / O switching part
22 Language type storage
23, 24, 25 Data storage
31 columns
32 attributes
33A, 33B, 33C Storage area
36 Language type information

Claims

Language identification means for identifying the language of multilingual document data including characters of a plurality of languages and different languages being continuous ;
Index creation means for creating an index for the multilingual document data for each language;
Index storage means for storing the index for each language, and search means for searching multilingual document data using the index for each language ,
The language identification means identifies a language by language identification information included in the multilingual document data,
The multi-language document processing apparatus , wherein the index creating means converts the language identification information into a predetermined special character and creates a character chain of all characters including the special character for each language.

A language identification means for identifying unrealized different languages the language of the multilingual document data multiple consecutive language characters,
Index creation means for creating an index for the multilingual document data for each language;
Index storage means for storing the index for each language, and search means for searching multilingual document data using the index for each language ,
The index creation means converts words or two-letter collocations of the multilingual document data into predetermined corresponding characters, and creates a character chain of all characters including the corresponding characters for each language. Document processing device.

The index storage means includes a plurality of storage areas in which one column in the database is divided and a language type is set for each of the columns. multilingual document processing apparatus according to claim 1 or 2, characterized in that storing.

The index storage means includes a plurality of columns in which a language type is set for each column in the database, and selects a column corresponding to the language type from the plurality of columns to store the index. Item 3. The multilingual document processing device according to Item 1 or 2 .

The index storage means includes a plurality of storage units in which a storage language type at the time of data storage and a search language type at the time of data search are set in each of a plurality of columns in the database or a plurality of storage areas obtained by dividing one column. The storage unit corresponding to the storage language type is selected from the plurality of storage units and the index is stored.
The searching means refers to the storage unit corresponding to the search language type including language type specified at the time of data retrieval, multi according to claim 1 or 2, characterized in that to search the index for the storage unit Language document processing device.

6. The multilingual document processing apparatus according to claim 5 , wherein the storage language type is set to a single language type for each column or storage area constituting the storage unit.

The index storage means includes a plurality of storage areas in which one column in the database is divided as the plurality of storage units and a storage language type and a search language type are respectively set, and the storage language type is 6. The multilingual document processing according to claim 5 , wherein a single language type is set for each storage area, and one of these storage language types is set as the language type of the column. apparatus.

The search language type is composed of a set of language types including at least one language type, and the set of language types is set for one column or storage area constituting each storage unit. 6. The multilingual document processing apparatus according to claim 5 , wherein one language type is a storage language type set in the storage unit.

A page dividing unit that divides the multilingual document data into a plurality of pages by language and within a predetermined number of characters, and the index creating unit creates an index for each page by language. The multilingual document processing apparatus according to 1 or 2 .

Entity storage means for separately storing the multilingual document data entity in one column or a plurality of columns in the database, and the multilingual document data entity and the multilingual document data index are separately storing means multilingual document processing apparatus according to any one of claims 1 to 5 9, characterized in that stored in the.