JPH0950442A

JPH0950442A - Multilanguage document registration retrieval device

Info

Publication number: JPH0950442A
Application number: JP7221149A
Authority: JP
Inventors: Makoto Ando; 誠安藤; Akio Yamashita; 明男山下; Kazuo Aihara; 一雄相原; Tatsuomi Kita; 辰臣喜多; Naomi Hiraoka; 直美平岡; Hiroko Matsuo; 裕子松尾; Hiroshi Yamaguchi; 浩山口; Shinji Kawamoto; 真司川本
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-08-08
Filing date: 1995-08-08
Publication date: 1997-02-18
Anticipated expiration: 2015-08-08
Also published as: JP3666066B2

Abstract

PROBLEM TO BE SOLVED: To suppress unregistered words to the minimum by registering extracted key word along with identifier as index, segmenting a word from inputted retrieval conditions, collating it with a key word of index and reading a document suited to the retrieval conditions. SOLUTION: A text data base part 3 stores the document including sentences of plural languages and a multilanguage key word extraction part 2 performs morpheme analysis corresponding to the sentence of the different language to the document and extracts the key word of the document. An index registration part 4 registers the extracted key word along with the identifier of the corresponding document as the index. Also, a retrieval condition input part 11 inputs the retrieval conditions, the word is segmented from the retrieval conditions to the index collation part, it is collated with the key word of the index and a text extraction part 14 reads the document by a collated result. Thus, the size of the index prepared at the time of registration is made compact as well.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の種類の言語
で記述された文を含む文書に対して検索のためのインデ
ックスを登録して、文書を検索する多言語文書登録検索
装置に関し、更に詳細には、複数の種類の言語で記述さ
れた文書のテキストデータベースから、複数の言語に対
応してキーワードを抽出してインデックスとして登録
し、登録したインデックスを用いて、多言語のテキスト
データベースを検索する多言語文書登録検索装置に関す
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multilingual document registration / retrieval apparatus for retrieving a document by registering an index for retrieval with respect to a document including sentences described in a plurality of languages. Specifically, from the text database of documents written in multiple types of languages, keywords are extracted corresponding to multiple languages and registered as an index, and the registered index is used to search a multilingual text database. The present invention relates to a multilingual document registration / retrieval device.

【０００２】[0002]

【従来の技術】従来から、複数の種類の言語で記述され
た文を含む多言語文書に対する文書検索装置として、例
えば、特開平４−２１１８０号公報に記載された「文書
検索装置」が知られている。この「文書検索装置」は、
自国語で作成されたキーワードを持つデータベースと、
そのキーワードを入力してデータベースを検索する文書
検索装置において、自国語および他国語の辞書を持ち、
他国語で入力されたキーワードに対しては、当該キーワ
ードを自国語に変換し、自国語のキーワードでデータベ
ースを検索する。そして、検索した文書の文書名と、選
択された文書の全文を他国語に変換して、画面に表示す
るように構成されている。2. Description of the Related Art Conventionally, as a document retrieval apparatus for a multilingual document including sentences described in a plurality of types of languages, for example, a "document retrieval apparatus" described in Japanese Patent Laid-Open No. 21180/1992 is known. ing. This "document retrieval device"
A database with keywords created in your own language,
In a document search device that inputs the keyword and searches the database, has a dictionary of a native language and another language,
For a keyword input in another language, the keyword is converted into the native language, and the database is searched with the keyword in the native language. Then, the document name of the retrieved document and the whole sentence of the selected document are converted into another language and displayed on the screen.

【０００３】[0003]

【発明が解決しようとする課題】ところで、文書検索の
処理にインデックスを用いる文書検索を行う場合、検索
対象となる文書が母国語のように１ヶ国語（例えば日本
語）の言語だけでなく、引用文献として引用される箇所
では、その他の国の言語（例えば英語）も含まれている
場合が多くあるので、これに対しては、文書検索装置に
おけるインデックスを複数の言語（多言語）に対応させ
ておかなければ、十分な文書の検索を行うことはできな
い。By the way, when performing a document search using an index in the document search process, the document to be searched is not limited to one language (for example, Japanese) such as a native language, and In many cases, the parts cited as cited documents also include languages of other countries (for example, English). Therefore, the index in the document retrieval device is compatible with multiple languages (multilingual). Otherwise, you will not be able to retrieve enough documents.

【０００４】従来、文書検索の処理にインデックスを用
いる文書検索装置において、文書検索に用いるインデッ
クスの作成方法は、形態素解析を行って単語を切り出
し、単語をキーワードとしてインデックスを作成する。
その場合、通常、形態素解析を行う言語が１ヶ国の言語
（例えば日本語のみ）であるため、その他の国の言語の
単語については、単に固有名詞の単語として切り出すこ
としかできなかった。そのため、切り出した単語をキー
ワードとしてインデックスに登録するために、標準の表
記や、原形に戻すことができず、したがって、その場
合、検索用のインデックスの作成では、そのまま、既登
録語に対する未登録語として抽出して、インデックスに
登録することしかできなかった。Conventionally, in a document search apparatus which uses an index for a document search process, a method of creating an index used for document search is to perform morphological analysis to cut out a word and create an index using the word as a keyword.
In that case, since the language for morphological analysis is usually one language (for example, only Japanese), words in languages of other countries could only be cut out as words of proper nouns. Therefore, in order to register the extracted words as a keyword in the index, it is not possible to restore the standard notation or the original form. I could only extract it and register it in the index.

【０００５】このため、対象文書に異なる国の言語が多
く含まれている場合は、検索インデックスとして登録す
る単語のキーワード（固有名詞）の語数が多くなり、イ
ンデックスサイズが必要以上に大きくなってしまうとい
う問題があった。Therefore, when the target document contains many languages of different countries, the number of keywords (proper nouns) of the words registered as the search index increases, and the index size becomes unnecessarily large. There was a problem.

【０００６】また、このような文書検索装置において、
文書検索を行う場合には、キーワードとして登録されて
いる例えば固有名詞のインデックスと同じパターンの文
字列でしか検索できず、適切な範囲の検索が十分に行え
ないという問題があった。つまり、文書検索の際には、
形態素解析により切り出した単語（固有名詞）と全く同
じパターンの文字列でないとヒットせず、検索しても所
望の検索文書がヒットされないという不具合があった。Further, in such a document retrieval apparatus,
When performing a document search, there is a problem that only a character string registered as a keyword, for example, a character string having the same pattern as the index of a proper noun can be searched, and a search in an appropriate range cannot be sufficiently performed. In other words, when searching for documents,
There is a problem that a desired search document is not hit even if a search is performed unless a character string having exactly the same pattern as the word (proper noun) cut out by the morphological analysis is hit.

【０００７】前述した特開平４−２１１８０号公報に記
載の「文書検索装置」のように、ある言語のキーワード
が登録されており、その国の言語以外の言語での検索要
求があった場合には、例えば、それと同じ意味の既に登
録されている言語の言葉に翻訳することによって、他の
国の言語を含む文書についても検索ができるように構成
できるが、その場合においても、検索のためのインデッ
クスは自国語のみのキーワードで構成されており、多言
語で構成された文書に対するインデックスの登録に対す
る配慮がなされていないという問題点があった。When a keyword in a certain language is registered and a search request is made in a language other than that of the country, such as the "document retrieval device" described in Japanese Patent Laid-Open No. 21180/1992 mentioned above. Can be configured to search for documents that include languages of other countries, for example, by translating it into a word in a language that has already been registered and has the same meaning. There is a problem in that the index is composed of keywords only in its own language, and consideration is not given to registering the index for documents composed of multiple languages.

【０００８】本発明は、これらの問題を解決するために
なされたものであり、本発明の第１の目的は、複数の言
語で記述された文を含む文書に対して、できる限りそれ
らの異なる各々の言語に対応して形態素解析を行い、単
語を切り出し、キーワードを抽出して、キーワードの登
録が行えるようにして、多言語で記述された文を含む文
書に対しても未登録語を最小限に押さえるようにした多
言語文書登録検索装置を提供することにある。The present invention has been made to solve these problems, and a first object of the present invention is to make documents containing sentences written in a plurality of languages different from each other as much as possible. Morphological analysis is performed for each language, words are extracted, keywords are extracted, and keywords can be registered so that unregistered words are minimized even for documents containing sentences written in multiple languages It is to provide a multilingual document registration / retrieval device that can be held down to the limit.

【０００９】また、本発明の第２の目的は、複数の言語
で記述された文を含む文書に対し、複数の言語に対応し
て形態素解析により単語を切り出し、キーワードを抽出
してキーワードの登録を行う場合、解析する範囲の重複
を避け、インデックスサイズを最小に押さえ、かつ、検
索を行う場合の検索精度を向上させる多言語文書登録検
索装置を提供することにある。A second object of the present invention is to extract a word from a document including sentences written in a plurality of languages by morphological analysis corresponding to a plurality of languages, extract the keyword, and register the keyword. To provide a multilingual document registration / retrieval apparatus that avoids duplication of the range to be analyzed, minimizes the index size, and improves the retrieval accuracy when performing retrieval.

【００１０】また、本発明の第３の目的は、複数の言語
で記述された文を含む文書に対し、自国語以外の言語の
単語の語形変化や、表記の揺れにも対応でき、検索を行
う場合の検索精度の向上を計ると共に、不必要な未登録
語の抽出を最小限に押さえ、インデックスサイズを最小
に押さえることができる多言語文書登録検索装置を提供
することにある。The third object of the present invention is to deal with a document including sentences written in a plurality of languages even when the inflection of a word in a language other than the native language and fluctuation of the notation are dealt with, and a search is performed. It is an object of the present invention to provide a multilingual document registration / retrieval device capable of improving the retrieval accuracy when performing the retrieval, minimizing the extraction of unnecessary unregistered words, and minimizing the index size.

【００１１】更に、本発明の第４の目的は、複数の言語
で記述された文を含む文書に対し、複数の言語に対応し
て形態素解析により単語を切り出す場合、それぞれの複
数の言語対応の形態素解析の組み合わせによる対象テキ
ストの重複した解析を避けて、できる限り効率的に最適
に形態素解析を行い、多言語で記述された文を含む文書
に対して未登録語を最小限に押さえるようにした多言語
文書登録検索装置を提供することにある。Further, a fourth object of the present invention is that, when a word including sentences written in a plurality of languages is cut out by morphological analysis corresponding to a plurality of languages, the words corresponding to the respective plurality of languages are handled. Avoid duplicate analysis of target text by combining morphological analysis, perform morphological analysis as efficiently and optimally as possible, and minimize unregistered words for documents containing sentences written in multiple languages. To provide a multilingual document registration / retrieval device.

【００１２】[0012]

【課題を解決するための手段】上記のような目的を達成
するため、本発明の第１の特徴とする多言語文書登録検
索装置は、複数の言語の文を含む文書に対して検索に用
いるインデックスを作成して登録し、該インデックスに
より文書の検索を行う多言語文書登録検索装置におい
て、複数の言語の文を含む文書を格納する多言語文書格
納手段（３）と、前記文書に対し異なる言語の文に対応
してそれぞれに形態素解析を行い、文書のキーワードを
抽出するキーワード抽出手段（２）と、前記キーワード
抽出手段により抽出されたキーワードを対応する文書の
識別子と共にインデックスとして登録するインデックス
登録手段（４）と、検索条件を入力する検索条件入力手
段（１１）と、前記検索条件入力手段によって入力され
た検索条件から単語を切り出し、切り出した単語とイン
デックスのキーワードとを照合するインデックス照合手
段（１２）と、キーワードと単語の照合結果により検索
条件に適合する文書を読み出す読出し手段（１４）とを
備えることを特徴とする。In order to achieve the above-mentioned object, a multilingual document registration / retrieval apparatus, which is the first feature of the present invention, is used for retrieval of a document including sentences in a plurality of languages. In a multilingual document registration / retrieval device that creates and registers an index and retrieves a document by the index, a multilingual document storage unit (3) that stores a document including sentences in a plurality of languages is different from the document. A keyword extraction unit (2) for performing morphological analysis on each sentence of the language to extract a keyword of the document, and an index registration for registering the keyword extracted by the keyword extraction unit together with the identifier of the corresponding document as an index. Means (4), search condition input means (11) for inputting search conditions, and a word from the search conditions input by the search condition input means Cut, an index matching means for matching the keywords of the cut-out words and the index (12), characterized in that it comprises a reading means (14) for reading the documents relevant to the search criteria by matching results of the keyword and words.

【００１３】また、本発明の第２の特徴とする多言語文
書登録検索装置は、複数の言語の文を含む文書の検索時
に用いるインデックスを作成する多言語文書登録検索装
置において、インデックスを作成する多言語文書登録装
置は、切り出し対象言語が異なる複数の単語切り出し手
段（２１ａ，２２ａ，２３ａ）と、前記複数の単語切り
出し手段の処理優先度を設定する設定手段（２７）と、
前記複数の単語切り出し手段を処理優先度に従って制御
し、文書から単語を切り出して、キーワードを抽出する
キーワード抽出制御手段（２６）と、抽出されたキーワ
ードと該キーワードの単語が切り出された文書の識別子
を対応させてインデックスに登録するインデックス登録
手段（２８）とを備えることを特徴とする。A multilingual document registration / retrieval apparatus, which is a second feature of the present invention, creates an index in the multilingual document registration / retrieval apparatus that creates an index to be used when retrieving a document containing sentences in a plurality of languages. The multilingual document registration apparatus includes a plurality of word cutout means (21a, 22a, 23a) having different cutout target languages, a setting means (27) for setting the processing priority of the plurality of word cutout means,
Keyword extraction control means (26) for controlling the plurality of word cutting means according to the processing priority, cutting the words from the document and extracting the keywords, and the extracted keywords and the identifiers of the documents from which the words of the keywords are cut out. And an index registration means (28) for registering in the index in association with each other.

【００１４】また、本発明の第３の特徴とする多言語文
書登録検索装置においては、前記キーワード抽出制御手
段は、ある処理優先度の単語切り出し手段で識別不能と
された語を、次の処理優先度の単語切り出し手段で処理
し、切り出された単語については、該単語の識別子をキ
ーワードとし、複数の単語切り出し手段で最後まで識別
不能とされた単語については、該単語をキーワードとす
ることを特徴とする。Further, in the multilingual document registration / retrieval device which is the third feature of the present invention, the keyword extraction control means processes the word which cannot be identified by the word cutting means of a certain processing priority, to the next processing. For the words that have been processed and processed by the priority word cutting means, the identifier of the word is used as a keyword, and for the words that cannot be identified to the end by a plurality of word cutting means, the word is used as a keyword. Characterize.

【００１５】また、本発明の第４の特徴とする多言語文
書登録検索装置は、複数の言語の文を含む文書に対して
検索に用いるインデックスを作成して登録し、該インデ
ックスにより文書の検索を行う多言語文書登録検索装置
において、文書登録装置は、登録する文書を入力してキ
ーワード抽出を指示する入力手段（１）と、単語の切り
出しに用いる辞書を備え形態素解析によって文書のキー
ワードを抽出するキーワード抽出手段（２）と、前記キ
ーワード抽出手段により抽出されたキーワードを対応す
る文書の識別子と共にインデックスに登録する登録手段
（４）と、登録する文書，インデックスおよび該辞書フ
ァイルに登録されていない単語を保持する保持手段
（３，５）とを備えることを特徴とする。A fourth feature of the present invention, a multilingual document registration / retrieval apparatus, creates and registers an index used for retrieval for a document including sentences in a plurality of languages, and retrieves the document by the index. In the multilingual document registration / retrieval device that performs the above, the document registration device is provided with an input unit (1) for inputting a document to be registered and instructing keyword extraction, and a dictionary used for extracting words to extract a keyword of the document by morphological analysis. A keyword extracting means (2), a registering means (4) for registering the keyword extracted by the keyword extracting means together with an identifier of a corresponding document in an index, a document to be registered, an index and not registered in the dictionary file. Holding means (3, 5) for holding a word.

【００１６】また、本発明の第５の特徴とする多言語文
書登録検索装置においては、前記キーワード抽出手段
は、複数の言語の文から構成される文書からそれぞれの
言語の文に対して形態素解析により単語を切り出す複数
の単語切出し手段（２１ａ，２２ａ，２３ａ）と、前記
複数の単語切り出し手段がそれぞれに参照する言語に対
応する辞書を格納する複数の辞書ファイル（２１ｂ，２
２ｂ，２３ｂ）と、前記複数の単語切出し手段を適用す
る順番の設定を行う順序設定手段（２７）と、前記順序
設定手段により設定された順に複数の単語切り出し手段
を制御して前記文書から対応する多言語の文の単語を切
り出す制御を行う制御手段（２６）とを備えることを特
徴とする。In the multilingual document registration / retrieval device, which is the fifth feature of the present invention, the keyword extracting means performs morphological analysis on a sentence of each language from a document composed of sentences of a plurality of languages. A plurality of word cutout means (21a, 22a, 23a) for cutting out words by means of a plurality of word cutout means, and a plurality of dictionary files (21b, 2) for storing dictionaries corresponding to the languages referred to by the plurality of word cutout means, respectively.
2b, 23b), an order setting means (27) for setting the order in which the plurality of word cutout means are applied, and a plurality of word cutout means are controlled from the document in the order set by the order setting means. And a control means (26) for controlling to cut out a word in a multilingual sentence.

【００１７】また、本発明の第６の特徴とする多言語文
書登録検索装置においては、単語切出し手段により未登
録語として判断された単語に関しては一時的に未登録キ
ーワード候補として保持する未登録キーワード候補保持
手段（２５）と、それ以外の辞書から抽出された単語に
関しては一時的にキーワード候補として保持しておくキ
ーワード候補保持手段（２４）とを備え、前記制御手段
（２６，２７）は、１段目の単語切出し手段を制御し
て、複数の言語の文を含む文書を入力として、形態素解
析により単語の切り出しを行い、未登録語と判断された
単語に関しては、一時的に未登録キーワード候補として
前記未登録キーワード候補保持手段に保持し、辞書から
抽出された単語に関してはキーワード候補として、前記
キーワード候補保持手段に保持する処理を行い、順次に
各々の単語切出し手段を制御して、前段の単語切り出し
手段により前記未登録キーワード候補保持手段に保持さ
れた未登録語候補を入力として、形態素解析により単語
の切り出しを行い、未登録語と判断された単語に関して
は、そのまま前記未登録キーワード候補保持手段に残
し、辞書から抽出された単語に関しては前記未登録キー
ワード候補保持手段より削除し、前記キーワード候補保
持手段に追加保持する処理を行い、最終的に前記キーワ
ード候補保持手段に保持されたキーワード候補をキーワ
ードとし、前記未登録キーワード候補保持手段に保持さ
れた未登録キーワードを未登録キーワードとして対応す
る文書の識別子と共にインデックスに登録することを特
徴とする。In the multilingual document registration / retrieval device, which is the sixth feature of the present invention, an unregistered keyword that is temporarily held as an unregistered keyword candidate for a word judged by the word cutting means as an unregistered word. The candidate holding means (25) and the keyword candidate holding means (24) for temporarily holding as a keyword candidate the words extracted from other dictionaries, the control means (26, 27) By controlling the word cutting means in the first stage and inputting a document including sentences in a plurality of languages, the words are cut out by morphological analysis, and the words judged as unregistered words are temporarily unregistered keywords. The keyword stored in the unregistered keyword candidate storage means as a candidate, and the keyword candidate stored as a keyword candidate for the word extracted from the dictionary. Performing the process of holding in stages, sequentially controlling each word cutout means, inputting the unregistered word candidates held in the unregistered keyword candidate holding means by the word cutout means of the previous stage, and inputting the word by morphological analysis. The words that have been cut out and determined to be unregistered words are left as they are in the unregistered keyword candidate holding means, and the words extracted from the dictionary are deleted from the unregistered keyword candidate holding means, and the keyword candidate holding means. Is additionally held, and finally the keyword candidate held in the keyword candidate holding means is used as a keyword, and the unregistered keyword held in the unregistered keyword candidate holding means is used as an unregistered keyword. It is characterized by being registered in the index together with.

【００１８】また、本発明の第７の特徴とする多言語文
書登録検索装置は、複数の言語の文を含む文書に対して
検索に用いるインデックスを作成して登録し、該インデ
ックスにより文書の検索を行う多言語文書登録検索装置
において、文書検索装置は、検索条件を入力する検索条
件入力手段（１１）と、前記検索条件入力手段によって
入力された検索条件から単語を切り出してインデックス
と照合する多言語対応のインデックス照合手段（１２）
と、前記インデックス照合手段の照合結果により、対応
する文書をテキストデータベースから抽出する抽出手段
（１４）とを有することを特徴とする。The multilingual document registration / retrieval device, which is the seventh characteristic of the present invention, creates and registers an index to be used for retrieval for a document including sentences in a plurality of languages, and retrieves the document by the index. In the multilingual document registration / retrieval device, the document retrieval device includes a retrieval condition input means (11) for inputting retrieval conditions, and a word extraction from the retrieval conditions input by the retrieval condition input means for matching with an index. Index matching means for language (12)
And extraction means (14) for extracting the corresponding document from the text database according to the matching result of the index matching means.

【００１９】また、本発明の第８の特徴とする多言語文
書登録検索装置においては、前記インデックス照合手段
は、複数の言語から構成される文書からそれぞれ対応の
言語の文に対して形態素解析を行って単語を切り出す複
数の単語切出し手段（１３１ａ，１３２ａ，１３３ａ）
と、１以上の単語切出し手段を組み合わせて当該前記単
語切出し手段を適用する順番を設定する順序設定手段
（１３７）と、前記順序設定手段により設定した順に検
索条件入力手段によって入力された検索条件の単語を切
り出す制御を行う制御手段（１３６）とを有することを
特徴とする。Further, in the multilingual document registration / retrieval device according to the eighth aspect of the present invention, the index collating means performs a morphological analysis on a sentence of a corresponding language from a document composed of a plurality of languages. A plurality of word cutout means (131a, 132a, 133a) for going and cutting out words
And an order setting means (137) for setting the order in which the word cutout means is applied by combining one or more word cutout means, and the search conditions input by the search condition input means in the order set by the order setting means. It has a control means (136) for controlling to cut out a word.

【００２０】また、本発明の第９の特徴とする多言語文
書登録検索装置においては、単語切出し手段により未登
録語として判断された単語に関しては一時的に未登録検
索語候補として保持する未登録検索語候補保持手段（１
３４）と、それ以外の辞書から抽出された単語に関して
は一時的に検索語候補として保持しておく検索語候補保
持手段（１３５）とを備え、前記制御手段（１３７，１
３７）は、１段目の単語切出し手段を制御して、複数の
言語の文を含む文書を入力として、形態素解析により単
語の切り出しを行い、未登録語と判断された単語に関し
ては、一時的に未登録検索語候補として前記未登録検索
語候補保持手段に保持し、辞書から抽出された単語に関
しては検索語候補として、前記検索語候補保持手段に保
持する処理を行い、順次に各々の単語切出し手段を制御
して、前段の単語切り出し手段により前記未登録検索語
候補保持手段に保持された未登録語候補を入力として、
形態素解析により単語の切り出しを行い、未登録語と判
断された単語に関しては、そのまま前記未登録検索語候
補保持手段に残し、辞書から抽出された単語に関しては
前記未登録検索語候補保持手段より削除し、前記検索語
候補保持手段に追加保持する処理を行い、最終的に前記
検索語候補保持手段に保持された検索語候補を検索語と
し、前記未登録検索語候補保持手段に保持された未登録
検索語候補を未登録検索語として、インデックス照合
し、対応する文書をテキストデータベース部により抽出
して結果情報を出力することを特徴とする。Further, in the multilingual document registration / retrieval device according to the ninth feature of the present invention, a word judged as an unregistered word by the word cutting means is temporarily held as an unregistered search word candidate. Search term candidate holding means (1
34) and a search word candidate holding means (135) for temporarily holding as a search word candidate for words extracted from other dictionaries, the control means (137, 1).
37) controls the word cutting means in the first stage, inputs a document including sentences in a plurality of languages, cuts out words by morphological analysis, and temporarily cuts words judged to be unregistered words. Is held in the unregistered search word candidate holding means as an unregistered search word candidate, and the word extracted from the dictionary is held as a search word candidate in the search word candidate holding means, and each word is sequentially The unregistered word candidate held in the unregistered search word candidate holding unit by the word cutting unit in the preceding stage is input as an input by controlling the cutting unit.
Words are cut out by morphological analysis, and words that are determined to be unregistered words are left as they are in the unregistered search word candidate holding means, and words extracted from the dictionary are deleted from the unregistered search word candidate holding means. Then, a process of additionally holding the search word candidate holding means is performed, and finally the search word candidate held in the search word candidate holding means is set as a search word, and the search word candidate holding means holds the search word candidate held in the unregistered search word candidate holding means. It is characterized in that the registered search word candidate is used as an unregistered search word, index matching is performed, the corresponding document is extracted by the text database unit, and the result information is output.

【００２１】このような様々な特徴を有する本発明の多
言語文書登録検索装置によれば、ここで第１の特徴とす
る多言語文書登録検索装置においては、多言語文書格納
手段（３）が、複数の言語の文を含む文書を格納してお
り、キーワード抽出手段（２）が、前記文書に対し異な
る言語の文に対応してそれぞれに形態素解析を行い、文
書のキーワードを抽出する。インデックス登録手段
（４）は、前記キーワード抽出手段により抽出されたキ
ーワードを対応する文書の識別子と共にインデックスと
して登録する。According to the multilingual document registration / retrieval apparatus of the present invention having such various characteristics, in the multilingual document registration / retrieval apparatus having the first characteristic, the multilingual document storage means (3) is provided. , Documents containing sentences in a plurality of languages are stored, and the keyword extracting means (2) performs morphological analysis on the documents corresponding to sentences in different languages to extract the keywords of the documents. The index registration means (4) registers the keyword extracted by the keyword extraction means as an index together with the identifier of the corresponding document.

【００２２】文書の検索を行う場合、検索条件入力手段
（１１）により、検索条件を入力すると、インデックス
照合手段（１２）が、前記検索条件入力手段によって入
力された検索条件から単語を切り出し、切り出した単語
とインデックスのキーワードとを照合する。そして、読
出し手段（１４）が、キーワードと単語の照合結果によ
り検索条件に適合する文書を読み出す。このようにし
て、複数の言語の文を含む文書に対して検索に用いるイ
ンデックスを作成して登録し、該インデックスにより文
書の検索を行う。When searching for a document, when the search condition is input by the search condition input means (11), the index collating means (12) cuts out a word from the search condition input by the search condition input means and cuts it out. Matches the word that was created with the keyword in the index. Then, the reading means (14) reads out a document that matches the search condition based on the matching result between the keyword and the word. In this manner, an index used for searching is created and registered for a document including sentences in a plurality of languages, and the document is searched by using the index.

【００２３】また、本発明の第２の特徴とする多言語文
書登録検索装置においては、インデックスを作成する多
言語文書登録装置として、切り出し対象言語が異なる複
数の単語切り出し手段（２１ａ，２２ａ，２３ａ）が備
えられており、設定手段（２７）が、前記複数の単語切
り出し手段の処理優先度を設定すると、キーワード抽出
制御手段（２６）が、前記複数の単語切り出し手段を処
理優先度に従って制御し、文書から単語を切り出して、
キーワードを抽出する。インデックス登録手段（２８）
は、抽出されたキーワードと該キーワードの単語が切り
出された文書の識別子を対応させてインデックスに登録
する。これにより、複数の言語の文を含む文書に対して
は、それぞれの対象言語に対応して複数の各々の単語切
り出し手段により、形態素解析を行ってキーワードの単
語を切り出せる。このため、多言語で記述された文書に
対して未登録語を最小限に押さえて、検索時に用いるイ
ンデックスを作成することができる。Further, in the multilingual document registration / retrieval apparatus which is the second feature of the present invention, a plurality of word clipping means (21a, 22a, 23a) having different clipping target languages are used as the multilingual document registration apparatus for creating an index. ) Is provided, and the setting means (27) sets the processing priority of the plurality of word cutout means, the keyword extraction control means (26) controls the plurality of word cutout means in accordance with the processing priority. , Cut out words from the document,
Extract keywords. Index registration means (28)
Registers the extracted keyword in the index in association with the identifier of the document in which the word of the keyword is cut out. As a result, with respect to a document including sentences in a plurality of languages, a plurality of word segmentation units corresponding to respective target languages can perform morphological analysis to segment a keyword word. Therefore, it is possible to minimize the number of unregistered words in a document written in multiple languages and create an index to be used for searching.

【００２４】また、本発明の第３の特徴とする多言語文
書登録検索装置においては、複数の言語の文を含む文書
に対し、それぞれの対象言語に対応して複数の各々の単
語切り出し手段により、形態素解析を行ってキーワード
の単語を切り出す場合、前記キーワード抽出制御手段
が、ある処理優先度の単語切り出し手段で識別不能とさ
れた語を、次の処理優先度の単語切り出し手段で処理
し、切り出された単語については、該単語の識別子をキ
ーワードとする。また、複数の単語切り出し手段で最後
まで識別不能とされた単語については、該単語をキーワ
ードとする。これにより、複数の言語で記述された文書
に対して、それぞれの言語に対応した形態素解析により
単語を切り出すことができ、解析する範囲の重複を避け
て、キーワードを抽出することができる。このようにし
て、キーワードの登録を行う場合のインデックスサイズ
を最小に押さえることができる。Further, in the multilingual document registration / retrieval device, which is the third feature of the present invention, for a document including sentences in a plurality of languages, a plurality of word segmentation means are provided corresponding to respective target languages. In the case where a keyword word is cut out by performing morphological analysis, the keyword extraction control means processes a word that cannot be identified by the word cutout means having a certain processing priority, and processes the word cutout means having the next processing priority, For the cut out word, the identifier of the word is used as a keyword. Further, with respect to a word which cannot be identified to the end by a plurality of word cutting means, the word is used as a keyword. As a result, words can be cut out from a document written in a plurality of languages by morphological analysis corresponding to each language, and keywords can be extracted while avoiding overlapping analysis ranges. In this way, the index size when registering keywords can be minimized.

【００２５】また、本発明の第４の特徴とする多言語文
書登録検索装置によれば、文書登録装置は、入力手段
（１）と、キーワード抽出手段（２）と、登録手段
（４）と、保持手段（３，５）とから構成されている。
入力手段（１）が、登録する文書を入力してキーワード
抽出を指示すると、キーワード抽出手段（２）が、単語
の切り出しに用いる辞書を備え形態素解析によって文書
のキーワードを抽出し、登録手段（４）が、前記キーワ
ード抽出手段により抽出されたキーワードを対応する文
書の識別子と共にインデックスに登録する。この結果、
保持手段（３，５）には、登録する文書，インデックス
および該辞書ファイルに登録されていない単語が保持さ
れる。これにより、複数の言語の文を含む文書に対して
検索に用いるインデックスを作成して登録し、該インデ
ックスにより文書の検索を行うことができる。According to the multilingual document registration / retrieval device of the fourth aspect of the present invention, the document registration device includes an input unit (1), a keyword extraction unit (2), and a registration unit (4). , Holding means (3, 5).
When the input unit (1) inputs the document to be registered and instructs the keyword extraction, the keyword extraction unit (2) is provided with a dictionary used for extracting words, extracts the keyword of the document by morphological analysis, and the registration unit (4 ) Registers the keyword extracted by the keyword extracting means in the index together with the identifier of the corresponding document. As a result,
The holding means (3, 5) holds the document to be registered, the index, and the word not registered in the dictionary file. As a result, it is possible to create and register an index used for a search for a document including sentences in a plurality of languages, and search the document by using the index.

【００２６】本発明の第５の特徴とする多言語文書登録
検索装置によれば、前記キーワード抽出手段において、
複数の単語切出し手段（２１ａ，２２ａ，２３ａ）が、
複数の言語の文から構成される文書からそれぞれの言語
の文に対して形態素解析により単語を切り出す場合、複
数の辞書ファイル（２１ｂ，２２ｂ，２３ｂ）が、前記
複数の単語切り出し手段（２１ａ，２２ａ，２３ａ）の
それぞれに参照する言語に対応する辞書を格納している
ので、順序設定手段（２７）により、前記複数の単語切
出し手段を適用する順番の設定を行うと、制御手段（２
６）が、前記順序設定手段により設定された順に複数の
単語切り出し手段を制御して前記文書から対応する多言
語の文の単語を切り出す制御を行う。According to the multilingual document registration / retrieval device of the fifth aspect of the present invention, in the keyword extracting means,
A plurality of word cutting means (21a, 22a, 23a)
When a word is cut out from a document composed of sentences of a plurality of languages for a sentence of each language by morphological analysis, a plurality of dictionary files (21b, 22b, 23b) are used by the plurality of word cutting means (21a, 22a). , 23a) stores a dictionary corresponding to the language to be referred to, so that when the order setting means (27) sets the order in which the plurality of word cutting means is applied, the control means (2
6) controls a plurality of word cutout means in the order set by the order setting means to cut out words of a corresponding multilingual sentence from the document.

【００２７】また、本発明の第６の特徴とする多言語文
書登録検索装置においては、未登録キーワード候補保持
手段（２５）は、単語切出し手段により未登録語として
判断された単語に関しては一時的に未登録キーワード候
補として保持し、また、それ以外の辞書から抽出された
単語に関しては、キーワード候補保持手段（２４）によ
り、一時的にキーワード候補として保持する。多言語対
応にキーワードの抽出を行う場合、前記制御手段（２
６，２７）は、１段目の単語切出し手段を制御して、複
数の言語の文を含む文書を入力として、形態素解析によ
り単語の切り出しを行う。これにより、未登録語と判断
された単語に関しては、一時的に未登録キーワード候補
として前記未登録キーワード候補保持手段に保持し、辞
書から抽出された単語に関してはキーワード候補とし
て、前記キーワード候補保持手段に保持する処理を行
う。Further, in the multilingual document registration / retrieval device, which is the sixth feature of the present invention, the unregistered keyword candidate holding means (25) temporarily stores the word judged as the unregistered word by the word cutting means. Are stored as unregistered keyword candidates, and words extracted from other dictionaries are temporarily stored as keyword candidates by the keyword candidate storage means (24). In the case of extracting a keyword corresponding to multiple languages, the control means (2
6, 27) controls the word cutting means in the first stage to input a document including sentences in a plurality of languages and cut words by morphological analysis. As a result, the word judged as an unregistered word is temporarily held in the unregistered keyword candidate holding means as an unregistered keyword candidate, and the word extracted from the dictionary is held as the keyword candidate in the keyword candidate holding means. The process to hold is performed.

【００２８】続いて、順次に各々の単語切出し手段を制
御して、前段の単語切り出し手段により、前記未登録キ
ーワード候補保持手段に保持された未登録語候補を入力
として、形態素解析により単語の切り出しを行い、未登
録語と判断された単語に関しては、そのまま前記未登録
キーワード候補保持手段に残し、辞書から抽出された単
語に関しては前記未登録キーワード候補保持手段より削
除し、前記キーワード候補保持手段に追加保持する処理
を行う。Then, each word cut-out means is sequentially controlled, and the word cut-out means in the preceding stage inputs the unregistered word candidates held in the unregistered keyword candidate holding means and cuts out the words by morphological analysis. The word determined to be an unregistered word is left as it is in the unregistered keyword candidate holding means, and the word extracted from the dictionary is deleted from the unregistered keyword candidate holding means to be stored in the keyword candidate holding means. Perform the process of additionally holding.

【００２９】そして、最終的に前記キーワード候補保持
手段に保持されたキーワード候補をキーワードとし、前
記未登録キーワード候補保持手段に保持された未登録キ
ーワードを未登録キーワードとして対応する文書の識別
子と共にインデックスに登録する。このようにして、順
次に形態素解析により単語の切り出しを行うので、複数
の言語に対応するそれぞれの複数の形態素解析の組み合
わせによる対象テキストの重複した解析を避けて、でき
る限り効率的に最適に形態素解析を行うことができ、多
言語で記述された文を含む文書に対して未登録語を最小
限に押さえるようにできる。Finally, the keyword candidates held in the keyword candidate holding means are used as keywords, and the unregistered keywords held in the unregistered keyword candidate holding means are used as unregistered keywords and are indexed together with the identifiers of the corresponding documents. to register. In this way, words are cut out sequentially by morpheme analysis, so avoiding redundant analysis of the target text by combining multiple morpheme analyzes corresponding to multiple languages, and optimize morpheme as efficiently as possible. Analysis can be performed, and unregistered words can be minimized for a document including a sentence written in multiple languages.

【００３０】また、本発明の第７の特徴とする多言語文
書登録検索装置によれば、文書検索装置は、検索条件入
力手段（１１）と、多言語対応のインデックス照合手段
（１２）と、抽出手段（１４）とを備えており、検索条
件入力手段（１１）が、検索条件を入力すると、多言語
対応のインデックス照合手段（１２）が、前記検索条件
入力手段によって入力された検索条件から単語を切り出
してインデックスと照合する。抽出手段（１４）は、前
記インデックス照合手段の照合結果により、対応する文
書をテキストデータベースから抽出する。これより、複
数の言語の文を含む文書に対して検索に用いるインデッ
クスが作成して登録してある場合に、該インデックスに
より文書の検索を行うことができる。According to the multilingual document registration / retrieval device, which is the seventh feature of the present invention, the document retrieval device includes a retrieval condition input means (11), an index collation means (12) corresponding to multiple languages. When the search condition inputting means (11) inputs a search condition, the multi-lingual index collating means (12) is provided with an extracting means (14) and the search condition inputting means (12) extracts the search condition from the search condition inputting means. Cut out words and match with index. The extraction means (14) extracts the corresponding document from the text database according to the matching result of the index matching means. As a result, when an index used for a search is created and registered for a document including sentences in a plurality of languages, the index can be used to search the document.

【００３１】また、文書検索を行う場合、本発明の第８
の特徴とする多言語文書登録検索装置によれば、前記イ
ンデックス照合手段において、複数の単語切出し手段
（１３１ａ，１３２ａ，１３３ａ）が、複数の言語の文
から構成される文書からそれぞれ対応の言語の文に対し
て形態素解析を行って単語を切り出すので、順序設定手
段（１３７）により、１以上の単語切出し手段を組み合
わせて当該前記単語切出し手段を適用する順番を設定
し、制御手段（１３６）によって、前記順序設定手段に
より設定した順に検索条件入力手段によって入力された
検索条件から単語を切り出す制御を行い、そして、検索
条件の単語により文書の検索を行う。In the case of performing a document search, the eighth aspect of the present invention
According to the multilingual document registration / retrieval device characterized by the above, in the index collating means, the plurality of word cutting means (131a, 132a, 133a) are arranged in a corresponding language from a document composed of sentences in a plurality of languages. Morphological analysis is performed on the sentence to cut out words, so the order setting means (137) sets the order in which the word cutting means is applied by combining one or more word cutting means, and the control means (136). Control is performed to cut out words from the search condition input by the search condition input means in the order set by the order setting means, and the document is searched by the word of the search condition.

【００３２】また、本発明の第９の特徴とする多言語文
書登録検索装置においては、未登録検索語候補保持手段
（１３４）が、単語切出し手段により未登録語として判
断された単語に関しては一時的に未登録検索語候補とし
て保持しており、それ以外の辞書から抽出された単語に
関しては、検索語候補保持手段（１３５）により、一時
的に検索語候補として保持しておく。Further, in the multilingual document registration / retrieval device according to the ninth feature of the present invention, the unregistered search word candidate holding means (134) temporarily holds the word judged as the unregistered word by the word cutting means. The search word candidate holding means (135) temporarily holds the words extracted from other dictionaries as the search word candidates.

【００３３】前記制御手段（１３６，１３７）が、１段
目の単語切出し手段を制御して、複数の言語の文を含む
文書を入力として、形態素解析により単語の切り出しを
行い、未登録語と判断された単語に関しては、一時的に
未登録検索語候補として前記未登録検索語候補保持手段
に保持し、辞書から抽出された単語に関しては検索語候
補として、前記検索語候補保持手段に保持する処理を行
う。The control means (136, 137) controls the word cutting means in the first stage to input a document containing sentences in a plurality of languages, cut out words by morphological analysis, and identify unregistered words. The determined word is temporarily held in the unregistered search word candidate holding means as an unregistered search word candidate, and the word extracted from the dictionary is held in the search word candidate holding means as a search word candidate. Perform processing.

【００３４】続いて、順次に各々の単語切出し手段を制
御して、前段の単語切り出し手段により前記未登録検索
語候補保持手段に保持された未登録語候補を入力とし
て、形態素解析により単語の切り出しを行い、未登録語
と判断された単語に関しては、そのまま前記未登録検索
語候補保持手段に残し、辞書から抽出された単語に関し
ては、前記未登録検索語候補保持手段より削除し、前記
検索語候補保持手段に追加保持する処理を行う。Then, each word cutout means is controlled in sequence, and the word registration means stored in the unregistered search word candidate holding means by the preceding word cutout means is used as an input to cut out words by morphological analysis. The word determined to be an unregistered word is left as it is in the unregistered search word candidate holding means, and the word extracted from the dictionary is deleted from the unregistered search word candidate holding means. A process of additionally holding the candidate holding means is performed.

【００３５】そして、最終的に前記検索語候補保持手段
に保持された検索語候補を検索語とし、前記未登録検索
語候補保持手段に保持された未登録検索語を未登録検索
語として、インデックス照合し、対応する文書をテキス
トデータベース部により抽出して結果情報を出力する。Finally, the search word candidate held in the search word candidate holding means is used as a search word, and the unregistered search word held in the unregistered search word candidate holding means is used as an unregistered search word. Collation is performed, and the corresponding document is extracted by the text database unit and the result information is output.

【００３６】このようにして、複数の言語の文を含む文
書に対して検索に用いるインデックスを作成して登録
し、該インデックスにより文書の検索を行うので、複数
の言語で記述された文書に対して、自国語以外の単語の
語形変化や、表記の揺れにも対応でき、検索精度の向上
が計れる。また、不必要な未登録語の抽出を最小限に押
さえており、インデックスサイズを最小に押さえること
ができる。更に、また、複数の言語で記述された文書に
対して、複数の言語に対応するそれぞれの複数の形態素
解析の組み合わせによる対象テキストの重複した解析を
避けて、できる限り効率的に最適に形態素解析を行うこ
とができ、多言語で記述された文書に対して未登録語を
最小限に押さえるようにできる。In this way, an index used for searching is created and registered for a document including sentences in a plurality of languages, and the document is searched using the index. Therefore, it is possible to cope with the inflection of words other than the native language and the fluctuation of the notation, and improve the search accuracy. Further, unnecessary extraction of unregistered words is minimized, and the index size can be minimized. Furthermore, for documents written in multiple languages, avoiding redundant analysis of the target text by combining multiple morphological analyzes for multiple languages, and perform morphological analysis as efficiently and optimally as possible. It is possible to minimize unregistered words in a document described in multiple languages.

【００３７】[0037]

【発明の実施の形態】以下、本発明を実施する形態につ
いて、図面を参照して具体的に説明する。図１は本発明
の一実施例にかかる多言語文書登録検索装置の構成を示
すブロック図である。図１において、１は入力処理部、
２は多言語キーワード抽出部、３はテキストデータベー
ス部、４はインデックス登録部、５はインデックスファ
イル部、１１は検索条件入力部、１２は多言語インデッ
クス登録部、１３は表示部、１４はテキスト抽出部であ
る。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be specifically described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a multilingual document registration / retrieval apparatus according to an embodiment of the present invention. In FIG. 1, 1 is an input processing unit,
2 is a multilingual keyword extraction unit, 3 is a text database unit, 4 is an index registration unit, 5 is an index file unit, 11 is a search condition input unit, 12 is a multilingual index registration unit, 13 is a display unit, and 14 is a text extraction unit. It is a department.

【００３８】テキストデータベース部３には、例えば、
英語の記述された文および日本語で記述された文などの
複数の言語で記述された文を含む文書（１１０：図１
１）が格納されている。入力処理部１が、ユーザからの
文書の登録の入力指示を受け付けると、多言語キーワー
ド抽出部２は、ユーザにより指示された文書に対して、
そのキーワードを抽出する処理を行う。ここで、キーワ
ードを抽出する文書は、例えば、入力処理部１から入力
されて、テキストデータベース部３に登録された文書で
あり、または、テキストデータベース部３に既に登録さ
れている文書である。これに文書に対して、キーワード
を抽出する文書が指定され、その文書に対して、キーワ
ードを抽出する処理が行われる。In the text database section 3, for example,
Documents containing sentences written in multiple languages, such as sentences written in English and sentences written in Japanese (110: FIG. 1
1) is stored. When the input processing unit 1 receives a document registration input instruction from the user, the multilingual keyword extraction unit 2 responds to the document instructed by the user.
The process of extracting the keyword is performed. Here, the document for extracting the keyword is, for example, a document input from the input processing unit 1 and registered in the text database unit 3, or a document already registered in the text database unit 3. A document from which a keyword is extracted is specified for the document, and the keyword is extracted from the document.

【００３９】多言語キーワード抽出部２には、後述する
ように、各国の言語に対応して形態素解析を行う複数の
形態素解析部（２１ａ〜２３ａ：図２）が備えられてお
り、この複数の各国の言語対応の形態素解析部を順次に
制御して、効率的に文書中の異なる言語の文に対応して
それぞれに形態素解析を行い、単語を切り出し、キーワ
ードを抽出する処理を行う。複数の形態素解析部には、
それぞれに解析する言語に対応して言語対応に辞書ファ
イルが設けられており、形態素解析を行い単語を切り出
す場合に、各言語に対応する単語は、各々の国の言語に
対応する該当の辞書と比較される。As will be described later, the multilingual keyword extraction unit 2 is provided with a plurality of morphological analysis units (21a to 23a: FIG. 2) for performing a morphological analysis corresponding to the language of each country. The morphological analysis unit corresponding to the language of each country is sequentially controlled, and the morphological analysis is efficiently performed for each sentence of different languages in the document, and the process of extracting the word and extracting the keyword is performed. Multiple morphological analyzers,
A dictionary file is provided for each language corresponding to the language to be analyzed, and when morphological analysis is performed to extract words, the words corresponding to each language are the corresponding dictionary corresponding to the language of each country. Be compared.

【００４０】ある言語の辞書ファイルに登録されている
単語は、キ−ワード候補として、キーワード候補保持部
に一時的に保持され、ある言語の辞書に登録されていな
い単語は、未登録キーワード候補として一時的に未登録
キーワード候補保持部に保持される。そして、未登録キ
ーワード候補の単語は、次の言語の形態素解析部により
形態素解析が行われる。このようにして、次の国の言語
に対応する辞書と比較する際に、先の未登録キーワード
候補の単語を含めて、形態素解析を行い、単語を切り出
して、キーワードを抽出する処理を行う。The words registered in the dictionary file of a certain language are temporarily stored as keyword candidates in the keyword candidate storage unit, and the words not registered in the dictionary of a certain language are unregistered keyword candidates. It is temporarily held in the unregistered keyword candidate holding unit. Then, the unregistered keyword candidate word is subjected to morphological analysis by the morphological analysis unit of the next language. In this way, when the word is compared with the dictionary corresponding to the language of the next country, the morphological analysis is performed including the word of the previously unregistered keyword candidate, the word is cut out, and the keyword is extracted.

【００４１】このようにして抽出されたキーワードは、
インデックス登録部４により、当該キーワードに対応す
る文書の識別子と共にインデックスとして、インデック
スファイル部５のインデックテーブルに登録される。The keywords extracted in this way are
The index registration unit 4 registers it in the index table of the index file unit 5 as an index together with the identifier of the document corresponding to the keyword.

【００４２】ユーザが所望する文書の検索を行う場合、
ユーザは検索条件入力部１１から検索条件を入力する。
検索条件が入力されると、多言語インデックス照合部１
２では、入力された検索条件から単語を切り出し、切り
出した単語とインデックテーブルのインデックスのキー
ワードとを照合する。そして、その照合結果によってテ
キスト抽出部１４により、キーワードと単語の照合結果
により検索条件に適合する文書を読み出し、表示部１３
において、読み出された文書を表示する。When a user searches for a desired document,
The user inputs the search condition from the search condition input unit 11.
When the search conditions are entered, the multilingual index matching unit 1
In step 2, a word is cut out from the input search condition, and the cut out word is collated with the keyword of the index of the index table. Then, according to the matching result, the text extracting unit 14 reads out a document that matches the search condition based on the matching result of the keyword and the word, and the display unit 13
At, the read document is displayed.

【００４３】図２は、多言語キーワード抽出部の要部の
構成を示すブロック図である。図２には、多言語キーワ
ード抽出部における各々の要素のブロックと共に、その
データの流れが示されている。図２において、１は入力
処理部、２１ａは第１番目の形態素解析部、２１ｂは第
１番目の辞書ファイル部、２２ａは第２番目の形態素解
析部、２２ｂは第２番目の辞書ファイル部、２３ａは第
Ｎ番目の形態素解析部である。２４はキーワード候補保
持部、２５は未登録キーワード候補保持部、２６はキー
ワード／未登録キーワード決定部、２７は順序設定部、
２８はインデックス登録部、２９はインデックスファイ
ル部である。FIG. 2 is a block diagram showing the structure of the main part of the multilingual keyword extracting section. FIG. 2 shows the block of each element in the multilingual keyword extraction unit and the data flow thereof. In FIG. 2, 1 is an input processing unit, 21a is a first morphological analysis unit, 21b is a first dictionary file unit, 22a is a second morphological analysis unit, 22b is a second dictionary file unit, Reference numeral 23a is the Nth morpheme analysis unit. 24 is a keyword candidate holding unit, 25 is an unregistered keyword candidate holding unit, 26 is a keyword / unregistered keyword determination unit, 27 is an order setting unit,
Reference numeral 28 is an index registration unit, and 29 is an index file unit.

【００４４】図２に示すように、多言語キーワード抽出
部には、各々の国の言語の文の形態素解析を行うための
それぞれの言語に対応する複数の形態素解析部（２１ａ
〜２３ａ）と、各形態素解析部に各々の言語の辞書デー
タを供給する各国語対応の複数の辞書ファイル部（２１
ｂ〜２３ｂ）とが備えられており、これらの複数の形態
素解析部（２１ａ〜２３ａ）を制御して、効率的に多言
語の文の形態素解析を行うために、その多言語文書の形
態素解析を行う順序を設定する順序設定部２７と、その
作業メモリとして、解析された単語をキーワード候補と
して一時的に登録しておくキーワード候補保持部２４
と、１つ言語に対応する形態素解析部では解析されなか
った単語については、別の言語に対応する形態素解析部
での形態素解析を行うために、一時的に登録しておく未
登録キーワード候補保持部２５が設けられている。そし
て、形態素解析が終了した場合に、キーワード／未登録
キーワード決定部２６において、登録するキーワードと
する単語と、未登録キーワードとしておく単語を決定
し、インデックス登録部２８において、キーワードが抽
出された文書の識別子と対応づけて、インデックスファ
イル部２９に登録する。As shown in FIG. 2, the multilingual keyword extraction unit includes a plurality of morphological analysis units (21a) corresponding to respective languages for morphological analysis of sentences in each country.
23a) and a plurality of dictionary file units (21) corresponding to each language that supply dictionary data of each language to each morphological analysis unit.
b to 23b) are provided, and the morphological analysis of the multilingual document is performed in order to efficiently perform the morphological analysis of the multilingual sentence by controlling the plurality of morphological analysis units (21a to 23a). An order setting unit 27 that sets the order in which the above is performed, and a keyword candidate holding unit 24 that temporarily stores the analyzed words as keyword candidates as its working memory.
For a word that has not been analyzed by the morpheme analysis unit corresponding to one language, an unregistered keyword candidate that is temporarily registered is stored in order to perform morpheme analysis by the morpheme analysis unit corresponding to another language. A section 25 is provided. Then, when the morphological analysis is completed, the keyword / unregistered keyword determination unit 26 determines a word to be a keyword to be registered and a word to be an unregistered keyword, and the index registration unit 28 extracts the keyword from the document. It is registered in the index file unit 29 in association with the identifier.

【００４５】図３は、キーワード抽出処理を行う場合に
用いられる制御テーブルの内容を説明する図である。図
３（ａ）にキーワード抽出管理テーブル３０を示してお
り、図３（ｂ）に、形態素解析管理テーブル３６を示し
ており、また、図３（ｃ）に解析対象文字列タイプ設定
テーブル３７を示している。FIG. 3 is a diagram for explaining the contents of the control table used when the keyword extraction process is performed. 3A shows a keyword extraction management table 30, FIG. 3B shows a morphological analysis management table 36, and FIG. 3C shows an analysis target character string type setting table 37. Shows.

【００４６】キーワード抽出管理テーブル３０は、図３
（ａ）に示すように、多言語キーワード抽出部に備えら
れている各々の形態素解析部の使用状態を管理するテー
ブルであり、番号フィールド３１，対応言語種別フィー
ルド３２，順番フィールド３３，使用フラグフィールド
３４，および解析対象文字列タイプフィールド３５から
構成されている。各々のフィールドに使用する形態素解
析部の条件データを設定する。例えば、上から２番目の
エントリには、日本語対応で形態素解析を行う形態素解
析部の条件が設定されており、番号フィールド３１には
“２”が設定され、対応言語種別フィールド３２には
“日本語対応”と設定され、順番フィールド３３には
“１”が設定され、使用フラグフィールド３４は“Ｏ
Ｎ”が設定されている。また、解析対象文字列タイプフ
ィールド３５には“テキスト−ＡＬＬ”が設定されて、
ここでの条件データが設定されている。つまり、この条
件データからは、「第２番目の形態素解析部は、日本語
対応に用いられ、多言語文書の形態素解析を１番目に行
い、その解析対象文字列の範囲を、テキスト全部として
行う」ことを意味している。The keyword extraction management table 30 is shown in FIG.
As shown in (a), it is a table for managing the usage state of each morphological analysis unit provided in the multilingual keyword extraction unit, and is a number field 31, a corresponding language type field 32, an order field 33, a usage flag field. 34 and an analysis target character string type field 35. The condition data of the morphological analysis unit used for each field is set. For example, in the second entry from the top, the condition of the morphological analysis unit for performing morphological analysis in Japanese is set, “2” is set in the number field 31, and “2” is set in the corresponding language type field 32. "Japanese" is set, "1" is set in the order field 33, and "O" is set in the use flag field 34.
N "is set. Further," text-ALL "is set in the analysis target character string type field 35,
The condition data here is set. In other words, from this condition data, "the second morphological analysis unit is used for Japanese, first performs morphological analysis of a multilingual document, and performs the analysis target character string range as the entire text. It means that.

【００４７】キーワード抽出処理で用いられる形態素解
析部は、形態素解析管理テーブル３６により管理され
る。形態素解析管理テーブル３６においては、図３
（ｂ）に示すように、使用可能な各国語対応の形態素解
析部の個数ｎと、現在使われている形態素解析部の番号
ｉとのデータが管理されている。また、解析対象文字列
の範囲の設定のために、図３（ｃ）に示すように、解析
対象文字列タイプ設定テーブル３７が設けられている。
解析対象文字列タイプ設定テーブル３７には、解析対象
文字列タイプに応じて、その解析対象文字列の範囲の規
定されている。例えば、解析対象文字列タイプが“未登
録語群”である場合には、解析対象文字列の範囲を「未
登録キーワード候補保持部あるいは未登録検索語保持部
のキーワード候補の全て」とするように設定され、ま
た、解析対象文字列タイプが“テキスト−ＡＬＬ”であ
る場合には、解析対象文字列の範囲を「登録文書あるい
は検索式の全てのテキスト」とするように設定されてい
る。The morphological analysis unit used in the keyword extraction process is managed by the morphological analysis management table 36. In the morphological analysis management table 36, FIG.
As shown in (b), data on the number n of morphological analysis units that can be used for each national language and the number i of the morphological analysis unit currently in use are managed. Further, in order to set the range of the analysis target character string, as shown in FIG. 3C, an analysis target character string type setting table 37 is provided.
In the analysis target character string type setting table 37, the range of the analysis target character string is defined according to the analysis target character string type. For example, when the analysis target character string type is “unregistered word group”, the range of the analysis target character string is set to “all keyword candidates in the unregistered keyword candidate holding unit or the unregistered search word holding unit”. When the analysis target character string type is “text-ALL”, the range of the analysis target character string is set to be “all text of registered document or search expression”.

【００４８】次に、これらの制御テーブルを用いて、図
２に示すようなキーワード抽出処理部における多言語文
書登録処理について説明する。図４は多言語文書登録処
理の全体の処理フローを示すフローチャートである。図
４に示すフローチャートは、１ヶ国以上の言語で記述さ
れている文書をそれぞれの言語に対応している形態素解
析を行うことによって、キーワードとする単語を切り出
し、インデックスに登録する処理の全体の流れを示して
いる。また、図５は、多言語文書登録処理の中のキーワ
ード抽出管理テーブルの条件の設定処理の処理フローを
示すフローチャートであり、図６は、多言語文書登録処
理の中の解析対象文字列範囲の設定処理の処理フローを
示すフローチャートであり、図７は、多言語文書登録処
理の中のキーワード抽出処理の処理フローを示すフロー
チャートである。また、図８は、多言語文書登録処理の
中の未登録キーワード候補処理の処理フローを示すフロ
ーチャートである。Next, the multilingual document registration processing in the keyword extraction processing section as shown in FIG. 2 will be described using these control tables. FIG. 4 is a flowchart showing the overall processing flow of the multilingual document registration processing. The flowchart shown in FIG. 4 is an overall flow of processing for extracting a word as a keyword by performing morphological analysis corresponding to each language on a document described in one or more languages and registering it in an index. Is shown. FIG. 5 is a flowchart showing the processing flow of the keyword extraction management table condition setting processing in the multilingual document registration processing, and FIG. 6 shows the analysis target character string range in the multilingual document registration processing. FIG. 7 is a flowchart showing a processing flow of setting processing, and FIG. 7 is a flowchart showing a processing flow of keyword extraction processing in the multilingual document registration processing. Further, FIG. 8 is a flowchart showing a processing flow of unregistered keyword candidate processing in the multilingual document registration processing.

【００４９】まず、図４のフローチャートを参照して、
多言語文書登録処理の全体の処理を説明する。処理を開
始すると、まず、ステップ４１において、入力処理部１
により、文書の登録指示を行う。次に、ステップ４２に
おいて、キーワード抽出条件の設定処理（図５）を行
い、続いて、次のステップ４３において、形態素解析の
解析対象の文字列集合の設定処理（図６）を行う。つま
り、キーワード抽出処理で用いる形態素解析部の条件を
設定し、続いて、条件を設定した形態素解析部を用いて
解析を行う対象の文字列集合の設定を行う。文字列集合
の設定処理では、例えば、登録文書のテキストの全て
か、未登録キーワード候補群の文字列の集合か等を設定
し、具体的に形態素解析を行う解析対象の文字列集合の
設定を行う。First, referring to the flow chart of FIG.
The overall processing of the multilingual document registration processing will be described. When the processing is started, first, in step 41, the input processing unit 1
The instruction to register the document is given by. Next, in step 42, a keyword extraction condition setting process (FIG. 5) is performed, and subsequently, in the next step 43, a character string set to be analyzed by morphological analysis is set (FIG. 6). That is, the condition of the morpheme analysis unit used in the keyword extraction process is set, and then the set of character strings to be analyzed is set using the morpheme analysis unit for which the condition is set. In the character string set setting process, for example, all of the text of the registered document or a set of unregistered keyword candidate character strings is set, and the character string set to be analyzed is specifically set. To do.

【００５０】これらの設定の処理が終ると、次に、キー
ワード抽出処理の制御を行うため、ステップ４４におい
て、現在使っている形態素解析部の番号（順序）を示す
変数ｉを“１”と設定し、文書登録の処理の最初に使う
形態素解析部をセットする。次に、ステップ４５におい
て、変数ｉに対応する言語の形態素解析部の使用フラグ
がオンであるか否かを判定する。使用フラグがオンでな
ければ、変数ｉの番号の形態素解析部による形態素解析
の処理は行わないので、次の番号の形態素解析部の処理
に進めるため、ステップ５２に進む。After the processing of these settings is completed, next, in order to control the keyword extraction processing, in step 44, the variable i indicating the number (order) of the morphological analysis unit currently used is set to "1". Then, the morphological analyzer used at the beginning of the document registration process is set. Next, in step 45, it is determined whether or not the use flag of the morphological analysis unit of the language corresponding to the variable i is on. If the usage flag is not on, the morphological analysis unit with the number of the variable i does not perform the morphological analysis process, so the process proceeds to step 52 to proceed to the process of the morphological analysis unit with the next number.

【００５１】また、ステップ４５の判定において、変数
ｉに対応する言語の形態素解析部の使用フラグがオンで
ある場合、すなわち、キーワード抽出管理テーブル３０
で順序を示す変数ｉに対応する言語（「解析する順番」
がｉとなっている言語）の形態素解析のエントリーの
「使用フラグ」がＯＮになっている場合、次のステップ
４６に進み、文字列集合に対し、第ｉ番目の対応の形態
素解析部による順次のキーワード抽出処理（図７）を行
う。このキーワード抽出処理では、後述するように、キ
ーワード抽出管理テーブル３０の条件データにより、第
ｉ番目の順序の形態素解析部のエントリの解析対象文字
列のタイプで設定された文字列集合に対して第ｉ番目の
対応する形態素解析部によりキーワードを抽出する。If the use flag of the morphological analysis unit of the language corresponding to the variable i is ON in the determination at step 45, that is, the keyword extraction management table 30
The language corresponding to the variable i indicating the order of
If the "use flag" of the morphological analysis entry of the language (where i is i) is ON, the process proceeds to the next step 46, and the i-th corresponding morphological analysis unit sequentially processes the character string set. The keyword extraction process (FIG. 7) is performed. In this keyword extraction process, as will be described later, the condition data of the keyword extraction management table 30 causes the first set of character string sets set in the type of the analysis target character string of the entry of the morphological analysis unit in the i-th order The i-th corresponding morpheme analysis unit extracts a keyword.

【００５２】次に、ステップ４７において、抽出された
キーワードが（第ｉ番目の）形態素解析用辞書に登録さ
れているか否かを判定し、登録されていなければ、ステ
ップ４８に進み、未登録キーワード候補として対応する
文書ＩＤと共に、未登録キーワード候補保持部に記憶
し、ステップ５１に進む。また、ステップ４７の判定に
おいて、抽出されたキーワードが形態素解析用辞書に登
録されていると判定された場合、ステップ４９に進み、
キーワード候補として対応する文書ＩＤと共に、キーワ
ード候補保持部に記憶する。そして、次に、ステップ５
０において、未登録キーワード候補処理（図８）を行
う。この未登録キーワード候補処理では、後述するよう
に、先の形態素解析部の処理では未登録キーワード候補
とされたが、後の形態素解析部の処理でキーワード候補
とされた単語について、文字列照合を行い、照合された
単語については未登録キーワードから外す処理を行う。
この未登録キーワード候補処理が終ると、次に、ステッ
プ５１に進む。Next, in step 47, it is judged whether or not the extracted keyword is registered in the (i-th) morphological analysis dictionary. If it is not registered, the process proceeds to step 48, and the unregistered keyword is registered. It is stored in the unregistered keyword candidate holding unit together with the corresponding document ID as a candidate, and the process proceeds to step 51. If it is determined in step 47 that the extracted keyword is registered in the morphological analysis dictionary, the process proceeds to step 49,
It is stored in the keyword candidate holding unit together with the corresponding document ID as a keyword candidate. And then, step 5
At 0, unregistered keyword candidate processing (FIG. 8) is performed. In this unregistered keyword candidate processing, as will be described later, character string matching is performed on the word that was determined as an unregistered keyword candidate in the processing of the previous morpheme analysis unit, but was made a keyword candidate in the processing of the subsequent morpheme analysis unit. The matching word is removed from the unregistered keywords.
When this unregistered keyword candidate process is completed, the process proceeds to step 51.

【００５３】ステップ５１においては、登録文書の第ｉ
番目の形態素解析部によるキーワード抽出が終了したか
否かを判定する。キーワード抽出が終了していなけれ
ば、ステップ４６に戻り、ステップ４６からの処理を繰
り返し行う。また、このステップ５１の判定処理によ
り、第ｉ番目に対応する形態素解析部によるキーワード
抽出処理の終了が確認できれば、次の形態素解析部によ
るキーワード抽出処理を行うため、次のステップ５２に
おいて、使用する形態素解析部の順序を示す変数ｉをイ
ンクリメントして、つまり、変数ｉを（ｉ＝ｉ＋１とし
て）カウントアップし、次のステップ５３において、使
用可能な各国語対応の形態素解析部の個数ｎと次に使用
する形態素解析部の順序を示す変数ｉと比較する。In step 51, the i-th item of the registration document
It is determined whether or not the keyword extraction by the th morpheme analysis unit is completed. If keyword extraction has not been completed, the process returns to step 46 and the processing from step 46 is repeated. Further, if it is possible to confirm the end of the keyword extraction process by the morpheme analysis unit corresponding to the i-th by the determination process of this step 51, the keyword extraction process by the next morpheme analysis unit is performed, and therefore it is used in the next step 52. The variable i indicating the order of the morphological analysis unit is incremented, that is, the variable i is counted up (i = i + 1), and in the next step 53, the number n of the available morphological analysis units corresponding to each national language and the next It is compared with a variable i indicating the order of the morphological analysis unit used for.

【００５４】この比較の結果、ｎ≧ｉであれば、第ｉ番
目の形態素解析部によるキーワード抽出処理は完了して
いないので、ステップ４５に戻り、ステップ４５からの
処理を繰り返し行う。また、ｎ＜ｉであれば、キーワー
ド抽出管理テーブルに設定された条件により使用可能状
態になっている形態素解析部による解析はすべて終了し
たことなので、次に、ステップ５４に進み、キーワード
候補群と未登録キーワード群の中からインデックスを作
成する処理を行う。これにより、一通りの文書の登録処
理は終了するので、次に、ステップ５５において、文書
の登録を終了するか否かを判定し、その外の文書の登録
処理を行う場合には、ステップ４１に処理を戻し、ステ
ップ４１からの処理を繰り返し行う。また、文書の登録
を終了する場合には、ここでの一連の処理を終了する。As a result of this comparison, if n ≧ i, the keyword extraction processing by the i-th morpheme analysis unit has not been completed, so the processing returns to step 45 and the processing from step 45 is repeated. If n <i, it means that the analysis by the morpheme analysis unit which is in the usable state according to the conditions set in the keyword extraction management table has all been completed, so the process proceeds to step 54, and the keyword candidate group The process of creating an index from the unregistered keyword group is performed. This completes the registration process of one document, and then, in step 55, it is determined whether or not the registration of the document is ended. The process is returned to and the process from step 41 is repeated. Further, when the registration of the document is ended, the series of processes here is ended.

【００５５】次に、多言語文書登録処理の中のキーワー
ド抽出管理テーブルの条件の設定処理について説明す
る。この処理は、図４の多言語文書登録処理の全体の処
理フローのステップ４２において実行される処理であ
る。図５のフローチャートを参照する。ここでの処理が
開始されると、ステップ６１に進み、キーワード抽出管
理テーブルに記憶させている各国語対応の形態素解析部
の数をカウントし、この形態素解析部の数を示す変数ｎ
に設定する。つまり、キーワード抽出管理テーブル３０
に登録されている形態素解析部の数がｍであったとする
と、ｎ＝ｍと設定される。次に、ステップ６２ににおい
て、各国語対応の形態素解析部を用いて、解析する順番
をキーワード抽出管理テーブル３０の順番フィールド３
３に設定する。そして、次のステップ６３において、こ
の文書登録時に使用する各国語対応の形態素解析を設定
するため、各国の言語対応の形態素解析部を使用するか
使用しないかを、キーワード抽出管理テーブル３０の使
用フラグフィールド３４においてＯＮ／ＯＦＦフラグに
よって設定する。これにより、キーワード抽出管理テー
ブルの条件の設定処理が終了する。Next, the setting process of the conditions of the keyword extraction management table in the multilingual document registration process will be described. This process is a process executed in step 42 of the overall process flow of the multilingual document registration process of FIG. Please refer to the flowchart of FIG. When the processing here is started, the process proceeds to step 61, the number of morphological analysis units corresponding to each language stored in the keyword extraction management table is counted, and a variable n indicating the number of morphological analysis units is counted.
Set to. That is, the keyword extraction management table 30
If the number of morphological analysis units registered in is m, then n = m is set. Next, in step 62, the analysis order is analyzed by using the morpheme analysis unit corresponding to each language, in the order field 3 of the keyword extraction management table 30.
Set to 3. Then, in the next step 63, in order to set the morphological analysis corresponding to each language used when registering this document, it is determined whether the morphological analysis unit corresponding to each language is used or not, using flag of the keyword extraction management table 30. It is set by the ON / OFF flag in the field 34. As a result, the condition setting process of the keyword extraction management table ends.

【００５６】次に、多言語文書登録処理の中の多言語文
書登録処理の中の解析対象文字列範囲の設定処理につい
て説明する。この処理は、前述したように、図４の多言
語文書登録処理の全体の処理フローのステップ４３にお
いて実行される処理である。図６のフローチャートを参
照する。この処理を開始すると、まず、ステップ７１に
おいて、キーワード抽出管理テーブル３０に記憶してい
る各国語対応の形態素解析部のエントリの番号を示す変
数ｊを“１”に設定する。次に、ステップ７２におい
て、キーワード抽出管理テーブルの第ｊ番目のエントリ
の各言語対応の形態素解析部に対する解析対象文字列の
タイプを、解析対象文字列タイプ設定テーブルの中から
選ぶ。前述したように、解析対象文字列タイプ設定テー
ブル３７には、解析対象文字列のタイプの種類に対応し
て、形態素解析を行う文字列の範囲を設定しており、こ
の設定された解析対象文字列タイプ設定テーブル３７の
解析対象文字列タイプの種類から、解析対象文字列の種
類を選択し、キーワード抽出管理テーブル３０の解析対
象文字列タイプフィールド３５に設定する。Next, the setting process of the analysis target character string range in the multilingual document registration process in the multilingual document registration process will be described. As described above, this process is a process executed in step 43 of the overall process flow of the multilingual document registration process of FIG. Please refer to the flowchart of FIG. When this process is started, first, at step 71, a variable j stored in the keyword extraction management table 30 and indicating the entry number of the morphological analysis unit corresponding to each language is set to "1". Next, in step 72, the type of the analysis target character string for the morphological analysis unit corresponding to each language of the jth entry of the keyword extraction management table is selected from the analysis target character string type setting table. As described above, in the analysis target character string type setting table 37, the range of the character string to be subjected to the morphological analysis is set according to the type of the analysis target character string type. The type of analysis target character string is selected from the types of analysis target character string types in the column type setting table 37, and is set in the analysis target character string type field 35 of the keyword extraction management table 30.

【００５７】次に、ステップ７３に進み、第ｊ番目の形
態素解析部の「解析する順番」が１番目であるか否かを
判定する。第ｊ番目の形態素解析部の順番が１番目であ
る場合には、ステップ７４に進み、先に設定した解析対
象文字列タイプを無視して強制的に「テキスト−ＡＬ
Ｌ」と設定する。そして、ステップ７５に進み。これに
より、１番目の形態素解析部で解析する解対象文字列の
範囲は、常に登録文書のテキスト全てとする。また、ス
テップ７３の判定において、形態素解析部の順番が１番
目でない場合には、そのまま、ステップ７５に進む。Next, in step 73, it is determined whether or not the "order of analysis" of the j-th morpheme analysis unit is the first. If the order of the j-th morpheme analysis unit is the first, the process proceeds to step 74, and the previously set analysis target character string type is ignored and the "text-AL" is forced.
L ". Then, the process proceeds to step 75. As a result, the range of the solution target character string analyzed by the first morphological analysis unit is always the entire text of the registered document. If the morphological analysis unit is not the first in the determination in step 73, the process directly proceeds to step 75.

【００５８】続いて、次の形態素解析部のエントリにお
ける解析対象文字列タイプの設定を行うため、次に、ス
テップ７５において、変数ｊをインクリメントして、次
のステップ７６において、ｊ≦ｎであるか否かを判定す
る。ｊ≦ｎであれば、未だ設定がなされていない形態素
解析部に対応するエントリがあるので、ステップ７２に
戻り、ステップ７２からの処理を繰り返す。また、ｊ≦
ｎでなければ、ここでの処理を終了する。つまり、変数
ｊをキーワード抽出管理テーブル３０に記憶させている
形態素解析部の数ｎと比較して、ｊがｎと同じか小さい
場合は、ステップ７２に戻る。そうでない場合は、処理
を終了する。Subsequently, in order to set the analysis target character string type in the entry of the next morpheme analysis unit, next, in step 75, the variable j is incremented, and in the next step 76, j ≦ n. Or not. If j ≦ n, there is an entry corresponding to the morpheme analysis unit that has not been set yet, so the process returns to step 72 and the processing from step 72 is repeated. Also, j ≦
If not n, the processing here is terminated. That is, the variable j is compared with the number n of morphological analysis units stored in the keyword extraction management table 30, and if j is equal to or smaller than n, the process returns to step 72. If not, the process ends.

【００５９】これにより、キーワード抽出管理テーブル
３０において、各々の形態素解析部に対して解析対象の
文字列集合の設定処理が完了する。ここでの解析対象の
文字列集合として、例えば、登録文書のテキストの全て
か、未登録キーワード候補群の文字列の集合か等を設定
する処理が完了するので、各々の形態素解析部は、この
設定内容に従って、キーワード抽出処理を行う。As a result, in the keyword extraction management table 30, the setting process of the character string set to be analyzed for each morphological analysis unit is completed. As the character string set to be analyzed here, for example, the process of setting all of the text of the registered document, the set of character strings of the unregistered keyword candidate group, etc. is completed, each morphological analysis unit, The keyword extraction processing is performed according to the setting contents.

【００６０】次に、多言語文書登録処理の中のキーワー
ド抽出処理について説明する。図７のフローチャートを
参照する。この処理は、前述したように、図４の多言語
文書登録処理の全体の処理フローのステップ４６におい
て実行される処理である。ここでのキーワード抽出処理
を開始し、ステップ８１に進むと、ステップ８１におい
て、キーワード抽出管理テーブルの解析対象文字列タイ
プで設定された文字列集合に対して形態素解析を行って
単語を切り出す。つまり、文字列集合の形態素解析が終
っていない位置から、形態素解析により単語を切り出
す。Next, the keyword extraction process in the multilingual document registration process will be described. Please refer to the flowchart of FIG. As described above, this process is a process executed in step 46 of the overall process flow of the multilingual document registration process of FIG. The keyword extraction process here is started, and when the process proceeds to step 81, in step 81, morphological analysis is performed on the character string set set in the analysis target character string type of the keyword extraction management table, and words are cut out. That is, a word is cut out by morphological analysis from a position where the morphological analysis of the character string set is not completed.

【００６１】次に、ステップ８２において、切り出され
た単語に不要語が含まれるか否かを判定する。不要語が
含まれていなければ、そのまま、ステップ８８に進み、
直ちに、不要語以外の単語を抽出したキーワードとし、
ここでの処理を終了とする。また、切り出された単語に
不要語が含まれる場合、ステップ８３に進み、変数ｉが
“１”であるか否かを判定する。変数ｉが“１”である
場合、現在使っている形態素解析部は、第１番目の形態
素解析部であるので、未登録キーワード候補に対する処
理はなく、この場合も、ステップ８８に進み、不要語以
外の単語を抽出したキーワードとして、ここでの処理を
終了する。Next, at step 82, it is judged whether or not the extracted word includes an unnecessary word. If the unnecessary word is not included, the process directly proceeds to step 88,
Immediately, select words other than unnecessary words as keywords,
This is the end of the process. If the extracted word includes an unnecessary word, the process proceeds to step 83 and it is determined whether the variable i is “1”. If the variable i is "1", the currently used morpheme analysis unit is the first morpheme analysis unit, so there is no processing for unregistered keyword candidates. Other words are extracted as keywords, and the process here ends.

【００６２】ステップ８３の判定において、変数ｉが
“１”でないと判定された場合には、次に、ステップ８
４に進み、解析順序がｉ番目の形態素解析の解析対象文
字列は未登録語群であるか否かを判定する。解析対象文
字列が未登録語群でない場合、ステップ８５に進み、切
り出された単語により、未登録キーワード候補に対して
文字列照合を行う。そして、次のステップ８６におい
て、その文字列照合の結果を判定する。照合できた場合
には、ステップ８７に進み、未登録キーワード候補から
切り出された単語あるいは文字列照合した単語を外し、
次に、ステップ８８において、不要語以外の単語を抽出
したキーワードとして、ここでの処理を終了する。If it is determined in step 83 that the variable i is not "1", then step 8
In step 4, it is determined whether the analysis target character string of the morphological analysis whose analysis order is the i-th is an unregistered word group. If the character string to be analyzed is not an unregistered word group, the process proceeds to step 85, and character strings are collated with the unregistered keyword candidates based on the cut out words. Then, in the next step 86, the result of the character string collation is determined. If the collation is successful, the process proceeds to step 87, and the word cut out from the unregistered keyword candidates or the word subjected to the character string collation is removed,
Next, in step 88, the processing here is terminated with words other than unnecessary words being extracted as keywords.

【００６３】また、ステップ８６の判定において、文字
列照合できたことが判定できなかった場合には、ステッ
プ８７の処理を行うこととなく、ステップ８８に進み、
不要語以外の単語を抽出したキーワードとして、ここで
の処理を終了する。If it cannot be determined in step 86 that the character string has been collated, the process proceeds to step 88 without performing step 87.
The process ends here with the words other than unnecessary words extracted as keywords.

【００６４】次に、多言語文書登録処理の中の未登録キ
ーワード候補処理の設定処理について説明する。図８の
フローチャートを参照する。この処理は、前述したよう
に、図４の多言語文書登録処理の全体の処理フローのス
テップ５０において実行される処理である。この未登録
キーワード候補処理を開始して、ステップ９１に進む
と、まず、現在使っている形態素解析の順序を示す変数
ｉが“１”であるか否かを判定する。変数ｉが“１”で
あれば、前述のように、現在使っている形態素解析部
は、第１番目の形態素解析部であるので、未登録キーワ
ード候補に対する処理はなく、直ちに、この未登録キー
ワード候補処理の処理を終了する。Next, the setting process of the unregistered keyword candidate process in the multilingual document registration process will be described. Please refer to the flowchart of FIG. As described above, this process is a process executed in step 50 of the overall process flow of the multilingual document registration process of FIG. When this unregistered keyword candidate process is started and the process proceeds to step 91, first, it is determined whether or not the variable i indicating the currently used morphological analysis order is “1”. If the variable i is "1", as described above, the morpheme analysis unit currently used is the first morpheme analysis unit, and therefore, there is no processing for the unregistered keyword candidate, and immediately this unregistered keyword is not processed. The processing of the candidate processing ends.

【００６５】また、ステップ９１の判定において、変数
ｉが“１”でないことが確認できれば、ステップ９２に
進み、解析順序がｉ番目の形態素解析の解析対象文字列
は未登録語群であるか否かを判定する。すなわち、キー
ワード抽出管理テーブル３０において、解析順序が第ｉ
番目の形態素解析部に対応のエントリの解析対象文字列
タイプフィールド３５の設定が「未登録語群」であるか
否かを判定する。この判定の結果、解析対象文字列タイ
プが「未登録語群」であれば、ステップ９５において、
未登録キーワード候補から抽出されたキーワードを外し
て、この処理を終了する。If it is confirmed in step 91 that the variable i is not "1", the flow advances to step 92 to determine whether the analysis target character string of the i-th morphological analysis is an unregistered word group. To determine. That is, in the keyword extraction management table 30, the analysis order is i-th.
It is determined whether the setting of the analysis target character string type field 35 of the entry corresponding to the th morpheme analysis unit is “unregistered word group”. If the result of this determination is that the analysis target character string type is “unregistered word group”, in step 95,
The keyword extracted from the unregistered keyword candidates is removed, and this processing ends.

【００６６】ステップ９２の判定において、解析対象文
字列タイプが「未登録語群」でなければ、ステップ９３
に進み、抽出されたキーワードを未登録キーワード候補
に対して文字列照合を行い、次のステップ９４におい
て、この文字列照合の結果を判定する。この判定の結
果、文字列照合できた場合には、ステップ９５に進み、
未登録キーワード候補から抽出されたキーワードを外し
て、この処理を終了する。また、文字列照合できなけれ
ば、そのまま、この処理を終了する。If it is determined in step 92 that the analysis target character string type is not "unregistered word group", step 93
In step 94, the extracted keyword is subjected to character string matching with the unregistered keyword candidate, and in the next step 94, the result of this character string matching is determined. If the result of this determination is that the character strings can be collated, the processing proceeds to step 95,
The keyword extracted from the unregistered keyword candidates is removed, and this processing ends. If the character string collation cannot be performed, this process is ended as it is.

【００６７】このようにして、未登録キーワード候補に
対する処理が行われ、この結果、先の形態素解析部の処
理では未登録キーワード候補とされたが、後の形態素解
析部の処理でキーワード候補とされた単語について、文
字列照合を行い、照合された単語については未登録キー
ワードから外す処理を行う。In this way, the processing for the unregistered keyword candidates is performed, and as a result, the unregistered keyword candidates are processed as the unregistered keyword candidates in the previous processing of the morphological analysis unit, but are processed as the keyword candidates in the processing of the subsequent morphological analysis unit. The character string matching is performed on the selected word, and the matching word is removed from the unregistered keywords.

【００６８】次に、複数の言語で記述された文を含む文
書を登録する場合について、具体的に複数の言語で記述
された文を含む文書を例示して、その動作例を説明す
る。図９は、多言語文書の一例を示す図である。図９に
示すように、ここでの多言語文書９９は、日本語と英語
の文章が存在する文書であり、多言語文書９９を新たに
文書登録する場合について説明する。この場合には、文
書全体に対して、まず日本語での形態素解析を行い、次
に、解析されなかった部分について、英語での形態素解
析を行い、キーワード抽出を行い、キーワードと共に当
該文書（文書の識別番号）を登録する。Next, in the case of registering a document including a sentence described in a plurality of languages, a document including a sentence described in a plurality of languages will be specifically exemplified and an operation example thereof will be described. FIG. 9 is a diagram showing an example of a multilingual document. As shown in FIG. 9, the multilingual document 99 here is a document in which Japanese and English sentences exist, and a case where the multilingual document 99 is newly registered will be described. In this case, the entire document is first subjected to morphological analysis in Japanese, and then the unanalyzed portion is subjected to morphological analysis in English, keyword extraction is performed, and the keyword (document Registration number).

【００６９】この多言語文書登録検索装置に、「英
語」，「日本語」，「中国語」，および「アラビア語」
対応の４つのそれぞれの言語に対応する形態素解析を行
える形態素解析部が設けられている場合、キーワード抽
出の条件を規定するキーワード抽出管理テーブル３０に
は、図３（ａ）に示すように、それぞれの形態素解析部
の制御の条件の設定がなされている。したがって、この
場合、キーワード抽出管理テーブル３０に登録されてい
る形態素解析部の数は“４”（ｎ＝４）とカウントされ
（ステップ６１：図５）、図３（ｂ）に示すように、形
態素解析管理テーブル３６に、レコード形式で（あるい
は変数として）一時的に記憶される。In this multilingual document registration / retrieval device, "English", "Japanese", "Chinese", and "Arabic"
When a morphological analysis unit capable of performing morphological analysis corresponding to each of the corresponding four languages is provided, the keyword extraction management table 30 that defines the conditions for keyword extraction, as shown in FIG. The conditions for controlling the morphological analysis unit are set. Therefore, in this case, the number of morphological analysis units registered in the keyword extraction management table 30 is counted as “4” (n = 4) (step 61: FIG. 5), and as shown in FIG. It is temporarily stored in the morphological analysis management table 36 in a record format (or as a variable).

【００７０】また、キーワード抽出管理テーブル３０の
条件の設定においては、各々の形態素解析部で文書を解
析する順番を、例えば、「日本語」，「英語」，「アラ
ビア語」，「中国語」の順とするため、キーワード抽出
管理テーブル３０の順番フィールド３３には、それぞれ
の言語対応の形態素解析部に対応して、上から順にその
順番を「２」，「１」，「４」，「３」と設定する（ス
テップ６２：図５）。Further, in setting the conditions of the keyword extraction management table 30, the order of analyzing the documents by each morphological analysis unit is, for example, "Japanese", "English", "Arabic", "Chinese". Therefore, in the order field 33 of the keyword extraction management table 30, the order is “2”, “1”, “4”, “in order from the top in correspondence with the morphological analysis unit corresponding to each language. 3 ”(step 62: FIG. 5).

【００７１】また、文書登録の処理の中のキーワード抽
出の処理で使用する形態素解析部の言語の種類を、ここ
では「日本語」，「英語」，および「中国語」とするの
で、キーワード抽出管理テーブル３０において、「日本
語」，「英語」および「中国語」の対応のエントリの使
用フラグフィールド３４を「ＯＮ」として、「アラビア
語」の対応のエントリの使用フラグフィールド３４は
「ＯＦＦ」とする。Further, since the types of languages of the morphological analysis unit used in the keyword extraction process in the document registration process are “Japanese”, “English”, and “Chinese” here, the keyword extraction is performed. In the management table 30, the use flag field 34 of the entry corresponding to “Japanese”, “English” and “Chinese” is set to “ON”, and the use flag field 34 of the entry corresponding to “Arabic” is set to “OFF”. And

【００７２】更に、文書登録の処理の中のキーワード抽
出の処理で形態素解析する各々の形態素解析部の解析対
象の文字列の範囲を特定して、効率よくキーワード抽出
の処理を実行するため、４つのそれぞれの言語に対応す
る形態素解析部に対して、２番目以降に設定している形
態素解析の処理では、解析の対象とする文字列群を必ず
しも常に登録する文書全体を範囲とせず、解析の対象と
するテキストあるいは文字列の範囲あるいはそれらの集
合を指定する。Further, in order to efficiently execute the keyword extraction process by specifying the range of the character string to be analyzed by each morphological analysis unit in the morphological analysis in the keyword extraction process in the document registration process, 4 In the morphological analysis processing set for the morphological analysis unit corresponding to each of the two languages, the second and subsequent ones do not always cover the entire document in which the character string group to be analyzed is always registered. Specifies the target text or range of character strings or a set of them.

【００７３】このため、図３（ｃ）に示すように、解析
対象文字列タイプ設定テーブル３７において、予め定義
している形態素解析を行う文字列の範囲に対応する解析
対象文字列タイプを、キーワード抽出管理テーブル３０
の解析対象文字列タイプフィールド３５に設定する。こ
の例では、キーワード抽出管理テーブル３０の解析対象
文字列タイプフィールド３５には、上から順に、「未登
録語群」，「テキスト−ＡＬＬ」，「未登録語群」，
「テキスト−範囲指定」と設定しており、日本語対応の
形態素解析部では、文書の全体を解析対象とするが、英
語対応の形態素解析部および中国語対応の形態素解析部
では、解析対象を未登録語群としている。なお、この場
合、第１番目で形態素解析を行う形態素解析部に関して
は、デフォルトで「テキスト−ＡＬＬ」として必ず最初
は登録文書の文書の全体を解析するように強制的に設定
し直される（ステップ７３〜ステップ７４：図６）。Therefore, as shown in FIG. 3C, in the analysis target character string type setting table 37, the analysis target character string type corresponding to the range of the character string to be subjected to the morphological analysis defined in advance is defined as a keyword. Extraction management table 30
Is set in the analysis target character string type field 35. In this example, in the analysis target character string type field 35 of the keyword extraction management table 30, “unregistered word group”, “text-ALL”, “unregistered word group”,
"Text-range specification" is set, and the Japanese morphological analysis unit targets the entire document for analysis, but the English morphological analysis unit and the Chinese morphological analysis unit target the analysis target. Unregistered word group. In this case, the first morphological analysis unit that performs the morphological analysis is forcibly set to "text-ALL" by default so that the entire document of the registered document is always analyzed first (step 73-Step 74: FIG. 6).

【００７４】このようにして、キーワード抽出管理テー
ブル３０に使用する各々の形態素解析部の順番，解析対
象文字列範囲などの条件が設定されると、設定された条
件に従って各々の形態素解析部が制御されて、キーワー
ド抽出の処理が実行される。キーワード抽出の処理が開
始されると、まず、順番が第１番目に設定されている形
態素解析部を用いて形態素解析を行う。この例では、順
序が１番目の「日本語対応」の形態素解析部により、そ
の「使用フラグ」が“ＯＮ”になっていることを確認し
てから（ステップ４５）、この形態素解析部に対応して
設定された文字列集合に対してキーワード抽出を行う。
つまり、この場合には「テキスト−ＡＬＬ」が設定され
ているので、登録文書の全てのテキストに対してキーワ
ード抽出を行う（ステップ４６）。In this way, when conditions such as the order of each morphological analysis unit used in the keyword extraction management table 30 and the range of character strings to be analyzed are set, each morphological analysis unit is controlled according to the set conditions. Then, the keyword extraction process is executed. When the keyword extraction process is started, first, a morphological analysis is performed using the morphological analysis unit whose order is set first. In this example, after confirming that the "usage flag" is "ON" by the morphological analysis unit for "Japanese corresponding" in the first order (step 45), the morphological analysis unit is supported. Then, keyword extraction is performed on the set of character strings.
That is, in this case, since "text-ALL" is set, keyword extraction is performed for all the texts of the registered document (step 46).

【００７５】キーワード抽出の処理（図７）において
は、切り出された単語のうち不要語として判断されるよ
うなもの以外をキーワードとする。「日本語」の形態素
解析の処理は、第１番目の解析処理であるため、未登録
キーワード候補に対する不要語の処理は行わない。そし
て、次に抽出されたキーワードが日本語形態素解析用の
辞書に登録されているかどうかを判定し（ステップ４
７）、登録されているものについては、文書ＩＤと共に
キーワード候補として記憶する（ステップ４８）。In the keyword extraction process (FIG. 7), the extracted words other than those which are judged as unnecessary words are used as keywords. Since the "Japanese" morphological analysis processing is the first analysis processing, unnecessary words are not processed for the unregistered keyword candidates. Then, it is determined whether or not the keyword extracted next is registered in the dictionary for Japanese morphological analysis (step 4).
7) The registered information is stored as a keyword candidate together with the document ID (step 48).

【００７６】図１０（ａ）および図１０（ｂ）は、日本
語対応の形態素解析部によるキーワード抽出処理が終っ
た段階のキーワード候補保持部および未登録キーワード
候補保持部の内容を対比して示す図である。例えば、登
録する文書（図９）の識別番号（文書ＩＤ）を“２０２
０４”とすると、「日本語」の形態素解析による全ての
キーワード候補の登録処理が終った段階で、図１０
（ａ）に示すように、キーワード候補保持部１００の文
書（fileＩＤ）１０１に対するキーワード候補１０２に
は、文書ＩＤ＝２０２０４の文書に対するキーワード候
補として、形態素解析によって切り出した単語の「イラ
ク」，「クウェート」，「国連」，…，「爆撃機」が記
憶される。一方、形態素解析用の辞書に登録されていな
いものは「日本語」の形態素解析による全ての未登録キ
ーワード候補の登録処理が終った段階で、図１０（ｂ）
に示すように、未登録キーワード候補保持部１０３の文
書（fileＩＤ）１０４に対する未登録キーワード候補１
０５には、文書ＩＤ＝２０２０４の文書に対する未登録
キーワード候補として、同じく、形態素解析により切り
出した単語の「パトリオット」，「The」，「Ministr
y」，「of」，…，「recently」が記憶される。10 (a) and 10 (b) show the contents of the keyword candidate holding unit and the unregistered keyword candidate holding unit at the stage when the keyword extraction processing by the Japanese-compatible morpheme analysis unit is completed in comparison. It is a figure. For example, if the identification number (document ID) of the document to be registered (FIG. 9) is “202
04 ”, when the registration process of all the keyword candidates by the morphological analysis of“ Japanese ”is completed, FIG.
As shown in (a), the keyword candidates 102 for the document (fileID) 101 in the keyword candidate holding unit 100 include the words “Iraq” and “Kuwait” extracted by morphological analysis as keyword candidates for the document with document ID = 20204. , "United Nations", ..., "Bomber" is stored. On the other hand, if the unregistered keyword candidates are not registered in the morphological analysis dictionary, the process of registration of all unregistered keyword candidates by the "Japanese" morphological analysis is completed, as shown in FIG.
As shown in FIG. 1, the unregistered keyword candidate 1 for the document (fileID) 104 of the unregistered keyword candidate holding unit 103
In FIG. 05, as the unregistered keyword candidates for the document with the document ID = 20204, similarly, the words “Patriot”, “The”, “Ministr” extracted by the morphological analysis are used.
“Y”, “of”, ..., “Recently” are stored.

【００７７】このようにして、「日本語」の形態素解析
による処理が終了すると、続いて、次の対応する言語の
「英語」の形態素解析による処理を開始する。この場合
において、前述の場合と同様に、キーワード抽出管理テ
ーブル３０の条件に従って、順序が２番目の「英語」の
形態素解析部により、その「使用フラグ」も“ＯＮ”に
なっていることを確認してから（ステップ４５）、この
形態素解析部に対応して設定された文字列集合に対して
キーワード抽出を行う。つまり、この場合にはキーワー
ド抽出管理テーブル３０の解析対象文字列タイプフィー
ルド３５には、その解析対象文字列タイプとして、「未
登録語群」が設定されているので、図１０（ｂ）に示す
ように、未登録キーワード候補保持部１０３の文書（fi
leＩＤ）１０４に対する未登録キーワード候補１０５に
記憶されている文字列に対して、キーワード抽出の処理
を行う。すなわち、現在の登録対象文書である文書ＩＤ
＝２０２０４の文書に対して、全文書の英語対応の形態
素解析を行うことなく、先に未登録キーワード候補とし
て抽出されている文字列に対して、キーワード抽出の処
理を行う（ステップ４６）。In this way, when the process of morphological analysis of "Japanese" is completed, the process of morphological analysis of "English" of the next corresponding language is started. In this case, as in the case described above, according to the conditions of the keyword extraction management table 30, it is confirmed that the "usage flag" is also "ON" by the morphological analysis unit of the second "English" in the order. After that (step 45), keyword extraction is performed on the character string set set corresponding to this morphological analysis unit. That is, in this case, "unregistered word group" is set as the analysis target character string type in the analysis target character string type field 35 of the keyword extraction management table 30, so that it is shown in FIG. As described above, the document (fi
The keyword extraction processing is performed on the character string stored in the unregistered keyword candidate 105 for (leID) 104. That is, the document ID that is the current registration target document
= 20204, the keyword extraction processing is performed on the character string previously extracted as the unregistered keyword candidate without performing the English-based morphological analysis of all the documents (step 46).

【００７８】この場合のキーワード抽出の処理（図７）
においても、前述の場合と同様に、切り出された単語の
うち不要語として判断されるようなもの以外をキーワー
ドとする。つまり、この処理により、未登録語キーワー
ド候補に対する不要語の処理として、不要語と判断され
る例えば「The」，「of」などが、未登録キーワード候
補から外される。そして、抽出されたキーワードが英語
形態素解析用の辞書に登録されているか否かを判定し
（ステップ４７）、登録されているものについては、文
書ＩＤと共にキーワード候補として記憶する（ステップ
４８）。Processing of keyword extraction in this case (FIG. 7)
In the same manner as in the case described above, keywords other than the ones that are judged as unnecessary words among the cut out words are used as the keywords. In other words, by this processing, for example, “The” and “of” that are determined to be unnecessary words are excluded from the unregistered keyword candidates as unnecessary word processing for the unregistered word keyword candidates. Then, it is determined whether or not the extracted keyword is registered in the English morphological analysis dictionary (step 47), and the registered one is stored as a keyword candidate together with the document ID (step 48).

【００７９】図１１（ａ）および図１１（ｂ）は、次の
英語対応の形態素解析部によるキーワード抽出処理が終
った段階のキーワード候補保持部および未登録キーワー
ド候補保持部の内容を対比して示す図である。前述のよ
うに、ここでの登録する文書（図９）の識別番号（文書
ＩＤ）を“２０２０４”とすると、未登録キーワード候
補に対して、「英語」の形態素解析による全ての登録処
理が終った段階においては、図１０（ｂ）に示す未登録
キーワード候補保持部１０３の文書（fileＩＤ）１０４
に対応する未登録キーワード候補１０５に記憶されてい
る文字列「Ministry」，「Education」，…，「sai
d」，「recently」に対して、英語対応の形態素解析部
での形態素解析が行われて、その結果、切り出された単
語の中で、英語形態素解析用の辞書に登録されている単
語を、図１１（ａ）に示すように、キーワード候補保持
部１１０の文書（fileＩＤ）１１１に対するキーワード
候補１１２に追加記憶する。つまり、文書ＩＤ＝２０２
０４の文書に対するキーワード候補として、その対応の
エントリに「ministry」，「education」，…，「sa
y」，「recent」として追加記憶する。11 (a) and 11 (b) compare the contents of the keyword candidate holding unit and the unregistered keyword candidate holding unit at the stage when the next keyword extraction processing by the English-compatible morphological analysis unit is completed. FIG. As described above, if the identification number (document ID) of the document to be registered here (FIG. 9) is “20204”, all the registration processing by the morphological analysis of “English” is completed for the unregistered keyword candidate. 10B, the document (fileID) 104 of the unregistered keyword candidate holding unit 103 shown in FIG.
Character strings “Ministry”, “Education”, ..., “sai stored in the unregistered keyword candidate 105 corresponding to
For "d" and "recently", a morphological analysis is performed by an English-compatible morphological analysis unit, and as a result, among the extracted words, the words registered in the English morphological analysis dictionary are As shown in FIG. 11A, it is additionally stored in the keyword candidate 112 for the document (fileID) 111 of the keyword candidate holding unit 110. That is, document ID = 202
As a keyword candidate for the 04 document, "ministry", "education", ..., "sa" are added to the corresponding entries.
It is additionally stored as "y" and "recent".

【００８０】なお、この説明の形態素解析の処理の中で
は、特に触れていないが、形態素解析により単語を切り
出す際に、単語の幾つかの表語を標準形に統一する処理
も同時に行われる。つまり、「Ministry」→「ministr
y」，「Education」→「education」のように、大文字
を小文字に統一する処理、また、「said」→「say」の
ように原形に統一する処理などが行われる。このように
して、１つの言語の形態素解析では未登録キーワード候
補とされた単語を、別の言語での形態素解析を行うこと
によってキーワード候補として抽出し、そのキーワード
候補として抽出されたキーワードを、未登録キーワード
候補から外す処理を行う。Although not particularly mentioned in the morphological analysis processing described in this description, when cutting out a word by the morphological analysis, processing for unifying several expressions of the word into a standard form is also performed at the same time. In other words, "Ministry" → "ministr
The process of unifying uppercase letters to lowercase, such as “y” and “Education” → “education”, and the process of unifying to the original form, such as “said” → “say”, are performed. In this way, words that are unregistered keyword candidates in one language morphological analysis are extracted as keyword candidates by performing morphological analysis in another language, and the keywords extracted as the keyword candidates are unregistered. The process of removing from the registered keyword candidates is performed.

【００８１】このようにして、英語対応の形態素解析用
の辞書に登録されていないもの、この例の場合には「Mo
nbushou」が残るので、これを未登録語キーワード候補
として記憶する。「英語」の形態素解析による全てのキ
ーワード抽出の処理が終った段階では、図１１（ｂ）に
示すように、未登録キーワード候補保持部１１３の文書
（fileＩＤ）１１４に対する未登録キーワード候補１１
５には、文書ＩＤ＝２０２０４の文書に対する未登録キ
ーワード候補として、その対応するエントリに「パトリ
オット」，「Monbushou」が記憶されている状態にな
る。In this way, the data which is not registered in the dictionary for morphological analysis corresponding to English, in the case of this example, "Mo
Since "nbushou" remains, it is stored as an unregistered word keyword candidate. At the stage when all the keyword extraction processing by the morphological analysis of “English” is completed, as shown in FIG.
In FIG. 5, “patriot” and “Monbushou” are stored in the corresponding entries as unregistered keyword candidates for the document of document ID = 20204.

【００８２】このようにして、「英語」の形態素解析に
よる処理が終了すると、続いて、第３番目の順序の言語
対応する形態素解析部による処理に入る。つまり、次の
対応する言語の「アラビア語」対応の形態素解析部によ
る処理に入ることになるが、しかし、キーワード抽出管
理テーブル３０において「アラビア語」の形態素解析部
の「使用フラグ」は“ＯＦＦ”になっているので、この
場合には、前述の場合と同様に、キーワード抽出管理テ
ーブル３０の条件に従って、順序が３番目の「アラビア
語」対応の形態素解析部の「使用フラグ」の“ＯＮ”が
確認できず（ステップ４５）、この「アラビア語」対応
の形態素解析部による処理はスキップする。When the processing by the morphological analysis of "English" is completed in this way, the processing by the morphological analysis unit corresponding to the language of the third order is subsequently started. That is, the processing is started by the morphological analysis unit corresponding to the next "Arabic" in the corresponding language. However, in the keyword extraction management table 30, the "use flag" of the morphological analysis unit for "Arabic" is "OFF". In this case, as in the case described above, according to the conditions of the keyword extraction management table 30, in this case, the “use flag” of the morphological analysis unit corresponding to the third “Arabic” is “ON”. "Cannot be confirmed (step 45), and the processing by the morphological analysis unit corresponding to" Arabic "is skipped.

【００８３】このようにして、第３番目の順序の言語に
対応する形態素解析部による処理がスキップされると、
続いて、第４番目の順序の言語に対応する形態素解析部
による処理に入る。この場合においても、前述の場合と
同様に、キーワード抽出管理テーブル３０の条件に従っ
て処理が進められる。この場合、順序が第４番目の「中
国語」対応の形態素解析部の「使用フラグ」は“ＯＮ”
になっていることが確認できるので（ステップ４５）、
この「中語語」対応の形態素解析部によって、その対応
に設定された文字列集合に対してキーワード抽出を行
う。この場合、キーワード抽出管理テーブル３０の解析
対象文字列タイプフィールド３５には、その解析対象文
字列タイプとして「未登録語群」が設定されているの
で、図１１（ｂ）に示す未登録キーワード候補保持部１
１３の文書（fileＩＤ）１１４に対する未登録キーワー
ド候補１１５に記憶されている文字列に対して、続いて
形態素解析を行い、そのキーワード抽出の処理を行う。
すなわち、現在の登録対象文書である文書ＩＤ＝２０２
０４の文書に対して、現在の未登録キーワード候補とし
て先に抽出されている文字列に対して、継続してキーワ
ード抽出の処理を行う（ステップ４６）。In this way, when the processing by the morphological analysis unit corresponding to the language of the third order is skipped,
Then, the process by the morphological analysis unit corresponding to the fourth order language is started. In this case as well, as in the case described above, the process proceeds according to the conditions of the keyword extraction management table 30. In this case, the “usage flag” of the morphological analysis unit corresponding to the fourth “Chinese” in the order is “ON”.
Since it can be confirmed (step 45),
By the morpheme analysis unit corresponding to this "Chinese language", the keyword extraction is performed on the character string set set to the correspondence. In this case, since "unregistered word group" is set as the analysis target character string type in the analysis target character string type field 35 of the keyword extraction management table 30, the unregistered keyword candidates shown in FIG. Holding part 1
The morpheme analysis is then performed on the character strings stored in the unregistered keyword candidates 115 for the 13 documents (fileID) 114, and the keyword extraction processing is performed.
That is, the document ID of the current registration target document = 202
For the document No. 04, the keyword extraction processing is continuously performed on the character string previously extracted as the current unregistered keyword candidate (step 46).

【００８４】この場合のキーワード抽出の処理（図７）
においても、前述の場合と同様に、切り出された単語の
うち不要語として判断されるようなもの以外をキーワー
ドとするが、該当するものはなく、また、未登録キーワ
ード候補保持部１１３の未登録キーワード候補１１５と
して記憶されている文字列に対しては「中国語」に該当
するものはないため、「中国語」の形態素解析によるキ
ーワード抽出の処理が終了しても、図１１（ａ）および
図１１（ｂ）に示すように、キーワード候補記憶部１１
０および未登録キーワード候補記憶部１１３の内容の変
化はない。Processing for keyword extraction in this case (FIG. 7)
In the same manner as in the case described above, the keywords other than the ones that are judged as unnecessary words among the cut out words are the keywords, but there is no corresponding keyword, and the unregistered keyword candidate holding unit 113 does not register them. Since there is nothing corresponding to “Chinese” in the character string stored as the keyword candidate 115, even if the processing of keyword extraction by the morphological analysis of “Chinese” is completed, FIG. As shown in FIG. 11B, the keyword candidate storage unit 11
There is no change in the contents of 0 and the unregistered keyword candidate storage unit 113.

【００８５】このようにして、全ての言語に対する形態
素解析によるキーワードの抽出の処理が終了すると、こ
れまでの処理により抽出したキーワードの内容に従っ
て、インデックステーブル（１２０：図１２）が作成さ
れる。図１２に示すように、インデックステーブル１２
０は、多言語の文書検索のためのインデックスとして、
各々の抽出されたキーワード１２１に対応して、文書
（fileＩＤ）１２２と当該文書の未登録語フラグ１２３
が登録されているテーブルである。ここでのインデック
ステーブル１２０に登録されるインデックスは、キーワ
ード１２１の単語を基本として、その対応の文書（file
ＩＤ）１２２の文書ＩＤがソートされ、当該文書ＩＤの
文書に関して、未登録キーワードがある場合に、その旨
の未登録フラグが設定される。このようなインデックス
テーブル１２０が作成されると、ここでの多言語文書の
文書登録の作業は終了する。In this way, when the keyword extraction processing by morphological analysis for all languages is completed, the index table (120: FIG. 12) is created according to the contents of the keywords extracted by the processing so far. As shown in FIG. 12, the index table 12
0 is an index for multilingual document search,
A document (fileID) 122 and an unregistered word flag 123 of the document corresponding to each extracted keyword 121.
Is a registered table. The index registered in the index table 120 here is based on the word of the keyword 121, and the corresponding document (file
The document ID of (ID) 122 is sorted, and when there is an unregistered keyword for the document of the document ID, an unregistered flag to that effect is set. When such an index table 120 is created, the document registration work of the multilingual document is completed here.

【００８６】次に、このようにした作成されたインデッ
クテーブル１２０のインデックスを用いて、多言語文書
検索を行う場合について説明する。Next, a case where a multilingual document search is performed using the index of the index table 120 thus created will be described.

【００８７】ユーザが所望する文書の検索を行う場合、
前述したように、ユーザは、検索条件入力部（１１：図
１）により検索条件を入力する。検索条件が入力される
と、多言語インデックス照合部（１２：図１）におい
て、入力された検索条件の検索条件式から単語を切り出
し、切り出した単語とインデックテーブルのインデック
スのキーワードとを照合する。この照合結果によって、
テキスト抽出部（１４：図１）により、インデックスと
単語の照合結果により検索条件に適合する文書を読み出
し、表示部（１３：図１）において、読み出された文書
を表示する。When a user searches for a desired document,
As described above, the user inputs the search condition using the search condition input unit (11: FIG. 1). When the search condition is input, the multilingual index matching unit (12: FIG. 1) cuts out a word from the search condition expression of the input search condition, and matches the cut out word with the keyword of the index of the index table. By this matching result,
The text extraction unit (14: FIG. 1) reads out a document that matches the search condition based on the matching result between the index and the word, and the display unit (13: FIG. 1) displays the read document.

【００８８】図１３は、多言語インデックス照合部の要
部の構成を示すブロック図である。図１３には、多言語
インデックス照合部における各々の要素のブロックと共
に、検索条件から形態素解析を行って検索式を決定する
場合のデータの流れが示されている。図１３において、
３はテキストデータベース部、１１は検索条件入力部、
１４はテキスト抽出部、１３１ａは第１番目の形態素解
析部、１３１ｂは第１番目の辞書ファイル部、１３２ａ
は第２番目の形態素解析部、１３２ｂは第２番目の辞書
ファイル部、１３３ａは第Ｎ番目の形態素解析部であ
る。１３４は検索語候補保持部、１３５は未登録検索語
候補保持部、１３６は検索式決定部、１３７は順序設定
部である。FIG. 13 is a block diagram showing the structure of the main part of the multilingual index matching unit. FIG. 13 shows a block of each element in the multilingual index matching unit and a data flow in the case of performing a morphological analysis from a search condition to determine a search expression. In FIG.
3 is a text database part, 11 is a search condition input part,
14 is a text extraction unit, 131a is a first morphological analysis unit, 131b is a first dictionary file unit, 132a
Is a second morpheme analysis unit, 132b is a second dictionary file unit, and 133a is an Nth morpheme analysis unit. Reference numeral 134 is a search word candidate holding unit, 135 is an unregistered search word candidate holding unit, 136 is a search formula determining unit, and 137 is an order setting unit.

【００８９】図１３に示すように、多言語インデックス
照合部には、検索条件として入力される検索条件式の各
々の国の言語の文の形態素解析を行うためのそれぞれの
言語に対応する複数の形態素解析部（１３１ａ〜１３３
ａ）と、各形態素解析部に各々の言語の辞書データを供
給する各国語対応の複数の辞書ファイル部（１３１ｂ〜
１３３ｂ）とが備えられており、これらの複数の形態素
解析部（１３１ａ〜１３３ａ）を制御して、効率的に多
言語の検索条件の形態素解析を行うために、その多言語
の検索条件の形態素解析を行う順序を設定する順序設定
部１３７と、その作業メモリとして、解析された検索条
件の単語を検索語候補として一時的に登録しておく検索
語候補保持部１３４と、１つ言語に対応する形態素解析
部では解析されなかった検索条件の単語については、別
の言語に対応する形態素解析部で形態素解析を行うため
に、一時的に登録しておく未登録検索語候補保持部１３
５が設けられている。そして、検索条件の検索条件式の
形態素解析が終了した場合に、検索式決定部１３６にお
いて、検索式とする単語を決定し、テキスト抽出部１４
により、インデックスの文書ＩＤによって、テキストデ
ータベース部３から検索条件に適合する文書を抽出す
る。As shown in FIG. 13, the multilingual index collating unit has a plurality of languages corresponding to respective languages for performing morphological analysis of sentences in the languages of the respective countries of the search condition expression input as the search conditions. Morphological analysis unit (131a to 133)
a), and a plurality of dictionary file units (131b to 131b) corresponding to each national language, which supply dictionary data of each language to each morphological analysis unit.
133b) is provided, and in order to efficiently perform the morpheme analysis of the multilingual search condition by controlling these plural morpheme analysis units (131a to 133a), the morpheme of the multilingual search condition is included. An order setting unit 137 that sets the order of analysis, a search word candidate holding unit 134 that temporarily stores the words of the analyzed search conditions as search word candidates as its working memory, and one language support For the words of the search condition that are not analyzed by the morpheme analysis unit, the unregistered search word candidate holding unit 13 that is temporarily registered in order to perform the morpheme analysis by the morpheme analysis unit corresponding to another language.
5 are provided. Then, when the morphological analysis of the search condition expression of the search condition is completed, the search expression determination unit 136 determines the word to be used as the search expression, and the text extraction unit 14
Thus, the document matching the search condition is extracted from the text database unit 3 by the document ID of the index.

【００９０】図１４は、多言語文書検索処理の全体の処
理フローを示すフローチャートである。図１４に示すフ
ローチャートは、１ヶ国以上の言語で記述されている検
索条件の中の検索条件式をそれぞれの言語に対応してい
る形態素解析を行うことによって、検索語とする単語を
切り出し、検索式を作成し、文書の検索を行う処理の全
体の流れを示している。また、図１５は、多言語文書検
索処理の中の検索語抽出処理の処理フローを示すフロー
チャートであり、図１６は、多言語文書検索処理の中の
未登録検索語候補処理の処理フローを示すフローチャー
トである。FIG. 14 is a flowchart showing the overall processing flow of multilingual document search processing. The flowchart shown in FIG. 14 cuts out a word to be used as a search word by performing a morphological analysis on the search condition expression in the search condition described in one or more languages corresponding to each language, and performs the search. It shows the overall flow of processing for creating a formula and searching for a document. FIG. 15 is a flowchart showing the processing flow of the search word extraction processing in the multilingual document search processing, and FIG. 16 shows the processing flow of the unregistered search word candidate processing in the multilingual document search processing. It is a flowchart.

【００９１】まず、図１４のフローチャートを参照し
て、多言語文書検索処理の全体の処理を説明する。処理
を開始すると、まず、ステップ１４１において、検索式
入力部１１により、文書検索を行う場合の多言語の検索
条件式を入力する。次に、ステップ１４２において、検
索語抽出条件の設定処理を行い、続いて、次のステップ
１４３において、形態素解析の解析対象の文字列集合の
設定処理を行う。すなわち、この場合には、前述したキ
ーワード抽出処理におけるキーワード抽出管理テーブル
と同様に、検索語抽出管理テーブル（図示せず）によ
り、検索語の切り出しのための形態素解析部の条件を設
定し、続いて、更に、条件を設定した形態素解析部を用
いて解析を行う対象の文字列集合の設定を行う。文字列
集合の設定処理では、例えば、検索条件式のテキストの
全てか、未登録検索語候補群の文字列の集合か（未登録
語群）等を設定し、具体的に形態素解析を行う解析対象
の文字列集合の設定を行う。なお、特に検索語抽出管理
テーブルを設けず、前述のキーワード抽出管理テーブル
３０を、ここでの検索語抽出管理テーブルとして用いる
ようにしてもよい。First, the overall processing of the multilingual document search processing will be described with reference to the flowchart of FIG. When the process is started, first, in step 141, the search expression input unit 11 inputs a multilingual search condition expression for document search. Next, in step 142, a search term extraction condition setting process is performed, and subsequently, in the next step 143, a character string set to be analyzed by morphological analysis is set. That is, in this case, similarly to the keyword extraction management table in the keyword extraction processing described above, the condition of the morphological analysis unit for cutting out the search word is set by the search word extraction management table (not shown), and Then, the set of character strings to be analyzed is further set using the morphological analysis unit for which the conditions have been set. In the character string set setting process, for example, all of the text of the search condition expression or a set of character strings of the unregistered search word candidate group (unregistered word group) is set, and the morphological analysis is specifically performed. Set the target character string set. It should be noted that the keyword extraction management table 30 described above may be used as the search word extraction management table here without providing a search word extraction management table.

【００９２】これらの設定の処理が終ると、次に、検索
式を作成する検索語の抽出処理の制御を行うため、ステ
ップ１４４において、現在使っている形態素解析部の順
序を示す変数ｉを“１”と設定し、検索条件式の形態素
解析の処理の最初に使う形態素解析部をセットする。次
に、ステップ１４５において、変数ｉに対応する言語の
形態素解析部の使用フラグがオンであるか否かを判定す
る。使用フラグがオンでなければ、変数ｉの番号の形態
素解析部による形態素解析の処理は行わないので、次の
番号の形態素解析部の処理に進めるため、ステップ１５
２に進む。After the processing of these settings is completed, next, in order to control the extraction processing of the search word for creating the search expression, in step 144, the variable i indicating the order of the morphological analysis unit currently used is set to " 1 "is set, and the morphological analysis unit used at the beginning of the processing of the morphological analysis of the search condition expression is set. Next, in step 145, it is determined whether the usage flag of the morphological analysis unit of the language corresponding to the variable i is on. If the use flag is not on, the morphological analysis unit with the number of the variable i does not perform the morphological analysis process, so that the process proceeds to the process of the morphological analysis unit with the next number.
Proceed to 2.

【００９３】また、ステップ１４５の判定において、変
数ｉに対応する言語の形態素解析部の使用フラグがオン
である場合、すなわち、検索語抽出管理テーブルで順序
を示す変数ｉに対応する言語（「解析する順番」がｉと
なっている言語）の形態素解析のエントリーの「使用フ
ラグ」がＯＮになっている場合、次のステップ１４６に
進み、文字列集合に対し、第ｉ番目の対応の形態素解析
部による順次の検索語の抽出処理（図１５）を行う。こ
の検索語の抽出処理では、後述するように、検索語抽出
管理テーブルの条件データにより、第ｉ番目の順序の形
態素解析部のエントリの解析対象文字列のタイプで設定
された文字列集合に対して第ｉ番目の対応する形態素解
析部により検索語（キーワード）を抽出する。When the use flag of the morphological analysis unit of the language corresponding to the variable i is turned on in the determination of step 145, that is, the language corresponding to the variable i indicating the order in the search word extraction management table (“analysis If the "use flag" of the morphological analysis entry for the language whose "order to perform" is i) is ON, the process proceeds to the next step 146, and the i-th corresponding morphological analysis is performed for the character string set. Sequential extraction processing of search words (FIG. 15) is performed. In this search word extraction process, as will be described later, with respect to the character string set set by the type of the analysis target character string of the entry of the ith morphological analysis unit in the i-th order, based on the condition data of the search word extraction management table. The search word (keyword) is extracted by the i-th corresponding morphological analysis unit.

【００９４】次に、ステップ１４７において、抽出され
た検索語が（第ｉ番目の）形態素解析用辞書に登録され
ているか否かを判定する。このステップ１４７の判定
で、登録されていなければ、ステップ１４８に進み、未
登録検索語候補として対応する検索条件式と共に、未登
録検索語候補保持部に記憶し、ステップ１５１に進む。
また、ステップ１４７の判定において、抽出された検索
語が形態素解析用辞書に登録されていると判定された場
合、ステップ１４９に進み、検索語候補として対応する
検索条件式と共に、検索語候補保持部に記憶する。そし
て、次に、ステップ１５０において、未登録検索語候補
処理（図１６）を行う。この未登録検索語候補処理で
は、後述するように、先の形態素解析部の処理では未登
録検索語候補とされたが、後の形態素解析部の処理で検
索語候補とされた単語について、文字列照合を行い、照
合された単語については未登録検索語から外す処理を行
う。この未登録検索語候補処理が終ると、次に、ステッ
プ１５１に進む。Next, in step 147, it is determined whether or not the extracted search word is registered in the (i-th) morphological analysis dictionary. If the result of determination in step 147 is that it is not registered, the process proceeds to step 148, where it is stored in the unregistered search word candidate holding unit together with the corresponding search condition expression as an unregistered search word candidate, and the process proceeds to step 151.
Further, when it is determined in step 147 that the extracted search word is registered in the morphological analysis dictionary, the process proceeds to step 149, and the search condition candidate holding unit together with the corresponding search condition expression as the search word candidate. Remember. Then, in step 150, unregistered search word candidate processing (FIG. 16) is performed. In this unregistered search word candidate process, as will be described later, the word of the unregistered search word candidate that was the unregistered search word candidate in the process of the previous morpheme analysis unit Column matching is performed, and the matched word is removed from the unregistered search words. When this unregistered search word candidate process is completed, the process proceeds to step 151.

【００９５】ステップ１５１においては、第ｉ番目の形
態素解析部による検索語抽出が終了したか否かを判定す
る。検索語抽出が終了していなければ、ステップ１４６
に戻り、ステップ１４６からの処理を繰り返し行う。ま
た、このステップ１５１の判定処理により、第ｉ番目に
対応する形態素解析部による検索語抽出処理の終了が確
認できれば、次の形態素解析部による検索語抽出処理を
行うため、次のステップ１５２において、使用する形態
素解析部の順序を示す変数ｉをインクリメントして、つ
まり、変数ｉを（ｉ＝ｉ＋１として）カウントアップ
し、次のステップ１５３において、使用可能な各国語対
応の形態素解析部の個数ｎと次に使用する形態素解析部
の順序を示す変数ｉと比較する。In step 151, it is determined whether or not the retrieval word extraction by the i-th morpheme analysis unit is completed. If the search word extraction is not completed, step 146
Then, the process from step 146 is repeated. Further, if it is possible to confirm the end of the search word extraction process by the morpheme analysis unit corresponding to the i-th by the determination process of step 151, the search word extraction process is performed by the next morpheme analysis unit. The variable i indicating the order of the morphological analysis units to be used is incremented, that is, the variable i is counted up (when i = i + 1), and in the next step 153, the number n of morphological analysis units corresponding to each national language that can be used is n. And a variable i indicating the order of the morphological analysis unit to be used next.

【００９６】この比較の結果、ｎ≧ｉであれば、第ｉ番
目の形態素解析部による検索語抽出処理は完了していな
いので、ステップ１４５に戻り、ステップ１４５からの
処理を繰り返し行う。また、ｎ＜ｉであれば、検索語抽
出管理テーブルに設定された条件により使用可能状態に
なっている形態素解析部による解析はすべて終了したこ
となので、次に、ステップ１５４に進み、検索語候補群
と未登録検索語群の中から検索式を作成する処理を行
う。これにより、多言語の文書検索のための検索式が作
成されたので、次に、ステップ１５５において、作成さ
れた検索式によりインデックスのキーワードと照合を行
い、対応する文書を抽出し、ここでの処理を終了する。If n ≧ i as a result of this comparison, it means that the search word extraction processing by the i-th morpheme analysis unit has not been completed, so the processing returns to step 145 and the processing from step 145 is repeated. Further, if n <i, it means that all the analysis by the morphological analysis unit which is in the usable state according to the condition set in the search word extraction management table has been completed, so the process proceeds to step 154 and the search word candidates A process of creating a search expression from the group and the unregistered search word group is performed. As a result, a search expression for multilingual document search is created. Next, in step 155, the created search expression is compared with the keyword of the index to extract the corresponding document. The process ends.

【００９７】次に、多言語文書検索処理の中の検索語抽
出処理について説明する。図１５のフローチャートを参
照する。この処理は、前述したように、図１４の多言語
文書検索処理の全体の処理フローのステップ１４６にお
いて実行される処理である。ここでの検索語抽出処理を
開始し、ステップ１６１に進むと、ステップ１６１にお
いて、検索語抽出管理テーブルの解析対象文字列タイプ
で設定された検索条件式の文字列集合に対して形態素解
析を行い単語を切り出す。すなわち、前回の文字列集合
の形態素解析が終っていない位置から、形態素解析の処
理により単語を切り出す。Next, the search word extraction process in the multilingual document search process will be described. Please refer to the flowchart of FIG. As described above, this process is a process executed in step 146 of the overall process flow of the multilingual document search process of FIG. The search word extraction processing here is started, and when proceeding to step 161, in step 161, a morphological analysis is performed on the character string set of the search condition expression set in the analysis target character string type of the search word extraction management table. Cut out words. That is, a word is cut out by a morphological analysis process from a position where the previous morphological analysis of the character string set is not completed.

【００９８】次に、ステップ１６２において、切り出さ
れた単語に不要語が含まれるか否かを判定する。不要語
が含まれていなければ、そのまま、ステップ１６８に進
み、直ちに、不要語以外の単語を抽出した検索語とし、
ここでの処理を終了とする。また、切り出された単語に
不要語が含まれる場合、ステップ１６３に進み、変数ｉ
が“１”であるか否かを判定する。変数ｉが“１”であ
る場合、現在使っている形態素解析部は、第１番目の形
態素解析部であるので、未登録検索語候補に対する処理
はなく、この場合も、ステップ１６８に進み、不要語以
外の単語を抽出した検索語として、ここでの処理を終了
する。Next, in step 162, it is judged whether or not the cut-out word includes an unnecessary word. If the unnecessary word is not included, the process proceeds to step 168 as it is, and a word other than the unnecessary word is immediately set as a search word,
This is the end of the process. If the extracted word includes an unnecessary word, the process proceeds to step 163 and the variable i
Is "1". If the variable i is "1", the morpheme analysis unit currently used is the first morpheme analysis unit, so there is no processing for unregistered search word candidates. The process here is terminated as a search word obtained by extracting words other than words.

【００９９】ステップ１６３の判定において、変数ｉが
“１”でないと判定された場合、次に、ステップ１６４
に進み、解析順序がｉ番目の形態素解析の解析対象文字
列は未登録語群であるか否かを判定する。解析対象文字
列が未登録語群でない場合、ステップ１６５に進み、切
り出された単語により、未登録検索語候補に対して文字
列照合を行う。そして、次のステップ１６６において、
その文字列照合の結果を判定する。照合できた場合に
は、ステップ１６７に進み、未登録検索語候補から切り
出された単語あるいは文字列照合した単語を外し、次
に、ステップ１６８において、不要語以外の単語を抽出
した検索語として、この処理を終了する。If it is determined in step 163 that the variable i is not "1", then step 164
Then, it is determined whether the analysis target character string of the morphological analysis whose analysis order is the i-th is an unregistered word group. If the analysis target character string is not an unregistered word group, the process proceeds to step 165, and the unregistered search word candidates are subjected to character string matching by the cut-out words. Then, in the next step 166,
The result of the character string matching is determined. If the collation is successful, the process proceeds to step 167, the word cut out from the unregistered search word candidates or the word subjected to the character string collation is removed, and then, in step 168, a word other than the unnecessary word is extracted as a search word, This process ends.

【０１００】また、ステップ１６６の判定において、文
字列照合できたことが判定できなかった場合には、ステ
ップ１６７の処理を行うことなく、ステップ１６８に進
み、不要語以外の単語を抽出した検索語として、ここで
の処理を終了する。If it cannot be determined in step 166 that the character string can be collated, the process proceeds to step 168 without performing the process of step 167, and the search word obtained by extracting words other than unnecessary words is searched. Then, the processing here is ended.

【０１０１】次に、多言語文書検索処理の中の未登録検
索語候補処理について説明する。図１６のフローチャー
トを参照する。この処理は、前述したように、図１４の
多言語文書検索処理の全体の処理フローのステップ１５
０において実行される処理である。この未登録検索語候
補処理を開始して、ステップ１７１に進むと、まず、現
在使っている形態素解析部の順序を示す変数ｉが“１”
であるか否かを判定する。変数ｉが“１”であれば、前
述のように、現在使っている形態素解析部は、第１番目
の形態素解析部であるので、未登録検索語候補に対する
処理はなく、直ちに、この未登録検索語候補処理の処理
を終了する。Next, the unregistered search word candidate process in the multilingual document search process will be described. Please refer to the flowchart of FIG. As described above, this process is step 15 of the overall process flow of the multilingual document search process of FIG.
This is the process executed at 0. When this unregistered search word candidate process is started and the process proceeds to step 171, first, the variable i indicating the order of the currently used morphological analysis unit is “1”.
Is determined. If the variable i is “1”, as described above, the morpheme analysis unit currently used is the first morpheme analysis unit, so there is no processing for the unregistered search word candidate, and this unregistered immediately. The processing of the search word candidate processing ends.

【０１０２】また、ステップ１７１の判定において、変
数ｉが“１”でないことが確認できれば、ステップ１７
２に進み、解析順序がｉ番目の形態素解析の解析対象文
字列は未登録語群であるか否かを判定する。すなわち、
検索語抽出の条件の制御テーブルにおいて、解析順序が
第ｉ番目の形態素解析部に対応のエントリの解析対象文
字列タイプフィールドの設定が「未登録語群」であるか
否かを判定する。この判定の結果、解析対象文字列タイ
プが「未登録語群」であれば、ステップ１７５におい
て、未登録検索語候補から抽出された検索語を外して、
この処理を終了する。If it is confirmed in step 171 that the variable i is not "1", step 17
The process proceeds to 2 and it is determined whether the analysis target character string of the morphological analysis whose analysis order is the i-th is an unregistered word group. That is,
In the control table of the search word extraction condition, it is determined whether the setting of the analysis target character string type field of the entry corresponding to the i-th morpheme analysis unit in the analysis order is “unregistered word group”. If the result of this determination is that the analysis target character string type is “unregistered word group”, in step 175 the search word extracted from the unregistered search word candidates is removed,
This process ends.

【０１０３】ステップ１７２の判定において、解析対象
文字列タイプが「未登録語群」でなければ、ステップ１
７３に進み、抽出された検索語を未登録検索語候補に対
して文字列照合を行い、次のステップ１７４において、
この文字列照合の結果を判定する。この判定の結果、文
字列照合できた場合には、ステップ１７５に進み、未登
録検索語候補から抽出された検索語を外して、この処理
を終了する。また、文字列照合できなければ、そのま
ま、この処理を終了する。If the analysis target character string type is not "unregistered word group" in the determination of step 172, step 1
Proceeding to 73, the extracted search word is character string collated with the unregistered search word candidate, and in the next step 174,
The result of this string matching is determined. If the result of this determination is that the character string collation has succeeded, the routine proceeds to step 175, the search word extracted from the unregistered search word candidates is removed, and this processing ends. If the character string collation cannot be performed, this process is ended as it is.

【０１０４】このようにして、未登録検索語候補に対す
る処理が行われ、この結果、先の形態素解析部の処理で
は未登録検索語候補とされたが、後の形態素解析部の処
理で検索語候補とされた単語について、文字列照合を行
い、照合された単語については未登録検索語から外す処
理を行う。In this way, the processing for the unregistered search word candidate is performed, and as a result, the unregistered search word candidate is determined as the unregistered search word candidate in the processing of the previous morpheme analysis unit, but is searched for in the processing of the subsequent morpheme analysis unit. Character string matching is performed on the candidate words, and processing for removing the matched words from the unregistered search words is performed.

【０１０５】次に、複数の言語で記述された文を含む検
索条件の検索条件式により、多言語文書を検索する場合
について、具体的に検索条件式を例示して、その動作例
を説明する。図１７は、多言語で記述された文を含む検
索条件の検索条件式の一例を示す図である。図１７に示
すように、ここでの検索条件１７９は、日本語と英語の
文章が存在する文の検索条件式であり、この検索条件１
７９の文から検索語を切り出し、文書検索を行うする場
合について説明する。この場合には、複数の言語で記述
された検索条件式の全体に対して、まず日本語での形態
素解析を行い、次に、解析されなかった部分について、
英語での形態素解析を行って、検索語の抽出を行い、抽
出した検索語により検索式を生成し、該当する文書を検
索する。Next, in the case of retrieving a multilingual document with a retrieval condition expression of retrieval conditions including sentences described in a plurality of languages, the operation condition will be described by exemplifying the retrieval condition expression. . FIG. 17 is a diagram showing an example of a search condition expression of a search condition including a sentence described in multiple languages. As shown in FIG. 17, the search condition 179 here is a search condition expression of a sentence in which Japanese and English sentences exist.
A case where a search word is cut out from the sentence of 79 and a document search is performed will be described. In this case, first perform a morphological analysis in Japanese on the entire search condition expression written in multiple languages, then
Morphological analysis in English is performed to extract a search word, a search expression is generated from the extracted search word, and the corresponding document is searched.

【０１０６】この多言語文書登録検索装置には、「英
語」，「日本語」，「中国語」，および「アラビア語」
対応の４つのそれぞれの言語に対応する形態素解析を行
える形態素解析部が設けられている場合、検索語抽出の
条件を規定する検索語抽出管理テーブルには、図３
（ａ）に示すキーワード抽出管理テーブル３０と同様な
内容で、それぞれの形態素解析部の制御の条件の設定が
なされているものとする。ここでは、特に、検索語抽出
管理テーブルの内容は図示しないが、必要に応じて、図
３（ａ）に示すキーワード抽出管理テーブル３０を、検
索語抽出管理テーブルと同様なものとして参照する。し
たがって、この場合の検索語抽出管理テーブルに登録さ
れている形態素解析部の数は“４”（ｎ＝４）とカウン
トされ、同様に、形態素解析管理テーブル（３６：図３
（ｂ））に、レコード形式で（あるいは変数として）一
時的に記憶される。In this multilingual document registration / retrieval device, "English", "Japanese", "Chinese", and "Arabic"
When a morphological analysis unit that can perform morphological analysis corresponding to each of the four corresponding languages is provided, the search word extraction management table that defines the conditions for the search word extraction is shown in FIG.
It is assumed that the control conditions of each morphological analysis unit are set with the same contents as the keyword extraction management table 30 shown in (a). Here, although the contents of the search word extraction management table are not shown in particular, the keyword extraction management table 30 shown in FIG. 3A is referred to as the same as the search word extraction management table, if necessary. Therefore, the number of morphological analysis units registered in the search word extraction management table in this case is counted as “4” (n = 4), and similarly, the morphological analysis management table (36: FIG.
In (b)), it is temporarily stored in a record format (or as a variable).

【０１０７】検索語抽出管理テーブルの条件の設定にお
いては、各々の形態素解析部で文書を解析する順番を、
例えば、「日本語」，「英語」，「アラビア語」，「中
国語」の順とするため、検索語抽出管理テーブル（キー
ワード抽出管理テーブル３０）の順序フィールドの「解
析する順番」には、それぞれの言語対応の形態素解析部
に対応して、それぞれ順にその順番を「２」，「１」，
「４」，「３」と設定する。In setting the conditions of the search word extraction management table, the order in which the documents are analyzed by each morphological analysis unit is
For example, in order of "Japanese", "English", "Arabic", and "Chinese", the order of "parsing order" in the order field of the search word extraction management table (keyword extraction management table 30) is: Corresponding to the morphological analysis unit corresponding to each language, the order is “2”, “1”,
Set "4" and "3".

【０１０８】また、検索語抽出の処理で使用する形態素
解析部の言語の種類を、ここでは「日本語」，「英
語」，および「中国語」とするので、検索語抽出管理テ
ーブル（キーワード抽出管理テーブル３０）において、
「日本語」，「英語」および「中国語」の対応のエント
リの使用フラグを「ＯＮ」とし、「アラビア語」の対応
のエントリの使用フラグは「ＯＦＦ」とする。Since the types of languages of the morphological analysis unit used in the search word extraction process are "Japanese", "English", and "Chinese" here, the search word extraction management table (keyword extraction In the management table 30),
The use flags of the entries corresponding to “Japanese”, “English”, and “Chinese” are set to “ON”, and the use flags of the entries corresponding to “Arabic” are set to “OFF”.

【０１０９】更に、文書検索の処理の中の検索語抽出の
処理で形態素解析する各々の形態素解析部の解析対象の
文字列の範囲を特定して、効率よく検索語抽出の処理を
実行するため、４つのそれぞれの言語に対応する形態素
解析部に対して、２番目以降に設定している形態素解析
の処理では、解析の対象とする文字列群を必ずしも常に
検索条件式の文の全体を範囲とせず、解析の対象とする
テキストあるいは文字列の範囲あるいはそれらの集合を
指定する。Further, in order to efficiently execute the search word extraction processing, the range of the character string to be analyzed by each morphological analysis unit in the morphological analysis processing in the search word extraction processing in the document search processing is specified. In the morphological analysis processing set for the morphological analysis unit corresponding to each of the four languages after the second, the character string group to be analyzed is not always covered by the entire sentence of the search condition expression. Instead, specify the range of text or character strings to be analyzed or a set of them.

【０１１０】このため、キーワード抽出の場合と同様
に、図３（ｃ）に示すように、解析対象文字列タイプ設
定テーブル３７において、予め定義している形態素解析
を行う文字列の範囲に対応する解析対象文字列タイプ
を、検索語抽出管理テーブル（キーワード抽出管理テー
ブル３０）の中の各エントリの解析対象文字列タイプと
してに設定する。この例では、検索語抽出管理テーブル
の解析対象文字列タイプとして、「英語」，「日本
語」，「中国語」，および「アラビア語」対応の形態素
解析部に対応して、それぞれ「未登録語群」，「テキス
ト−ＡＬＬ」，「未登録語群」，「テキスト−範囲指
定」と設定する。Therefore, as in the case of keyword extraction, as shown in FIG. 3C, in the analysis target character string type setting table 37, it corresponds to a range of character strings to be subjected to morphological analysis which is defined in advance. The analysis target character string type is set as the analysis target character string type of each entry in the search word extraction management table (keyword extraction management table 30). In this example, as the analysis target character string type of the search word extraction management table, "unregistered" is associated with each of the morphological analysis units corresponding to "English", "Japanese", "Chinese", and "Arabic". Set "word group", "text-ALL", "unregistered word group", and "text-range specification".

【０１１１】したがって、この場合、日本語対応の形態
素解析部では、文書の全体を解析対象とするが、英語対
応の形態素解析部および中国語対応の形態素解析部で
は、解析対象を未登録語群としている。なお、第１番目
で形態素解析を行う形態素解析部に関しては、デフォル
トで「テキスト−ＡＬＬ」として必ず最初は登録文書の
文書の全体を解析するように強制的に設定し直される。Therefore, in this case, the Japanese-compatible morphological analysis unit targets the entire document as an analysis target, whereas the English-compatible morphological analysis unit and the Chinese-compatible morphological analysis unit target the analysis target as an unregistered word group. I am trying. Note that the first morphological analysis unit that performs morphological analysis is forcibly set to "text-ALL" by default to always analyze the entire registered document at first.

【０１１２】このようにして検索語抽出管理テーブル
（キーワード抽出管理テーブル３０）により、使用する
各々の形態素解析部の順番，解析対象文字列範囲などの
条件が設定されると、設定された条件に従って各々の形
態素解析部が制御されて、検索語抽出の処理が実行され
る。検索語抽出の処理が開始されると、まず、順番が第
１番目に設定されている形態素解析部を用いて形態素解
析を行う。この例では、順序が１番目の「日本語対応」
の形態素解析部により、その「使用フラグ」が“ＯＮ”
になっていることを確認してから（ステップ１４５）、
この形態素解析部に対応して設定された文字列集合に対
して検索語抽出を行う。つまり、この場合には「テキス
ト−ＡＬＬ」が設定されているので、検索条件式の全て
のテキストに対して検索語抽出を行う（ステップ１４
６）。In this way, when the conditions such as the order of each morphological analysis unit to be used, the analysis target character string range, etc. are set by the search word extraction management table (keyword extraction management table 30), according to the set conditions. Each morphological analysis unit is controlled to execute the search word extraction process. When the search word extraction process is started, first, the morphological analysis is performed using the morphological analysis unit whose order is set first. In this example, the first order is "Japanese"
The "use flag" is "ON" by the morphological analysis unit of
After confirming that (step 145),
Search word extraction is performed on the character string set set corresponding to this morphological analysis unit. That is, in this case, since "text-ALL" is set, the search word is extracted for all the texts of the search condition expression (step 14).
6).

【０１１３】検索語抽出の処理（図１５）においては、
切り出された単語のうち不要語として判断されるような
もの以外を検索語とする処理を行う。「日本語」の形態
素解析の処理は、第１番目の解析処理であるため、未登
録検索語候補に対する不要語の処理は行わない。そし
て、次に抽出された検索語が日本語形態素解析用の辞書
に登録されているかどうかを判定し（ステップ１４
７）、登録されているものについては、対応する検索条
件式と共に検索語候補として記憶する（ステップ１４
８）。In the retrieval word extraction process (FIG. 15),
A process is performed in which words other than those that are determined to be unnecessary words among the cut out words are used as search words. Since the "Japanese" morphological analysis processing is the first analysis processing, unnecessary words are not processed for unregistered search word candidates. Then, it is judged whether or not the extracted search word is registered in the dictionary for Japanese morphological analysis (step 14).
7) The registered one is stored as a search word candidate together with the corresponding search condition expression (step 14).
8).

【０１１４】図１８（ａ）および図１８（ｂ）は、日本
語対応の形態素解析部による検索語抽出処理が終った段
階の検索語候補保持部および未登録検索語候補保持部の
内容を対比して示す図である。例えば、図１７に示すよ
うな検索条件式１７９の多言語の文「イラク部隊の撤退
とパトリオットミサイルとMinistry of Education」に
対し、「日本語」の形態素解析による全ての検索語候補
の登録処理が終った段階では、図１８（ａ）に示すよう
に、検索語候補保持部の検索語候補１８１には、検索語
候補として、形態素解析により切り出した単語の「イラ
ク」，「部隊」，「撤退」，「ミサイル」が記憶される
と共に、一方、形態素解析用の辞書に登録されていない
ものは「日本語」の形態素解析による全ての未登録検索
語候補の登録処理が終った段階で、図１８（ｂ）に示す
ように、未登録検索語候補保持部の未登録検索語候補１
８２には、同じく、「日本語」の形態素解析によって切
り出した「パトリオット」，「Ministry」，「of」，
「Education」が、未登録検索語候補として記憶され
る。18 (a) and 18 (b) compare the contents of the search word candidate holding unit and the unregistered search word candidate holding unit at the stage when the search word extraction processing by the Japanese morphological analysis unit is completed. FIG. For example, for a multilingual sentence “withdrawal of Iraqi troops and Patriot missile and Ministry of Education” of a search condition expression 179 as shown in FIG. 17, registration processing of all search word candidates by morphological analysis of “Japanese” is performed. At the end stage, as shown in FIG. 18A, the search word candidates 181 in the search word candidate holding unit include, as search word candidates, the words “Iraq”, “unit”, and “withdrawal” of the words cut out by the morphological analysis. , And “missile” are stored, while those that are not registered in the dictionary for morphological analysis are displayed when the registration processing of all unregistered search word candidates by morphological analysis of “Japanese” is completed. As shown in 18 (b), the unregistered search word candidate 1 in the unregistered search word candidate holding unit 1
Similarly, in 82, "patriot", "Ministry", "of", which are cut out by morphological analysis of "Japanese",
“Education” is stored as an unregistered search word candidate.

【０１１５】このようにして、「日本語」の形態素解析
による処理が終了すると、続いて、次の対応する言語の
「英語」の形態素解析による処理を開始する。この場合
において、前述の場合と同様に、検索語抽出管理テーブ
ル（キーワード抽出管理テーブル３０）の条件に従っ
て、順序が２番目の「英語」の形態素解析部による処理
を行うが、その場合にも、その「使用フラグ」も“Ｏ
Ｎ”になっていることを確認してから（ステップ１４
５）、この形態素解析部に対応して設定された文字列集
合に対して検索語抽出を行う。つまり、この場合には検
索語抽出管理テーブル（キーワード抽出管理テーブル３
０）の解析対象文字列タイプとして、「未登録語群」が
設定されているので、図１８（ｂ）に示すように、未登
録検索語候補保持部の未登録検索語候補１８２の文字列
に対して、検索語抽出の処理を行う（ステップ１４
６）。In this way, when the process of morphological analysis of "Japanese" is completed, the process of morphological analysis of "English" of the next corresponding language is subsequently started. In this case, as in the case described above, the processing is performed by the morphological analysis unit of the second "English" in the order according to the conditions of the search word extraction management table (keyword extraction management table 30). The "use flag" is also "O"
After confirming that it is N "(step 14
5) The search word is extracted from the character string set set corresponding to this morpheme analysis unit. That is, in this case, the search word extraction management table (keyword extraction management table 3
Since "unregistered word group" is set as the analysis target character string type of 0), the character string of the unregistered search word candidate 182 in the unregistered search word candidate holding unit is set as shown in FIG. 18B. Then, the process of extracting the search word is performed (step 14).
6).

【０１１６】この場合の検索語抽出の処理（図１５）に
おいても、前述の場合と同様に、切り出された単語のう
ち不要語として判断されるようなもの以外を検索語とす
る。つまり、この処理により、未登録語検索語候補に対
する不要語の処理として、不要語と判断される「of」
が、未登録検索語候補から外される。そして、抽出され
た検索語が英語形態素解析用の辞書に登録されているか
否かを判定し（ステップ１４７）、登録されているもの
については、対応する検索条件式と共に検索語候補とし
て記憶する（ステップ１４８）。In the retrieval word extraction process (FIG. 15) in this case, as in the case described above, the extracted words other than those which are judged as unnecessary words are used as retrieval words. That is, by this process, as an unnecessary word process for the unregistered word search word candidate, “of” that is determined to be an unnecessary word
Is excluded from the unregistered search word candidates. Then, it is determined whether or not the extracted search word is registered in the English morphological analysis dictionary (step 147), and the registered one is stored as a search word candidate together with the corresponding search condition expression (( Step 148).

【０１１７】図１９（ａ）および図１９（ｂ）は、次の
英語対応の形態素解析部による検索語抽出処理が終った
段階の検索語候補保持部および未登録検索語候補保持部
の内容を対比して示す図である。図１８（ｂ）の未登録
検索語候補１８２に対して、「英語」の形態素解析によ
る全ての登録処理が終った段階において、その文字列
「Ministry」，「Education」に対して、英語対応の形
態素解析部での形態素解析が行われて、その結果、切り
出された単語の中で、英語形態素解析用の辞書に登録さ
れている単語を、図１９（ａ）に示すように、検索語候
補保持部の検索語候補１９１に、検索語候補として「mi
nistry」および「education」が、追加記憶される。ま
た、前述の場合と同様に、この説明の形態素解析の処理
の中では、特に触れていないが、形態素解析により単語
を切り出す際に、大文字を小文字に統一する表語を標準
形に統一する処理が、同時に行われる。このようにし
て、１つの言語の形態素解析では、未登録検索語候補と
された単語を、別の言語での形態素解析を行うことによ
って検索語候補として抽出し、その検索語候補として抽
出された検索語を、未登録検索語候補から外す処理を行
う。19 (a) and 19 (b) show the contents of the search word candidate holding unit and the unregistered search word candidate holding unit at the stage when the search word extraction processing by the next English morphological analysis unit is completed. It is a figure shown in contrast. At the stage where all the registration processing by the morphological analysis of "English" is completed for the unregistered search word candidate 182 of FIG. The morpheme analysis is performed by the morpheme analysis unit, and as a result, the words registered in the English morpheme analysis dictionary are extracted as search word candidates, as shown in FIG. 19 (a). In the search term candidate 191 in the holding unit, “mi
"nistry" and "education" are additionally stored. Also, as in the case described above, in the morphological analysis process of this description, although not particularly touched, when cutting out a word by morphological analysis, a process of unifying uppercase letters to lowercase letters to standard forms But at the same time. In this way, in the morpheme analysis of one language, the words that are unregistered search word candidates are extracted as the search word candidates by performing the morpheme analysis in another language, and are extracted as the search word candidates. Processing for removing a search word from unregistered search word candidates is performed.

【０１１８】この結果、英語対応の形態素解析用の辞書
に登録されていないもの、この例の場合には「パトリオ
ット」が残るので、これを未登録語検索語候補として記
憶する。「英語」の形態素解析による全ての検索語抽出
の処理が終った段階では、図１９（ｂ）に示すように、
未登録検索語候補保持部の未登録検索語候補１９２とし
て「パトリオット」が記憶されている状態になる。As a result, a word that is not registered in the dictionary for morphological analysis corresponding to English, that is, "Patriot" remains in this example, and is stored as an unregistered word search word candidate. At the stage when the processing of extracting all the search words by the morphological analysis of "English" is completed, as shown in FIG.
“Patriot” is stored as the unregistered search word candidate 192 in the unregistered search word candidate holding unit.

【０１１９】このようにして、「英語」の形態素解析に
よる処理が終了すると、続いて、第３番目の順序の言語
対応する形態素解析部による処理に入る。つまり、次の
対応する言語の「アラビア語」対応の形態素解析部によ
る処理に入ることになるが、しかし、検索語抽出管理テ
ーブル（キーワード抽出管理テーブル３０）において
「アラビア語」の形態素解析部の「使用フラグ」は“Ｏ
ＦＦ”になっているので、この場合には、前述の場合と
同様に、検索語抽出管理テーブル（キーワード抽出管理
テーブル３０）の条件に従って、順序が３番目の「アラ
ビア語」対応の形態素解析部の「使用フラグ」の“Ｏ
Ｎ”が確認できず（ステップ１４５）、この「アラビア
語」対応の形態素解析部による処理はスキップする。When the processing by the morphological analysis of "English" is completed in this way, the processing by the morphological analysis unit corresponding to the language in the third order is subsequently started. That is, the processing is started by the morphological analysis unit corresponding to the next "Arabic" in the corresponding language. However, in the search term extraction management table (keyword extraction management table 30), the morphological analysis unit of the "Arabic""Useflag" is "O"
In this case, as in the case described above, the morpheme analysis unit corresponding to the third "Arabic" corresponding to the condition of the search word extraction management table (keyword extraction management table 30) is used. "O" of "Use Flag"
N "cannot be confirmed (step 145), and the processing by the morphological analysis unit corresponding to" Arabic "is skipped.

【０１２０】このようにして、第３番目の順序の言語に
対応する形態素解析部による処理がスキップされると、
続いて、第４番目の順序の言語に対応する形態素解析部
による処理に入る。この場合においても、前述の場合と
同様に、検索語抽出管理テーブル（キーワード抽出管理
テーブル３０）の条件に従って処理が進められる。この
場合、順序が第４番目の「中国語」対応の形態素解析部
の「使用フラグ」は“ＯＮ”になっていることが確認で
きるので（ステップ４５）、この「中語語」対応の形態
素解析部によって、その対応に設定された文字列集合に
対して検索語抽出を行う。この場合、検索語抽出管理テ
ーブル（キーワード抽出管理テーブル３０）の解析対象
文字列タイプには、その解析対象文字列タイプとして
「未登録語群」が設定されているので、図１９（ｂ）に
示す未登録検索語候補保持部の未登録検索語候補１９２
に記憶されている文字列に対して、続いて形態素解析を
行い、その検索語抽出の処理を行う（ステップ１４
６）。In this way, if the processing by the morphological analysis unit corresponding to the language of the third order is skipped,
Then, the process by the morphological analysis unit corresponding to the fourth order language is started. Also in this case, as in the case described above, the process proceeds according to the conditions of the search word extraction management table (keyword extraction management table 30). In this case, it can be confirmed that the "use flag" of the morpheme analysis unit corresponding to the fourth "Chinese" in the order is "ON" (step 45). The analysis unit extracts the search word from the character string set set to the correspondence. In this case, since the analysis target character string type of the search word extraction management table (keyword extraction management table 30) is set to “unregistered word group” as the analysis target character string type, FIG. The unregistered search word candidate 192 in the unregistered search word candidate holding unit shown
Then, the morpheme analysis is performed on the character string stored in, and the search word is extracted (step 14).
6).

【０１２１】この場合の検索語抽出の処理（図１５）に
おいても、前述の場合と同様に、切り出された単語のう
ち不要語として判断されるようなもの以外を検索語とす
る処理を行うが、該当するものはなく、また、未登録検
索語候補保持部の未登録検索語候補１９２として記憶さ
れている文字列に対しては「中国語」に該当するものは
ないため、「中国語」の形態素解析による検索語抽出の
処理が終了しても、図１９（ａ）および図１９（ｂ）に
示すように、検索語候補記憶部および未登録検索語候補
記憶部の内容の変化はない。In the process of extracting the search word in this case (FIG. 15) as well, similar to the above-described case, a process is performed in which the extracted words other than those which are judged as unnecessary words are used as the search words. , "Chinese" does not apply to the character string stored as the unregistered search word candidate 192 in the unregistered search word candidate holding unit, and therefore "Chinese" does not apply. 19A and 19B, there is no change in the contents of the search word candidate storage unit and the unregistered search word candidate storage unit, even if the processing of the search word extraction by the morphological analysis is completed. .

【０１２２】このようにして、全ての言語に対する形態
素解析による検索語の抽出の処理が終了すると、これま
での処理により抽出した検索語の内容に従って、図２０
に示すように、検索式２００が作成される。検索式２０
０は、多言語の文書検索のための検索キー２０１と未登
録語フラグ２０２の配列となっており、検索式２００に
より、その検索キー２０１と未登録語フラグ２０２が、
例えば、図１２に示すようなインデックステーブル１２
０のインデックス（キーワード，文書ＩＤ，未登録語フ
ラグ）と照合され、その対応の文書ＩＤから多言語文書
（図９）が読み出される。In this way, when the process of extracting the search word by the morphological analysis for all languages is completed, the process shown in FIG.
As shown in, the search expression 200 is created. Search formula 20
0 is an array of a search key 201 and an unregistered word flag 202 for a multilingual document search, and the search key 201 and the unregistered word flag 202 are changed according to the search formula 200.
For example, an index table 12 as shown in FIG.
The index (keyword, document ID, unregistered word flag) of 0 is collated, and the multilingual document (FIG. 9) is read from the corresponding document ID.

【０１２３】[0123]

【発明の効果】以上に説明したように、本発明の多言語
文書登録検索装置によれば、１か国以上の言語で記述さ
れ文を含む文書あるいは検索条件式があり、それらの文
書を登録し、また、検索条件式により検索を行う場合、
可能な限り記述された言語に対する形態素解析部の処理
を組合せて、できる限り、精度を上げて単語を切り出せ
るようにしている。このため、登録の際に作成するイン
デックスのサイズもコンパクトにできる。また、文書検
索の際にも、インデックスとの照合の精度（再現率）を
上げることができる。また、キーワード抽出管理テーブ
ルに「解析する順番」、「使用フラグ」そして、「解析
対象文字列のタイプ」といったキーワード抽出における
条件の制御情報を埋め込むことによって、各形態素解析
で単に対象テキストを重複して解析するのでなく、状況
に応じて最適に効率的に解析できるようになる。これら
の条件は、ユーザによりカスタマイズすることにより、
ユーザに対応して任意に多言語文書の登録および検索が
可能となる。As described above, according to the multilingual document registration / retrieval apparatus of the present invention, there are documents or search conditional expressions that include sentences and are described in languages of one or more countries, and those documents are registered. In addition, when performing a search using a search condition expression,
As much as possible, the processing of the morphological analysis unit for the written language is combined to make it possible to extract words with the highest accuracy. Therefore, the size of the index created at the time of registration can be made compact. Further, also in the document search, it is possible to improve the accuracy (recall rate) of collation with the index. In addition, by embedding control information of conditions for keyword extraction such as “order of analysis”, “use flag”, and “type of analysis target character string” in the keyword extraction management table, the target text is simply duplicated in each morphological analysis. It becomes possible to analyze the data optimally and efficiently according to the situation, instead of performing the analysis. By customizing these conditions by the user,
It is possible to arbitrarily register and search multilingual documents according to users.

[Brief description of drawings]

【図１】図１は本発明の一実施例にかかる多言語文書
登録検索装置の構成を示すブロック、FIG. 1 is a block diagram showing a configuration of a multilingual document registration / retrieval device according to an embodiment of the present invention,

【図２】図２は多言語キーワード抽出部の要部の構成
を示すブロック図、FIG. 2 is a block diagram showing a configuration of a main part of a multilingual keyword extraction unit,

【図３】図３はキーワード抽出処理を行う場合に用い
られる制御テーブルの内容を説明する図、FIG. 3 is a diagram for explaining the contents of a control table used when performing keyword extraction processing;

【図４】図４は多言語文書登録処理の全体の処理フロ
ーを示すフローチャート、FIG. 4 is a flowchart showing the overall processing flow of multilingual document registration processing;

【図５】図５は多言語文書登録処理の中のキーワード
抽出管理テーブルの条件の設定処理の処理フローを示す
フローチャート、FIG. 5 is a flow chart showing a processing flow of a keyword extraction management table condition setting process in the multilingual document registration process;

【図６】図６は多言語文書登録処理の中の解析対象文
字列範囲の設定処理の処理フローを示すフローチャー
ト、FIG. 6 is a flowchart showing a processing flow of setting processing of an analysis target character string range in the multilingual document registration processing;

【図７】図７は多言語文書登録処理の中のキーワード
抽出処理の処理フローを示すフローチャート、FIG. 7 is a flowchart showing a processing flow of keyword extraction processing in the multilingual document registration processing,

【図８】図８は多言語文書登録処理の中の未登録キー
ワード候補処理の処理フローを示すフローチャート、FIG. 8 is a flowchart showing a processing flow of unregistered keyword candidate processing in the multilingual document registration processing,

【図９】図９は多言語文書の一例を示す図、FIG. 9 is a diagram showing an example of a multilingual document,

【図１０】図１０（ａ）および図１０（ｂ）は日本語
対応の形態素解析部によるキーワード抽出処理が終った
段階のキーワード候補保持部および未登録キーワード候
補保持部の内容を対比して示す図、10 (a) and 10 (b) show the contents of the keyword candidate holding unit and the unregistered keyword candidate holding unit at the stage when the keyword extraction processing by the morphological analysis unit for Japanese is completed. Figure,

【図１１】図１１（ａ）および図１１（ｂ）は次の英
語対応の形態素解析部によるキーワード抽出処理が終っ
た段階のキーワード候補保持部および未登録キーワード
候補保持部の内容を対比して示す図、11 (a) and 11 (b) compare the contents of the keyword candidate holding unit and the unregistered keyword candidate holding unit at the stage when the keyword extraction processing by the next English morphological analysis unit is completed. Figure showing,

【図１２】図１２は作成された多言語対応のインデッ
クステーブルの一例を示す図、FIG. 12 is a diagram showing an example of a created multilingual index table,

【図１３】図１３は多言語インデックス照合部の要部
の構成を示すブロック図、FIG. 13 is a block diagram showing a configuration of a main part of a multilingual index matching unit,

【図１４】図１４は多言語文書検索処理の全体の処理
フローを示すフローチャート、FIG. 14 is a flowchart showing the overall processing flow of multilingual document search processing;

【図１５】図１５は多言語文書検索処理の中の検索語
抽出処理の処理フローを示すフローチャート、FIG. 15 is a flowchart showing a processing flow of a search word extraction processing in the multilingual document search processing;

【図１６】図１６は多言語文書検索処理の中の未登録
検索語候補処理の処理フローを示すフローチャート、FIG. 16 is a flowchart showing a processing flow of unregistered search word candidate processing in the multilingual document search processing;

【図１７】図１７は多言語で記述された文を含む検索
条件の検索条件式の一例を示す図、FIG. 17 is a diagram showing an example of a search condition expression of a search condition including a sentence written in multiple languages;

【図１８】図１８（ａ）および図１８（ｂ）は、日本
語対応の形態素解析部による検索語抽出処理が終った段
階の検索語候補保持部および未登録検索語候補保持部の
内容を対比して示す図、FIG. 18A and FIG. 18B show the contents of the search word candidate holding unit and the unregistered search word candidate holding unit at the stage when the search word extraction processing by the Japanese morphological analysis unit is completed. Figure shown in contrast,

【図１９】図１９（ａ）および図１９（ｂ）は、次の
英語対応の形態素解析部による検索語抽出処理が終った
段階の検索語候補保持部および未登録検索語候補保持部
の内容を対比して示す図、19 (a) and 19 (b) show the contents of the search word candidate holding unit and the unregistered search word candidate holding unit at the stage when the search word extraction processing by the next English morphological analysis unit is finished. Figure showing contrasting,

【図２０】図２０は多言語の検索条件式により生成さ
れた検索式の一例を示す図である。FIG. 20 is a diagram showing an example of a search expression generated by a multilingual search condition expression.

[Explanation of symbols]

１…入力処理部、２…多言語キーワード抽出部、３…テ
キストデータベース部、４…インデックス登録部、５…
インデックスファイル部、１１…検索条件入力部、１２
…多言語インデックス登録部、１３…表示部、１４…テ
キスト抽出部、２１ａ…第１番目の形態素解析部、２１
ｂ…第１番目の辞書ファイル部、２２ａ…第２番目の形
態素解析部、２２ｂ…第２番目の辞書ファイル部、２３
ａ…第Ｎ番目の形態素解析部である。２４…キーワード
候補保持部、２５…未登録キーワード候補保持部、２６
…キーワード／未登録キーワード決定部、２７…順序設
定部、２８…インデックス登録部、２９…インデックス
ファイル部、３０…キーワード抽出管理テーブル、３１
…番号フィールド、３２…対応言語種別フィールド、３
３…順番フィールド、３４…使用フラグフィールド、３
５…解析対象文字列タイプフィールド、３６…形態素解
析管理テーブル、３７…解析対象文字列タイプ設定テー
ブル、９９…多言語文書、１００…キーワード候補記憶
部、１０１…文書（fileＩＤ）、１０２…キーワード候
補、１０３…未登録キーワード候補記憶保持部、１０４
…文書（fileＩＤ）、１０５…未登録キーワード候補、
１１０…キーワード候補記憶部、１１１…文書（fileＩ
Ｄ）、１１２…キーワード候補、１１３…未登録キーワ
ード候補記憶保持部、１１４…文書（fileＩＤ）、１１
５…未登録キーワード候補、１２０…インデックステー
ブル、１２１…抽出されたキーワード、１２２…文書
（fileＩＤ）、１２３…未登語フラグ、１３１ａ…第１
番目の形態素解析部、１３１ｂ…第１番目の辞書ファイ
ル部、１３２ａ…第２番目の形態素解析部、１３２ｂ…
第２番目の辞書ファイル部、１３３ａ…第Ｎ番目の形態
素解析部、１３４…検索語候補保持部、１３５…未登録
検索語候補保持部、１３６…検索式決定部、１３７…順
序設定部、１７９は検索条件、１８１…検索語候補、１
８２…未登録検索語候補、１９１…検索語候補、１９２
…未登録検索語候補、２００…検索式、２０１…検索キ
ー、２０２…未登録語フラグ。1 ... Input processing unit, 2 ... Multilingual keyword extraction unit, 3 ... Text database unit, 4 ... Index registration unit, 5 ...
Index file part, 11 ... Search condition input part, 12
... multilingual index registration unit, 13 ... display unit, 14 ... text extraction unit, 21a ... first morphological analysis unit, 21
b ... 1st dictionary file part, 22a ... 2nd morphological analysis part, 22b ... 2nd dictionary file part, 23
a ... Nth morphological analysis unit. 24 ... Keyword candidate holding unit, 25 ... Unregistered keyword candidate holding unit, 26
... Keyword / unregistered keyword determination unit, 27 ... Sequence setting unit, 28 ... Index registration unit, 29 ... Index file unit, 30 ... Keyword extraction management table, 31
... Number field, 32 ... Corresponding language type field, 3
3 ... order field, 34 ... use flag field, 3
5 ... Analysis target character string type field, 36 ... Morphological analysis management table, 37 ... Analysis target character string type setting table, 99 ... Multilingual document, 100 ... Keyword candidate storage unit, 101 ... Document (fileID), 102 ... Keyword candidate , 103 ... Unregistered keyword candidate storage / holding unit, 104
... document (fileID), 105 ... unregistered keyword candidate,
110 ... Keyword candidate storage unit, 111 ... Document (fileI
D), 112 ... Keyword candidates, 113 ... Unregistered keyword candidate storage / holding unit, 114 ... Document (fileID), 11
5 ... Unregistered keyword candidate, 120 ... Index table, 121 ... Extracted keyword, 122 ... Document (fileID), 123 ... Unregistered flag, 131a ... First
Th morpheme analysis unit, 131b ... first dictionary file unit, 132a ... second morpheme analysis unit, 132b ...
2nd dictionary file part, 133a ... Nth morphological analysis part, 134 ... Search word candidate holding part, 135 ... Unregistered search word candidate holding part, 136 ... Search formula determining part, 137 ... Order setting part, 179 Is a search condition, 181 ... Search word candidates, 1
82 ... Unregistered search word candidates, 191 ... Search word candidates, 192
... unregistered search word candidates, 200 ... search formula, 201 ... search key, 202 ... unregistered word flag.

───────────────────────────────────────────────────── フロントページの続き (72)発明者相原一雄神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者喜多辰臣神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者平岡直美神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者松尾裕子神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者山口浩神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者川本真司神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Kazuo Aihara 3-2-1, Sakado, Takatsu-ku, Kawasaki-shi, Kanagawa KSP R & D Business Park Building at Fuji Xerox Co., Ltd. (72) Tatsuomi Kita Takatsu-ku, Kawasaki-shi, Kanagawa 3-2-1, Sakado KSP R & D Business Park Building, Fuji Xerox Co., Ltd. (72) Inventor Naomi Hiraoka 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa Prefecture KSP R & D Business Park Building, Fuji Xerox Co., Ltd. (72) Inventor Yuko Matsuo 3-2-1 Sakado, Takatsu-ku, Kawasaki-shi, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Hiroshi Yamaguchi 3-2-1 Sakado, Takatsu-ku, Kawasaki-shi, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Shinji Kawamoto 32-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd.

Claims

[Claims]

1. A multilingual document registration / retrieval apparatus for creating and registering an index used for a search for a document including sentences in a plurality of languages, and performing a document search using the index includes a sentence in a plurality of languages. Multilingual document storage means for storing documents, keyword extraction means for performing morphological analysis on the documents corresponding to sentences of different languages, and extracting document keywords, and keywords extracted by the keyword extraction means Index registration means for registering as an index with the identifier of the corresponding document, search condition input means for inputting the search condition, a word cut out from the search condition input by the search condition input means, and the cut out word and the keyword of the index Depending on the index matching means that matches with and the matching result of keywords and words A multilingual document registration / retrieval apparatus comprising: a reading unit that reads out a document that matches a retrieval condition.

2. A multilingual document registering and searching apparatus for creating and registering an index used for a search for a document including sentences in a plurality of languages, and searching the document by the index, a multilingual document for creating an index. The registration device includes a plurality of word cutout units having different cutout target languages, a setting unit configured to set processing priorities of the plurality of word cutout units, and a control unit configured to control the plurality of word cutout units according to the processing priority, and to extract words And a keyword extraction control means for extracting a keyword to extract a keyword, and an index registration means for registering the extracted keyword and the identifier of the document in which the word of the keyword is extracted in an index in association with each other. Language document registration and retrieval device.

3. The multilingual document registration / retrieval device according to claim 2, wherein the keyword extraction control unit determines a word that cannot be identified by a word cutting unit having a certain processing priority as a word having a next processing priority. For the words cut out by the cutting means, the identifiers of the words are used as keywords, and for the words that cannot be identified to the end by a plurality of word cutting means, the words are used as keywords. Language document registration and retrieval device.

4. A multilingual document registration / retrieval device that creates and registers an index to be used for a search for a document including sentences in a plurality of languages, and performs a document search by the index, in which the document registration device registers. An input means for inputting a document to instruct keyword extraction, a keyword extraction means for extracting a keyword of a document by morphological analysis, which includes a dictionary used for extracting words, and a keyword extracted by the keyword extraction means for a corresponding document A multilingual document registration / retrieval apparatus comprising: a registration unit that registers an identifier together with an identifier; and a holding unit that holds a document to be registered, an index, and a word that is not registered in the dictionary file.

5. The multilingual document registration / retrieval device according to claim 4, wherein the keyword extracting unit cuts out a word from a document composed of sentences of a plurality of languages by morphological analysis for a sentence of each language. A plurality of word cutout means, a plurality of dictionary files storing dictionaries corresponding to languages referred to by the plurality of word cutout means, and an order setting means for setting an order in which the plurality of word cutout means are applied; A multilingual document registration search, comprising: a control unit that controls a plurality of word cutting units in the order set by the order setting unit to cut out words of a corresponding multilingual sentence from the document. apparatus.

6. The multilingual document registration / retrieval device according to claim 5, further comprising: unregistered keyword candidate holding means for temporarily holding an unregistered keyword candidate for a word judged by the word cutting means as an unregistered word. , And a keyword candidate holding unit for temporarily holding as a keyword candidate the words extracted from other dictionaries, and the control unit controls the word cutting unit in the first stage to select a plurality of languages. With a document containing a sentence as an input, the words are cut out by morphological analysis, and for words that are determined to be unregistered words, they are temporarily held in the unregistered keyword candidate holding means as unregistered keyword candidates, and from the dictionary. The extracted words are retained as keyword candidates in the keyword candidate retaining means, and each word is sequentially analyzed. Controlling the output means, with the unregistered word candidates held in the unregistered keyword candidate holding means by the word cutting means in the preceding stage as an input, the words are cut out by morphological analysis, and the words judged to be unregistered words Is left in the unregistered keyword candidate holding means as it is, the word extracted from the dictionary is deleted from the unregistered keyword candidate holding means, and additionally held in the keyword candidate holding means, and finally the keyword A multilingual document characterized in that the keyword candidate held in the candidate holding means is used as a keyword, and the unregistered keyword held in the unregistered keyword candidate holding means is registered as an unregistered keyword in the index together with the identifier of the corresponding document. Registration search device.

7. A multilingual document registration / retrieval device that creates and registers an index to be used for retrieval for a document including sentences in a plurality of languages, and retrieves the document by the index. A search condition input means for inputting, a multilingual index matching means for cutting out a word from the search condition input by the search condition input means and matching it with an index, and a matching document by the matching result of the index matching means. A multilingual document registration / retrieval device, which comprises:

8. The multilingual document registration / retrieval device according to claim 7, wherein the index collating means performs morphological analysis on a sentence of a corresponding language from a document composed of a plurality of languages to find a word. A plurality of word cutout means to be cut out, an order setting means for combining one or more word cutout means to set the order of applying the word cutout means, and a search condition input means in the order set by the order setting means A multilingual document registration / retrieval apparatus comprising: a control unit that controls to cut out a word of a retrieval condition.

9. The multilingual document registration / retrieval device according to claim 8, wherein an unregistered search word candidate is temporarily held as an unregistered search word candidate with respect to a word judged by the word cutting means as an unregistered word. And a search word candidate holding means for temporarily holding a word extracted from a dictionary other than that as a search word candidate, wherein the control means controls the word cutting means in the first stage. , Inputting a document including sentences in a plurality of languages, cutting out words by morphological analysis, and regarding a word judged as an unregistered word, the unregistered search word candidate holding means is temporarily regarded as an unregistered search word candidate. The word extracted from the dictionary is stored as a search word candidate in the search word candidate holding means, and each word cutout means is sequentially controlled so that the word With the unregistered word candidates held in the unregistered search word candidate holding means by the extraction means as input, words are cut out by morphological analysis, and the words judged to be unregistered words are the unregistered search words as they are. Leave it in the candidate holding means,
The word extracted from the dictionary is deleted from the unregistered search word candidate holding means, additionally held in the search word candidate holding means, and finally the search word candidate held in the search word candidate holding means. Is used as a search word, the unregistered search word held in the unregistered search word candidate holding means is used as an unregistered search word, index matching is performed, the corresponding document is extracted by the text database unit, and the result information is output. Characteristic multilingual document registration and retrieval device.