JP2010250389A

JP2010250389A - Information retrieval system, method and program, and index generation system, method, and program

Info

Publication number: JP2010250389A
Application number: JP2009096383A
Authority: JP
Inventors: Masaki Yonetani; 雅樹米谷; Fumihiko Terui; 文彦照井
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2009-04-10
Filing date: 2009-04-10
Publication date: 2010-11-04
Anticipated expiration: 2029-04-10
Also published as: JP5285491B2

Abstract

PROBLEM TO BE SOLVED: To provide a retrieval technology for giving an appropriate retrieval result while eliminating the omission of retrieval. SOLUTION: When an index used for the retrieval is organized, one document is divided by a token by the use of two system, i.e., morphological analysis and an N-gram system. Boundaries of the whole tokens are calculated based on a plurality of token division results, and the appearance position information of each token and the appearance position information of the subsequent token are calculated, so that indexing is performed. Then, a computer system keeps a token string constituted of the entire token boundaries calculated based on the plurality of token division results in order to restore an original document and to display by emphasis a hit position. In this case, when the keeping is performed, the start position number and end position number of each token are kept together. When the retrieval is performed, a retrieval word inputted by a user are divided by the respective token division methods, and the finished token strings are bonded by OR and retrieved. Since one of the token strings is included in the retrieval result when coincidence is obtained, the omission of retrieval is prevented. COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、コンピュータの記憶装置に蓄積されたテキスト情報などを情報検索する技術に関するものである。 The present invention relates to a technique for retrieving information such as text information stored in a storage device of a computer.

近年のコンピュータやインターネットの普及などの理由で、ニュース、ブログ、その他のテキストのコンテンツなど、膨大な数の非定型文書が作られるようになり、必要な文書を高速かつ的確に検索できる検索システムの必要性が高まっている。 Due to the spread of computers and the Internet in recent years, a huge number of atypical documents such as news, blogs, and other text contents have been created, and a search system that can quickly and accurately search for necessary documents. There is a growing need.

大量の文章からユーザーが意図した文書を的確に検索するためには、言語の特定や文章の解析を行って単純な文字列一致よりも高精度な検索を実現しているが、日本語などの区切りが明確ではない言語や、日本語やドイツ語など多くの言語に見られる複合語などは、テキスト検索索引の性格上の問題で効率的に利用できないのが現状である。 In order to accurately search a document intended by the user from a large amount of sentences, the language is specified and the sentences are analyzed to achieve a more accurate search than simple character string matching. At present, languages that are not clearly delimited or compound words found in many languages such as Japanese and German cannot be used efficiently due to the nature of the text search index.

従来より、このような複合語に対処する典型的な技法として、形態素解析手法と、Ｎグラム解析手法が知られている。 Conventionally, a morphological analysis method and an N-gram analysis method are known as typical techniques for dealing with such compound words.

文書の解析に形態素解析手法を利用した場合には、文法や語彙を考慮するため、より検索ユーザーの入力した検索語の意味に近い文書を結果の上位にすることができるが、ユーザーの入力した検索語によっては異なるトークン分割が行われる可能性があるため検索漏れが発生する。 When the morphological analysis method is used to analyze the document, the grammar and vocabulary are taken into consideration, so the document closer to the meaning of the search word entered by the search user can be ranked higher in the results. Depending on the search term, there is a possibility that different token splits may be performed, so search omission occurs.

一方、文書の解析にＮグラム解析手法を利用した場合には、検索漏れは生じないが、文法や語彙を考慮しないため検索結果のランキングに、これらの要素を考慮することができない。 On the other hand, when the N-gram analysis method is used for document analysis, no search omission occurs. However, since the grammar and vocabulary are not taken into consideration, these factors cannot be considered in the ranking of the search results.

これら両手法の問題点を補うために、両手法を用いて作成した検索索引を同時に持ち、結果を統合するような方法も考えられるが、索引だけでなく、要約文作成などのためのトークン列も二重に記憶しなければならないなど効率的ではない。 In order to make up for the problems of both methods, it is possible to have a search index created using both methods at the same time and integrate the results, but not only the index but also a token string for creating a summary sentence, etc. Is also inefficient, such as having to memorize twice.

ここで、典型的な従来技術について、もう少し詳しく説明する。すなわち、従来の検索システムでは、元文書およびユーザーの入力した文章をある単位（以下、トークン）に分割し、そのトークン同士が一致するかどうかで検索結果にその文書を含めるかどうかを判断している。トークンを inverted index （逆引き索引）に格納する際に、元文書内でのトークンの出現順に位置番号を与えて保持する。 Here, the typical prior art will be described in a little more detail. In other words, the conventional search system divides the original document and the text entered by the user into a unit (hereinafter referred to as a token), and determines whether to include the document in the search result based on whether the tokens match. Yes. When storing tokens in an inverted index, a position number is given and held in the order of appearance of the tokens in the original document.

そのトークンを作成する際、主に２つの手法が使用されている。１つは、上述の形態素解析で、文章を意味のある単語単位に分割し、その単語をトークンとする方法である。もう１つは、これも上述したＮグラムと呼ばれる方法で、文字をＮ文字ごとに、重なりを考慮して分割する。 When creating the token, two methods are mainly used. One is a method of dividing a sentence into meaningful word units and using the words as tokens in the morphological analysis described above. The other is a method called N-gram, which is also described above, and divides a character into N characters in consideration of overlap.

形態素解析は辞書を用い、意味のある分割を行い、単語の活用なども考慮できるために高品質な検索が行える半面、辞書にない単語は分割できなかったり、誤った単語分割を行ってしまうことで、たとえ検索文字列とまったく同じ単語が含まれている文書でも検索結果から漏れるという欠点がある。 Morphological analysis uses a dictionary to make meaningful divisions, and the use of words can be considered, so high-quality search can be performed, but words that are not in the dictionary can not be divided or incorrect word division is performed Therefore, there is a disadvantage that even a document containing exactly the same word as the search character string is leaked from the search result.

Ｎグラム技法は、逆に機械的に分割を行うため、検索文字列と完全に一致していれば検索結果に含めることができるが、逆に部分一致（「東京都」に対して「京都」で一致）してしまうノイズや、単語の活用などは対応できないという欠点がある。従来の技法では、一般的には、これら２つの手法を択一で採用するため、どちらにも長所、短所があり、十分な検索品質を提供できていない場合が多い。それぞれの手法で作成した索引に対しての検索をそれぞれ別に行い、結果を合成する手法が用いられることもあるが、合成の仕方が非常に複雑になる上にコストもかかる。また、Ｎグラムに形態素の単語境界の情報を記録してランキングの精度を向上させる手法もよく知られているが、この手法では形態素の大きな特徴である活用や表記の”ゆれ”などを拾うことができず、優れた解決方法とは言えない。 The N-gram technique, on the contrary, mechanically divides, so if it exactly matches the search string, it can be included in the search results, but conversely, it partially matches (“Kyoto” versus “Tokyo”). There is a drawback that it is not possible to deal with noise that makes it match) and the use of words. Conventional techniques generally employ either of these two methods, so both have advantages and disadvantages and often cannot provide sufficient search quality. In some cases, a method of separately searching for indexes created by the respective methods and synthesizing the results is used, but the method of synthesis becomes very complicated and expensive. In addition, there is a well-known technique to improve the accuracy of ranking by recording morpheme word boundary information in N-grams, but this technique picks up the major features of morpheme and “sway” of notation. Cannot be said to be an excellent solution.

特開平１１−２０４８６７号公報は、Ｎグラムをベースに、単語境界の情報をそれぞれのＮグラムに記憶することで、検索の精度を向上させる手法を開示する。この手法では、単語分割とＮグラム分割で使用する文字列がまったく同一でなくてはならないため、形態素解析の特徴である活用や表記のゆれを考慮した検索を行うことができない。更には、この技法においては、実際に文書が保管されている場所より元文書を取得することなく、ユーザーに検索した文書の内容を表示する方法や、ユーザーが要求に合う文書を実際に文書全体を読まずに判断できるよう、検索結果に含まれた文書に対してユーザーの検索要求に即した要約を作成し検索結果に含める方法がよく利用される。その場合に、実際の元文書を取得することなく高速に要約を作成するために、索引時に元文書のトークン分割結果を保持しておくことにより、検索結果提示時に元文書を再現し、検索索引から取得したヒット位置から、再現された元文書内での位置を計算し、その箇所を強調表示するという方法が使用されている。しかし、形態素解析とＮグラム解析手法など幾つかの解析手法を使用する場合は異なるトークン分割結果を生じることとなり、検索索引では、トークンの出現順に位置番号を与えて保持しているため、それぞれのトークン分割結果を個別に記憶しなければならず、また、検索結果も、それぞれの方法で生成した後に結果を統合しなければならず、効率的ではない。 Japanese Patent Application Laid-Open No. 11-204867 discloses a technique for improving the search accuracy by storing word boundary information in each N-gram based on the N-gram. In this method, since the character strings used for word division and N-gram division must be exactly the same, it is not possible to perform a search that takes into account the use and fluctuation of notation that are the characteristics of morphological analysis. Furthermore, in this technique, the contents of the retrieved document are displayed to the user without acquiring the original document from the location where the document is actually stored. In order to be able to determine without reading, a method of creating a summary in accordance with a user's search request for a document included in the search result and including it in the search result is often used. In that case, in order to create a summary at high speed without acquiring the actual original document, the original document is reproduced when the search result is presented by retaining the token division result of the original document at the time of indexing. A method is used in which the position in the reproduced original document is calculated from the hit position obtained from the above, and the position is highlighted. However, when several analysis methods such as morphological analysis and N-gram analysis method are used, different token division results are generated. In the search index, position numbers are given in the order in which tokens appear, and each result is retained. The token split results must be stored individually, and the search results must be integrated after being generated by each method, which is not efficient.

特開平８−２４９３４６号公報は、文書の索引での単語並び、および、元文書での表記と索引の単語表記が異なる場合に単語を再生するためのインデックスを持つことで、元文書の復元、および、ヒット位置の強調表示を可能とする技術を開示する。一つのトークン分割列（主に形態素解析）を取り扱うための手法であり、この手法では複数のトークン分割手法を用いた場合には元文書の復元およびヒット位置の強調表示を行うことができない。 Japanese Patent Laid-Open No. 8-249346 discloses a word arrangement in a document index, and an index for reproducing words when the notation in the original document and the word notation in the index are different, thereby restoring the original document, And the technique which enables highlighting of a hit position is disclosed. This is a method for handling one token division sequence (mainly morphological analysis). In this method, when a plurality of token division methods are used, the original document cannot be restored and the hit position cannot be highlighted.

特開２００６−９９４２７公報に開示の手法においては、あらかじめ、Ｎグラムインデックスと形態素インデックスの二つが作成される。そうして、検索要求の処理に際して、Ｎグラムインデックスによる一次検索（正確ではないが高速）のヒット数と形態素インデックスによる検索のヒット数の近似度を判定し、近似している場合にはＮグラムインデックスによる二次検索（正確だが一次検索おり低速）を省略することで、検索精度をある程度確保しながら、全文検索の高速化を図る手法である。この手法では、一旦近似度を判定することが必要である。 In the method disclosed in Japanese Patent Application Laid-Open No. 2006-99427, two N-gram indexes and a morpheme index are created in advance. Thus, when processing the search request, the degree of approximation between the number of hits of the primary search (not accurate but fast) by the N-gram index and the number of hits of the search by the morpheme index is determined. This is a technique for speeding up the full-text search while securing a certain degree of search accuracy by omitting the secondary search by the index (accurate but the primary search is slow). In this method, it is necessary to determine the degree of approximation once.

本出願に係る特願２００８−４６５８２号明細書には、ある基準となる一つの分割方法により生成されたトークン列に、その他の分割方法で生成されたトークン列をマッピングし、マッピングに必要となる情報を各トークンに付加することで、複数の分割方法を同時に使用した検索システムを構築する方法が記載されている。しかし、この方法は、基準となる分割方法が無い場合には適用することができない。 In the specification of Japanese Patent Application No. 2008-46582 relating to the present application, a token string generated by another division method is mapped to a token string generated by one division method as a reference, and is necessary for mapping. A method for constructing a search system using a plurality of division methods simultaneously by adding information to each token is described. However, this method cannot be applied when there is no standard division method.

特開平１１−２０４８６７号公報JP-A-11-204867 特開平８−２４９３４６号公報JP-A-8-249346 特開２００６−９９４２７公報JP 2006-99427 A 特願２００８−４６５８２号明細書Japanese Patent Application No. 2008-46582

従って、この発明の目的は、検索の漏れをなくしつつ、適切な検索結果を与えることのできる検索技術を提供することにある。 Accordingly, an object of the present invention is to provide a search technique that can provide an appropriate search result while eliminating a search omission.

本発明は、上記目的を達成するために、少なくとも２つのトークン分割の手法を利用する。本発明はこれには限定されないが、好適には形態素解析とＮグラム方式という、２つの手法が用いられる。 In order to achieve the above object, the present invention uses at least two token division techniques. The present invention is not limited to this, but preferably two methods are used: morphological analysis and N-gram method.

本発明に従うコンピュータ・システムは先ず、検索に使用する索引を構築する際に、１つの文書をそれぞれの方式によってトークン分割する。そして、複数のトークン分割結果より全トークンの境界を計算し、各トークンの出現位置情報と次に続くトークンの出現位置情報を計算し、索引付けを行う。 The computer system according to the present invention first divides a document into tokens according to each method when an index used for searching is constructed. Then, the boundary of all tokens is calculated from a plurality of token division results, and the appearance position information of each token and the appearance position information of the following token are calculated and indexed.

次に、コンピュータ・システムは、元文書の復元およびヒット位置の強調表示のために、複数のトークン分割結果から算出した全トークン境界からなるトークン列を保管する。この時、各トークンの開始位置番号、終了位置番号と共に保管する。 Next, the computer system stores a token string including all token boundaries calculated from a plurality of token division results for restoring the original document and highlighting the hit position. At this time, the token is stored together with the start position number and end position number of each token.

検索にあたっては、コンピュータ・システムは、ユーザーが入力した検索語をそれぞれのトークン分割方法によって分割し、出来上がったトークン列を OR で結合して検索を行う。これによってどちらかのトークン列が一致した場合には検索結果に含まれるため、検索漏れを防ぐことができる。 When searching, the computer system divides the search term input by the user by each token splitting method, and searches by combining the completed token strings with OR. As a result, when either of the token strings matches, it is included in the search result, so that a search omission can be prevented.

コンピュータ・システムが元文書の復元およびヒット位置の強調表示を行う際には、ヒットしたトークンの位置情報および次に続くトークンの出現位置情報を使用することで、複数のトークン分割が行われていても、保管している全トークン境界からなるトークン列より元文書の復元およびヒット位置の強調表示を正確に効率よく行うことができる。 When the computer system restores the original document and highlights the hit position, the token information is divided into multiple tokens using the hit token position information and the next token appearance position information. In addition, it is possible to accurately and efficiently restore the original document and highlight the hit position from the token string composed of all stored token boundaries.

以上のように、本発明によれば、形態素解析とＮグラム解析手法などの複数のトークン分割手法を組み合わせることにより、ある検索語が複数のトークン分割方法により、複数のトークン列に展開されるような場合にも正しく検索を行うことができる。また、文書をトークン分割した際に、形態素解析とＮグラム解析手法によるものなど、複数候補が考えられる場合においても、両者の結果を個別に保持することなく、かつ、元文書の復元及びヒット位置の強調表示を正確に行うことができる。 As described above, according to the present invention, by combining a plurality of token partitioning methods such as a morphological analysis and an N-gram analysis method, a certain search word is expanded into a plurality of token strings by a plurality of token partitioning methods. Even in such cases, the search can be performed correctly. Even when multiple candidates are considered, such as those based on morphological analysis and N-gram analysis, when the document is divided into tokens, it is possible to restore the original document and hit position without separately holding the results of both. Can be accurately highlighted.

インターネットを介して複数のクライアント・コンピュータがウェブ・サーバに接続される構成を示す図である。It is a figure which shows the structure by which a some client computer is connected to a web server via the internet. クライアント・コンピュータの構成を示すブロック図である。It is a block diagram which shows the structure of a client computer. ウェブ・サーバの構成を示すブロック図である。It is a block diagram which shows the structure of a web server. 論理構成の機能ブロック図である。It is a functional block diagram of a logical configuration. 索引作成処理のフローチャートを示す図である。It is a figure which shows the flowchart of an index creation process. 複数の分割方法を適用する処理を示す図である。It is a figure which shows the process which applies a some division | segmentation method. 保留中トークンから、元文書復元用と索引用トークンを出力する処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process which outputs the token for original document restoration and the token for an index from a pending token. 索引用トークンの例を示す図である。It is a figure which shows the example of the token for an index. 元文書復元用トークン列の例を示す図である。It is a figure which shows the example of the token sequence for original document restoration. 検索処理のフローチャートを示す図である。It is a figure which shows the flowchart of a search process. 元文書復元処理のフローチャートを示す図である。It is a figure which shows the flowchart of an original document decompression | restoration process. 検索結果の重要度計算の例を示す図である。It is a figure which shows the example of importance calculation of a search result.

以下、図面を参照して、本発明の実施例を説明する。特に断わらない限り、同一の参照番号は、図面を通して、同一の対象を指すものとする。また、以下で説明するのは、本発明の検索システムを、Ｗｅｂサーバ上で使用する実施例であるが、これは本発明の一実施形態であり、スタンドアロンのコンピュータ上でも同様に実施可能であることを理解されたい。 Embodiments of the present invention will be described below with reference to the drawings. Unless otherwise noted, the same reference numerals refer to the same objects throughout the drawings. Also, what will be described below is an example in which the search system of the present invention is used on a Web server. However, this is an embodiment of the present invention and can be similarly implemented on a stand-alone computer. Please understand that.

図１において、ウェブ・サーバ１０２には、インターネット１０４を介して、複数のクライアント・コンピュータ１０６ａ、１０６ｂ・・・１０６ｚが接続されている。図１のシステムにおいては、クライアント・コンピュータのユーザは、Ｗｅｂブラウザを通じて、インターネット１０４の回線を介して、ウェブ・サーバ１０２に、ログインする。具体的には、所定のＵＲＬをＷｅｂブラウザに打ち込んで、所定のページを表示する。なお、Ｗｅｂブラウザではなく、所定の専用クライアント・アプリケーション・プログラムを使ってログ・インするようにしてもよい。 1, a plurality of client computers 106 a, 106 b... 106 z are connected to the web server 102 via the Internet 104. In the system of FIG. 1, a user of a client computer logs in to the web server 102 through a line of the Internet 104 through a web browser. Specifically, a predetermined URL is typed into a Web browser and a predetermined page is displayed. The log-in may be performed using a predetermined dedicated client application program instead of the Web browser.

クライアント・コンピュータのユーザは、インターネット１０４を経由してウェブ・サーバ１０２にアクセスすると、所望のキーワードを打ち込んで、文書の検索を行う。 When the user of the client computer accesses the web server 102 via the Internet 104, he / she enters a desired keyword and searches for a document.

次に、図２を参照して、図１で参照番号１０６ａ、１０６ｂ・・・１０６ｚのように示されているクライアント・コンピュータのハードウェア・ブロック図について、説明する。図２において、クライアント・コンピュータは、メイン・メモリ２０６、ＣＰＵ２０４、ＩＤＥコントローラ２０８をもち、これらは、バス２０２に接続されている。バス２０２には更に、ディスプレイ・コントローラ２１４と、通信インターフェース２１８と、ＵＳＢインターフェース２２０と、オーディオ・インターフェース２２２と、キーボード・マウス・コントローラ２２８が接続されている。ＩＤＥコントローラ２０８には、ハードディスク・ドライブ（ＨＤＤ）２１０と、ＤＶＤドライブ２１２が接続されている。ＤＶＤドライブ２１２は、必要に応じて、ＣＤ−ＲＯＭやＤＶＤから、プログラムを導入するために使用する。ディスプレイ・コントローラ２１４には、好適には、ＬＣＤ画面をもつディスプレイ装置２１６が接続されている。ディスプレイ装置２１６には、ウェブ・ブラウザを通じて、ウェブの画面が表示される。 Next, referring to FIG. 2, a hardware block diagram of the client computer indicated by reference numerals 106a, 106b... 106z in FIG. In FIG. 2, the client computer has a main memory 206, a CPU 204, and an IDE controller 208, which are connected to the bus 202. In addition, a display controller 214, a communication interface 218, a USB interface 220, an audio interface 222, and a keyboard / mouse controller 228 are connected to the bus 202. A hard disk drive (HDD) 210 and a DVD drive 212 are connected to the IDE controller 208. The DVD drive 212 is used for introducing a program from a CD-ROM or DVD as necessary. A display device 216 having an LCD screen is preferably connected to the display controller 214. A web screen is displayed on the display device 216 through a web browser.

ＵＳＢインターフェース２２０には、必要に応じて、専用コントローラ、加速度センサ・デバイスなどのデバイスを接続をすることができる。これらのデバイスは、ウェブ内での操作性を向上するために使用することができる。 Devices such as a dedicated controller and an acceleration sensor / device can be connected to the USB interface 220 as necessary. These devices can be used to improve operability within the web.

キーボード・マウス・コントローラ２２８には、キーボード２３０と、マウス２３２が接続されている。キーボード２３０は、ユーザーが検索したい文字を、ディスプレイ２１６に表示された検索用のダイアログ（図示しない）に打ち込んだりするために使用される。 A keyboard 230 and a mouse 232 are connected to the keyboard / mouse controller 228. The keyboard 230 is used to input a character that the user wants to search into a search dialog (not shown) displayed on the display 216.

ＣＰＵ２０４は、例えば、３２ビット・アーキテクチャまたは６４ビット・アーキテクチャに基づく任意のものでよく、インテル社のＰｅｎｔｉｕｍ（インテル・コーポレーションの商標）４、Ｃｏｒｅ（商標）２Ｄｕｏ、ＡＭＤ社のＡｔｈｌｏｎ（商標）などを使用することができる。 The CPU 204 may be, for example, any one based on a 32-bit architecture or a 64-bit architecture, such as Intel Pentium (trademark of Intel Corporation) 4, Core (trademark) 2 Duo, AMD Athlon (trademark), or the like. Can be used.

ハードディスク・ドライブ２１０には、少なくとも、オペレーティング・システムと、オペレーティング・システム上で動作するＷｅｂブラウザ（図示しない）が格納されており、システムの起動時に、オペレーティング・システムは、メインメモリ２０６にロードされる。オペレーティング・システムは、ＷｉｎｄｏｗｓＸＰ（マイクロソフト・コーポレーションの商標）、ＷｉｎｄｏｗｓＶｉｓｔａ（マイクロソフト・コーポレーションの商標）、Ｌｉｎｕｘ（Linus Torvaldsの商標）などを使用することができる。 The hard disk drive 210 stores at least an operating system and a web browser (not shown) that runs on the operating system, and the operating system is loaded into the main memory 206 when the system starts up. . As the operating system, Windows XP (a trademark of Microsoft Corporation), Windows Vista (a trademark of Microsoft Corporation), Linux (a trademark of Linus Torvalds), or the like can be used.

通信インターフェース２１８は、オペレーティング・システムが提供するＴＣＰ／ＩＰ通信機能を利用して、イーサネット（商標）・プロトコルなどにより、ウェブ・サーバ１０２と、通信する。 The communication interface 218 communicates with the web server 102 by the Ethernet (trademark) protocol or the like using a TCP / IP communication function provided by the operating system.

図３は、ウェブ・プロバイダ側のハードウェア構成の概要ブロック図である。図３に示すように、クライアント・コンピュータ１０６ａ、１０６ｂ・・・１０６ｚは、インターネット１０４を経由して、ウェブ・サーバ１０２の通信インターフェース３０２に接続される。通信インターフェース３０２はさらに、バス３０４に接続され、バス３０４には、ＣＰＵ３０６、主記憶（ＲＡＭ）３０８、及びハードディスク・ドライブ（ＨＤＤ）３１０が接続されている。 FIG. 3 is a schematic block diagram of a hardware configuration on the web provider side. As shown in FIG. 3, client computers 106 a, 106 b... 106 z are connected to the communication interface 302 of the web server 102 via the Internet 104. The communication interface 302 is further connected to a bus 304, and a CPU 306, a main memory (RAM) 308, and a hard disk drive (HDD) 310 are connected to the bus 304.

図示しないが、ウェブ・サーバ１０２にはさらに、キーボード、マウス、及びディスプレイが接続され、これらによって、ウェブ・サーバ１０２全体の管理やメンテナンス作業を行うようにしてもよい。 Although not shown, a keyboard, a mouse, and a display are further connected to the web server 102, and the management and maintenance work of the entire web server 102 may be performed by these.

ウェブ・サーバ１０２のハードディスク・ドライブ３１０には、オペレーティング・システム、クライアント・コンピュータクライアント・コンピュータ１０６ａ、１０６ｂ・・・１０６ｚのログイン管理のための、ユーザＩＤとパスワードの対応テーブルが保存されている。ハードディスク・ドライブ３１０にはさらに、ウェブ・サーバ１０２をＷｅｂサーバとして機能させるためのＡｐａｃｈｅなどのソフトウェアが保存され、ウェブ・サーバ１０２の立ち上げ時に、主記憶３０８にロードされて、動作する。これによって、クライアント・コンピュータ１０６ａ、１０６ｂ・・・１０６ｚが、ＴＣＰ／ＩＰのプロトコルで、ウェブ・サーバ１０２にアクセスすることが可能となる。 The hard disk drive 310 of the web server 102 stores a correspondence table between user IDs and passwords for login management of the operating system and client computers 106a, 106b,... 106z. The hard disk drive 310 further stores software such as Apache for causing the web server 102 to function as a web server, and is loaded into the main memory 308 and operates when the web server 102 is started up. As a result, the client computers 106a, 106b,... 106z can access the web server 102 using the TCP / IP protocol.

後で詳しく説明するが、ハードディスク・ドライブ３１０には、検索される文書のコンテンツ、文書のインデックス、インデックス作成用モジュール、検索用モジュールなどが保存され、必要に応じて、主記憶３０８にロードされる。 As will be described in detail later, the contents of the document to be searched, the document index, the index creation module, the search module, and the like are stored in the hard disk drive 310 and loaded into the main memory 308 as necessary. .

尚、上記ウェブ・サーバ１０２として、インターナョナル・ビジネス・マシーンズ・コーポレーションから購入可能な、ＩＢＭ（インターナョナル・ビジネス・マシーンズ・コーポレーションの商標）ＳｙｓｔｅｍＸ、Ｓｙｓｔｅｍｉ、Ｓｙｓｔｅｍｐなどの機種のサーバを使うことができる。その際、使用可能なオペレーティング・システムは、ＡＩＸ（インターナョナル・ビジネス・マシーンズ・コーポレーションの商標）、ＵＮＩＸ（The Open Groupの商標）、Ｌｉｎｕｘ（商標）、Ｗｉｎｄｏｗｓ（商標）２００３Ｓｅｒｖｅｒなどがある。 The web server 102 is a server of a model such as IBM (trademark of International Business Machines Corporation) System X, System i, or System p, which can be purchased from International Business Machines Corporation. Can be used. In this case, usable operating systems include AIX (trademark of International Business Machines Corporation), UNIX (trademark of The Open Group), Linux (trademark), Windows (trademark) 2003 Server, and the like.

図４は、本発明の実施例に係る機能論理ブロック図である。この機能論理ブロック図は、ウェブ・サーバ１０２に含まれる部分４０２と、クライアント・コンピュータ１０６に含まれる部分４０４と、外部コンテンツ源としてのインターネット２０６と、データの収集範囲を指定するリレーショナル・データベース（ＲＤＢ）４０８と、ファイルサーバ４１０とからなる。 FIG. 4 is a functional logic block diagram according to the embodiment of the present invention. This functional logic block diagram shows a portion 402 included in the web server 102, a portion 404 included in the client computer 106, the Internet 206 as an external content source, and a relational database (RDB) that specifies the data collection range. 408 and a file server 410.

ウェブ・サーバ１０２において、コンテンツ収集部４１２は、インターネット４０６、ＲＤＢ４０８、及びファイル・サーバ４１０などの情報ソースから、検索対象となるコンテンツを巡回収集し、収集したコンテンツと、そのコンテンツに対するポインタとを対応付けて、コンテンツ格納部４１４に一時的に格納する。 In the web server 102, the content collection unit 412 cyclically collects content to be searched from information sources such as the Internet 406, the RDB 408, and the file server 410, and associates the collected content with a pointer to the content. In addition, it is temporarily stored in the content storage unit 414.

コンテンツ格納部４１４は、ハードディスク・ドライブ３１０などの記憶装置内に構成され、収集したコンテンツを検索可能とする処理を行なうための一時的記憶領域を与える。ここに巡回収集されるコンテンツからは、少なくとも文書データが抽出され、その際、テキスト以外の画像や音声、ビデオなどのコンテンツも、必要に応じて収集される。 The content storage unit 414 is configured in a storage device such as the hard disk drive 310, and provides a temporary storage area for performing processing that makes it possible to search the collected content. At least document data is extracted from the contents collected here, and at that time, contents such as images, sounds, and videos other than text are also collected as necessary.

コンテンツ収集部４１２は、予め設定されたスケジュールに従うなど、所定のタイミングでコンテンツの巡回収集を行なう。 The content collection unit 412 performs cyclic collection of content at a predetermined timing such as according to a preset schedule.

ウェブ・サーバ１０２はさらに、文字列解析部４１６を有する。文字列解析部４１６は、コンテンツ格納部４１４に格納された様々なデータ形式のコンテンツから、文書データを抽出する。文字列解析部４１６は、ＨＴＭＬやＸＭＬなどで記述されたコンテンツの場合、タグを除去して、文書データを抽出する。文字列解析部４１６は、ＰＤＦ形式のファイルから、埋め込まれている文書情報を抽出する機能をもつ。さらに文字列解析部４１６は、ＯＣＲ機能により、イメージデータ中の文字を抽出する機能ももつ。 The web server 102 further includes a character string analysis unit 416. The character string analysis unit 416 extracts document data from content in various data formats stored in the content storage unit 414. In the case of content described in HTML, XML, or the like, the character string analysis unit 416 removes tags and extracts document data. The character string analysis unit 416 has a function of extracting embedded document information from a PDF file. Furthermore, the character string analysis unit 416 also has a function of extracting characters from the image data using the OCR function.

文字列解析部４１６は、コンテンツ格納部４１４に格納されている文書をトークンに分割するための規則や辞書を含むトークン位置定義部４１８を含む。 The character string analysis unit 416 includes a token position definition unit 418 including a rule and a dictionary for dividing the document stored in the content storage unit 414 into tokens.

本発明によれば、文字列解析部４１６は、少なくとも２とおりの方法で文書をトークンに分割する。この実施例では、その１つの方法は、形態素解析法であり、もう１つの方法は、Ｎグラム分割法である。 According to the present invention, the character string analysis unit 416 divides a document into tokens by at least two methods. In this embodiment, one method is a morphological analysis method and another method is an N-gram division method.

トークン分割部４２０は、トークン位置定義部４１８に格納されている辞書を参照しながら、形態素解析法により、文字列からトークンを切り出して分割する。トークン分割部４２０の初期の分割の粒度は任意であるが、好適には、複合語など大きな意味単位の語を残すようにする。 The token dividing unit 420 cuts out a token from the character string and divides it by a morphological analysis method while referring to the dictionary stored in the token position defining unit 418. The initial division granularity of the token division unit 420 is arbitrary, but preferably a word of a large semantic unit such as a compound word is left.

トークン展開部４２２は、同様に、トークン位置定義部４１８に格納されている辞書を参照しながら、分割されたトークンに対して、必要に応じて表記の揺れ、活用語、同義語、複合語及び略語を展開し派生したトークンを付加していく。トークンの展開のために使用できる辞書としては、同義語のトークンを関連付けて登録する同義語辞書、複合語のトークンとその複合語を構成するさらに小さな意味単位のトークンとを関連付けて登録する複合語辞書、及び、略語とその略語が表す語とのトークンを関連付けて登録する略語辞書などがある。この際、種々の辞書を用いてトークンを展開し、派生したトークンを文字列に割り当てることにより、情報検索の再現率を向上させることが可能となる。更には、係り受けなどの構文解析、意味解析、文脈解析、固有表現抽出など、他のテキスト・マイニングを行なうことにより、トークン間の関連性を抽出し、トークンに対して、関連情報をあらわすトークンを追加して割り当てることができる。 Similarly, the token expansion unit 422 refers to the dictionary stored in the token position definition unit 418, and, for the divided tokens, the notation fluctuation, the use word, the synonym, the compound word, and the Expand abbreviations and add derived tokens. As a dictionary that can be used for token expansion, synonym dictionaries that associate and register synonym tokens, compound words that associate and register compound word tokens and tokens of smaller semantic units that compose the compound word There are dictionaries, abbreviation dictionaries that associate and register tokens of abbreviations and words that abbreviations represent. At this time, by expanding tokens using various dictionaries and assigning the derived tokens to character strings, it is possible to improve the recall rate of information retrieval. In addition, tokens that represent related information for tokens are extracted by performing other text mining such as syntactic analysis such as dependency analysis, semantic analysis, context analysis, specific expression extraction, etc. Can be added and assigned.

トークン分割部４２４は、Ｎグラム分割法により、文字列からトークンを切り出して分割する。結果のトークンは、トークン展開部４２６に格納される。 The token dividing unit 424 cuts out a token from the character string and divides it by the N-gram dividing method. The resulting token is stored in token expansion unit 426.

なお、文字列解析部４１６で使用するトークン分割方法は、形態素解析法及びＮグラム分割法以外に、単にタブやスペースで単語を分割する、という方法も使うことができる。 In addition to the morphological analysis method and the N-gram division method, a method of simply dividing words by tabs or spaces can be used as the token division method used by the character string analysis unit 416.

本発明によれば、基本的に既知の任意のトークン分割方法を採用することができ、しかも、２種類以上の任意の種類のトークン分割方法を併せて使用することができるを理解されたい。 It should be understood that basically any known token splitting method can be employed according to the present invention, and any two or more types of token splitting methods can be used in combination.

文字列解析部４１６において、トークンの解析処理中、文字列から分割されたトークン及び派生したトークンは、トークン間の位置関係を維持するデータ構造として、解析データ格納部４２８に書き込まれる。このときの解析データのデータ構造は、特に限定されないが、例えば、コンテンツ中の文書データにおけるトークン及び、文書データ中での文字位置の組を概ね出現順に繋げた構造とすることができる。さらに、トークンの解析処理中に、各トークンについて、ＨＴＭＬやＸＭＬのタグなどからさらにその重要度を判定し、検索時のランキングのために、トークンの重要度を関連付けることもできる。 In the character string analysis unit 416, during token analysis processing, the tokens divided from the character string and the derived tokens are written into the analysis data storage unit 428 as a data structure that maintains the positional relationship between the tokens. The data structure of the analysis data at this time is not particularly limited. For example, a structure in which a set of tokens in document data in content and character positions in document data are connected in the order of appearance can be used. Further, during token analysis processing, the importance level of each token can be further determined from an HTML tag or an XML tag, and the importance level of the token can be associated for ranking during search.

文字列解析部４１６は、システム管理者による操作や、予め設定されたスケジュールに従って、または、所定量のコンテンツがコンテンツ格納部４１４に新たに追加されたり更新された場合に、処理を開始するようにしてもよい。また、トークン分割部４２０、４２４及び、トークン展開部４２２、４２６は、好適にはＣＰＵ３０６の実行空間を提供するＲＡＭ３０８にロードされて、実行される。 The character string analysis unit 416 starts processing according to an operation by the system administrator, a preset schedule, or when a predetermined amount of content is newly added or updated in the content storage unit 414. May be. Further, the token division units 420 and 424 and the token expansion units 422 and 426 are preferably loaded into the RAM 308 that provides the execution space of the CPU 306 and executed.

ウェブ・サーバ１０２はさらに、索引構築部４３０を有する。索引構築部４３０は、解析データ格納部４２８に書き込まれた解析データを読み出して、索引付けを施して、索引格納部４３６に格納する。索引構築部４３０は、トークン位置定義部４３２と文書内索引作成部４３４のモジュールを有し、これらのモジュールは、好適にはＣＰＵ３０６の実行空間を提供するＲＡＭ３０８にロードされて、実行される。 The web server 102 further includes an index construction unit 430. The index construction unit 430 reads the analysis data written in the analysis data storage unit 428, performs indexing, and stores it in the index storage unit 436. The index construction unit 430 includes modules of a token position definition unit 432 and an in-document index creation unit 434. These modules are preferably loaded into the RAM 308 that provides the execution space of the CPU 306 and executed.

索引格納部４３６は、ハードディスク・ドライブ３１０上に、データベースまたはファイルとして構成される。 The index storage unit 436 is configured as a database or file on the hard disk drive 310.

索引格納部４３６が格納する索引データは、好適には、文書中のトークンの出現位置を示す情報を含んだ転置インデックス(inverted index)として構成することができる。しかし、これは、一例であって、索引データのデータ構造は、トークンと、そのトークンを文書内に有するコンテンツと、その文書データ中の当該トークンの出現位置とが対応付けられる限り、任意のデータ構造でよい。 The index data stored in the index storage unit 436 can be preferably configured as an inverted index including information indicating the appearance position of the token in the document. However, this is an example, and the data structure of the index data is arbitrary data as long as the token, the content having the token in the document, and the appearance position of the token in the document data are associated with each other. The structure may be sufficient.

トークン位置定義部４３２は、各解析データに表れる各トークン間の位置関係から、各トークンに対して、文書内位置番号を定義して、割り当てる。文書内位置番号は、コンテンツ内の位置を識別する。 The token position definition unit 432 defines and assigns an in-document position number to each token based on the positional relationship between the tokens appearing in each analysis data. The position number in the document identifies the position in the content.

文書内索引作成部４３４は、コンテンツ毎に、トークン、文書内位置番号、及び適宜付加情報を対応づけて、文書データ内での索引である、索引エントリを作成する。索引エントリのデータ構造は、例えば、各トークン毎に、文書内位置番号及び適宜付加情報を整理した配列に、コンテンツ識別値を関連付けたものとして構成することができる。 The in-document index creation unit 434 creates an index entry that is an index in the document data by associating the token, the in-document position number, and appropriate additional information for each content. The data structure of the index entry can be configured, for example, by associating a content identification value with an array in which the position numbers in the document and appropriate additional information are arranged for each token.

ウェブ・サーバ１０２は更に、検索部４３８を有する。検索部４３８は、クライアント・コンピュータ１０６からの検索要求に応じて、その検索要求に含まれる検索式について、索引格納部４３６からの索引データと照合しながら検索処理を実行し、検索結果をクライアント・コンピュータ１０６に返す。 The web server 102 further includes a search unit 438. In response to the search request from the client computer 106, the search unit 438 executes search processing for the search expression included in the search request while checking the index data from the index storage unit 436. Return to computer 106.

検索部４３８は、検索要求受付部４４０、検索結果作成部４４２、検索処理部４４６及び元文書復元部４４８を有し、これらのモジュールは、好適にはＣＰＵ３０６の実行空間を提供するＲＡＭ３０８にロードされて、実行される。 The search unit 438 includes a search request reception unit 440, a search result creation unit 442, a search processing unit 446, and an original document restoration unit 448. These modules are preferably loaded into the RAM 308 that provides the execution space of the CPU 306. And executed.

一方、クライアント・コンピュータ１０６側では、ウェブ・ブラウザ、プロセッサ２０４、プロセッサ２０４の実行空間を与えるメイン・メモリ２０６などが協働して機能するの検索照会部４０４が構成される。検索照会部４０４は、検索要求部４５０と、検索結果表示部４５２とからなる。 On the other hand, on the client computer 106 side, a search query unit 404 is configured in which a web browser, a processor 204, a main memory 206 that provides an execution space of the processor 204, and the like function in cooperation. The search inquiry unit 404 includes a search request unit 450 and a search result display unit 452.

検索時には、検索要求部４５０は、ユーザが、クライアント・コンピュータ１０６のウェブ・ブラウザの画面のテキスト入力領域にユーザが検索したい文字を打ち込んで、所定のボタンをマウスでクリックすることに応答して、検索要求を、検索要求受付部４４０に送る。この際、ＣＧＩなどの周知の技術を使うことができる。 At the time of the search, the search request unit 450 responds that the user types a character that the user wants to search into the text input area of the web browser screen of the client computer 106 and clicks a predetermined button with the mouse. The search request is sent to the search request receiving unit 440. At this time, a known technique such as CGI can be used.

検索要求受付部４４０では、検索照会部４５０からの検索要求を受け取ると、その受け取った検索要求を解析して、その解析結果を、検索処理部４３８に発行する。検索要求は、検索文字列、及び検索文字列間を接続する論理演算子などの検索式を含むことができる。検索文字列はさらに、形態素解析などにより、検索トークンに分割することができる。この分割の際に、文字列解析部４１６を呼び出すことができる。 When the search request receiving unit 440 receives the search request from the search inquiry unit 450, it analyzes the received search request and issues the analysis result to the search processing unit 438. The search request can include a search expression such as a search character string and a logical operator that connects the search character strings. The search character string can be further divided into search tokens by morphological analysis or the like. In this division, the character string analysis unit 416 can be called.

検索処理部４３８は、検索要求に基づき、索引格納部４３６の索引データに対する照会を実行して、その照会に対する照会集合を取得する。元文書復元部４４８は、当該照会集合に基づき、解析データ格納部４２８のデータを参照して、元文書を復元する。復元された元文書は、検索結果作成部４４２を介して、クライアント・コンピュータ１０６の検索結果表示部４５２に送られ、その内容は、クライアント・コンピュータ１０６のウェブ・ブラウザの画面に表示される。このとき、元文書復元部４４８は、元文書を復元しつつ、ヒット位置が強調表示可能であるように、ウェブ・ブラウザの画面上の属性を設定可能である。典型的には、元文書復元部４４８は、ヒット位置に〜などのタグを付与して、復元した元文書を、クライアント・コンピュータ１０６に送出してもよい。 Based on the search request, the search processing unit 438 executes a query for the index data in the index storage unit 436 to obtain a query set for the query. The original document restoration unit 448 restores the original document with reference to the data in the analysis data storage unit 428 based on the inquiry set. The restored original document is sent to the search result display unit 452 of the client computer 106 via the search result creation unit 442, and the content is displayed on the web browser screen of the client computer 106. At this time, the original document restoration unit 448 can set the attribute on the screen of the web browser so that the hit position can be highlighted while restoring the original document. Typically, the original document restoration unit 448 may add tags such as to to the hit position and send the restored original document to the client computer 106. .

次に、図５のフローチャートを参照して、インデックスを作成する処理について説明する。 Next, processing for creating an index will be described with reference to the flowchart of FIG.

先ず、ステップ５０２では、コンテンツ収集部４１２が、インターネット４０６、ファイルサーバ４１０などから、文書データを取得して、コンテンツ格納部４１４に格納する。 First, in step 502, the content collection unit 412 acquires document data from the Internet 406, the file server 410, etc., and stores it in the content storage unit 414.

ステップ５０４では、文字列解析部４１６が、コンテンツ格納部４１４に格納された文書データに含まれる文字列を、用意されている複数のトークン分割手法で分割する。この実施例では、トークン分割部４２０による形態素解析手法と、トークン分割部４２４によるＮグラム手法である。こうして分割されたトークンはさらに、トークン展開部４２２、４２４でそれぞれ適宜展開される。これらの展開は、好適には、主記憶３０８に展開されるが、主記憶３０８に十分な容量がない場合は、ハードディスク・ドライブ３１０に展開される。 In step 504, the character string analysis unit 416 divides the character string included in the document data stored in the content storage unit 414 using a plurality of prepared token division methods. In this embodiment, there are a morphological analysis method by the token dividing unit 420 and an N-gram method by the token dividing unit 424. The tokens thus divided are further expanded as appropriate by token expansion units 422 and 424, respectively. These expansions are preferably expanded in the main memory 308, but are expanded in the hard disk drive 310 if the main memory 308 does not have sufficient capacity.

図６に、「日本IBM株式会社」という文字列を、形態素解析手法である分割方法１と、Ｎグラム手法である分割方法２で分割する様子が示されている。図６にはまた、「株式会社」というトークンに、「(株)」という同義語が展開によって付加されることも示されている。 FIG. 6 shows how the character string “IBM Japan, Inc.” is divided by a dividing method 1 that is a morphological analysis technique and a dividing method 2 that is an N-gram technique. FIG. 6 also shows that the synonym “(share)” is added to the token “corporation” by expansion.

ステップ５０６では、文字列解析部４１６が、分割されたトークンを出現順に選択する。 In step 506, the character string analysis unit 416 selects the divided tokens in the order of appearance.

ステップ５０８では、トークンの開始境界と異なるかどうか、文字列解析部４１６が判断する。もしそうでないなら、処理は、ステップ５３２に進む。ステップ５３２については、後で説明する。 In step 508, the character string analysis unit 416 determines whether the token boundary is different from the start boundary. If not, processing proceeds to step 532. Step 532 will be described later.

ステップ５０８での判断が肯定的だと、処理は、ステップ５１０に進み、そこで、位置決定保留中トークンがあるかどうかが判断される。 If the determination at step 508 is affirmative, processing proceeds to step 510 where it is determined whether there is a pending position token.

ステップ５１０で、位置決定保留中トークンがあると判断されると、ステップ５１２で、位置決定保留中トークンが選択される。後で詳しく説明するが、位置決定保留中トークンとは、好適には主記憶３０８上の所定の領域に保持され、元文書復元用トークンを作成する基となるものである。 If it is determined at step 510 that there is a positioning pending token, a positioning pending token is selected at step 512. As will be described in detail later, the position determination pending token is preferably held in a predetermined area on the main memory 308 and serves as a basis for creating an original document restoration token.

ステップ５１０で、位置決定保留中トークンがないと判断されると、処理は、ステップ５２４に進む。ステップ５２４については、後で説明する。 If it is determined at step 510 that there is no pending positioning token, processing proceeds to step 524. Step 524 will be described later.

戻って、ステップ５１２の後は、ステップ５１４で、選択中トークンの開始境界が、選択した保留中トークンの終了境界より大きいかどうかが判断される。もしそうであるなら、処理はステップ５２４に進む。 Returning to step 512, after step 512, it is determined whether the start boundary of the selected token is greater than the end boundary of the selected pending token. If so, processing proceeds to step 524.

ステップ５１４で、選択中トークンの開始境界が、選択した保留中トークンの終了境界より大きくないと判断されると、ステップ５１６で、記憶している境界が、選択した保留中トークンの終了境界より小さいかどうかが判断される。ここでいう記憶している境界とは、後述するステップ５２０で記憶される保留中トークンの終了位置と、ステップ５２８におけるトークンの開始位置とで規定されるものである。なお、記憶している境界、保留中トークンの終了位置、トークンの開始位置などは、好適には主記憶３０８の所定の共有メモリ領域に書き換え可能に維持されて、文字列解析部４１６などのモジュールによってアクセス可能な変数である。 If it is determined in step 514 that the start boundary of the selected token is not greater than the end boundary of the selected pending token, the stored boundary is less than the end boundary of the selected pending token in step 516. It is judged whether or not. The boundary stored here is defined by the end position of the pending token stored in step 520, which will be described later, and the start position of the token in step 528. The stored boundary, the end position of the pending token, the start position of the token, etc. are preferably maintained in a predetermined shared memory area in the main memory 308 so as to be rewritable, and the module such as the character string analysis unit 416 Is a variable accessible by.

ステップ５１６で、記憶している境界が、選択した保留中トークンの終了境界より小さいと判断されると、ステップ５１８で、記憶している境界と保留中トークンの終了境界より、元文書復元用トークンが生成され、次のステップ５２０で、位置カウンタを増加し、保留中トークンの終了位置を記憶し、位置決定保留中全トークンの終了位置の再計算を行なう処理が行なわれ、ステップ５２２で、選択した保留中トークンを索引用トークンとして出力し、位置決定保留中トークンから除くという処理が行なわれて、処理はステップ５１０の判断に戻る。 If it is determined in step 516 that the stored boundary is smaller than the end boundary of the selected pending token, the original document restoration token is determined in step 518 from the stored boundary and the end boundary of the pending token. In the next step 520, the position counter is incremented, the end position of the pending token is stored, and the end position of all pending position determination tokens is recalculated. In step 522, the selection is performed. The pending token is output as an index token and removed from the position determination pending token, and the process returns to the determination in step 510.

ステップ５１６で、記憶している境界が、選択した保留中トークンの終了境界より小さくないと判断されると、直ちにステップ５２２に進んで、そこでの処理の後、ステップ５１０の判断に戻る。 If it is determined in step 516 that the stored boundary is not smaller than the end boundary of the selected pending token, the process immediately proceeds to step 522 and returns to the determination in step 510 after processing there.

さて、ステップ５２４の判断は、ステップ５１０で位置決定保留中トークンがないと判断された場合、あるいはステップ５１４で選択中トークンの開始位置が選択した保留中トークンの終了境界より大きいと判断される場合に、実行される。 The determination in step 524 is made when it is determined in step 510 that there is no position determination pending token, or when the start position of the selected token is determined to be larger than the end boundary of the selected pending token in step 514. To be executed.

そのステップ５２４で、トークンの開始境界が記憶している境界より大きいと判断されると、ステップ５２６では、記憶している境界とトークンの開始境界より、元文書復元用トークンが生成される。次にステップ５２８で、位置カウンタを増加し、トークンの開始位置を記憶する処理が行なわれ、ステップ５３０では、位置決定保留中全トークンの終了位置の再計算が行なわれる。 If it is determined in step 524 that the token start boundary is larger than the stored boundary, an original document restoration token is generated in step 526 from the stored boundary and the token start boundary. Next, in step 528, the position counter is incremented and the start position of the token is stored. In step 530, the end positions of all tokens pending position determination are recalculated.

ステップ５２４で、トークンの開始境界が記憶している境界より大きくないと判断されると、ステップ５２６とステップ５２８をスキップして、直接ステップ５３０の処理に進む。 If it is determined in step 524 that the token start boundary is not larger than the stored boundary, step 526 and step 528 are skipped and the process proceeds directly to step 530.

次のステップ５３０では、選択したトークンが、位置決定保留中トークンに追加される。なお、ステップ５０８での処理が否定的である場合にも、処理は直接ステップ５３０に来る。 In the next step 530, the selected token is added to the positioning pending token. Note that even if the process in step 508 is negative, the process directly goes to step 530.

ステップ５３４では、最後のトークンかどうかが判断され、そうであれば処理はステップ５３６に進み、そうでなければ、処理はステップ５０８の判断に戻る。 In step 534, it is determined whether it is the last token. If so, the process proceeds to step 536; otherwise, the process returns to the determination in step 508.

ステップ５３６では、保留中トークンから、元文書復元用トークンと、索引用トークンを出力する処理が行なわれる。このとき、出力された元文書復元用トークンと、索引用トークンの情報は、後で検索及び文書の表示に利用するため、ハードディスク・ドライブ３１０に書き出されて保存される。 In step 536, processing for outputting the original document restoration token and the index token from the pending token is performed. At this time, the output information of the original document restoration token and the index token is written and stored in the hard disk drive 310 for later use in search and document display.

図７は、図５のステップ５３６の処理をより詳細に示すフローチャートである。図７のステップ７０２では、位置決定保留中トークンを調べて、位置決定保留中トークンがある限り、ステップ７０６〜７１０を実行する。 FIG. 7 is a flowchart showing the process of step 536 of FIG. 5 in more detail. In step 702 of FIG. 7, the position determination pending token is checked, and steps 706 to 710 are executed as long as there is a position determination pending token.

すなわち、ステップ７０４では、位置決定保留中トークンが選択され、ステップ７０６では記憶している境界と保留中トークンの終了境界により、元文書復元用トークンが生成される。 That is, in step 704, the position determination pending token is selected, and in step 706, the original document restoration token is generated based on the stored boundary and the end boundary of the pending token.

ステップ７０８では、位置カウンタを増加し、位置決定保留中トークンの終了位置を記憶し、位置決定保留中全トークンの終了位置の再計算する処理が行なわれ、次のステップ７１０では、選択した位置決定保留中トークンを索引用トークンとして出力し、位置決定保留中トークンから除く処理が行なわれる。 In step 708, the position counter is incremented, the end position of the position determination pending token is stored, and the end position of all the position determination pending tokens is recalculated. In the next step 710, the selected position determination is performed. The pending token is output as an index token, and processing for removing it from the positioning pending token is performed.

こうして、位置決定保留中トークンがなくなると、ステップ７０２の判断が否定的となって、処理が終了する。 Thus, when there is no position determination pending token, the determination in step 702 is negative and the process ends.

図８は、「日本IBM株式会社」という文字列に対して、図５及び図７のフローチャートの処理で生成された索引用トークンを示す図である。これは、採用されている全種類のトークン分割方法（この実施例では、形態素解析とＮグラム）で分割された全てのトークンを含む。例えば、「株式」は、位置番号3で登録され、その次に来るトークンの位置は5である、という情報をトークンがもつようにする。より効率的には、トークン位置の差分を保持するようにしてもよい。実際は、トークンの先頭に、そのトークンがあらわれる文書の文書ＩＤが付けられることになる。 FIG. 8 is a diagram showing index tokens generated by the processing of the flowcharts of FIGS. 5 and 7 for the character string “IBM Japan, Ltd.”. This includes all tokens divided by all types of token splitting methods employed (in this example, morphological analysis and N-grams). For example, “stock” is registered with position number 3 and the token has information that the position of the next token is 5. More efficiently, the token position difference may be held. Actually, the document ID of the document in which the token appears is attached to the beginning of the token.

図９は、元文書復元用トークン列を示す図である。元文書復元用トークンは、図８に示すような、全種類のトークン分割方法で得られたトークンの分割結果から算出された全トークン境界を元にしたトークン列を、各トークンの開始位置番号、終了位置番号と共に保管したものである。この場合も、より効率的には、トークン位置の差分を保持するようにしてもよい。 FIG. 9 shows an original document restoration token string. As shown in FIG. 8, the original document restoration token includes a token string based on all token boundaries calculated from token division results obtained by all types of token division methods, and a start position number of each token. Stored with the end position number. Also in this case, the difference between the token positions may be held more efficiently.

次に、図１０のフローチャートを参照して、作成された索引に基づき、検索を行う処理を説明する。この処理は、図４の検索処理部４４６によって実行される。図１０においてまず、ステップ１００２では、ユーザが入力し、検索要求受付部４４０が受領した検索語が、全解析手法で解析され、トークンに分解され、分解されたトークンは更に展開される。このとき、好適には、図５のステップ５０４と同じ処理が行なわれる。 Next, a process for performing a search based on the created index will be described with reference to the flowchart of FIG. This processing is executed by the search processing unit 446 in FIG. In FIG. 10, first, in step 1002, the search term input by the user and received by the search request receiving unit 440 is analyzed by all analysis methods, decomposed into tokens, and the decomposed tokens are further expanded. At this time, preferably, the same processing as step 504 in FIG. 5 is performed.

ステップ１００４では、索引作成と同様に、各トークン位置の検出が行なわれる。ここでは、好適には、図５で実行されるトークン生成方法と同様にして、検索トークン列が生成され、ステップ１００６では、そうして生成された複数の検索トークン列が順次選ばれる。 In step 1004, each token position is detected as in the index creation. Here, preferably, a search token string is generated in the same manner as the token generation method executed in FIG. 5. In step 1006, a plurality of search token strings generated in this manner are sequentially selected.

ステップ１００８では、選ばれた検索トークン列を文書データ内に含むコンテンツの集合１〜Ｓが検索により取得される。ここでＳは、１以上の整数である。この検索は、好適には、索引格納部４３６を検索することによって行われる。 In step 1008, a set of contents 1 to S including the selected search token string in the document data is acquired by searching. Here, S is an integer of 1 or more. This search is preferably performed by searching the index storage unit 436.

ステップ１０１０では、集合１〜Ｓが交わりをもつかどうか判断され、もし交わりをもたないなら、ステップ１０２６に進み、最後のトークン列かどうかが判断され、もし最後のトークン列であるなら、処理は終わり、もしそうでないなら、ステップ１００６に戻る。 In step 1010, it is determined whether or not the sets 1 to S have a cross. If there is no cross, the process proceeds to step 1026, where it is determined whether or not the last token sequence. If not, return to step 1006.

ステップ１０１０で、集合１〜Ｓが交わりをもつと判断されると、ステップ１０１２で、集合１〜Ｓの積集合が中間照会集合として作成される。 If it is determined in step 1010 that the sets 1 to S have an intersection, a product set of the sets 1 to S is created as an intermediate query set in step 1012.

ステップ１０１４で、中間照会集合に含まれるコンテンツが選択される。 At step 1014, content included in the intermediate query set is selected.

ステップ１０１６で、選択したコンテンツの文書データ内での検索トークン列の連続性が検証される。これは、検索トークン列が文書データの境界に跨らないかどうかの検証である。 In step 1016, the continuity of the search token string in the document data of the selected content is verified. This is verification of whether the search token string does not cross the boundary of the document data.

ステップ１０１８では、連続が少なくとも１つ維持されたかどうかが判断され、もしそうなら、ステップ１０２０で、選択したコンテンツが中間照会集合に維持され、そうでなければ、ステップ１０２２で選択したコンテンツが中間照会集合から削除される。 In step 1018, it is determined whether at least one continuation has been maintained, and if so, in step 1020, the selected content is maintained in the intermediate query set, otherwise the content selected in step 1022 is the intermediate query. Removed from the set.

ステップ１０２４では、照会結果集合と中間照会集合の和集合が、照会結果集合とされる。そして、ステップ１０２６に進み、最後のトークン列かどうかが判断され、もし最後のトークン列であるなら、処理は終わり、もしそうでないなら、ステップ１００６に戻る。 In step 1024, the union of the query result set and the intermediate query set is taken as the query result set. Then, the process proceeds to step 1026, where it is determined whether or not it is the last token string. If it is the last token string, the process ends. If not, the process returns to step 1006.

図１１は、元文書復元処理のフローチャートを示す図である。この処理は、図４の元文書復元部４４８によって実行される。図１１のステップ１１０２では、表示する文書に含まれる全ての元文書復元用トークンが取得される。 FIG. 11 is a flowchart of the original document restoration process. This process is executed by the original document restoration unit 448 of FIG. In step 1102 of FIG. 11, all original document restoration tokens included in the document to be displayed are acquired.

ステップ１１０４では、例えば図１０の検索結果として得られた、ヒットしたトークンの索引格納部４３６に保持されている開始位置情報及び終了位置情報が取得される。 In step 1104, for example, start position information and end position information held in the index storage unit 436 of the hit token obtained as a search result of FIG. 10 are acquired.

ステップ１１０６では、ステップ１１０２で選択されたトークンが順次選択される。 In step 1106, the tokens selected in step 1102 are sequentially selected.

ステップ１１０８では、選択されたトークンの開始位置が、ヒット範囲としての開始位置情報及び終了位置情報の間に含まれるかどうかが判断され、もしそうなら、ステップ１１１０で、強調表示が開始される。 In step 1108, it is determined whether the start position of the selected token is included between the start position information and the end position information as a hit range. If so, in step 1110, highlighting is started.

ステップ１１１２では、トークンが、元文書復元文字列に追加される。 In step 1112, the token is added to the original document restoration character string.

ステップ１１１４では、選択されたトークンの終了位置が、ヒット範囲としての開始位置情報及び終了位置情報の間に含まれるかどうかが判断され、もしそうなら、ステップ１１１６で、強調表示が終了される。 In step 1114, it is determined whether or not the end position of the selected token is included between the start position information and the end position information as the hit range. If so, in step 1116, the highlighting is ended.

そして、ステップ１１１８に進み、最後のトークン列かどうかが判断され、もし最後のトークン列であるなら、処理は終わり、もしそうでないなら、ステップ１１０６に戻る。 Then, the process proceeds to step 1118, where it is determined whether or not it is the last token string. If it is the last token string, the process ends. If not, the process returns to step 1106.

結果の元文書復元文字列は、検索結果作成部４４２から、検索結果表示部４５２に送られて、ユーザのクライアント・コンピュータ１０５のディスプレイ２１６に表示される。 The resulting original document restoration character string is sent from the search result creation unit 442 to the search result display unit 452 and displayed on the display 216 of the user's client computer 105.

図１２は、検索結果を重要度で重み付けするための処理の例を示す図である。この例では、「東京都」という元検索文字列が、形態素解析と所定の論理操作によって、(東京都 OR (東京 AND 京都))と論理式に分解される。 FIG. 12 is a diagram illustrating an example of processing for weighting search results by importance. In this example, the original search character string “Tokyo” is decomposed into a logical expression (Tokyo OR (Tokyo AND Kyoto)) by morphological analysis and a predetermined logical operation.

論理式中で、「東京都」のところは、元検索文字列と同一なので1と置かれ、「東京」のところは、形態素解析によって「東京都」と同一視されるので1と置かれ、「京都」のところは、「東京都」の文脈では、「東」＋「京都」と読まれる場合なので、可能性は1/3であると看做すと、
(東京都 OR (東京 AND 京都)) = (1 + (1 * 1/3)) = 4/3で、この値で正規化すると、「東京都」の部分の重みは0.75、(東京 AND 京都)の部分の重みは0.25となる。 In the logical expression, “Tokyo” is placed as 1 because it is the same as the original search string, and “Tokyo” is placed as 1 because it is identified as “Tokyo” by morphological analysis. Because “Kyoto” is read as “East” + “Kyoto” in the context of “Tokyo”, if the possibility is 1/3,
(Tokyo OR (Tokyo AND Kyoto)) = (1 + (1 * 1/3)) = 4/3, normalized by this value, the weight of `` Tokyo '' is 0.75, (Tokyo AND Kyoto ) Is 0.25.

索引中では、例えば、その個々のキーワード毎の、ヒットした文書中の出現率に、その重みを掛けた値を個々のキーワードの重みと看做して、元の論理式に代入して全体の重みを計算するなどして、検索結果の重要度を計算することにより、複数の検索結果を重要度順にソートして表示することが可能となる。 In the index, for example, the value obtained by multiplying the appearance rate in the hit document for each individual keyword by the weight is regarded as the weight of the individual keyword, and is substituted into the original logical expression. By calculating the importance of search results by calculating weights, etc., it becomes possible to sort and display a plurality of search results in order of importance.

これは検索結果の重要度を計算するための、本発明に適用し得る一例に過ぎないことを理解されたい。 It should be understood that this is only one example that can be applied to the present invention for calculating the importance of search results.

なお、この実施例では、クライアント・コンピュータは、通常のパーソナル・コンピュータとして示されているが、携帯、ＰＤＡなどのモバイル・デバイスからアクセスして検索できるようにしてもよい。 In this embodiment, the client computer is shown as a normal personal computer. However, the client computer may be accessed and searched from a mobile device such as a mobile phone or a PDA.

また、トークン分割方法は、２つでなく３つ以上を適用してもよい。その際、形態素解析分割法とＮグラム分割法を含んでいてもよく、そうでなくてもよい。すなわち、単にタブやスペースで単語を分割する、という分割方法も適宜使用することができる。本発明によれば、どのような分割方法で得られたトークンでも、統一的に扱うことができる。 Also, three or more token dividing methods may be applied instead of two. At that time, the morphological analysis division method and the N-gram division method may be included or not. That is, a dividing method of simply dividing words by tabs or spaces can be used as appropriate. According to the present invention, tokens obtained by any division method can be handled uniformly.

また、上記実施例では、文書として、日本語の文書を検索するものであったが、英語、ドイツ語、フランス語などの任意の印欧語、アラビア語、ヘブライ語などのセム語、あるには文字をもつ任意の言語で書かれた文書の検索に本発明を適用することが可能である。その際、本発明における、任意のトークン分割方法を使用できるという柔軟性が、多様な言語への適用性を高めることが理解されるだろう。 In the above embodiment, a Japanese document is searched as a document. However, any Indo-European language such as English, German, French, Semitic language such as Arabic, Hebrew, or a character is used. The present invention can be applied to search for documents written in an arbitrary language. In doing so, it will be appreciated that the flexibility of using any token splitting method in the present invention increases its applicability to various languages.

さらに、本発明を、ウェブ・サーバ上で索引の作成及び検索を行う構成の実施例で説明したが、本発明は、スタンドアロンのコンピュータ・システム上でも実現できることは当業者に明らかであろう。 Further, while the present invention has been described with an embodiment of an arrangement for creating and retrieving an index on a web server, it will be apparent to those skilled in the art that the present invention can be implemented on a stand-alone computer system.

４２０・・・第１のトークン分割部
４２４・・・第２のトークン分割部
４２８・・・解析データ格納部
４３０・・・索引構築部
４３６・・・索引格納部 420 ... 1st token dividing part 424 ... 2nd token dividing part 428 ... Analysis data storage part 430 ... Index construction part 436 ... Index storage part

Claims

An index creation system for retrieving a document stored in a storage device by computer processing,
A first token splitting unit that reads the document and generates a token by a first splitting method;
A second token splitting unit that reads the document and generates a token by a second splitting method different from the first splitting method;
Each of the generated tokens is provided with a start position and an end position in the document and stored as an index in the storage device.
Indexing system.

The index creation system according to claim 1, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

Generating an original document restoration token for storing a token string based on all token boundaries obtained by the first and second token division methods together with a start position number and an end position number of each token; The indexing system of claim 1, further comprising means for storing in the indexing system.

An index creation method for searching a document stored in a storage device by computer processing,
Reading the document and generating a token with a first segmentation technique;
Reading the document and generating a token with a second splitting technique different from the first splitting technique;
Each of the generated tokens is provided with a start position and an end position in the document and stored as an index in the storage device.
Index creation method.

The index creation method according to claim 4, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

Generating an original document restoration token for storing a token string based on all token boundaries obtained by the first and second token division methods together with a start position number and an end position number of each token; The index creation method according to claim 4, further comprising a step of storing in the index.

An index creation program for retrieving documents stored in a storage device by computer processing,
The computer,
Reading the document and generating a token with a first segmentation technique;
Reading the document and generating a token with a second splitting technique different from the first splitting technique;
Each of the generated tokens is provided with a start position and an end position in the document and stored in the storage device as an index.
Indexing program.

The index creation program according to claim 7, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

Generating an original document restoration token for storing a token string based on all token boundaries obtained by the first and second token division methods together with a start position number and an end position number of each token; The index creation program according to claim 7, further comprising a step of storing in the index creation program.

A search system for searching a document stored in a storage device by computer processing,
The document is read, and a first token generated by the first dividing method and a token generated by a second dividing method different from the first dividing method include a start position in the document An index file assigned an end position and stored in the storage device;
A means of accepting the string to be searched;
Means for obtaining a plurality of tokens to be searched by dividing the accepted character string by the first dividing method and the second dividing method;
Means for calculating the union of documents including the individual tokens as a search result by searching the search file with the tokens to be searched.
Search system.

The search system according to claim 10, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

An original document restoration token file stored in the storage device together with a token position based on all token boundaries obtained by the first and second token division methods, together with the start position number and end position number of each token. The search system according to claim 10, further comprising an original document restoration unit for restoring the original document from the original document restoration token.

A search method for searching a document stored in a storage device by computer processing,
The document is read, and a first token generated by the first dividing method and a token generated by a second dividing method different from the first dividing method include a start position in the document Providing an index file stored in the storage device with an end position; and
Accepting a string to search for;
Obtaining a plurality of tokens to be searched by dividing the accepted character string by the first dividing method and the second dividing method;
Calculating the union of documents including the individual tokens as a search result by searching the search file with the tokens to be searched.
retrieval method.

The search method according to claim 13, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

Storing the token sequence based on all token boundaries obtained by the first and second token dividing methods in the storage device together with the start position number and end position number of each token as an original document restoration token. The retrieval method according to claim 13, further comprising a step for restoring the original document from the original document restoration token.

A search program for searching a document stored in a storage device by computer processing,
The computer,
The document is read, and a first token generated by the first dividing method and a token generated by a second dividing method different from the first dividing method include a start position in the document Providing an index file stored in the storage device with an end position; and
Accepting a string to search for;
Obtaining a plurality of tokens to be searched by dividing the accepted character string by the first dividing method and the second dividing method;
By searching the search file by the token to be searched, a step of calculating a union of documents including the individual tokens as a search result is executed.
Search program.

The search program according to claim 16, wherein the first division method is a morphological analysis method, and the second division method is an N-gram method.

Storing the token sequence based on all token boundaries obtained by the first and second token dividing methods in the storage device together with the start position number and end position number of each token as an original document restoration token. The retrieval program according to claim 16, further comprising a step for restoring the original document from the original document restoration token.